# bigforecast

This notebook serves as the front-end for [bigforecast](https://github.com/jameslamb/bigforecast), a project completed as part of the Summer 2017 session of "W251: Scaling Up! Really Big Data", a course in UC-Berkeley's Master of Information in Data Science (MIDS) program.

The code blocks below allow you to play with a unique dataset we've created for macroeconomic research. Like traditional datasets in econ/finance, it includes time-series of globally-important macro variables like blue chip stock prices and the price of oil. However, we also have ingested (and are still ingesting!) a corpus of global news articles from the [GDELT Event Database](https://www.gdeltproject.org/). This will allow researchers to let their curiosity take them to new and exciting places...some unorthodox features that could be used to explain financial phenomena include:

* Count of articles-per-hour (globally) with the words "financial", "crisis", or "bailout" in the title
* Average tone/sentiment of Wall Street Journal articles over time
* Average tone/sentiment of non-US articles mentioning the United States Federal Reserve

## 1. Load Dependencies

Most of the code needed to interact with this system is available inside the `bigforecast` package. Anything else added below is specific to interactive work with the databases we've created.

In [7]:
import bigforecast.influx as bgfi
import os
import pandas as pd

## 2. Search for Data

There are two sources of data in this project: macro data stored in InfluxDB and news articles stored in Elasticsearch. In this section, we'll show how to hit these data sources to create a training dataset.

### a. InfluxDB

InfluxDB holds time-series data from [Yahoo Finance](). Field names in this DB correspond to financial tickers on Yahoo Finance. To get a list of tickers, you can use functions available in `bigforecast`:

In [3]:
# Connect to the DB
influxDB = bgfi.db_connect(host="169.53.56.26", #os.environ['HOSTNAME']
                           database="modeldb",
                           client_type="dataframe")

# List available fields:
bgfi.list_series(db_client=influxDB)

Series available in this DB:


['aapl', 'cad', 'goog', 'uso']

Looks like Google and Apple's stock prices are available in there! Let's pull out a dataframe to explore these two series.

In [5]:
# Get a windowed dataset
trainDF = bgfi.build_dataset(db_client=influxDB,
                             var_list=['aapl', 'goog'],
                             start_time='2017-08-19 18:00:00',
                             end_time='2017-08-22',
                             window_size='30s')

# Drop NAs in the resulting DataFrame
trainDF = trainDF.dropna(axis=0, how='any')

# Print the first few rows
trainDF.head()

Unnamed: 0,goog,aapl
2017-08-20 06:37:30+00:00,915.6071,164.717067
2017-08-20 06:38:00+00:00,916.216933,165.77132
2017-08-20 06:38:30+00:00,914.63346,163.5506
2017-08-20 06:39:00+00:00,915.396017,160.439
2017-08-20 06:39:30+00:00,916.704067,164.08295


### b. Elasticsearch

## 3. Build a Model

## 4. Produce Forecasts

## 5. References