# bigforecast

This notebook serves as the front-end for [bigforecast](https://github.com/jameslamb/bigforecast), a project completed as part of the Summer 2017 session of "W251: Scaling Up! Really Big Data", a course in UC-Berkeley's Master of Information in Data Science (MIDS) program.

The code blocks below allow you to play with a unique dataset we've created for macroeconomic research. Like traditional datasets in econ/finance, it includes time-series of globally-important macro variables like blue chip stock prices and the price of oil. However, we also have ingested (and are still ingesting!) a corpus of global news articles from the [GDELT Event Database](https://www.gdeltproject.org/). This will allow researchers to let their curiosity take them to new and exciting places...some unorthodox features that could be used to explain financial phenomena include:

* Count of articles-per-hour (globally) with the words "financial", "crisis", or "bailout" in the title
* Average tone/sentiment of Wall Street Journal articles over time
* Average tone/sentiment of non-US articles mentioning the United States Federal Reserve

## 1. Load Dependencies

Most of the code needed to interact with this system is available inside the `bigforecast` package. Anything else added below is specific to interactive work with the databases we've created.

In [None]:
import bigforecast.influx as bgfi
#import bigforecast.timeseries as bgfts
import os
import pandas as pd
import math

# Modelling stuff:
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.vector_ar.var_model import VAR
from sklearn.metrics import mean_squared_error

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline  

## 2. Search for Data

There are two sources of data in this project: macro data stored in InfluxDB and news articles stored in Elasticsearch. In this section, we'll show how to hit these data sources to create a training dataset.

### a. InfluxDB

InfluxDB holds time-series data from [Yahoo Finance](). Field names in this DB correspond to financial tickers on Yahoo Finance. To get a list of tickers, you can use functions available in `bigforecast`:

In [None]:
# Connect to the DB
influxDB = bgfi.db_connect(host=os.environ['HOSTNAME'],
                           database="modeldb",
                           client_type="dataframe")

# List available fields:
bgfi.list_series(db_client=influxDB)

Looks like Google and Apple's stock prices are available in there! Let's pull out a dataframe to explore these two series.

In [None]:
# Get a windowed dataset
trainDF = bgfi.build_dataset(db_client=influxDB,
                             var_list=['BP', 'PTR', 'OJSCY'],
                             start_time='2016-08-01 00:00:00',
                             end_time='2017-08-22 00:00:00',
                             window_size='15m')

# Drop NAs in the resulting DataFrame
trainDF = trainDF.dropna(axis=0, how='any')

# Print the first few rows
print(trainDF.shape)
trainDF.head()

### b. Elasticsearch

## 3. Build a Model

### a. Model Parameters

In this section, we'll train a few time-series models to predict the variable of interest and compare their performance. To accomplish this, you need to specify a few parameters below.

- `TARGET` is a string identify the name of the variable you want to forecast.
- `X_VARS` is a list of strings identifying the explanatory variables you want to use.

In [None]:
# Set up model params

# Name of the variable you want to forecast
TARGET = 'PTR'

# (Optional) List of names of exogeneous variables
X_VARS = ['OJSCY', 'BP']

In [None]:
# Train a model
mod = bgfi.train_model(trainDF[TARGET], model_type="ARIMA")

# Examine model fit
# mod.summary2()

# Evaluate out-of-sample accuracy
# 

# Fit data
bgfi.prediction_plots(mod, trainDF, TARGET)

## 4. Produce Forecasts

## 5. References

See [the bigforecast repo](https://github.com/jameslamb/bigforecast/blob/dev/docs/references.md) for a full list of references.