# bigforecast

This notebook serves as the front-end for [bigforecast](https://github.com/jameslamb/bigforecast), a project completed as part of the Summer 2017 session of "W251: Scaling Up! Really Big Data", a course in UC-Berkeley's Master of Information in Data Science (MIDS) program.

The code blocks below allow you to play with a unique dataset we've created for macroeconomic research. Like traditional datasets in econ/finance, it includes time-series of globally-important macro variables like blue chip stock prices and the price of oil. However, we also have ingested (and are still ingesting!) a corpus of global news articles from the [GDELT Event Database](https://www.gdeltproject.org/). This will allow researchers to let their curiosity take them to new and exciting places...some unorthodox features that could be used to explain financial phenomena include:

* Count of articles-per-hour (globally) with the words "financial", "crisis", or "bailout" in the title
* Average tone/sentiment of Wall Street Journal articles over time
* Average tone/sentiment of non-US articles mentioning the United States Federal Reserve

## 1. Load Dependencies

Most of the code needed to interact with this system is available inside the `bigforecast` package. Anything else added below is specific to interactive work with the databases we've created.

In [None]:
import bigforecast.influx as bgfi
import os
import pandas as pd

# Modelling stuff:
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error

## 2. Search for Data

There are two sources of data in this project: macro data stored in InfluxDB and news articles stored in Elasticsearch. In this section, we'll show how to hit these data sources to create a training dataset.

### a. InfluxDB

InfluxDB holds time-series data from [Yahoo Finance](). Field names in this DB correspond to financial tickers on Yahoo Finance. To get a list of tickers, you can use functions available in `bigforecast`:

In [None]:
# Connect to the DB
influxDB = bgfi.db_connect(host="169.53.56.26", #os.environ['HOSTNAME']
                           database="modeldb",
                           client_type="dataframe")

# List available fields:
bgfi.list_series(db_client=influxDB)

Looks like Google and Apple's stock prices are available in there! Let's pull out a dataframe to explore these two series.

In [None]:
# Get a windowed dataset
trainDF = bgfi.build_dataset(db_client=influxDB,
                             var_list=['aapl', 'goog'],
                             start_time='2017-08-19 18:00:00',
                             end_time='2017-08-22',
                             window_size='30s')

# Drop NAs in the resulting DataFrame
trainDF = trainDF.dropna(axis=0, how='any')

# Print the first few rows
trainDF.head()

### b. Elasticsearch

## 3. Build a Model

In [None]:
# Train an ARIMA model
def ARIMAmodel(RawData, lags_to_forecast):
    RawData = RawData.values
    size = int(len(RawData) * 0.8)
    forecast_size = len(RawData)- size
    #split data into training and test sets
    train, test = RawData[0:size], RawData[size:len(RawData)]
    train = [x for x in train]
    test = [x[0] for x in test]

    bestp, bestd, bestq = -1, -1, -1
    best_error = 1000000
    forecasted_values = []
    for p in range(0,5):
        for d in range(0,5):
            for q in range(0,5):
                try:
                    model = ARIMA(train, order=(p,d,q))
                    model_fit = model.fit(disp=0, transparams = True)  # Add transparams = True, returns statsmodels.tsa.arima.ARIMAResults class
                    forecasted_values = [x for x in model_fit.forecast(forecast_size)[0]]
                    error = mean_squared_error(y_true = test, y_pred = forecasted_values)
                    if error < best_error:
                        bestp = p
                        bestd = d
                        bestq = q
                except:
                    pass

    print("Best ARIMA model (p,d,q) : (" + str(bestp) + "," + str(bestd) + "," + str(bestq) + ")")

    # forecast
    if not(bestp == -1 or bestd == -1 or bestq == -1):
        #Predict the next few lags
        model = ARIMA(RawData, order=(bestp,bestd,bestq))
        model_fit = model.fit(disp=0, transparams = True)
        forecasted_values = [x for x in model_fit.forecast(lags_to_forecast)[0]]
        return forecasted_values
    else:
        print("no valid ARIMA model")
        return None

In [None]:
# Train a VAR model
def VARmodel(RawData, VAR, lags_to_forecast):
    size = int(len(RawData) * 0.8)
    forecast_size = len(RawData)- size

    #split data into training and test sets
    train, test = RawData[0:size], RawData[size:len(RawData)]

    bestp = -1
    best_error = 1000000
    forecasted_values = []
    for p in range(0,min(size,25)):
        try:
            #print("trying", p)
            train_model = VAR(train)
            var_model = train_model.fit(p) 
            lag_order = var_model.k_ar
            forecasted_values = [x[1] for x in var_model.forecast(train[-lag_order:].values, steps = forecast_size)]
            error = mean_squared_error(y_true = test["price"], y_pred = forecasted_values)
            if error < best_error:
                bestp = p
        except:
            pass

    # plot
    if bestp != -1:
        #Predict the next few lags
        model = VAR(RawData)
        var_model = train_model.fit(bestp) 
        lag_order = var_model.k_ar
        forecasted_values = [x[1] for x in var_model.forecast(RawData[-lag_order:].values, steps = lags_to_forecast)]
        return forecasted_values
    else:
        print("No VAR Model fitted")
        return None

In [None]:
# make predictions
def predict(varlist = [], ticker = "uso", lags_to_forecast = 5, table_name="stockpricedemo", start_time = "2017-08-17", end_time = "2017-08-19", interval_length = "15m"):
    print("pulling data")
    list_of_data = pull_influx_data(table_name, ticker, varlist, start_time, end_time, interval_length) #add ticker
    print("making dataframe")
    dataframe_to_model = process_list_to_dataframe(list_of_data, varlist)
    if varlist == []:
        #use arima model
        print("Running ARIMA model")
        predicted_values = ARIMAmodel(dataframe_to_model, lags_to_forecast)
        print "----------------------"
        print "ARIMA model predicts the next", lags_to_forecast, "lags to be:" , predicted_values
        print "----------------------"
    else:
        #use VAR model
        hello=1
        print("Running VAR model")
        predicted_values = VARmodel(dataframe_to_model, VAR, lags_to_forecast)
        print "----------------------" 
        print "VAR model predicts the next", lags_to_forecast, "lags to be:" ,  predicted_values
        print "----------------------" 

## 4. Produce Forecasts

## 5. References