#### Stock Market Prediction

Build a model to predict financial market’s movements. The forecasting algorithm aims to foresee whether tomorrow’s  exchange closing price is going to be lower or higher with respect to today. 

Next steps will include developing a trading strategy on top of that, based on our predictions, and backtest it against a benchmark.

In [6]:
import cPickle
import numpy as np
import pandas as pd
import datetime
from sklearn import preprocessing
from datetime import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn import neighbors
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
import operator
import pandas.io.data
#from pandas_datareader import data, wb
from sklearn.qda import QDA
import re
from dateutil import parser
#from backtest import Strategy, Portfolio

The aim of the project is to predict whether future daily returns of a S&P 500 are going to be positive or negative.

The problem is a binary classification problem.

The metric used will be the daily return:

$Return_i=\dfrac{AdjClose_i–AdjClose_{i−1}}{AdjClose_{i−1}}$

The Return on the $i^{th}$ day is equal to the *Adjusted Stock Close Price* on the $i^{th}$ day minus the Adjusted Stock Close Price on the (i-1)-th day divided by the Adjusted Stock Close Price on the (i-1)-th day. Adjusted Close Price of a stock is its close price modified by taking into account dividends. It is common practice to use this metrics in Returns computations.

Since the beginnning I decided to focus only on S&P 500, a stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE (New York Stock Exchange) or NASDAQ. Being such a diversified portfolio, the S&P 500 index is typically used as a market benchmark, for example to compute betas of companies listed on the exchange.

Since the beginnning I decided to focus only on S&P 500, a stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE (New York Stock Exchange) or NASDAQ. Being such a diversified portfolio, the S&P 500 index is typically used as a market benchmark, for example to compute betas of companies listed on the exchange.

Feature Analysis

The main idea is to use world major stock indices as input features for the machine learning based predictor. The intuition behind this approach is that globalization has deepened the interaction between financial markets around the world. Shock wave of US financial crisis (from Lehman Brothers crack) hit the economy of almost every country and debt crisis originated in Greece brought down all major stock indices. Nowadays, no financial market is isolated. Economic data, political perturbation and any other oversea affairs could cause dramatic fluctuation in domestic markets. A “bad day” on the Australian or Japanese exchange is going to heavily affect Wall Street opening and trend. In the light of the previous considerations the following predictors have been selected:

NASDAQ Composite (^IXIC Yahoo Finance)
Dow Jones Industrial Average (^DJI Quandl)
Frankfurt DAX (^GDAXI Yahoo Finance)
London FTSE-100 (^FTSE Yahoo Finance)
Paris CAC 40 (^FCHI Yahoo Finance)
Tokyo Nikkei-225 (^N225 Yahoo Finance)
Hong Kong Hang Seng (^HSI Yahoo Finance)
Australia ASX-200 (^AXJO Yahoo Finance)
It is very easy to get historical daily prices of the previous indices. Python provides easy libraries to handle the download. The data can be pulled down from Yahoo Finance or Quandl and cleanly formatted into a dataframe with the following columns:

Date : in days
Open : price of the stock at the opening of the trading (in US dollars)
High : highest price of the stock during the trading day (in US dollars)
Low : lowest price of the stock during the trading day (in US dollars)
Close : price of the stock at the closing of the trading (in US dollars)
Volume : amount of stocks traded (in US dollars)
Adj Close : price of the stock at the closing of the trading adjusted with dividends (in US dollars)
The following is a screenshot of Yahoo Finance website showing a subset of NASDAQ Composite historical prices. This is exactly how a Pandas DataFrame looks like after having downloaded the data.

In [7]:
def getStock(symbol, start, end):
    """
    Downloads Stock from Yahoo Finance.
    Computes daily Returns based on Adj Close.
    Returns pandas dataframe.
    """
    df =  pd.io.data.get_data_yahoo(symbol, start, end)
 
    df.columns.values[-1] = 'AdjClose'
    df.columns = df.columns + '_' + symbol
    df['Return_%s' %symbol] = df['AdjClose_%s' %symbol].pct_change()
    
    return df

In [10]:
def getStockFromQuandl(symbol, name, start, end):
    """
    Downloads Stock from Quandl.
    Computes daily Returns based on Adj Close.
    Returns pandas dataframe.
    """
    import Quandl
    df =  Quandl.get(symbol, trim_start = start, trim_end = end, authtoken="your token")
 
    df.columns.values[-1] = 'AdjClose'
    df.columns = df.columns + '_' + name
    df['Return_%s' %name] = df['AdjClose_%s' %name].pct_change()
    
    return df

In [9]:
def getStockDataFromWeb(fout, start_string, end_string):
    """
    Collects predictors data from Yahoo Finance and Quandl.
    Returns a list of dataframes.
    """
    start = parser.parse(start_string)
    end = parser.parse(end_string)
    
    nasdaq = getStock('^IXIC', start, end)
    frankfurt = getStock('^GDAXI', start, end)
    london = getStock('^FTSE', start, end)
    paris = getStock('^FCHI', start, end)
    hkong = getStock('^HSI', start, end)
    nikkei = getStock('^N225', start, end)
    australia = getStock('^AXJO', start, end)
    
    djia = getStockFromQuandl("YAHOO/INDEX_DJI", 'Djia', start_string, end_string) 
    
    out =  pd.io.data.get_data_yahoo(fout, start, end)
    out.columns.values[-1] = 'AdjClose'
    out.columns = out.columns + '_Out'
    out['Return_Out'] = out['AdjClose_Out'].pct_change()
    
    return [out, nasdaq, djia, frankfurt, london, paris, hkong, nikkei, australia]


In [12]:
# Yes a lot of them, zipline, pandas and even matplotlib can download data from Yahoo Finance. I recommend you use pandas:

from pandas.io.data import DataReader
from datetime import datetime

goog = DataReader("GOOG",  "yahoo", datetime(2000,1,1), datetime(2012,1,1))
goog["Adj Close"]

Date
2004-08-19     50.119968
2004-08-20     54.100990
2004-08-23     54.645447
2004-08-24     52.382705
2004-08-25     52.947145
2004-08-26     53.901190
2004-08-27     53.022069
2004-08-30     50.954132
2004-08-31     51.133953
2004-09-01     50.075011
2004-09-02     50.704382
2004-09-03     49.955130
2004-09-07     50.739348
2004-09-08     51.098988
2004-09-09     51.103983
2004-09-10     52.612476
2004-09-13     53.696398
2004-09-14     55.689407
2004-09-15     55.944152
2004-09-16     56.928171
2004-09-17     58.686414
2004-09-20     59.620484
2004-09-21     58.861240
2004-09-22     59.130972
2004-09-23     60.349754
2004-09-24     59.855246
2004-09-27     59.071031
2004-09-28     63.366744
2004-09-29     65.474638
2004-09-30     64.735377
                 ...    
2011-11-17    300.135389
2011-11-18    297.143366
2011-11-21    290.180320
2011-11-22    289.710772
2011-11-23    284.770724
2011-11-25    281.219276
2011-11-28    293.801692
2011-11-29    291.174315
2011-11-30    299.39