
# Predicting the Dow Jones from Daily News

* Author:   Jay Huang
* E-mail:   askjayhuang at gmail dot com
* GitHub:   https://github.com/jayhuang1
* Created:  2017-11-26

This workshop predicts movement in the Dow Jones Industrial Average by examining the top 25 news headlines of the day from the [Reddit World News subreddit](https://www.reddit.com/r/worldnews/?hl=). Natural language processing is used to tokenize the news headlines of a given day, build a vocabulary, and count the occurences of each token in the vocabulary. A binary classification is used to predict if the Dow Jones Industrial Average adjusted close value increased or decreased in a given day.

## Data Ingestion

The data set was downloaded from [Kaggle](https://www.kaggle.com). The data set contains the top 25 news headlines of the day from August 8th, 2008 to July 1st, 2016. Each instance (the news headlines of a given day) is labelled by a binary value that indicates whether the Dow Jones Industrial Average increased or decreased in value:
* '1' when the DJIA Adjusted Close increased or stayed the same;
* '0' when the DJIA Adjusted Close decreased.

A csv file was downloaded from Kaggle that contained the aforementioned data and read into a pandas DataFrame.

In [1]:
import pandas as pd

CSV_PATH = 'data/Combined_News_DJIA.csv'

# Read data from csv file and clean DataFrame
df = pd.read_csv(CSV_PATH, index_col=0)
df.index = pd.DatetimeIndex(df.index)

## Data Exploration

Let's explore the csv file. Note that each headline is stored in its own separate column.

The DataFrame contains 1989 rows and 26 columns. 25 columns represent the headlines while the column 'Label' contains a binary value that indicates whether the DJIA increased or decreased.

In [2]:
df.head()

Unnamed: 0_level_0,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",b'Georgian troops retreat from S. Osettain cap...,...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,b'Welcome To World War IV! Now In High Definit...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...","b""The US military was surprised by the timing ...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',"b""The commander of a Navy air reconnaissance s...",...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",b'Russia exaggerating South Ossetian death tol...,...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


In [3]:
df.shape

(1989, 26)

## Data Wrangling

In order for Scikit Learn to process a vocabulary for a given day, the top 25 headlines of a given day need to be combined together into a single string. The following function is used to combine the headlines:

In [4]:
def combine_text_columns(df):
    """Combine all text columns in a row of a DataFrame."""
    # Fill non-null values to be an empty string
    df.fillna("", inplace=True)

    # Join all text columns in a row with a space in between
    df = df.apply(lambda x: " ".join(x), axis=1)

    return df

The data set is wrangled into X and y and split into training and test data:

In [5]:
from sklearn.model_selection import train_test_split

# Set X and y
X = df.drop('Label', axis=1)
y = df['Label']

# Split data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=31)

To test the function we just created, the first row of the data is fed into combine_text_columns. We then test the vectorizer CountVectorizer that will be later used to create a matrix of token counts.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Show tokenized words for the first row
X_combined = combine_text_columns(X)
tokenizer = CountVectorizer().build_tokenizer()(X_combined.iloc[0])
df_tk = pd.DataFrame([[x, tokenizer.count(x)] for x in set(tokenizer)], columns=['Word', 'Count'])
df_tk.sort_values('Count', inplace=True, ascending=False)
print(X.iloc[0].name, '\n')
print(X_combined.iloc[0], '\n')
print(df_tk.head(15), '\n')

2008-08-08 00:00:00 

b"Georgia 'downs two Russian warplanes' as countries move to brink of war" b'BREAKING: Musharraf to be impeached.' b'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)' b'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire' b"Afghan children raped with 'impunity,' U.N. official says - this is sick, a three year old was raped and they do nothing" b'150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets.' b"Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO's side" b"The 'enemy combatent' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it." b'Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO]' b'Did the U.S. Prep Georgia for War with Russia

The countries of Georgia and Russia seemed to have dominated the headlines on August 8th, 2008.

## Model Building

We are now ready to build a machine learning pipeline on our binary classification problem. Note that instead of using a simple bag-of-words model, we are using bi-grams to capture important information involving phrases with 2 tokens. We will later discuss how the hyperparameters used for LogisticRegression were found.

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
import warnings

warnings.simplefilter(action='ignore')

# Create a FunctionTransfomer to combine text columns in a row
combine_text_ft = FunctionTransformer(combine_text_columns, validate=False)

# Create pipeline
pl = Pipeline([
    ('cmb', combine_text_ft),
    ('vct', CountVectorizer(ngram_range=(2, 2))),
    ('clf', LogisticRegression(C=.01, solver='liblinear'))
])

# Fit the pipeline on train data
pl.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('cmb', FunctionTransformer(accept_sparse=False,
          func=<function combine_text_columns at 0x113ace6a8>,
          inv_kw_args=None, inverse_func=None, kw_args=None,
          pass_y='deprecated', validate=False)), ('vct', CountVectorizer(analyzer='word', binary=False, decode_error='st...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

## Model Optimization

In order to find the optimal values for hyperparameters C and solver in LogisticRegression, we run a grid search cross validation:

In [15]:
import numpy as np
from sklearn.model_selection import GridSearchCV

# Grid search cross validation for hyperparameters C and solver in LogisticRegression
parameters = {'clf__C': np.logspace(-2, 2, 10), 'clf__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

estimator = GridSearchCV(pl, parameters)
estimator.fit(X, y)

print("Best score: %0.3f" % estimator.best_score_)
print("Best parameters set:")
best_parameters = estimator.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))


Best score: 0.534
Best parameters set:
	clf__C: 0.01
	clf__solver: 'liblinear'


## Model Evaluation

Let's score our test data on our newly created pipeline as well as run a classification report and a cross tabulation table. Note that mean accuracy isn't the best indicator for evaluating how well our model performed because it doesn't describe how each class performs. Considering the f1-score of each class is a better metric.

In [19]:
from sklearn.metrics import classification_report

# Score the test data
print('Mean Accuracy:', pl.score(X_test, y_test), '\n')

# Print the classification report of test data
y_pred = pl.predict(X_test)
target_names = ['DJIA Decreased (0)', 'DJIA Increased (1)']
print(classification_report(y_test, y_pred, target_names=target_names))

# Print the cross tabulation table of test data
print(pd.crosstab(y_test, y_pred, rownames=["Actual"], colnames=["Predicted"]), '\n')

Mean Accuracy: 0.545226130653 

                    precision    recall  f1-score   support

DJIA Decreased (0)       0.56      0.25      0.34       191
DJIA Increased (1)       0.54      0.82      0.65       207

       avg / total       0.55      0.55      0.50       398

Predicted   0    1
Actual            
0          47  144
1          37  170 



Note that our model performed significantly better at predicting if the DJIA increased vs decreased. 

Finally, let's create a coefficient-word table and see which phrases corresponds to movement in the Dow Jones:

In [18]:
# Print coefficient-word table
vct = pl.get_params()['vct']
clf = pl.get_params()['clf']
words = vct.get_feature_names()
coeffs = clf.coef_.tolist()[0]
coeff_df = pd.DataFrame({'Word': words,
                         'Coefficient': coeffs})
coeff_df = coeff_df.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
print(coeff_df.head(15), '\n')
print(coeff_df.tail(15), '\n')

        Coefficient        Word
144783     0.078849   have been
356740     0.077918     will be
314127     0.075884   that they
316182     0.070915   the first
25087      0.064349   and other
14363      0.063924   after the
328424     0.060601      to the
169728     0.059127      is now
120035     0.057518  first time
170293     0.056781       is to
348153     0.056246      war on
173491     0.056015      it has
325581     0.053955    to build
324548     0.053750     time in
157766     0.053338    in china 

        Coefficient          Word
641       -0.045234        10 000
200530    -0.045402     member of
242887    -0.047878    people are
169718    -0.048318         is no
204211    -0.048912   minister of
340660    -0.051119         up in
160019    -0.051735        in the
316357    -0.053105    the german
68415     -0.057831   children in
313526    -0.059449       that is
230187    -0.063391     on monday
49120     -0.064290     bin laden
321213    -0.066460      there is
323266    

Phrases such as "bin laden" and "threatens to" have very negative coefficients. It makes sense that terrorist threats would have a negative affect on the Dow Jones.

## Conclusion

Our model does a decent job of predicting movement in the Dow Jones based off of news headlines, especially when predicting an increase in the Dow Jones.

Ideas to explore for improving our model in future iterations is to optimize the ngram_range parameter in CountVectorizer and adding interaction terms into our pipeline.