# Web Mining Project - Stock Market Prediction

## Question 1 - Data Pre-Processing and Exploration

Import relevant packages and define datasets directory.

In [3]:
import nltk
import string
import pandas as pd
from nltk.corpus import stopwords

datasets_directory = r"C:\Users\Ron Michaeli\Dropbox\4th Year 1st Semester\Web Mining\Project\Datasets"

Create a set of stop words and punctuation.

In [6]:
stop_list = set(stopwords.words("english") + list(string.punctuation))

Read the combined dataset into a DataFrame.

In [7]:
combined_news_djia_df = pd.read_csv( "C:\\Users\\ravedan\\PycharmProjects\\Project-Stock-Market-Prediction\\Data\\Combined_News_DJIA.csv")

A function to clean the text in the Reddit headlines columns (the function applies per each "cell" in these columns).

In [8]:
def clean_text(cell_text):
    cell_text = str(cell_text)
    if cell_text.startswith("b\"") or cell_text.startswith("b\'"):
        cell_text = cell_text[1:]
    cell_text = cell_text.replace("\"", "").replace("\'", "")
    tokens = nltk.word_tokenize(cell_text)
    clean_text = ""
    for token in tokens:
        token = token.lower()
        if token not in stop_list and len(token) > 1:
            clean_text += token + " "
    return clean_text.strip()

Clean the Reddit headlines columns in the DataFrame.

In [9]:
headlines_columns = combined_news_djia_df.columns[range(2, 27)]
combined_news_djia_df[headlines_columns] = combined_news_djia_df[headlines_columns].applymap(clean_text)

Class distribution:

In [6]:
or c in [0, 1]:
    samples_per_class.append([c, combined_news_djia_df["Label"].value_counts()[c]])
pd.DataFrame(samples_per_class, columns=['Class', '# of Samples'])samples_per_class = []
f

Unnamed: 0,Class,# of Samples
0,0,924
1,1,1065


So we can say that the data is pretty much balanced.

Top frequent words of each class:

In [12]:
n = 10
for c in [0, 1]:
    print 'Class:', c
    word_frequency_per_class = pd.Series()
    class_df = combined_news_djia_df.loc[combined_news_djia_df["Label"] == c]
    for headline_column in combined_news_djia_df[headlines_columns]:
        word_frequency_per_headline_column = pd.Series(" ".join(class_df[headline_column]).split()).value_counts()[:n]
        word_frequency_per_class = word_frequency_per_class.append(word_frequency_per_headline_column)
    word_frequency_per_class = word_frequency_per_class.groupby(by=word_frequency_per_class.index).sum().sort_values(ascending=False)
    print(word_frequency_per_class.to_frame('Word Frequency'))

Class: 0


TypeError: to_frame() takes at most 2 arguments (3 given)

We can see that there is no apparent difference in frequent words between the two classes.

We would expect that class '0' will contain more negative words such as: war, killed, etc. that may cause a negative public mood, and lead to a decrease of DJIA.

It's fair to say that, in terms of frequent words, both classes are equal.

## Question 2 - Google Correlate & Google Trends

In the attached report.

## Question 3 - Keras & Non-Keras Model Building

Create X, y datasets.

In [8]:
X = combined_news_djia_df.drop(["Label", "Date"], axis=1)
y = combined_news_djia_df[["Label"]]

Split dataset 80-20 for Pipeline fitting and cross-validation.

Why we split the dataset?

a) Becuase we have enough data to afford splitting.

b) To prevent overfitting.

c) To use them later for the chosen model evaluation.

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Convert X_train and X_test to a list of lists (all 25 headlines per day are joined to one list).

In [11]:
X_train = X_train.apply(lambda x: " ".join(x), axis=1).tolist()
X_test = X_test.apply(lambda x: " ".join(x), axis=1).tolist()

### Non-Keras Model Evaluation

Import relevant packages.

In [13]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron

In [None]:
classifiers = \
{
    'Naive Bayes': MultinomialNB(),
    'SGDClassifier': SGDClassifier(),
    'Perceptron': Perceptron()
}

Use Pipeline and it's built-in CV to compare between the classifiers, using TF-IDF vectorizer as feature extractor.

In [None]:
for classifier in classifiers:
    print classifier
    pipeline = Pipeline([('vect', TfidfVectorizer()), ('clf', classifiers[classifier])])
    parameters = {'vect__max_df': np.arange(0.1, 1, 0.1),
                  'clf__alpha': np.arange(0.01, 0.1, 0.01)}
    gs_clf = GridSearchCV(pipeline, parameters, n_jobs=1, cv=5)
    gs_clf = gs_clf.fit(X_train, y_train)
    print 'Best params:', gs_clf.best_params_
    print 'Mean cross-validated score:', gs_clf.best_score_
    print ''
    print ''

We see that SGD Classifier presents the highest mean CV score (~54%), so this is our chosen model.

Let's test the SGD Classifier accuracy using the best params and AUC evaluation metrics.

In [14]:
tf_idf_vectorizer = TfidfVectorizer(max_df=0.3)
X_train_transformed = tf_idf_vectorizer.fit_transform(X_train)
X_test_transformed = tf_idf_vectorizer.transform(X_test)

In [15]:
type(X_train_transformed)

scipy.sparse.csr.csr_matrix

In [None]:
from sklearn import metrics

sgd_classifier = SGDClassifier(alpha=0.06)
sgd_classifier.fit(X_train_transformed, y_train)
prediction = sgd_classifier.predict(X_test_transformed)
print 'AUC score:', metrics.roc_auc_score(y_test, prediction)

### Keras Model Evaluation

Import relevant packages for Keras.

In [None]:
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
np.random.seed(7)

Create LSTM-RNN model and fit it on train data.

In [None]:
top_words = 10000

#create the model
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train_transformed, y_train, validation_data=(X_test_transformed, y_test), epochs=3, batch_size=128)

In [None]:
scores = model.evaluate(X_test_transformed, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 51.51%

## Question 4 - Predict New Data Using The Models

Now we need to test our models on the latest actual data from Reddit and Yahoo! Finance.

### Reddit Data Collection

We'll use the PRAW package to collect the top 25 headlines in the past 14 days from Reddit.

* Dow Jones data will be based on past 10 business days = 14 calendar days.

In [None]:
import praw
import time
import datetime

Reddit API credentials to be authenticated via OAuth.

In [None]:
reddit = praw.Reddit(client_id='iNh-qH0nZTHNUw',
                     client_secret='AbZlGO-Hb15xmMagvOoU0EMjMgo',
                     password='davidalush',
                     user_agent='alusha1',
                     username='alusha89')

Get all posts from Reddit\WorldNews within the past 14 days.

In [None]:
days = 14
now = int(time.time())
two_weeks_ago = now - (60 * 60 * 24 * days)
posts_from_past_two_weeks = list(reddit.subreddit('worldnews').submissions(two_weeks_ago, now))

Sort the posts by their scores in descending order (=hottest posts first).

In [None]:
posts_from_past_two_weeks.sort(key=lambda x: x.score, reverse=True)

Collect top 25 posts per each day in the past 14 days.

In [None]:
num_of_posts = 25
top_25_posts_from_past_two_weeks = {}
for post in posts_from_past_two_weeks:
    formatted_date = datetime.datetime.fromtimestamp(post.created).strftime('%m/%d/%Y')
    if formatted_date not in top_25_posts_from_past_two_weeks:
        top_25_posts_from_past_two_weeks[formatted_date] = []
    if len(top_25_posts_from_past_two_weeks[formatted_date]) < num_of_posts:
        top_25_posts_from_past_two_weeks[formatted_date].append(post.title.encode('ascii', 'ignore'))

Convert Reddit data to DataFrame to facilitate further processing.

In [None]:
latest_reddit_data = pd.DataFrame.from_dict(top_25_posts_from_past_two_weeks, orient='index').sort_index()

### Reddit Data Pre-Processing

Pre-process Reddit data the same way done in Question 1.

In [None]:
headlines_columns = latest_reddit_data.columns[range(0, 25)]
latest_reddit_data[headlines_columns] = latest_reddit_data[headlines_columns].applymap(clean_text)

In [None]:
latest_reddit_data

### Yahoo! Finance Data Collection

Use BeautifulSoup and urllib to crawl Yahoo! Finance and get the DJIA table of the past 10 days.

In [None]:
from bs4 import BeautifulSoup
import urllib2

page = urllib2.urlopen("https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI")
soup = BeautifulSoup(page, "html.parser")
latest_djia_table = pd.read_html(str(soup.find("table", class_="W(100%) M(0)")), header=0)[0]
latest_djia_table = latest_djia_table.head(11)

Classify each day's label according to Kaggle's instructions:

* "1" when DJIA Adj Close value rose or stayed as the same

* "0" when DJIA Adj Close value decreased

In [None]:
latest_djia_table["Label"] = latest_djia_table["Adj Close**"] >= latest_djia_table["Adj Close**"].shift(-1)
latest_djia_table["Label"] = latest_djia_table["Label"].astype(int)
latest_djia_data = latest_djia_table.head(10)

The following function is to reformat the Date column so we can join latest_reddit_data and latest_djia_data on that column.

In [None]:
def reformat_date(cell_text):
    return datetime.datetime.strptime(cell_text, '%b %d, %Y').strftime('%m/%d/%Y')

In [None]:
latest_djia_data["Date"] = latest_djia_data["Date"].apply(reformat_date)

In [None]:
latest_djia_data

#### Join Reddit and Yahoo! Finance tables on "Date" column

In [None]:
latest_combined_news_djia_df = pd.merge(latest_djia_data[["Date", "Label"]], latest_reddit_data, left_on="Date", right_index=True)

In [None]:
latest_combined_news_djia_df

### Predicting Latest Data Using Non-Keras Model

In [None]:
X_latest = latest_combined_news_djia_df.drop(["Label", "Date"], axis=1)
y_latest = latest_combined_news_djia_df[["Label"]]
input_data = X_latest.apply(lambda x: " ".join(x), axis=1).tolist()
input_data_transformed = tf_idf_vectorizer.transform(input_data)
pred = sgd_classifier.predict(input_data_transformed)
print 'AUC score:', metrics.roc_auc_score(y_latest, pred)

We can see that the AUC score is lower than before.

### Predicting Latest Data Using Keras Model

In [None]:
predictions = model.predict(new_X_test_transformed)
rounded = [round(x[0]) for x in predictions]
print(rounded)

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

In [None]:
new_X_test_transformed = tf_idf_vectorizer.fit_transform(new_X_test)
scores = model.evaluate(X_test_transformed, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 80.00%