## Predicting market movement on D+1 with current day news

In this project, we are going to explore the possibility of predicting a market's movement using current-day news articles by using Natural Language Processing and Machine Learning techniques.

##### This project was last updated on 16 Aug 2020. The author concluded that the model is not able to predict the market's movement on D+1 based on news released from a single source on D. For more details, please see the Conclusion section. This project will not be updated anymore. Note that neither the author (Keith Wee) nor the results of this project represents any political or religious view or opinion. 

### Dataset information 
Using RSS feeds that are publicly available, we regularly run our web scraping script (available on our GitHub repository) to extract new articles that are periodically updated by the provider. We decided to run the script every morning 8AM, afternoon 3PM and night 9PM. Over the course of one month, we collected 692 articles from a single RSS feed that focuses on Singapore's news. *(16 Aug 2020) In total, we collected 1115 articles for use in this project.*

**All articles used in this project belongs to its copyright owners. My use of these resources constitutes personal and non-commercial use.**

### Objective
To predict the impact of current day's news articles on the stock market (upward or downward movement) tomorrow in Singapore.

### Approach
We will be using the **TfidfVectorizer** module by Scikit-Learn and the **Bag-of-words** approach. TfidfVectorizer is a form of Feature Extraction whereby it transforms text data into features vector that can be used as input for estimators. The value of a word is proportionate to the count in the corpus (a collection of texts), adjusted inversely to the number of documents it appears in the corpus, also known as Inverse Document Frequency (IDF). IDF adjusts words that frequently appears across multiple documents / texts (e.g. we, the, I), which will not be useful in drawing differences between documents.

To replicate real-life situation as much as possible, we decided to fit the vectorizer using a time-series forecasting approach. This means that we will fit the vectorizer with data prior to a cut-off date, instead of using train_test_split (which is random and may include "future" data). Essentially, we assume the appearance of a particular word each day will affect the market's movement the next day.

### Assumptions
We make the following assumptions in this project:
1. The market is NOT semi-strong form efficient, which in turn means it is not strong-form efficient.
2. The appearance of a single word in an article has, to a certain extent, an impact (positive or negative) to the market.
3. The stock market movement is purely functional on news articles from a single source.
4. All market players act the same way when consuming these information. 

The first assumption is necessary as if the market is semi-strong form efficient, the moment the article is released, the impact will be reflected immediately in the market the day the article is released (unless it is after market closure). Assumptions 3 and 4 are not realistic but are essential for this project.

*Updated on 16 August 2020 to amend an error in assumptions made. However, there are no impact to the conclusions drawn previously.*

### Model
In this project, we decided to use the Logistic Regression model as we are looking at a binary scenario. In the future, we will be further exploring the use of other models (e.g. MultinomialNB, Regression models) for multi-class classification or even regression analysis. 

In the attempt to achieve the highest accuracy and AUROC on the testing data, we modified the C regularisation parameter. For more information on regularisation in machine learning, see here: https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a . 

#### Measures
The classifier will be evaluated with the following metrics: 

 - Precision score
 - Recall score
 - AUROC
 - Accuracy score for the testing set and training set
 
### Conclusion

#### 19th July 2020
We managed to achieve an accuracy score of 0.7272, which is the highest we have seen well attempting different methods offline. It has a relatively high recall of 0.75 and AUROC of 0.73214. However, we see a low precision score of 0.6000. This Type I error may be costly to investors if they depend on this model to guide their investment decision. Such decisions may include investing when given a positive signal (1) and the market moves down tomorrow (a false positive). 

It is also worth a note that the perfect training accuracy is due to the model learning the vocabulary and coefficients from the training set itself. However, this may be signs of overfitting on the training data. In the future, with more data, we hope to do a Train-Evaluate-Test split on our dataset, which will give us better insight to the model.

As mentioned in our approach, we assume that the appearance of a single word today influences the market as a whole the next day to a certain extent. This brings us to our first limitation of the model: **it does not take into context of how the word is used**. For example, compare the sentences *"Bars in Singapore are heavily impacted due to COVID-19"* and *"Bars are allowed to resume operations next week."* Ignoring all other terms, the classifier will predict a downward market movement tomorrow simply due to the word "bar" *(see Negative Determinants)*, despite that the contexts are largely different. We can improve this model using n-grams of a word (e.g bigrams or trigrams) in its original sentence, which will provide a better context. The downside of doing so is that it exponentially increases the number of features extracted by the Vectorizer, and it has to be done at a document-level (e.g generate n-grams tokens for each article and consolidate them together at the end for each date). Thus, we suggest the use of *min_df* parameter in the Vectorizer when fitting it with n-grams features. This parameter suggests that tokens that do not appear in more than, or equal to, $X$ documents shall be ignored (e.g if min_df = 2, a token/word has to appear in at least 2 documents for it to be a feature). This will significantly reduce the amount of features passed into the classifier. This can also potentially cut out misspelled words or words that are unique to a certain article. 

Another limitation is that the classifier is only trained on words seen by the Vectorizer. A simple google search tells us that the Oxford dictionary contains 171,476 words, and this does not include *slangs* (short languages) and acronyms, among others. In our Vectorizer, the vocabulary only contains 2104 words. This means that when faced with new words, the Vectorizer simply ignores them and fit the words that it have seen. With our limited dataset, it was surprising to see such high accuracy and AUROC. During our drafting, we adjusted the amount of training data (aka fitting data) that is seen by the Vectorizer and Classifier. We noticed that as the amount of training data increases (i.e more vocabulary), the accuracy of the classifier drops (to approximate 0.6) on the testing set. We will continue to work on getting more data (both training and testing) and continue to see if predicting stock market movement purely based on news article is sufficient.

As we continue to work on this project, we will occasionally update the conclusion we have arrived at.

#### 16 August 2020
Since 19 July 2020, we continued to harvest our sources at regular interval for the purpose of this project. In total, we scraped 1116 articles from a single source, where only 748 are used in this project (eg. only Sunday's articles are used to predict Monday's market movements, ignoring all articles from Friday and Saturday). 

As we continued to gather more data over the past 2 months for the project, we also continued to train our model. Using the same model (eg. without improving or implementing the considerations mentioned on 19 July 2020), we concluded that using the bag-of-words approach on articles from a single source on D to predict market's movement on D+1 is **ineffective**. With more data collected, we trained the model with a 75/25 train-test split using time-series forecasting approach.

The results were expected by us since the beginning. The Area Under the Receiver Operating Characteristic Curve (AUROC) drops to 0.5, signalling that the model is **incapable** of distinguishing the differences between the binary classes. 

In addition to the considerations mentioned in our conclusion on 19 July, we also studied the way we collected our data. First, most RSS feeds provide a timezone-aware timestamp for the datetime the news is released. However in our process of web-scraping, we removed such essential information and normalised the timestamp to midnight of the day the article is released. This gave rise to our assumption of the lack of semi-strong form efficiency in the market. However, with the recent advance in technology, algorithmic trading based on real-time structured and unstructured data has become very common. Such technology analyses, prices, and takes advantage of any price discrepancies almost instantly. As such, we should not assume that the market is semi-strong form inefficient. Therefore, we should **only consider articles released after market closure** as market players would not be able to price these articles until the next market-open day. Thus, a timezone-aware datetimestamp will be needed. Next, we should also consider combining articles from multiple sources that are relevant to a target market. In this project, only articles from a single source are used. This is yet another unrealistic assumption that all market players consume only information from a single source.

All in all, the project demonstrated the use of Natural Language Processing and Supervised Machine Learning techniques using Python. The results we arrived at were expected due to the unrealistic assumptions. However, this gave us the opportunity to reflect and think about how to improve the way we collect, clean and manage our data. This project also allowed us to think about how such techniques can be used in the areas of finance. With this, this project shall be deemed concluded and will not be further worked on.

In [1]:
import pandas as pd
import numpy as np
import math

import nltk
from nltk.corpus import wordnet, stopwords
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, roc_auc_score

In [2]:
read = pd.read_csv('C:/XXX/news.csv', sep = ',') #masked for privacy
read.sort_values('pubDate', ascending = True, inplace = True)

### Text Processing
In this section, we will process the scrapped articles by title, description, and publication date. First, we call the generate_dataFrame() function. This function consolidates a strings of tokens for each day into a Pandas Dataframe by calling the other functions: generate_tokens() and process_text(). 

To start off, the process_text() function takes in a series where each item in the series is tokenized with punctuations removed. If a token consists of just punctuation(s), it will be removed and discarded. Following that, we attempted to lemmatize each token to achieve standardisation. We first do Part-of-Speech (POS) tagging on the generated tokens. POS tagging is essential in the lemmatization process and it is used as a required argument in the Lemmatizer. Finally, we proceed to lemmatize these tokens in the list, which will be passed to the generate_tokens() function. 

In generate_tokens(), we filter out stopwords which is not meaninful as it appears in almost all documents. Then, we use only alpha tokens as numeric tokens does not provide enough context (e.g. 50 apples in a basket, 50 injuries in a car accident) by itself. These tokens are then consolidated by date, and passed back to the dataFrame, together with its date. 

In [3]:
def process_text(series):
    lst = []
    for item in series:
        tokens = nltk.word_tokenize(item)
        tokens = [token.translate(str.maketrans('','', string.punctuation)).lower() for token in tokens if token.translate(str.maketrans('','', string.punctuation)) != '']
        lst = lst + tokens
                                
    pos_tag = dict(nltk.pos_tag(lst))
    lemmatizer = nltk.stem.WordNetLemmatizer()
    def wn_pos_type(token):
        if token in ['NN', 'NNS', 'NNP', 'NNPS']:
            return nltk.corpus.wordnet.NOUN
        elif token in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']:
            return nltk.corpus.wordnet.VERB
        elif token in ['RB', 'RBR', 'RBS']:
            return nltk.corpus.wordnet.ADV
        elif token in ['JJ', 'JJR', 'JJS']:
            return nltk.corpus.wordnet.ADJ
        else:
            return 'n' # Default parameter by WordNetLemmatizer. Returns the word itself.
    lst = [lemmatizer.lemmatize(tok, pos = wn_pos_type(pos_tag[tok])) for tok in lst]
    return lst

def generate_tokens(df):
    title_series = df['title']
    desc_series = df['description']
    tokens = process_text(title_series) + process_text(desc_series)
    tokens = [tok for tok in tokens if tok not in stopwords.words('english') and tok.isalpha()]
    return tokens

def generate_dataFrame():
    tokens_list = []
    for date in list(read['pubDate'].unique()):
        dataFrame = read[(read['pubDate'] == date) & (read['source'] == 'XXX_XXX')]
        tokens = generate_tokens(dataFrame)
        tokens_list.append([date, " ".join(tokens)])
            
    df = pd.DataFrame(tokens_list, columns = ['date', 'tokens'])
    return df

In [4]:
df = generate_dataFrame()
df.set_index('date', inplace = True)
df.index = pd.to_datetime(df.index)
df.head(3)

Unnamed: 0_level_0,tokens
date,Unnamed: 1_level_1
2020-06-17,nursing home resident allow visitor day jun tw...
2020-06-18,paul tambyah first singaporean head internatio...
2020-06-19,playgrounds beach reopen start phase patient a...


### Market Movement
The market index we chose in this project is the FTSE Straits Times Index (STI). Historical data is extracted from Yahoo Finance. Using in-built functions, we calculated the percentage change of the closing point and offset the date by -1. This is because of our objective to find out how news today affects the markets tomorrow (e.g. how will news on 3rd July 2020 affects the market movement on 4th July 2020). 

In [5]:
sti = pd.read_csv('C:/XXX/STI.csv').ffill() #masked for privacy
sti['Date'] = pd.to_datetime(sti['Date'])
sti.set_index('Date', inplace = True)
sti['Chng'] = sti.pct_change(periods = 1)['Close']
sti.index = sti.index - pd.DateOffset(1)

Before continuing, we would like to explain how, and why, we merged the Dataframes in the next block. We want to raise a note that we are essentially predicting Monday's market movement using Sunday's articles only, completely ignoring Friday's and Saturday's news. Realistically, this is not the case as investors consumes news over the weekend and price it when market opens on Monday. Having said that, since we offset the dates by -1 (the percentage change on 13th Jul becomes the target for 12th Jul), we merged the Dataframes together to produce a complete Dataframe with input (the tokenized articles) and output (the pct_change). 

### Model Evaluations

#### Results from 19th July 2020

In [6]:
merged_df = df.merge(sti, how = 'inner', left_index = True, right_index = True).sort_index(ascending = True)
"""On 19th Jul, we used approximately half of our dataset to train and half to test. This is because we carry out live daily scrapping of news from various sources and do not have ready access to historical news sources."""
training_jul = merged_df[merged_df.index <= pd.to_datetime('1 July 2020')]
testing_jul = merged_df[(merged_df.index > pd.to_datetime('1 July 2020')) & (merged_df.index <= pd.to_datetime('19 July 2020'))].dropna(how = 'all')

"""We conduct a binary classification using existing data, where 1 represents a positive change and 0 represents a negative change"""
X_train_jul = training_jul['tokens']
y_train_jul = np.where(training_jul['Chng'] >= 0, 1, 0)

X_test_jul = testing_jul['tokens']
y_test_jul = np.where(testing_jul['Chng'] >= 0, 1, 0)

vect_jul = TfidfVectorizer().fit(X_train_jul)
X_train_vectorized_jul = vect_jul.transform(X_train_jul)
clf_jul = LogisticRegression(C = 5).fit(X_train_vectorized_jul, y_train_jul) # C = 5 generates the best accuracy and AUROC for the test set after several experiments

coeff_jul = clf_jul.coef_
idx_jul = np.argsort(-coeff_jul, ).reshape(-1, 1)
vocab_jul = vect_jul.vocabulary_

word_list_jul = []
for i in idx_jul:
    for word, n in vocab_jul.items():
        if n == i:
            word_list_jul.append([word, coeff_jul.reshape(-1, 1)[i][0, 0]])
word_list_jul = pd.DataFrame(word_list_jul, columns = ["word", "coef"])

print("Top 5 Positive Determinants in July \n", word_list_jul.head(5))
print("\nTop 5 Neagtive Determinants in July \n", word_list_jul.tail(5))

predictions_jul = clf_jul.predict(vect_jul.transform(X_test_jul))

y_test_jul= y_test_jul.reshape(-1, 1)
precision_jul = precision_score(y_test_jul, predictions_jul)
recall_jul = recall_score(y_test_jul, predictions_jul)
accuracy_jul = clf_jul.score(vect_jul.transform(X_test_jul), y_test_jul)
roc_jul = roc_auc_score(y_test_jul, predictions_jul)

print("\nJuly Precision: {}\nJuly Recall: {}\nJuly AUROC: {}\nJuly Accuracy (Test): {}\nJuly Accuracy (Train): {}".format(precision_jul, recall_jul, roc_jul, accuracy_jul, clf_jul.score(X_train_vectorized_jul, y_train_jul)))

Top 5 Positive Determinants in July 
      word      coef
0  monday  0.698438
1   party  0.641720
2     pap  0.555946
3     grc  0.436361
4     jul  0.436258

Top 5 Neagtive Determinants in July 
         word      coef
2099  mosque -0.272058
2100  friday -0.280093
2101     jun -0.346791
2102     bar -0.347927
2103  sunday -0.377431

July Precision: 0.6
July Recall: 0.75
July AUROC: 0.7321428571428571
July Accuracy (Test): 0.7272727272727273
July Accuracy (Train): 1.0


#### Results from 16th August 2020

In [7]:
"""On 16 Aug, we used the 75/25 (based on number of days) train-test split through the time-series forecasting approach"""
training_aug = merged_df[merged_df.index <= pd.to_datetime('31 July 2020')]
testing_aug = merged_df[~merged_df.isin(training_aug)].dropna(how = 'all')

X_train_aug = training_aug['tokens']
y_train_aug = np.where(training_aug['Chng'] >= 0, 1, 0)

X_test_aug = testing_aug['tokens']
y_test_aug = np.where(testing_aug['Chng'] >= 0, 1, 0)

vect_aug = TfidfVectorizer().fit(X_train_aug)
X_train_vectorized_aug = vect_aug.transform(X_train_aug)
clf_aug = LogisticRegression(C = 5).fit(X_train_vectorized_aug, y_train_aug) # C = 5 to remain consistent with the model trained in July 2020

coeff_aug = clf_aug.coef_
idx_aug = np.argsort(-coeff_aug, ).reshape(-1, 1)
vocab_aug = vect_aug.vocabulary_

word_list_aug = []
for i in idx_aug:
    for word, n in vocab_aug.items():
        if n == i:
            word_list_aug.append([word, coeff_aug.reshape(-1, 1)[i][0, 0]])
word_list_aug = pd.DataFrame(word_list_aug, columns = ["word", "coef"])

print("Top 5 Positive Determinants in August: \n", word_list_aug.head(5))
print("\nTop 5 Negative Determinants in August: \n", word_list_aug.tail(5))

predictions_aug = clf_aug.predict(vect_aug.transform(X_test_aug))

y_test_aug = y_test_aug.reshape(-1, 1)
precision_aug = precision_score(y_test_aug, predictions_aug)
recall_aug = recall_score(y_test_aug, predictions_aug)
accuracy_aug = clf_aug.score(vect_aug.transform(X_test_aug), y_test_aug)
roc_aug = roc_auc_score(y_test_aug, predictions_aug)

print("\nAugust Precision: {}\nAugust Recall: {}\nAugust AUROC: {}\nAugust Accuracy (Test): {}\nAugust Accuracy (Train): {}".format(precision_aug, recall_aug, roc_aug, accuracy_aug, clf_aug.score(X_train_vectorized_aug, y_train_aug)))

Top 5 Positive Determinants in August: 
         word      coef
0  candidate  0.824228
1      party  0.778829
2        pap  0.619303
3         ng  0.544373
4     return  0.540972

Top 5 Negative Determinants in August: 
            word      coef
3711  transport -0.320686
3712     sunday -0.361915
3713     arrest -0.400221
3714        bar -0.400547
3715  wednesday -0.498875

August Precision: 0.0
August Recall: 0.0
August AUROC: 0.5
August Accuracy (Test): 0.4444444444444444
August Accuracy (Train): 1.0


  _warn_prf(average, modifier, msg_start, len(result))


The error message "UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples." was generated as the predictions in August returned a single class (eg. all predictions were 0). This is because the classifier cannot determine the differences between the binary classes (as seen with an AUROC of 0.5) therefore the it predicts 0 for all cases. 

#### A final note

One disadvantage of using the bag-of-words approach is that the trained model can simply be **manipulated** to generate a positive or negative response by simply parsing in the top determinants (shown below). 

In [8]:
print("Prediction: {}. An example of a manipulated positive response".format(clf_jul.predict(vect_jul.transform(["monday party pap jul grc"]))))
print("Prediction: {}. An example of a manipulated negative response".format(clf_jul.predict(vect_jul.transform(["mosque friday jun bar sunday"]))))

Prediction: [1]. An example of a manipulated positive response
Prediction: [0]. An example of a manipulated negative response
