<h1> PART1   Introduction to the problem </h1>  
<strong>The goal of this data analytics is to use the news to predict whether the behaviour of the DJIA close value.</strong>
<strong>The dataset is from Kaggle, the author of this Kaggle problem provides us with 3 datasets:</strong>
<ol>
<li>RedditNews.csv: two columns The first column is the "date", and second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date.</li>
<li>DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info.</li>
<li>Combined_News_DJIA.csv: To make things easier for my students, I provide this combined dataset with 27
columns. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".</li>
</ol>
#### Note that this problem is binary classification task.Hence there are only two labels:
1. "1" when DJIA Adj Close value rose or stayed as the same;
2. "0" when DJIA Adj Close value decreased.     

#### About the training data set and test data set:
**The author requires us:** for task evaluation, please use data from 2008-08-08 to 2014-12-31 as Training Set, and Test Set is then the following two years data from 2015-01-02 to 2016-07-01. This is roughly a 80%/20% split.
#### About the evaluation metric:
The author of this kaggle problem requires us use AUC metric.

<h1> PART2 Implementation</h1>

In [None]:
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import numpy as np
import seaborn as sns
import string
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

In [None]:
# This is for setting the theme of the graph
sns.set(style="darkgrid")

## STEP1 Data Collection and Selection
** About the data collection step, since the dataset kaggle provided is the csv file, we don't need to collect the data by ourselves from other places.The kaggle provided us with 3 datasets: they are RedditNews.csv, DJIA_table.csv, Combined_News_DJIA.csv. The Combined_News.csv combines the RedditNews.csv and DJIA_table.csv.Therefore, in the project, we choose the combined_news.csv as the data source. By using the following python statement, we change the combined_news.csv into a pandas dataframe, which is convient for data analytics. **

In [None]:
combine_news = pd.read_csv('../input/Combined_News_DJIA.csv')

## STEP2 Data Pre Processing 
### Since the target data collected and selected after step one may contain missing, inconsistent and erroneous values and may have to be cleaned before they are mined for useful information and knowledge. So the following we  will explore the data and find if there are data we need to pre process.

In [None]:
# First lets see how our data look
combine_news.head()

In [None]:
combine_news.info()

** As we can see the output of above command, there are total of 1989 rows, for columns "Date", "Label", "Top1", "Top2", "Top3", "Top4", "Top5", "Top6", "Top7", "Top8", "Top9", "Top10", "Top11","Top12", "Top13", "Top14","Top15", "Top16", "Top17","Top18", "Top19", "Top20", "Top21", "Top22", there are 1989 non-null objects, which satisfies the clean data criteria. However, for column "Top23","Top24","Top25", there are some null values, which does not affect the data mining, since not all the columns in these rows are null. **
### Conclusion,the dataset is already clean.We don't need to do any clean data action.We can do more exploration on the data by using statistical graph.

In [None]:
sns.countplot(x='Label',data=combine_news)

** As above figure shows, the count of the Label 1  is larger than Label 2, but the difference between these two label is not very significant, which means that stock close value is more likely to rise or stay the same.**

## STEP3 Data-Transformation

### From the dataframe, we can know that the attributes are news, which is the text data.However, the classification algorithms are all dealing with numeric values, therefore, we have to transform the text data into numeric data. In this case, we will use natural language processing algorithm to transform the data.

** First, we have to get rid of the punctuation marks in each news **

In [None]:
def news_process(news):
    '''
    this function takes news as an argument,then remove all the punctuation marks and 
    all the stop words in the news, finally return a list of words.
    '''
    if isinstance(news,str):
        news = news.strip('b')
        news = [c for c in news if c not in string.punctuation]
        news = ''.join(news)
        return news
    else:
        return ''

** Then we need to combine the 25 top news into one column, called headlines.We use the following code to do that.**

In [None]:
headlines = []
for row in range(0,len(combine_news.index)):
    headlines.append(' '.join(news_process(news) for news in combine_news.iloc[row,2:27]))
combine_news['headlines'] = headlines

**The following output is a part of dataframe after we combining all the news column into one headline column**

In [None]:
df = combine_news[['Date','Label','headlines']]
df.head()

** The next problem arise is that each headline has one or more sentences, but we don't care the grammar, or the structure of the sentence, what matters is the key words, therefore, we have to get rid of those conjunction words like 'the','of','as',etc and many other common words which does not contain important information. To achieve this goal, we make use of the NLTK library, which has a list of words called stop words, we will use the following code to remove the stop words in each headline.Also, a word has many forms, for example, the word will have forms: won't,willing, so when we do natural language processing, we really want to replace won't and willing with will, in other words, we want to use their root word instead.To fulfill this, we will make use of the snowballstemmer(a NLTK library method) **

In [None]:
def clean_news(news):
    stemmer = SnowballStemmer("english", ignore_stopwords=True)    
    clean_news = [stemmer.stem(word) for word in news.split() if word.lower() not in stopwords.words('english')]
    return clean_news

**The above function also helps to tokenize the news. Tokenization is just the term used to describe the process of converting the normal text strings in to a list of tokens (words that we actually want).**

## Vectorization

### Currently, we have the headlines as lists of tokens and now we need to convert each of those lists of key words into a vector the SciKit Learn's machine learning algorithm models can work with.
** We'll do that in three steps using the bag-of-words model:**

**1. Count how many times does a word occur in each headline (Known as term frequency)**

**2. Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)**

**3. Normalize the vectors to unit length, to abstract from the original text length (L2 norm)**

**Let's begin the first step:**

**We will first use SciKit Learn's **CountVectorizer**. This model will convert a collection of text documents to a matrix of token counts.**  
**We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are the actual documents, in this case a column per headline. **

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

** There are a lot of arguments and parameters that can be passed to the CountVectorizer. In this case we will just specify the **analyzer** to be our own previously defined function,which is the clean_news function，the analyzer is used to extract the important words of each headline **

In [None]:
# Might take awhile...
bow_transformer = CountVectorizer(analyzer=clean_news).fit(df['headlines'])

# Print total number of vocab words
print(len(bow_transformer.vocabulary_))

### As we can see from above, there are total number of 28315 words 

** Let's take one headline and get its bag-of-words counts as a vector, putting to use our new `bow_transformer`:**

In [None]:
headline4 = df['headlines'][3]
print(headline4)

####  Now lets see its vector representation

In [None]:
bow4 = bow_transformer.transform([headline4])
print(bow4)
print(bow4.max())

** As we can see from above, the 4th headline has many unique words, many of them appears only once, but there is a maximum frequency of a word which appears 9 times. We can see which word is this.**

In [None]:
print(bow_transformer.get_feature_names()[26693])

** The result tells that the most frequent word appears in this headline is US, which is united states of America, which makes sense, because the policy and the moves the US takes do have some effect on the stock market.**

** Now we can use **.transform** on our Bag-of-Words (bow) transformed object and transform the entire DataFrame of headlines. Let's go ahead and check out how the bag-of-words counts for the entire dataframe is a large, sparse matrix:**

In [None]:
headlines_bow = bow_transformer.transform(df['headlines'])

In [None]:
print('Shape of Sparse Matrix: ', headlines_bow.shape)
print('Amount of Non-Zero occurences: ', headlines_bow.nnz)

In [None]:
sparsity = (100.0 * headlines_bow.nnz / (headlines_bow.shape[0] * headlines_bow.shape[1]))
print('sparsity: {}'.format(round(sparsity)))

** By the above code, we have already transformed the each headline to a vector of words count, which is numeric data.**

After the counting, the term weighting and normalization can be done with TF-IDF.
### So what is TF-IDF?
TF-IDF stands for *term frequency-inverse document frequency*, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

**TF: Term Frequency**, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

*TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).*

**IDF: Inverse Document Frequency**, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

*IDF(t) = log_e(Total number of documents / Number of documents with term t in it).*

See below for a simple example.

**Example:**

Consider a document containing 100 words wherein the word cat appears 3 times. 

The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
____

### Conclusion: 
####      If a word appears frequently in a document, it's important, give the word high score.
####      If a word appears in many documents, it's not a unique identifier, give the word low score.
Then we will apply this to this case

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer().fit(headlines_bow)
tfidf4 = tfidf_transformer.transform(bow4)
print(tfidf4)

** To transform the entire bag-of-words corpus into TF-IDF corpus at once:**

In [None]:
headlines_tfidf = tfidf_transformer.transform(headlines_bow)
print(headlines_tfidf.shape)

### By using the above code, we have successfully tranform the text data to numeric data

## STEP4 Data Mining and Data modeling

### Train a Model

**Quote from kaggle:**  For task evaluation, please use data from 2008-08-08 to 2014-12-31 as Training Set, and Test Set is then the following two years data (from 2015-01-02 to 2016-07-01). This is roughly a 80%/20% split.

**Split the data to train and test**

In [None]:
headline_train = df[df['Date']<'2015-01-01']['headlines']
label_train = df[df['Date']<'2015-01-01']['Label']
headline_test = df[df['Date']>'2014-12-31']['headlines']
label_test = df[df['Date']>'2014-12-31']['Label']

### We use a pipeline to combine all the transformer and vectorizer into a pipe in order to make the model training and testing easier. For training the model, we will use Naive Bayes algorithm.

### In Scikit Learn , the Naive Bayes algorithm has two variations, one is multinomial naive bayes, another is bernoulli bayes.We implement both two variations, and evaluate both, then choose the better one.

**Multinomial Naive Bayes**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=clean_news)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [None]:
pipeline.fit(headline_train,label_train)

In [None]:
predictions = pipeline.predict(headline_test)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(label_test,predictions))

In [None]:
from sklearn import metrics 
metrics.accuracy_score(predictions,label_test)

In [None]:
y_pred_prob = pipeline.predict_proba(headline_test)[:,1]

In [None]:
metrics.roc_auc_score(label_test,y_pred_prob) 

** As we can see from the classification report above, on the test data, the model predict all the test data to label 1, also, the AUC score, is not good. The problem may be caused by the news we use, in the above code, we only use today's news to predict today's stock close value, which is not very good. Therefore, below, we will use the past news to today's stock close value **

In [None]:
df_one_day_before = df[:]
df_one_day_before['Label'] = df_one_day_before['Label'].shift(-1)
df_yesterday = df_one_day_before[:-1]
df_yesterday

In [None]:
headline_train = df_yesterday[df_yesterday['Date']<'2015-01-01']['headlines']
label_train = df_yesterday[df_yesterday['Date']<'2015-01-01']['Label']
headline_test = df_yesterday[df_yesterday['Date']>'2014-12-31']['headlines']
label_test = df_yesterday[df_yesterday['Date']>'2014-12-31']['Label']

In [None]:
pipeline.fit(headline_train,label_train)

In [None]:
predictions = pipeline.predict(headline_test)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(label_test,predictions))

**From above, we can see that when we change to use yesterday's news to predict the close value, the performance of the mutinomial naive bayes is still bad, why is that? Then we will use Bernoulli naive bayes to see if it performs better.**

In [None]:
df_one_day_before = df[:]
df_one_day_before['Label'] = df_one_day_before['Label'].shift(-1)
df_yesterday = df_one_day_before[:-1]
df_yesterday
headline_train = df_yesterday[df_yesterday['Date']<'2015-01-01']['headlines']
label_train = df_yesterday[df_yesterday['Date']<'2015-01-01']['Label']
headline_test = df_yesterday[df_yesterday['Date']>'2014-12-31']['headlines']
label_test = df_yesterday[df_yesterday['Date']>'2014-12-31']['Label']

In [None]:
from sklearn.naive_bayes import BernoulliNB
bernoulli_pipeline = Pipeline([
    ('bow', CountVectorizer(ngram_range=(1,2),analyzer=clean_news)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', BernoulliNB(alpha=0.5,binarize=0.0)),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [None]:
bernoulli_pipeline.fit(headline_train,label_train)

**Some notes about the parameters:in this case we choose alpha=0.5, because after we tuning the alpha parameter with different values, alpha=0.5 gives us the best result.Also, we set the binarize=0.0, that is because Bernoulli requires binary data, and we don't want to lose any information, if we set binarize to 0.5,means that words have value of 0.3 will be regarded as 0, and we don't want that happen.**

In [None]:
predictions = bernoulli_pipeline.predict(headline_test)

In [None]:
print(classification_report(label_test,predictions))

In [None]:
metrics.accuracy_score(label_test,predictions)

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(label_test,predictions))

In [None]:
metrics.accuracy_score(predictions,label_test)

In [None]:
y_pred_prob = bernoulli_pipeline.predict_proba(headline_test)[:,1]

In [None]:
metrics.roc_auc_score(label_test,y_pred_prob) 

## PART4 Evaluation

## We have some good news from the above output,  the performance of the Bernoulli naive bayes is obviously better than the mutinomial naive bayes: the accuracy score of the mutinomial naive bayes is only 0.50, but the accuracy score of Bernoulli is 0.54. Also another metric AUC which stands for Area under Curve, has better value in Bernoulli naive bayes, which is 0.544, while mutinomial only has value of 0.51.As we all know, AUC is a very good metric to measure the performance of binary classification.

## But why Bernoulli naive bayes performs better that mutinomial naive bayes?
**Lets say today's news includes North Korea declares a war to South Korea, in multinomial naive bayes, it puts more focus on the times each word appears, however, in bernoulli naive bayes, it only cares that if a word appears or not,therefore for stock market investors, do they care how many times north korea, war such words appear in news? The answer is no, they only care about do north korea declare a war, which means even if north korea appear only once in the news, this information is still having a huge influence on the stock market.Therefore, Bernoulli naive bayes performs better than mutinomial naive bayes in this case.  **

## Another question is that why naive bayes works in this case?

**Naive Bayes has a strong assumption that the features have to be independent.In this case, Top15 news can be independent on Top14 news, because Top14 news cannot talk to Top15 news and ask will you happen? News always happen without any signs, therefore in this case, we can use naive bayes.**