 In this sections, we will learn to train a machine learning model on the existing sentiment data and try to predict sentiment of the news headlines.

# Bag of Words (BoW)
Bag of words take words in documents and find out the frequency of each word. 

## Input

In [1]:
corpus = [
'I love dogs',
'I hate dogs and knitting',
'Knitting is my hobby and my passion']

corpus

['I love dogs',
 'I hate dogs and knitting',
 'Knitting is my hobby and my passion']

## Count of Words
First, we will import CountVectorizer from sklearn and instantiate as an object. CountVectorizer separates each sentence into tokens, and then count the number of times each token occurs in a sentence.

In [2]:
# Import CountVectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer
# Instantiate CountVectorizer()
cv = CountVectorizer()

# Fit the model
word_count = cv.fit_transform(corpus)

## Extract Features
There are few things to note if we use default arguments of CountVectorizer:
1. Everything will be in lowercase <br>
2. Words less than two letters will not be included (There is no I) <br>
3. Punctuations will be removed <br>
4. There will be no duplicates

In [3]:
# Print all the words 
print(cv.get_feature_names())

['and', 'dogs', 'hate', 'hobby', 'is', 'knitting', 'love', 'my', 'passion']


## Represent features and count in a dataframe
We create a dataframe which store features and their corresponding word count.

In [4]:
# Import pandas 
import pandas as pd

# Create a dataframe and store the number of times a word appear in a document
df = pd.DataFrame(word_count.toarray(), columns = cv.get_feature_names())
df

Unnamed: 0,and,dogs,hate,hobby,is,knitting,love,my,passion
0,0,1,0,0,0,0,1,0,0
1,1,1,1,0,0,1,0,0,0
2,1,0,0,1,1,1,0,2,1


Now we can feed this into  machine learning algorithms to achieve the required task.

We will learn to use XGBoost algorithm on the vectors generated from BoW to predict the sentiment of the news headlines


# XGBoost Model

In this notebook, we will learn to use XGBoost model and predict sentiment of the news headlines. The steps involved are:
1. Read data
2. Determine target variables
3. Create predictor variables
4. Split data into train and test 
5. Apply Bag of Words on train and test dataset
6. Run XGBoost on the train dataset
7. Predict sentiment scores on the test dataset
8. Analyse the results

## Read data

In [5]:
# Import pandas
import pandas as pd

# Read news sentiment data
news_sentiment_data = pd.read_csv('news_headline_sentiments.csv')
news_sentiment_data.head()

Unnamed: 0.1,Unnamed: 0,news_headline,time_stamp,URL,source_id,sentiment_class,sentiment_scores
0,1999840,IMF chief: Beware of global deflation,2014-01-27 10:17:47-08:00,/video/data/2.0/video/business/2014/01/24/intv...,1248,0,0.0
1,2012782,Cars for Sale - ToDrive.com,2014-01-27 10:17:47-08:00,http://www.todrive.com/,1303,0,0.0
2,1995595,Info Edge CFO Raghuvanshi resigns,2014-01-27 10:17:47-08:00,http://www.thehindu.com/business/Industry/info...,1239,-1,-0.3182
3,2012786,Weak company earnings drag US stocks mostly lower,2014-01-27 10:17:47-08:00,//www.suntimes.com/business/24993766-420/weak-...,1303,-1,-0.7184
4,2012787,Some workers set to reach $100K pay in Boeing ...,2014-01-27 10:17:47-08:00,//www.suntimes.com/business/24725356-420/some-...,1303,-1,-0.0772


## Determine target variables
Here target variable is to predict sentiment class of the news headlines.  

In [6]:
# Store sentiment_class in y
y = news_sentiment_data.sentiment_class

## Create predictor variables
Here predictor variables are news headlines which are used to predict target variable which is sentiment class of the news headlines.

In [7]:
#Store news healines in X
X=news_sentiment_data.news_headline

#Convert X in string if the value of x is not string
X=[str(x) if type(x)!=str else x for x in X]

## Split data into train and test  

We use 80% of our data to train and the rest 20% to test.

In [9]:
test_ratio=0.2
train_ratio=1-test_ratio
num_train=int(train_ratio*len(X))
#Train data set
X_train=X[:num_train]
y_train=y[:num_train]
#Test data set
X_test=X[num_train:]
y_test=y[num_train:]

## Apply Bag of Words on train and test dataset

We convert training and testing dataset into the bag of words vector. 

For that first we process the news headlines which consists of 
1. converting text to lower case
2. removing stop words such as the, and, of, etc.

We then pass the processed news headlines into `CountVectorizer` method. 

In [11]:
# Import CountVectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer and required arguments to process the data
count_vectorizer = CountVectorizer(stop_words='english', lowercase=True)

In [12]:
# Fit and transform the model on train dataet
X_new_train = count_vectorizer.fit_transform(X_train)
#fit:means the model learn from the data
#transform: produce the output based on the fitted model

In [13]:
# Transform the test dataset
X_new_test = count_vectorizer.transform(X_test)

## Run XGBoost on the train dataset

To train our model, we pass the the vectors to XGBoost model. We use `XGBClassifier` and pass the following parameters:

1. <b>max_depth:</b>  limits the number of nodes in the tree
2. <b>n_estimators:</b> number of trees to create 

In [17]:
# Import XGBClassifier from xgboost
from xgboost import XGBClassifier

# Instantiate XGBClassifier
xg = XGBClassifier(max_depth = 6, n_estimators = 100)

# Fit the model on train dataset
xg_model = xg.fit(X_new_train,y_train)

## Predict
Now the model is trained, we predict the sentiment class using the test dataset. We use `predict` function on the test dataset.

In [18]:
# Predict sentiment class on the test dataset
prediction = xg_model.predict(X_new_test)
#remind:X_new_test = count_vectorizer.transform(X_test)

## Results Analysis
 We use confusion matrix, classification report and prediction accuracy to examine the performance of the classification model.

In [20]:
from sklearn.metrics import confusion_matrix 
print(confusion_matrix(y_test, prediction))

[[ 56110  33451   5551]
 [  1797 178694   3528]
 [  4097  36928  79845]]


### Classification report

In [22]:
from sklearn.metrics import classification_report 
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

          -1       0.90      0.59      0.71     95112
           0       0.72      0.97      0.83    184019
           1       0.90      0.66      0.76    120870

   micro avg       0.79      0.79      0.79    400001
   macro avg       0.84      0.74      0.77    400001
weighted avg       0.82      0.79      0.78    400001



### Prediction accuracy

In [23]:
from sklearn.metrics import accuracy_score 
print(accuracy_score(y_test,prediction))

0.7866205334486663
