# 1. Import the libraries, load dataset, print shape of data, data description.

In [378]:
#importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer


from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
#from sklearn.model_selection import KFold, cross_val_score

from sklearn.metrics import confusion_matrix

In [272]:
import warnings
warnings.filterwarnings("ignore")

In [273]:
#importing dataset
dataset=pd.read_csv('Tweets.csv')

In [274]:
#printing first 5 rows of data
dataset.head(5)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [275]:
dataset.shape

(14640, 15)

In [276]:
#extracting unique values of ariline_sentiment to acquire classes of target variable
dataset.airline_sentiment.unique()

array(['neutral', 'positive', 'negative'], dtype=object)

Dataset is composed of 15 columns and 14640 rows. It means there are 14640 tweets to be used to train and test model for sentiment extraction.
There are 3 classes of target variable (airline_sentiment) - neutral, positive and negative. This information will be usefull once the training of the model is to be done. 
Following code tries to find portion of those classes in dataset.

In [277]:
#percentage of of positive sentiment
100*dataset[dataset['airline_sentiment']=='positive'].shape[0]/dataset.shape[0]

16.14071038251366

In [278]:
#percentage of of negative sentiment
100*dataset[dataset['airline_sentiment']=='negative'].shape[0]/dataset.shape[0]

62.69125683060109

In [279]:
#percentage of of negative sentiment
100*dataset[dataset['airline_sentiment']=='neutral'].shape[0]/dataset.shape[0]

21.168032786885245

Suprisingly, most of the tweets are negative ones - having share of more than 62%. It could be said target variable is not highly imbalanced, since there is no class that is represented in more than 90% of data. 



In [280]:
#checking if there are null values in dataset
dataset.isnull().any()

tweet_id                        False
airline_sentiment               False
airline_sentiment_confidence    False
negativereason                   True
negativereason_confidence        True
airline                         False
airline_sentiment_gold           True
name                            False
negativereason_gold              True
retweet_count                   False
text                            False
tweet_coord                      True
tweet_created                   False
tweet_location                   True
user_timezone                    True
dtype: bool

There are some columns with null values, however, columns of our interest (airline sentiment and text) have no null values. Null values in e.g. column negativeReason are result of non applicability in case the sentiment is positive or neutral. 

# 2. Understand of data-columns:

## a. Drop all other columns except “text” and “airline_sentiment”.

Instead of dropping columns, dataset is only reshaped by using only two columns of further interest. That approach is used becuase there are only two columns to be used out of 15 columns. The result is the same as 13 columns have been dropped. 

In [281]:
#assigning only two out of 15 columns of dataset to dataset
dataset=dataset[[ 'text','airline_sentiment']]

## b. Check the shape of the data.

In [282]:
dataset.shape

(14640, 2)

Now the dataset is composed on 14640 rows (cases) and 2 columns (text and target variable).

## c. Print the first 5 rows of data.

In [283]:
dataset.head(5)

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative


# 3. Text pre-processing: Data preparation.
#### NOTE:- Each text pre-processing steps should be mentioned in the notebook separately.

Task is to do text pre-processing to prepare raw text to the text which will be vecotrized. The approach used is to develop functions for each pre processing step. At the end, final fiunction is developed which combine all of the functions developed for each step in one consecutive order.

## a. Html tag removal.

In [284]:
#function which takes text with html tags and returns text without it.

#soup=BeautifulSoup(html, 'html.parser')

def html_remover(text_with_html):
    #applying get_text() function of BeautifulSoup class
    text_no_html = BeautifulSoup(text_with_html,"html.parser").get_text()
    return (text_no_html)

## b. Tokenization.

In [285]:
#function which tokenize text and returns list of tokens
def tokenization(text_to_tokenize):
    #instancing ToktokTokenizer object
    tokenizer=ToktokTokenizer()
    #applying tokenize() function of ToktokTokenizer object instance
    tokens=tokenizer.tokenize(text_to_tokenize)
    return(tokens)

## c. Remove the numbers.

In [286]:
#function which returns text without numbers
def removal_number(text_with_numbers):
    #pattern is created in a form "[0,1,2,3,4,5,6,7,8,9]"
    pattern=r'[0-9]'
    #pattern is substitued with null value; in other words, whenever there is a digit identified, it is subsituted with nothing
    text_without_numbers=re.sub(pattern,'',text_with_numbers)
    return(text_without_numbers)

## d. Removal of Special Characters and Punctuations.

In [287]:
#function which removes special characters by preserving only characters a-zA-Z and space
def removal_special_characters(text_with_special_char):
    #pattern is created in a form "not [a-zA-Z and space]"
    pattern=r'[^a-zA-Z\s]'
    #pattern is subsituted with nothing; 
    #in effect, whenever a character that does not belongs to small or capital letter 
    #or a space character is encountered, it is subsituted with nothing
    text_without_special_char=re.sub(pattern,'', text_with_special_char)
    return(text_without_special_char)

## e. Conversion to lowercase.

In [288]:
#function that returns text with lowercase only
def to_lower_case(text_with_upper_case):
    text_lower_case_only=text_with_upper_case.lower()
    return(text_lower_case_only)

## f. Lemmatize or stemming.

In [289]:
#instance of WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()

#function to lemmatize word
def nltk_lemmatize_word(word):
    #applying lemmatize function of WordNetLemmatizer() object
    return lemmatizer.lemmatize(word)

## g. Join the words in the list to convert back to text string in the data frame. (So that each row contains the data in text format.)

In [290]:
#function which recieve list of words and joins them into text
def join_words(words_to_join):
    return (" ".join(words_to_join))

In [291]:
#function which combineds all of the steps above
def preprocess_text(raw_text):
    #removal of html
    no_html_text=html_remover(raw_text)
    
    #tokenization
    tokens=tokenization(no_html_text)
    #creating empty list to store preprocessed tokens
    preprocessed_tokens=[]
    #loop to run through all tokens of the raw_text (which are result of tokenization step)
    for i in tokens:
        #in this loop all tasks are performed on single token (word)
        
        #removal of special characters
        spec_char_removed=removal_special_characters(i)
        #removal of upper case
        lower_case=to_lower_case(spec_char_removed)
        #lemmatization of word
        lemmatized_word=nltk_lemmatize_word(lower_case)
        
        #storing all results performed on individual tokens in one list 
        preprocessed_tokens.append(lemmatized_word)
             
    
    #joining the words(preprocessed tokens) into one text and returning the result
    return(join_words(preprocessed_tokens))
        

In [292]:
#verfication of the preprocess_text()function on a single text entry
preproc_tok=preprocess_text(dataset['text'][50])

In [293]:
#printing the result of preprocess_text()function on a single text entry
preproc_tok

'virginamerica is flight  on it  s way  wa supposed to take off  minute ago website still show  on time  not  in flight   thanks '

In [294]:
#printing original text for comparison
dataset['text'][50]

'@VirginAmerica Is flight 769 on it\'s way? Was supposed to take off 30 minutes ago. Website still shows "On Time" not "In Flight". Thanks.'

preprocess_text function removed all special characters, numbers, capital letters, and lemmatized words (e.g. shows--> show, minutes-->minute), however, there still opportunity to further preprocess text (e.g. contractions have not been converted (e.g. it's) or stop words have not been removed), however, task was clear on required preprocessed steps.

Now it is time to apply preprocess_text() function (and inherently all text preprocessing steps) on dataset. It is done in a loop where each element of dataset['text'] dataframe column have been passed to the function and result is stored on same place of the dataframe. 

In [295]:
#seeting index to 0 
i=0
#looping through all the elements of dataset['text'] column
for text in dataset['text']:
    dataset['text'].iloc[i]=preprocess_text(text)
    i+=1
    

In [296]:
#verifying if the preprocess steps were performed correctly on a snigle entry
dataset['text'].iloc[50]

'virginamerica is flight  on it  s way  wa supposed to take off  minute ago website still show  on time  not  in flight   thanks '

 ## h. Print the first 5 rows of data after pre-processing.

In [166]:
dataset.head(5)

Unnamed: 0,text,airline_sentiment
0,virginamerica what dhepburn said,neutral
1,virginamerica plus you ve added commercial to...,positive
2,virginamerica i didn t today must mean i nee...,neutral
3,virginamerica it s really aggressive to blast...,negative
4,virginamerica and it s a really big bad thing...,negative


# 4. Vectorization:

It is worth noting that target variable is composed of three possible classes. There are to possible ways to represetn such classes in numerical form - to perform one hot encoding of target variable, or to perform label encoding. 

In case of one hot encoding, target would be composed of three columns (or two if one of the varibles would be droped(e.g. if sentiment is not negative or neutral, then it is positive and there is no reason to retain such variable ; it is completly defined by other two)). Having three(or two target variables) might lead to missclassification in case all of the targets are classified as 0 (e.g. all three classes resulted in 0 value - in effect it would mean there was no class identified for such case).

In case of label encoding, labels would be given to each sentiment class (e.g. negative - 0, neutral - 1, positive - 2). Problem with label encoding is it is hard to establish values of labels (e.g. it could be the other way around for given example of labels), however target would remain as one vairable (one column). 

Both approaches will be used, since both have some advantages and disadvantages. 

In addition to above, dataset needs to be splitted into train and test data before vectorization is applied. The reason for this is to prevent dataleaks between train and test data. Data leak might happen if vocabulary of vectorizer is generated base don whole dataset, instead of train data only. If such vocabulary is generated only on train data, then, once the transform function of vectorizer is applied on test data, only vocabulary of train data will be used, hence simulating real unseen data (some new words might be present in test data that are not part of train data vocabulary).
 

#### One hot encoding

In [297]:
#generating dummy_variable (one hot encoding) from target variable
dataset_dummy_target=pd.get_dummies(dataset, drop_first=False, prefix='', prefix_sep='', columns=['airline_sentiment'])

In [298]:
#verifying the result of previous step
dataset_dummy_target.head(5)

Unnamed: 0,text,negative,neutral,positive
0,virginamerica what dhepburn said,0,1,0
1,virginamerica plus you ve added commercial to...,0,0,1
2,virginamerica i didn t today must mean i nee...,0,1,0
3,virginamerica it s really aggressive to blast...,1,0,0
4,virginamerica and it s a really big bad thing...,1,0,0


#### Label encoding

In [324]:
#creating mapping dictionary
labels_dict={'negative': 0, 'neutral': 1, 'positive': 2}

In [305]:
#generating int_label list where labels are stored in correspondence to airline sentiment
int_labels=[]
for value in dataset['airline_sentiment']:
    #extracting the value of key in labels_dict dictionary
    int_labels.append(labels_dict[value])

In [307]:
#setting index for label_necoded_target_series
index=dataset.index

In [310]:
#generating pandas series to be concatenated to dataset dataframe
label_encoded_target_series=pd.Series(data=int_labels, index=index)

In [318]:
#generating new dataset by concatenation of dataset and series generated in step above
dataset_label_encoded_target=pd.concat([dataset, label_encoded_target_series], axis=1)

In [320]:
#verifying previous steps
dataset_label_encoded_target

Unnamed: 0,text,airline_sentiment,0
0,virginamerica what dhepburn said,neutral,1
1,virginamerica plus you ve added commercial to...,positive,2
2,virginamerica i didn t today must mean i nee...,neutral,1
3,virginamerica it s really aggressive to blast...,negative,0
4,virginamerica and it s a really big bad thing...,negative,0
...,...,...,...
14635,americanair thank you we got on a different fl...,positive,2
14636,americanair leaving over minute late flight n...,negative,0
14637,americanair please bring american airline to b...,neutral,1
14638,americanair you have my money you change my f...,negative,0


#### Splitting dataset into train and test sets

In [323]:
#splitting dataset where target variable is tranformed into dummy variables
x_train_d, x_test_d, y_train_d, y_test_d=train_test_split(dataset_dummy_target['text'], dataset_dummy_target[['negative','neutral', 'positive']], test_size=0.3, stratify=dataset_dummy_target[['negative','neutral', 'positive']], random_state=7)

#splitting dataset where target is label encoded
x_train_l, x_test_l, y_train_l, y_test_l=train_test_split(dataset_label_encoded_target['text'], dataset_label_encoded_target[0], test_size=0.3, stratify=dataset_label_encoded_target[0], random_state=7)

Shapes of trainning datasets:

In [325]:
x_train_d.shape

(10248,)

In [326]:
y_train_d.shape

(10248, 3)

In [327]:
x_train_l.shape

(10248,)

In [328]:
y_train_l.shape

(10248,)

First few rows of y_test_d and y_test_l:

In [329]:
y_test_d.head(5)

Unnamed: 0,negative,neutral,positive
9982,1,0,0
7868,0,1,0
10586,1,0,0
8957,0,1,0
8784,1,0,0


In [330]:
y_test_l.head(5)

8651     0
11637    0
7739     1
1245     1
9256     1
Name: 0, dtype: int64

#### Generating vectors

Both CountVectorizer and TfIdfVectorizer will be applied on two sets of trainning and test data - once for one-hot-encoded and once for label encoded target variables. Even though this action is applied only on independant variables and does not include target variable, it is safer to do this step on both becuase of train test split action that was performed beforehand. 

## a. Use CountVectorizer.

In [331]:
#for target dummy_variable 

#creating vectorizer - C in name means count, d at the end means dummy target variable
C_vectorizer_d=CountVectorizer()

#vectorizer is fitted with trainning data
C_vectorizer_d.fit(x_train_d)

#transform is applied on both, trainning and test data
x_train_features_countV_d=C_vectorizer_d.transform(x_train_d)
x_test_features_countV_d=C_vectorizer_d.transform(x_test_d)

x_train_features_countV_d=x_train_features_countV_d.toarray()
x_test_features_countV_d=x_test_features_countV_d.toarray()

In [414]:
#for target label encoded

#creating vectorizer - C in name means count, l at the end means label encoded target variable
C_vectorizer_l=CountVectorizer()

#vectorizer is fitted with trainning data
C_vectorizer_l.fit(x_train_l)

#transform is applied on both, trainning and test data
x_train_features_countV_l=C_vectorizer_l.transform(x_train_l)
x_test_features_countV_l=C_vectorizer_l.transform(x_test_l)

x_train_features_countV_l=x_train_features_countV_l.toarray()
x_test_features_countV_l=x_test_features_countV_l.toarray()

In [333]:
#printing the vocabulary of CountVectorizer (in this case only for dataset where target variable is "one hot" encoded)
C_vectorizer_d.vocabulary_

{'united': 9869,
 'flt': 3277,
 'cancelled': 1266,
 'flightled': 3240,
 'and': 368,
 'get': 3569,
 'email': 2721,
 'am': 317,
 'what': 10301,
 'happened': 3829,
 'to': 9537,
 'courtesy': 1920,
 'phn': 7246,
 'call': 1246,
 'had': 3789,
 'book': 1001,
 'diff': 2346,
 'airline': 223,
 'amp': 356,
 'city': 1533,
 'delayed': 2194,
 'because': 818,
 'of': 6824,
 'maintenance': 6037,
 'that': 9374,
 'fixed': 3191,
 'but': 1200,
 'can': 1262,
 'board': 979,
 'flight': 3217,
 'crew': 1969,
 'didn': 2338,
 'stay': 8958,
 'in': 5088,
 'boarding': 981,
 'area': 506,
 'fail': 3024,
 'my': 6479,
 'dad': 2059,
 'booked': 1002,
 'through': 9453,
 'orbitz': 6930,
 'due': 2602,
 'weather': 10239,
 'he': 3883,
 'make': 6042,
 'it': 5323,
 'the': 9377,
 'airport': 233,
 'you': 10565,
 'help': 3926,
 'him': 3984,
 'read': 7755,
 'bio': 913,
 'see': 8407,
 'who': 10337,
 'work': 10436,
 'with': 10391,
 'have': 3868,
 'never': 6568,
 'encountered': 2760,
 'this': 9427,
 'your': 10573,
 'before': 829,
 'disa

## b. Use TfidfVectorizer.

In [334]:
#for target dummy_variable

#creating vectorizer - T in name means TfIdf, d at the end means dummy target variable
T_vectorizer_d=TfidfVectorizer()

#vectorizer is fitted with trainning data
T_vectorizer_d.fit(x_train_d)

#transform is applied on both, trainning and test data
x_train_features_tfIdfV_d=T_vectorizer_d.transform(x_train_d)
x_test_features_tfIdfV_d=T_vectorizer_d.transform(x_test_d)

x_train_features_tfIdfV_d=x_train_features_tfIdfV_d.toarray()
x_test_features_tfIdfV_d=x_test_features_tfIdfV_d.toarray()


In [335]:
#for target label encoded

#creating vectorizer - T in name means TfIdf, l at the end means label encoded target variable
T_vectorizer_l=TfidfVectorizer()

#vectorizer is fitted with trainning data
T_vectorizer_l.fit(x_train_l)

#transform is applied on both, trainning and test data
x_train_features_tfIdfV_l=T_vectorizer_l.transform(x_train_l)
x_test_features_tfIdfV_l=T_vectorizer_l.transform(x_test_l)

x_train_features_tfIdfV_l=x_train_features_tfIdfV_l.toarray()
x_test_features_tfIdfV_l=x_test_features_tfIdfV_l.toarray()

In [336]:
#printing the vocabulary of TfIdfVectorizer (in this case only for dataset where target variable is "one hot" encoded)
T_vectorizer_d.vocabulary_

{'united': 9869,
 'flt': 3277,
 'cancelled': 1266,
 'flightled': 3240,
 'and': 368,
 'get': 3569,
 'email': 2721,
 'am': 317,
 'what': 10301,
 'happened': 3829,
 'to': 9537,
 'courtesy': 1920,
 'phn': 7246,
 'call': 1246,
 'had': 3789,
 'book': 1001,
 'diff': 2346,
 'airline': 223,
 'amp': 356,
 'city': 1533,
 'delayed': 2194,
 'because': 818,
 'of': 6824,
 'maintenance': 6037,
 'that': 9374,
 'fixed': 3191,
 'but': 1200,
 'can': 1262,
 'board': 979,
 'flight': 3217,
 'crew': 1969,
 'didn': 2338,
 'stay': 8958,
 'in': 5088,
 'boarding': 981,
 'area': 506,
 'fail': 3024,
 'my': 6479,
 'dad': 2059,
 'booked': 1002,
 'through': 9453,
 'orbitz': 6930,
 'due': 2602,
 'weather': 10239,
 'he': 3883,
 'make': 6042,
 'it': 5323,
 'the': 9377,
 'airport': 233,
 'you': 10565,
 'help': 3926,
 'him': 3984,
 'read': 7755,
 'bio': 913,
 'see': 8407,
 'who': 10337,
 'work': 10436,
 'with': 10391,
 'have': 3868,
 'never': 6568,
 'encountered': 2760,
 'this': 9427,
 'your': 10573,
 'before': 829,
 'disa

Vocabulary of both vectorizers produce same output, which is expected since the same training data is used in both scenarios.

# 5. Fit and evaluate the model using both types of vectorization.

Four models will be used, one for each set of trainning and test data generated in steps above, as described below:
- rf_model_CV_d - random forest model which will be fitted with train data generated by count vectorizer where target is one hot encoded;
- rf_model_CV_l - random forest model which will be fitted with train data generated by count vectorizer where target is label encoded;
- rf_model_TV_d - random forest model which will be fitted with train data generated by tfidf vectorizer where target is one hot encoded; and
- rf_model_TV_l - random forest model which will be fitted with train data generated by tfidf vectorizer where target is label encoded;

Once the models are fitted, accuracy score will be extracted for all four models on both, train and test datasets, and finally confussion matrix will be generated at the end. 

In [362]:
#countVectorizer
#one_hot_encoded target

#creating an insance of randomforest object that will be fitted (trainned) by data generated by countVectorizer 
rf_model_CV_d=RandomForestClassifier(random_state=7, n_estimators=10, n_jobs=4)
#fitting the model (trainning)
rf_model_CV_d=rf_model_CV_d.fit(x_train_features_countV_d, y_train_d)

In [363]:
#TfIdfVectorizer
#one_hot_encoded target

#creating an insance of randomforest object that will be fitted (trainned) by data generated by tfIdfVectorizer 
rf_model_TV_d=RandomForestClassifier(n_estimators=10, n_jobs=4, random_state=7)
#fitting the model (trainning)
rf_model_TV_d=rf_model_TV_d.fit(x_train_features_tfIdfV_d, y_train_d)

In [364]:
#countVectorizer
#label_encoded target

#creating an insance of randomforest object that will be fitted (trainned) by data generated by countVectorizer 
rf_model_CV_l=RandomForestClassifier(random_state=7, n_estimators=10, n_jobs=4)
#fitting the model (trainning)
rf_model_CV_l=rf_model_CV_l.fit(x_train_features_countV_l, y_train_l)

In [365]:
#TfIdfVectorizer
#label_encoded target

#creating an insance of randomforest object that will be fitted (trainned) by data generated by tfIdfVectorizer
rf_model_TV_l=RandomForestClassifier(n_estimators=10, n_jobs=4)
#fitting the model (trainning)
rf_model_TV_l=rf_model_TV_l.fit(x_train_features_tfIdfV_l, y_train_l)

Calculating accuracy score for each of the models

In [366]:
#calculating accuracy score on train and test data
#model trainned by data generated by countVectorizer
#target variable was one_hot_encoded

#score on train data
rf_model_CV_score_train_d=rf_model_CV_d.score(x_train_features_countV, y_train)
#score on test data
rf_model_CV_score_test_d=rf_model_CV_d.score(x_test_features_countV, y_test)

In [367]:
print("Accuracy score of rf_model_CV_d on train data is", rf_model_CV_score_train_d)
print("Accuracy score of rf_model_CV_d on test data is",rf_model_CV_score_test_d)

Accuracy score of rf_model_CV_d on train data is 0.9627244340359095
Accuracy score of rf_model_CV_d on test data is 0.6432149362477231


In [368]:
#calculating accuracy score on train and test data
#model trainned by data generated by TfIdfVectorizer
#target variable was one_hot_encoded

#score on train data
rf_model_TV_score_train_d=rf_model_TV_d.score(x_train_features_tfIdfV_d, y_train)
#score on test data
rf_model_TV_score_test_d=rf_model_TV_d.score(x_test_features_tfIdfV_d, y_test)

In [369]:
print("Accuracy score of rf_model_TV_d on train data is",rf_model_TV_score_train)
print("Accuracy score of rf_model_TV_d on test data is",rf_model_TV_score_test)

Accuracy score of rf_model_TV_d on train data is 0.9666276346604216
Accuracy score of rf_model_TV_d on test data is 0.639344262295082


In [370]:
#calculating accuracy score on train and test data
#model trainned by data generated by CountVectorizer
#target variable was label_encoded

rf_model_CV_score_train_l=rf_model_CV_l.score(x_train_features_countV_l, y_train_l)
#score on test data
rf_model_CV_score_test_l=rf_model_CV_l.score(x_test_features_countV_l, y_test_l)

In [371]:
print("Accuracy score of rf_model_CV_l on train data is", rf_model_CV_score_train_l)
print("Accuracy score of rf_model_CV_l on test data is",rf_model_CV_score_test_l)

Accuracy score of rf_model_CV_l on train data is 0.98467993754879
Accuracy score of rf_model_CV_l on test data is 0.73816029143898


In [339]:
#calculating accuracy score on train and test data
#model trainned by data generated by TfIdfVectorizer
#target variable was label_encoded

rf_model_TV_score_train_l=rf_model_TV_l.score(x_train_features_tfIdfV_l, y_train_l)
#score on test data
rf_model_TV_score_test_l=rf_model_TV_l.score(x_test_features_tfIdfV_l, y_test_l)
##############

In [372]:
print("Accuracy score of rf_model_TV_l on train data is",rf_model_TV_score_train_l)
print("Accuracy score of rf_model_TV_ on test data is",rf_model_TV_score_test_l)

Accuracy score of rf_model_TV_l on train data is 0.9758977361436377
Accuracy score of rf_model_TV_ on test data is 0.7317850637522769


All of the models are overfit which can be seen from the difference between accuracy achieved on train and test data. In addition, both models that were trained on datasets which have label encoded target varibale performed significantly better on test data. 

From the achieved accuracy, it is noticable that models trained on count vectorized data perfromed slightly better which was not expected, since there are lots of common words that have not been removed by preprocessing (e.g. no stop words removal was applied). 

In order to reduce overfit, radnom forest should be regulated with hyper parameters. 

#### Confusion matrix

To calculate confusion matrix, confusion_matrix function from sklearn.metric is used. Function must be fed with predicted and true values.
For predicted and true values where one hot encoded target varibale was use index of max value had to be given to confusion_matrix() function.
It is worth noting that confusion matrix will be calculated on test data only.

Predictions

In [375]:
#predicting classes where target variable was one hot encoded
rf_model_CV_d_pred_test=rf_model_CV_d.predict(x_test_features_countV_d)
rf_model_TV_d_pred_test=rf_model_TV_d.predict(x_test_features_tfIdfV_d)

In [376]:
#predicting classes where target variable was label encoded
rf_model_CV_l_pred_test=rf_model_CV_l.predict(x_test_features_countV_l)
rf_model_TV_l_pred_test=rf_model_TV_l.predict(x_test_features_tfIdfV_l)

In [379]:
#confusion matrices

#target variable - one hot encoded
#calculating confusion matrices
cm_CV_d=confusion_matrix(y_test_d.values.argmax(axis=1), rf_model_CV_d.predict(x_test_features_countV_d).argmax(axis=1))
cm_TV_d=confusion_matrix(y_test_d.values.argmax(axis=1), rf_model_TV_d.predict(x_test_features_tfIdfV_d).argmax(axis=1))

#target variable - label encoded
#calculating confusion matrices
cm_CV_l=confusion_matrix(y_test_l, rf_model_CV_l.predict(x_test_features_countV_l))
cm_TV_l=confusion_matrix(y_test_l, rf_model_TV_l.predict(x_test_features_tfIdfV_l))
#cm=confusion_matrix(y_test.values.argmax(axis=1), rf_model_TV.predict(x_test_features_tfIdfV).argmax(axis=1))


In [380]:
#confusion matrix for model trained with count vectorized data where target was one hot encoded variable
cm_CV_d

array([[2671,   66,   16],
       [ 645,  258,   27],
       [ 414,   54,  241]], dtype=int64)

In [381]:
#confusion matrix for model trained with tf-idf vectorized data where target was one hot encoded variable
cm_TV_d

array([[2686,   51,   16],
       [ 662,  235,   33],
       [ 458,   35,  216]], dtype=int64)

In [382]:
#confusion matrix for model trained with count vectorized data where target was label encoded variable
cm_CV_l

array([[2592,  121,   40],
       [ 511,  365,   54],
       [ 326,   98,  285]], dtype=int64)

In [383]:
#confusion matrix for model trained with tf-idf vectorized data where target was label encoded variable
cm_TV_l

array([[2601,  124,   28],
       [ 533,  337,   60],
       [ 341,  105,  263]], dtype=int64)

In [391]:
lst=np.array([0,0,0])
not(lst.any())

True

It seems like all the difference between two distinct approaches with regarts to processing target variable lies in false negative sentiment, therefor the reason for such difference is explored in more details with code below.

First, let's see are there any, and how many predictions are there where all three one hot encoded categories are predicted to be zero. This means there is no eligible class assigned to particular case.

In [404]:
#summ is variable which stores sum of such cases where all three targets were predicted 0 
#in case where one hot encoded target is used
summ=0

#setting index variable to 0
index=0
#index_list will be used to store indexes of cases described above
index_list=[]
#looping through predictions
for lst in rf_model_CV_d_pred_test:
    #checking if all three predicted values a re equal to 0 - not(any()) is used to negate result of any()
    if not(lst.any()):
        #if it is the case - summ is added by 1
        summ+=1
        #index of such case is recorded in index_list
        index_list.append(index)
    index+=1
print ("There are ",summ,"cases where all three one hot encoded target varibales were predicted to be 0")

There are  900 cases where all three one hot encoded target varibales were predicted to be 0


In [397]:
#printing one of such cases
rf_model_CV_d_pred_test[index_list[0]]

array([0, 0, 0], dtype=uint8)

In [405]:
#printing the result of argmax in such cases
rf_model_CV_d_pred_test[index_list[0]].argmax()

0

#### Conclusion for CM

Confusion matrices of models trained on data with one-hot encoded target variable shows higher amount of false negative. The reason for that should be looked in argmax function which returns index of max value in list of three values for each predicted set of values (there are three output variables). Once all three are incorrectly predicted as 0, argmax will return index of first one, which belongs to negative sentiment, hence boosting false negative.
In general, all confusion matrices shows highest values along diagonal, whixh was expected. 

#### Regularized model with lower number of features

In order to reduce overfit problem evident in all of the models, regularized random forest  model will be created. 

First the number of features are reduced in training data by setting hyper parameter max_features of CountVectorizer to avoid course of dimensionality. Code below explore number of features if such paramtere is not set.

In [415]:
#number of words(features) - obtained as len of list of keys of vocabulary
len(C_vectorizer_l.vocabulary_.keys())

10579

In [442]:
#for target label encoded

#creating vectorizer - C in name means count, l at the end means label encoded target variable
C_vectorizer_l_reg=CountVectorizer(max_features=500)

#vectorizer is fitted with trainning data
C_vectorizer_l_reg.fit(x_train_l)

#transform is applied on both, trainning and test data
x_train_features_countV_l_reg=C_vectorizer_l.transform(x_train_l)
x_test_features_countV_l_reg=C_vectorizer_l.transform(x_test_l)

x_train_features_countV_l_reg=x_train_features_countV_l_reg.toarray()
x_test_features_countV_l_reg=x_test_features_countV_l_reg.toarray()

In [443]:
#verifying the number of features once it has been reduced
len(C_vectorizer_l_reg.vocabulary_.keys())

500

In [496]:
#creating the model with number of estimators and max_features to be used while creating each individual estimators
rf_model_CV_l_reg=RandomForestClassifier(n_estimators = 500, random_state=7,max_features=100, max_depth=28)

In [497]:
#trainning the model
rf_model_CV_l_reg=rf_model_CV_l_reg.fit(x_train_features_countV_l_reg, y_train_l)

In [498]:
#accuracy of model on training data
rf_model_CV_score_train_l_reg=rf_model_CV_l_reg.score(x_train_features_countV_l_reg, y_train_l)

In [499]:
#accuracy of model on test data
rf_model_CV_score_test_l_reg=rf_model_CV_l_reg.score(x_test_features_countV_l_reg, y_test_l)

In [500]:
print("Accuracy score of rf_model_CV_l_reg on train data is", rf_model_CV_score_train_l_reg)
print("Accuracy score of rf_model_CV_l_reg on test data is",rf_model_CV_score_test_l_reg)

Accuracy score of rf_model_CV_l_reg on train data is 0.6834504293520687
Accuracy score of rf_model_CV_l_reg on test data is 0.6520947176684881


Code above explore possibilities to regularize radnom forest classifier to reduce overfit problem, but to increase accuracy of a model. The idea is to generate more rather poor estimators, but robust ensemble of such poor estimators. Each estimator is pruned with max_depth. Max_features in combination with number of estimators selected (100 and 500 respectively) gives a good chance for all features to be used in estimators. As a result overfit has been reduced, however such reduction was payed by lower accuracy. Probably some other ML or DNN technique should be tested as well. 

# 6. Summarize your understanding of the application of Various Pre-processing and Vectorization and performance of your model on this dataset.

The aim of all text preprocessing steps are to get more clear (valuable) set of features to be used for model.
Html tag removal is apllied to remove html tags that might be common to whole corpus, rendering it as redundant, carying no or little information. 
Tokenization tokenized text for lemmatization.
Removal of numbers has the same goal as removal of html tags, while conversion to lower case and removal of special characters and punctuations have goal of reducing number of possible features - e.g. "Word" and "word" would be two different features.

Tf-Idf Vectorizer is expected to give better results (because of it's ability to better extract more important features in vectors), however, it showed slightly lower perfomance (the model trained by such dataset showed lower perfomance). 

It has been noticed once the countVectorizer was limited with max_features, the accuracy score on both, test and train data increased. 

All of random forest models showed significant overfit problem, so all of them required some sort of regularization. Models trained with label encoded target variable showed significantly better performance on test data (~10% better accuracy score for both vectorization principles).

It was interesting to see how confusion matrices are wrongly calculated indicating problems of no class identified in some cases where one hot encoded target variables were used (this has been explained in details in part related to confusion matrices).

# 7.Overall notebook should have

## a. Well commented code
Code has been comented

## b. Structure and flow
This notebook is structured to fit all the tasks in sequntial manner. 