#### Sentiment Classification Model based on Natural Language Processing. A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

### Import Libraries

In [246]:
from bs4 import BeautifulSoup
import unicodedata
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
import spacy
import pandas as pd  

### Read the input data

In [247]:
train = pd.read_csv("Tweets.csv")

In [248]:
train.head(5)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [249]:
# Check the shape of the input data
train.shape

(14640, 15)

In [250]:
# Observation : There are 14640 rows and 15 columns in the input data

In [251]:
train.columns.values

array(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'], dtype=object)

In [252]:
# The columns in the input data are:
# tweet_id : a unique number generated for each tweet
# airline_sentiment : negative, positive or neutral 
# airline_sentiment_confidence : confidence level of the customer sentiment
# negativereason : reason for negative experience / feedback
# negativereason_confidence : confidence levelof the negative feedback
# airline : name of the airline
# airline_sentiment_gold : indicates the gold membership of the customer
# name : name of customer
# negativereason_gold : reason for negative feedback from a gold customer
# retweet_count : how many times the tweet was retweeted
# text : feedback from the customer
# tweet_coord : NA
# tweet_created : time when the tweet was generated
# tweet_location : location where the tweet was generated
# user_timezone : timezone of the customer tweeting

In [253]:
# drop all columns except for the text and the airline_sentiment.

train.drop(['tweet_id'], axis=1, inplace=True)
train.drop(['airline_sentiment_confidence'], axis=1, inplace=True)
train.drop(['negativereason'], axis=1, inplace=True)
train.drop(['negativereason_confidence'], axis=1, inplace=True)
train.drop(['airline'], axis=1, inplace=True)
train.drop(['airline_sentiment_gold'], axis=1, inplace=True)
train.drop(['name'], axis=1, inplace=True)
train.drop(['negativereason_gold'], axis=1, inplace=True)
train.drop(['retweet_count'], axis=1, inplace=True)
train.drop(['tweet_coord'], axis=1, inplace=True)
train.drop(['tweet_created'], axis=1, inplace=True)
train.drop(['tweet_location'], axis=1, inplace=True)
train.drop(['user_timezone'], axis=1, inplace=True)

In [254]:
train.columns.values

array(['airline_sentiment', 'text'], dtype=object)

In [255]:
train.shape

(14640, 2)

In [256]:
# OBSERVATION : There are now 14640 rows in the input data and 2 columns

In [10]:
train.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


In [257]:
# The text data has special characters, mixed case characters and numbers/digits

In [258]:
# Get the number of reviews
num_reviews = train["text"].size

In [259]:
num_reviews

14640

### Define Text Pre-processing functions

#### These functions are used to pre-process the input data - remove HTML tags, convert all input text to #### lower case, remove accented characters, tokenize text and lemmatize input. 

In [260]:
# Function to strip HTML tags from the input text
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

# Function to convert text to lower case
def lowercase_text(text):
    lower_text = text.lower()
    return lower_text

# Function to remove accented charactersr from the input text
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

# Function to tokenize input text
def tokenize_input(text):
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(text)
    return tokens

# Function to remove special characters from the input text
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

# Stemming function
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ''.join([ps.stem(word) for word in text.split()])
    return text

#Lemmatization function
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

### Pre-process input text data

In [56]:
# Initialize an empty list to hold the clean text after pre-processing the input text
clean_text = []

In [261]:
# Loop over each review; create an index i that goes from 0 to the length of the review list 
# remove HTML tags, tokenize, remove special characters, remove accented characters, lower case 
# text and lemmatize the input text.

token_list = []
clean_text_temp1 = []
clean_text_temp2 = []
clean_text_temp3 = []
clean_text_temp4 = []
clean_text_temp5 = []

for i in range( 0, num_reviews ):
    # Call text pre-processing functions for each one, and add to the list of clean text list
    clean_text_temp1.append( strip_html_tags( train["text"][i]))
    token_list.append(tokenize_input(train["text"][i]))
    clean_text_temp2.append( remove_special_characters( clean_text_temp1[i], remove_digits=True))
    clean_text_temp3.append( remove_accented_chars( clean_text_temp2[i]))
    clean_text_temp4.append( lowercase_text( clean_text_temp3[i]))
    clean_text_temp5.append( lemmatize_text( clean_text_temp4[i]))

In [262]:
clean_text = clean_text_temp5

In [265]:
# Sample text prior to transformation

In [263]:
train["text"][0]

'@VirginAmerica What @dhepburn said.'

In [266]:
# Post transformation text

In [267]:
clean_text[0]

'virginamerica what dhepburn say'

In [270]:
# Add the pre-processed text list to the dataframe as a column 

In [271]:
train.shape

(14640, 2)

In [272]:
train.columns.values

array(['airline_sentiment', 'text'], dtype=object)

In [274]:
# Top 5 reviews in the input text

In [275]:
train.head()

Unnamed: 0,airline_sentiment,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


In [276]:
train['clean_text'] = clean_text

In [277]:
train.head()

Unnamed: 0,airline_sentiment,text,clean_text
0,neutral,@VirginAmerica What @dhepburn said.,virginamerica what dhepburn say
1,positive,@VirginAmerica plus you've added commercials t...,virginamerica plus you have add commercial to ...
2,neutral,@VirginAmerica I didn't today... Must mean I n...,virginamerica i do not today must mean i need ...
3,negative,@VirginAmerica it's really aggressive to blast...,virginamerica its really aggressive to blast o...
4,negative,@VirginAmerica and it's a really big bad thing...,virginamerica and its a really big bad thing a...


In [278]:
# Replace the 'airline_sentiment' column with discreet integer values using the map below
# 0 = neutral
# 1 = positive sentiment
#-1 = negative sentiment

In [279]:
train.airline_sentiment.replace(['neutral', 'positive', 'negative'], [0, 1, -1], inplace=True)

In [280]:
train.head()

Unnamed: 0,airline_sentiment,text,clean_text
0,0,@VirginAmerica What @dhepburn said.,virginamerica what dhepburn say
1,1,@VirginAmerica plus you've added commercials t...,virginamerica plus you have add commercial to ...
2,0,@VirginAmerica I didn't today... Must mean I n...,virginamerica i do not today must mean i need ...
3,-1,@VirginAmerica it's really aggressive to blast...,virginamerica its really aggressive to blast o...
4,-1,@VirginAmerica and it's a really big bad thing...,virginamerica and its a really big bad thing a...


In [282]:
# Check the number of positive, negative and neutral sentiments

In [283]:
pd.value_counts(train['airline_sentiment'])

-1    9178
 0    3099
 1    2363
Name: airline_sentiment, dtype: int64

In [284]:
# There are 9178 negative sentiments, 3099 neutral sentiments and 2363 positive sentiments

### Split the input data into train and test data

In [285]:
from sklearn.model_selection import train_test_split

In [286]:
train.head()

Unnamed: 0,airline_sentiment,text,clean_text
0,0,@VirginAmerica What @dhepburn said.,virginamerica what dhepburn say
1,1,@VirginAmerica plus you've added commercials t...,virginamerica plus you have add commercial to ...
2,0,@VirginAmerica I didn't today... Must mean I n...,virginamerica i do not today must mean i need ...
3,-1,@VirginAmerica it's really aggressive to blast...,virginamerica its really aggressive to blast o...
4,-1,@VirginAmerica and it's a really big bad thing...,virginamerica and its a really big bad thing a...


In [287]:
X = train.loc[:, train.columns == 'clean_text']
y = train.loc[:, train.columns == 'airline_sentiment']

In [288]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [289]:
# Convert training and test data into lists

In [290]:
X_train_list  = X_train['clean_text'].tolist()
y_train_list = y_train['airline_sentiment'].tolist()
X_test_list   = X_test['clean_text'].tolist()
y_test_list  = y_test['airline_sentiment'].tolist()

### Count Vectorizer

#### CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known word.

In [291]:
from sklearn.feature_extraction.text import CountVectorizer

In [294]:
# Initialize the CountVectorizer object with bag of words tool
count_vectorizer = CountVectorizer(max_features=2000)

In [295]:
train_data_features_cv = count_vectorizer.fit_transform(X_train_list)

In [296]:
train_data_features_cv = train_data_features_cv.toarray()

In [297]:
train_data_features_cv

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [298]:
print (train_data_features_cv.shape)

(10248, 2000)


In [299]:
# Take a look at the words in the vocabulary
vocab = count_vectorizer.get_feature_names()
print (vocab)



In [300]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features_cv, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print (count, tag)

194 aa
5 aadvantage
6 abc
5 ability
90 able
345 about
10 above
14 absolute
23 absolutely
10 absurd
7 ac
21 accept
21 acceptable
27 access
6 accident
14 accommodate
8 accommodation
7 accord
54 account
11 acct
5 accurate
7 across
11 act
10 action
16 actual
54 actually
5 ad
90 add
12 additional
38 address
11 admiral
9 advance
13 advantage
7 advice
13 advise
17 advisory
9 affect
10 afford
319 after
22 afternoon
279 again
6 against
9 age
5 agency
297 agent
93 ago
10 ah
14 ahead
6 ahold
96 air
6 airbus
23 aircraft
10 airfare
360 airline
30 airlines
16 airplane
257 airport
8 airway
56 airways
5 albany
13 alert
7 alist
487 all
8 alliance
58 allow
68 almost
12 alone
13 along
132 already
9 alright
91 also
11 alternate
11 although
82 always
77 am
65 amazing
13 america
51 american
2083 americanair
12 americanairline
5 americanairlines
8 among
9 amount
668 an
2582 and
14 angry
7 anniversary
23 announce
16 announcement
5 annoy
7 annoying
5 annricord
199 another
143 answer
6 anticipate
304 any
16 any

1482 of
203 off
95 offer
5 offering
22 office
6 official
13 officially
9 often
65 oh
13 ohare
77 ok
29 okay
9 okc
55 old
15 omg
2616 on
13 onboard
57 once
405 one
142 online
246 only
10 ontime
17 onto
76 open
15 operate
5 operation
5 operational
5 operator
10 opportunity
84 option
333 or
59 ord
24 order
5 organization
28 original
12 originally
27 orlando
5 orleans
17 oscar
171 other
11 otherwise
471 our
7 ourselves
488 out
9 outbound
7 outlet
12 outside
5 outsource
12 outstanding
278 over
14 overbooke
25 overhead
22 overnight
5 overweight
6 owe
46 own
6 pa
9 pack
5 package
24 page
5 pain
7 painful
9 pair
8 palm
6 paper
7 paperwork
11 parent
8 paris
7 park
5 parking
24 part
13 partner
7 partnership
13 party
88 pass
15 passbook
159 passenger
5 passport
6 password
57 past
5 pat
17 pathetic
12 patience
7 patient
7 pax
201 pay
8 payment
12 pbi
14 pdx
7 peanut
5 penalty
190 people
17 per
10 perfect
6 performance
13 perhaps
7 period
6 perk
78 person
11 personal
7 personnel
7 pet
5 ph
18 phila

In [301]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

In [302]:
# Initialize a Random Forest classifier with 100 trees
forest_cv = RandomForestClassifier(verbose=2,n_jobs=-1,n_estimators = 100) 

In [303]:
# Fit the forest to the training set, using the bag of words as features and the sentiment labels as the response variable

In [304]:
print ("Training the random forest...")

Training the random forest...


In [305]:
forest_cv = forest_cv.fit( train_data_features_cv, y_train_list )

# random forest performance through cross vaidation 
print (forest_cv)
print (np.mean(cross_val_score(forest_cv,train_data_features_cv,y_train_list,cv=10)))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


building tree 1 of 100building tree 2 of 100

building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.7s


building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64 of 100
building tree 65 of 100
building tree 66 of 100
building tree 67 of 100
building tree 68 of 100
building tree 69 of 100
building tree 70 of 100
building tree 71 of 100
building tree 72 of 100
building tree 73 of 100
building tree 74

[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   10.5s finished


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=2,
                       warm_start=False)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   18.3s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    7.9s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]:

0.7662005525914635


[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished


In [306]:
# Get a bag of words for the test set, and convert to a numpy array
test_data_features_cv = count_vectorizer.transform(X_test_list)
test_data_features_cv = test_data_features_cv.toarray()

In [307]:
# Use the random forest to make sentiment label predictions
result_cv = forest_cv.predict(test_data_features_cv)
print (result_cv)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s


[-1 -1 -1 ... -1 -1 -1]


[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.1s finished


In [308]:
predicted_output_cv = pd.DataFrame()

In [309]:
predicted_output_cv = X_test

In [310]:
predicted_output_cv['Actual_sentiment'] = y_test['airline_sentiment']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [311]:
predicted_output_cv['Predicted_sentiment'] = result

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [313]:
predicted_output_cv.head(10)

Unnamed: 0,clean_text,Actual_sentiment,Predicted_sentiment
13983,americanair in car gng to dfw pull over hr ago...,-1,-1
14484,americanair after all the plane do not land in...,-1,-1
6403,southwestair can not believe how many pay cust...,-1,-1
9653,usairway i can legitimately say that i would h...,-1,-1
13268,americanair still no response from aa great jo...,-1,-1
2384,united we have developer fly down tmrw morn w ...,0,-1
9613,usairway hello anyone there,-1,0
11612,usairways husainhaqqani mr husain u shld prote...,-1,-1
9252,usairway not likely flightaware say plane be s...,-1,-1
13923,americanair they do not even give an option to...,-1,-1


In [314]:
from sklearn.metrics import accuracy_score

In [315]:
accuracy_score(predicted_output_cv['Actual_sentiment'], predicted_output_cv['Predicted_sentiment'])

0.7506830601092896

In [316]:
# The classifier has an accuracy of 75%

### Tf-Idf Vectorizer

#### TF-IDF are word frequency scores that try to highlight words that are more interesting.

In [317]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [318]:
vectorizer_tfidf = TfidfVectorizer()

In [319]:
train_data_features_tfidf = vectorizer_tfidf.fit_transform(X_train_list)

In [320]:
train_data_features_tfidf = train_data_features_tfidf.toarray()

In [321]:
print (train_data_features_tfidf.shape)

(10248, 10194)


In [323]:
# Take a look at the words in the vocabulary
vocab = vectorizer_tfidf.get_feature_names()
print (vocab)



In [332]:
print(vectorizer_tfidf.vocabulary_)



In [333]:
print(vectorizer.idf_)

[9.54178824 5.03643839 9.54178824 ... 9.54178824 8.84864106 9.54178824]


In [324]:
# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features_tfidf, axis=0)

In [325]:
# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print (count, tag)

0.3897525956686098 a_life_story
43.57743680940793 aa
0.35377406434805353 aaaand
0.525369120317686 aaba
0.34710978641058315 aadfw
1.785919967473942 aadvantage
0.7482946203928436 aafail
0.8135526676740285 aal
0.3651634520683453 aaron
0.6215309051252931 aarp
0.8102261519227079 aas
0.456650337547751 aaso
0.7089919820329638 ab
0.7182555918696697 abandon
0.39014761761817235 abandonment
0.387058459512223 abassinet
0.38726392850441027 abbreve
2.257311543856689 abc
0.683673813206007 abc_wtvd
0.30816081861349776 abcdef
0.3525454218731768 abcnetwork
0.3407829148866057 abcnews
0.3780056566458416 abcs
0.34340441752732626 abducted
0.542548347106413 abi
0.682630458041382 abigailedge
1.6790855426631852 ability
24.9179134257741 able
0.9042417429790957 aboard
0.46618540328407787 aboout
0.4008108619161566 abound
71.91791267250764 about
3.4226477790172796 above
1.5289058415374144 abq
0.4841170593258644 abroad
5.215204524708784 absolute
8.992836669669366 absolutely
0.3782224896396882 absorber
0.49680990623

1.0998422709364104 bestfriend
0.3736242386541001 bestinclass
0.40524412819344885 bestinclasssocial
0.4859654132826697 bestplanesever
0.5336030151322803 besty
4.751330399092508 bet
0.4607830903478064 bethonor
0.5336030151322803 betsy
11.736288365121206 better
0.36181322690529566 betterfrom
0.4487204515482196 betterother
0.5997087154563345 bettween
9.299982941853324 between
1.0894198932978787 beverage
0.3749728997759882 bevy
1.1319169190658658 beware
0.4600132155162787 bewhat
8.20997162566344 beyond
2.5143048617626804 bf
0.8529027623156731 bff
0.33093462058948475 bg
0.3257723283806739 bgkwm
0.833872925424589 bgm
1.260240131867961 bgr
1.7235504006202476 bhm
0.38477444905251557 bible
0.6765608005810577 bicycle
16.99158525864931 big
0.5356182138955671 biggie
1.3841119068473426 bike
0.44329300702869 bila
2.9083582416085623 bill
0.3857776773986735 billing
0.9788130602342415 billion
0.34400472386566955 billmelate
4.538905318217677 bin
0.8063334508310449 bind
0.364670948352789 bingo
0.774645407

0.3906371484125884 cling
0.5298783745209208 clinicpolly
0.5147392471811698 cll
0.3603873198244541 clockwork
0.37162611683821284 clog
11.014488677453796 close
1.0912063134262175 closed
1.1458518666971207 closer
0.4678968055003789 closetoverhead
0.8118405830218465 closing
0.38825629275860823 closure
7.184989577155079 clothe
0.4174183399193033 clothessuit
1.5601388076322018 clothing
1.5174552750863528 cloud
0.6655358352895437 clown
0.30237250721699593 cls
14.904828293589459 clt
0.47729193483338433 cltbna
0.4282171152642269 cltnyc
0.4076905173964148 cltsfo
9.914579075363799 club
2.8668768344225146 clue
0.33115567001022456 cluecould
0.34901172799252056 clueless
0.4341012084438232 clunky
0.7211693291568324 cluster
0.4431758904068042 clusterfuck
0.8892252051731269 clutch
0.37919519297733567 cmfatfeet
3.250067456177826 cmh
0.414671347880105 cmhiad
0.3881199597630588 cmhoak
2.2520451493033633 cmon
0.3815652909370455 cnbc
0.25133603003652444 cnceld
0.3865072713019101 cncle
0.2969222377950629 cng

0.3906515157111573 defibrillator
0.4864135295782491 definately
2.2750230500430946 define
7.003543538538419 definitely
1.340211412109674 definition
0.4407475115244607 definitive
0.33450167512554563 deflator
4.071381825143043 degree
0.3445692760699483 dehli
0.4808210150421182 dehydrate
4.766420958457282 deice
0.4096413375257638 deicing
0.36585413293375674 del
0.4039885659882956 delacy
124.61226174011884 delay
0.4577030229304188 delayconsequence
3.289349308130257 delayed
0.6980115584473928 delayedagainevery
0.3752365831520593 delayedbecause
0.2777591846034496 delayedcancelle
0.6075684227151537 delayedconnection
0.3523223140756293 delayedl
0.4611365046644352 delayedno
0.6258429196187568 delayednot
0.37466309545748644 delayedover
0.38485787621961665 delayedovernight
0.43530826533913386 delayedstill
0.4721906997599998 delayedthat
0.4127455519764837 delayforwhat
0.7511370460303934 delaykille
0.39985470554526686 delaypending
0.79457487292688 delays
0.38870277696894495 delayscancelled
0.3971963

16.070290307583555 ewr
0.36585413293375674 ewrbrudel
0.3960486034835594 ewrfll
0.36594311895684606 ewriad
0.3678147943484673 ewrmco
0.3053216379190752 ewrord
0.4080727708003992 ewrpdx
0.39628423671514335 ewrsti
0.3496009452498273 ex
3.5033008607025518 exact
5.187279428954188 exactly
2.915579112844639 example
0.3887032208150609 exasperate
0.3781596261590854 exceed
0.3666977584828672 excellence
6.5032594916850215 excellent
0.4537662323482878 excellentcustomerservice
4.813458884232082 except
1.604696737622972 exception
1.6941243291847874 exceptional
0.39183840855489876 exceptionally
0.6643328314663 exceptionalservice
1.6582306639042987 exchange
1.7427670305078062 excite
5.9201454892268615 excited
0.4299316925040887 excitement
0.7304411224326695 exciting
0.4078715775745941 exclusive
1.0706601344654423 exclusively
0.3904933492411231 excruciatingly
7.23021172588584 excuse
0.3612306755791036 excusegeez
3.4769080263250767 exec
0.8772076491974167 execplat
1.0107508068638094 execution
1.36540413

0.3565937978688903 gangway
0.38344711928956293 garage
2.7572827582325137 garbage
0.6625419963132885 garcia
0.6606218668598547 garciachicago
0.33535353965850456 gardening
1.5964222105028476 gary
0.3918267202113071 garywerk
0.3422293237664218 gas
0.40723733939222434 gassing
78.13248361447967 gate
0.5107525689663869 gateagain
0.7499048486441708 gates
0.43804147653530584 gatestill
0.457358851666266 gather
0.7262248024698046 gatwick
1.116614426556188 gay
1.0880606387192873 gb
0.3956395956239848 gc
0.4194953881963134 gcvwj
0.37278076510713165 gd
2.5112804097549795 gear
0.4264364726302741 geeeeezzzzz
0.7336073909221781 geek
0.3334412055126158 geeks
0.5161822304358511 geez
0.6656429168040314 geg
0.37130692836672724 gem
2.8058936720275556 general
0.33893253666073436 generally
0.7053021165020434 generate
1.7952498730040864 generic
0.7229765269609674 genius
1.1030480032454155 gent
0.7014783145512257 gentleman
0.3773427795063491 genuine
0.3776278151011642 genuinely
0.49707623117828525 geography
0.

0.4320838984059872 httptcofheslpmu
0.5407775041892033 httptcofinqfhue
0.5436994302004172 httptcofixywxyb
0.4706524403718623 httptcofjkvqmbmas
0.4472827632158413 httptcoflbgzzkd
0.38000190073622786 httptcoflwmgdahxu
0.35109111053310554 httptcofmrmn
0.43453834614459763 httptcofornpfky
1.1655376395659378 httptcofqxelbon
0.5371356626207406 httptcofraqdpkyga
0.40844039951047695 httptcofrdcayda
0.34952669032399797 httptcofrghglmkqf
0.3566405130753809 httptcofrhuxibii
0.4068372229801318 httptcofrrfxccwz
0.6818787733071984 httptcoftlzwtvo
0.4485247075598908 httptcoftpdawbd
0.5435706208089855 httptcoftvnwwqf
0.523752201688055 httptcofupfuayir
0.37040980466053364 httptcofvlxirhf
0.3943174198008957 httptcofvudmhpf
0.6981111625671671 httptcofvyzjldton
0.5659650342577083 httptcofwkjble
0.5319354873734484 httptcofxnvba
0.4209989019963363 httptcofyfrzbk
0.4111259385930395 httptcofyvaioul
0.4363215938589105 httptcogbeasz
0.40055044907734727 httptcogbhgwt
0.40473281029570346 httptcogbiwugfnm
0.45234729

0.5565245291657982 httptcoyqhkljabn
0.39587265841293706 httptcoysahvxzk
0.4632839291107755 httptcoysqbvqmgb
0.44041063097096916 httptcoywxrfngr
0.5371356626207406 httptcoyyzdpbvu
0.4253400228308921 httptcoyzhzrqmi
0.39363447114603906 httptcozaqltf
0.4161480512859185 httptcozcbjyolsn
0.45002452002029075 httptcozdkxnktou
0.4265850097402098 httptcozfqmpgxvs
0.4686497607014229 httptcozfroinpszi
0.3167648997490747 httptcozfwjgtxzt
0.6889585805951486 httptcozgqebfk
0.4912392630246473 httptcozhgqprsg
0.44555788560724346 httptcozhmfdiw
0.5605688170038106 httptcozhxokaqa
0.942460714832972 httptcozikuoxgnw
0.3826357276640137 httptcozjlztua
0.40107717013995897 httptcozjvilw
0.5529155244903692 httptcozkoeclgiu
0.363941990816245 httptcozmjkalzl
0.41715582027692605 httptcoznsujpbv
0.6063857628353173 httptcozoicegli
0.46928380607063935 httptcozouowgvq
0.4933372536635079 httptcozpzpoeon
0.603793842546914 httptcozqutusepw
0.3566405130753809 httptcozsdgzydnde
0.5055195704022798 httptcoztrdwvnl
0.3956317

0.3403992222787004 leakingwhat
0.34567943369322757 leanin
9.28638570164627 learn
0.31323762159529367 lease
16.384443828805644 least
0.4439718606593595 leastthebeverageswillbecold
0.36108044056105904 leather
0.4555146100393466 leatherseat
0.5327591309855828 leathery
53.59513866414277 leave
1.4121114692010557 leaving
0.4560466591871061 leavingtomm
0.37161033391587606 leeannhealey
0.3490264545204491 leeway
2.398107212926773 left
9.43412843033485 leg
0.7075716490014041 legacy
0.7304466287409276 legal
0.8715377786097027 legally
0.7940938410978108 legit
0.7564283640987586 legitimate
4.381006063782207 legroom
0.40609986547123217 leigh_emery
0.3955665246049837 leinenkugel
0.3942120861508222 leisure
0.426065646951601 lemme
0.7516069234693337 length
0.3601445480370563 lengthy
0.3665559112297202 lense
0.3606454197669265 leopolds_ic
0.35732099474846507 lesliewolfson
12.86258557129728 less
0.39769239644202486 lessand
0.3957277864020921 lesscustomer
3.3574807481452984 lesson
45.64189930610494 let
0.

0.5302516972382396 minsrescue
0.4157238439705851 minsthank
3.7330391025756153 mint
0.4218355609176633 mintyfresh
60.78757701587838 minute
0.42403329517416866 minutesand
0.42694769300231183 minutesi
0.5924402359932421 minutessaid
0.7451317827884101 miracle
0.3590853413768807 miranda
0.9090973776182534 miriam
0.39206113580555874 misconnecte
2.9238982027062574 miserable
0.4143522394332939 miserablemorne
0.36591407750860167 miserably
0.38139976461612524 misery
0.8005578181673765 misfortune
0.37116533197890067 misinformation
0.3426246650443285 misjudge
0.8137237780305965 mislead
0.3566472696881529 misleading
0.383329618972504 mismanagement
1.8181583629046767 misplace
0.4185093746042867 misplaced
0.4055841237974891 mispurchase
0.45778934569013074 misread
62.996978477985856 miss
1.844790456751339 missed
0.37267141232181333 missedconnection
0.43439740764766793 missedflight
0.3701415085251602 missedwork
1.1685023459146051 missing
0.4377748429723727 missingstollen
0.4707360501710282 missingtheos

1.4535262663758182 onhold
1.3223577012092909 onholdwith
0.3798307731863008 onion
36.189734470971665 online
0.3652369924112252 onlineapp
0.3376753921719466 onlinerep
0.42488603580756545 onlineuseless
53.45419254353121 only
0.37439720463932025 onlyblue
0.4891335054075571 onlyinamerica
0.40091962628470595 onlywaytofly
0.3732533822678643 onplane
0.40447504252273003 ontario
0.4162815709169065 ontarmac
3.165702595197068 ontime
5.696150009047 onto
1.2782268590096597 ooh
1.1840499609700714 oop
0.37557046421631163 oopsno
0.3769858670604293 op
0.594042897583742 opal
21.629190953695232 open
0.44386881098155756 opened
0.5775006294710943 openin
0.4890804000984218 opening
0.3192905291484062 openskie
5.0819007729457235 operate
0.3289080672270707 operating
1.6963020493429497 operation
2.046812123451536 operational
1.955003931840651 operator
1.472696476028661 opinion
3.6151356465605553 opportunity
0.3167556217496739 oppose
1.0545822180849156 opposite
0.3437900390948842 oprah
0.29727735303013986 ops
0.2

0.4864295654058721 prechk
0.9450765429652903 preciate
0.38953901868706037 preciation
1.0015464333132527 precious
0.44343388685380425 precioustime
1.2608991082906371 precipitation
0.3563149633611576 preclude
1.1726599472724566 predict
0.3060035646028694 predictive
0.3666744080984415 preemptive
0.39067637008849004 preexisting
1.6913738513513823 pref
4.623123991243611 prefer
0.36585413293375674 preferably
1.1057849047608972 preference
1.832527803489017 preferred
0.4232197563700671 prefference
0.9759284066160823 preflight
0.5420329830911772 pregame
0.3599069264668686 preggo
0.39343104025422926 pregnant
0.3819566896242363 pregnantwithtwin
0.3716604218587553 premgold
5.060856877431802 premier
0.7878918335289777 premiere
0.3264360913366839 premierk
3.1068691986101156 premium
0.3684306063123519 preoccupied
0.4082433371521587 prepaid
0.7299157988710181 preparation
2.541390372340496 prepare
0.722794383142372 prepared
0.40231982467029465 prepurchase
0.4080301648549552 prerecord
0.3521338895376528

0.3722053839549157 sansfo
1.0013917199189608 santa
0.34257242864077203 sao
0.35731404433725017 sarahzou
1.9518898081107747 sarcasm
0.4307929536182633 sass
1.7546252458359624 sat
0.39537037754051757 satellite
0.4894014260600276 satesq
0.47227017486222467 satisfaction
0.4274047571454868 satisfactorily
0.9029981390576338 satisfactory
0.7575941291796071 satisfied
0.3820256564031839 satisfy
0.3835558144451177 satmorrow
5.22647224944698 saturday
0.9620610419415081 sauce
0.42517091461964807 sauna
0.3400381405247034 sauvignon
1.4548518491519213 savannah
7.904369159977793 save
0.39496082692270607 saveface
1.3924017205197048 saver
0.401662949593788 savethoseseat
0.42602241485201897 saveus
0.5836980481504702 saving
0.34754154528191483 savr
0.32842118506070694 saw
0.3167648997490747 saxonandparole
71.03847064196698 say
0.40496639555566455 sayas
1.6473878698299158 sayin
0.6281646334703639 saying
0.5278381887966613 saysorrychris
1.1262561419276895 sb
0.36608135845160233 sba
0.8963168316720271 sc
1.2

0.7265176742963154 stair
1.2980469894735218 stall
0.3157487660873751 stamp
8.294125784354556 stand
3.601376437365296 standard
0.3574412397038477 standards
10.780524093675165 standby
0.5603451943653054 standbye
0.5204927442497791 standbylot
0.38479854577116285 standing
3.1385303476074475 star
2.5465526123088926 staralliance
0.34916612169765743 starboard
0.6921019549172058 starbucks
0.7879546687574894 stare
0.41351057552521664 stars
23.330868377204776 start
0.40656720223608966 startingbloc
0.36316592762821787 startling
0.35040219349175594 startup
0.8113092514778948 starve
0.389117843079683 starving
1.7579383868479141 stat
6.413261233789402 state
2.3566415061186614 statement
1.1497376646638282 states
0.4534736569186723 staticy
1.4517324016362694 station
0.42846396962718003 stationary
22.697929266647822 status
0.42750442455080423 statusmatchpaidoff
0.3819499156133587 statuswill
13.6462339243924 stay
0.33859342762660194 stbernard
0.3688851740058698 stclass
0.29791316866495704 stdby
4.605962

0.7731524604396323 turkish
10.013254291625005 turn
0.9894040254461273 turnaround
0.7204161480425284 turrible
0.39978907051655654 tus
0.3683993621147968 tusdfw
0.49138498859405905 tux
11.309941365856135 tv
0.35403010336802077 tvmovie
0.7012432225653751 tvs
0.5996465206525152 tvsmusic
0.424747024415298 tvu
0.5449512490629378 twa
0.3712171448939215 tweak
20.831832702794113 tweet
0.34324988234449666 tweeted
0.9113810697001822 tweeter
0.4695565781865826 tweeting
1.1397183287636916 twelve
0.41490994399560105 twenty
14.683732590161313 twice
0.3780126388344463 twicegot
0.3455550666308225 twin
0.3328369860743583 twist
4.810243663192039 twitt
13.758238419247286 twitter
1.129756116202508 twitterdm
0.5418803870163605 twitteremailweb
0.36192252586110596 twitterz
40.95677873394743 two
0.36713053137546287 twtr
2.1277720482581453 tx
0.4129953523806462 txfd
0.48376616643464687 txt
0.35209021016268793 txtemail
4.427942212534823 ty
3.7782237250603634 type
2.0028787584601755 typical
0.8299499734948149 typ

0.358312372843124 wayway
0.3830376182873514 wb
0.32961342989233167 wbroken
0.6242521940674963 wc
0.3788193621915261 wclass
0.3362276602405696 wcomm
0.39127843345266433 wcompanie
0.4438409507929754 wcomputer
0.408141113582465 wconnection
0.4303132911748061 wcustomer
0.3683993621147968 wcx
0.6578145220857659 wd
0.8278780965029772 wdelay
0.46388493820905313 wdw
136.43322199745472 we
2.134260392484279 weak
0.5039224440973069 weaktea
0.3490403516004076 weappreciateyou
1.9841672181105534 wear
0.32061399147946207 weary
46.77270949591765 weather
0.47083741282318664 weatherbut
0.4122488615455522 weathercall
0.3266681267375369 weatherflight
0.5214764502917276 weatherhere
0.4190224300376221 weatherless
1.0485516940590938 weatherrelate
0.594042897583742 weave
5.786128038703381 web
0.3965274875946083 webbernatural
0.3900307007538698 weblink
0.3341298402915118 webs
30.20823562506451 website
0.3776367779619442 wed
4.121939900093018 wedding
4.930335828272887 wednesday
0.4101034949818025 weds
0.3933354

In [326]:
# Initialize a Random Forest classifier with 100 trees
forest_tfidf = RandomForestClassifier(verbose=2,n_jobs=-1,n_estimators = 100) 

In [327]:
forest_tfidf = forest_tfidf.fit( train_data_features_tfidf, y_train_list )

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.


building tree 1 of 100
building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100building tree 6 of 100

building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100


[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   13.3s


building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64 of 100
building tree 65 of 100
building tree 66 of 100
building tree 67 of 100
building tree 68 of 100
building tree 69 of 100
building tree 70 of 100
building tree 71 of 100
building tree 72 of 100
building tree 73 of 100
building tree 74 of 100
building tree 75

[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   46.4s finished


In [328]:
# random forest performance through cross vaidation 
print (forest_tfidf)
print (np.mean(cross_val_score(forest_tfidf,train_data_features_tfidf,y_train_list,cv=10)))

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=2,
                       warm_start=False)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   20.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   49.1s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    9.5s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   29.2s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    7.6s
[Parallel(n_jobs=-1)]:

0.7557583841463414


In [329]:
# Get a bag of words for the test set, and convert to a numpy array
test_data_features_tfidf = vectorizer_tfidf.transform(X_test_list)
test_data_features_tfidf = test_data_features_tfidf.toarray()

In [330]:
# Use the random forest to make sentiment label predictions
result_tfidf = forest_tfidf.predict(test_data_features_tfidf)
print (result_tfidf)

[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  25 tasks      | elapsed:    0.0s


[-1 -1 -1 ... -1 -1 -1]


[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.2s finished


In [331]:
predicted_output_tfidf = pd.DataFrame()
predicted_output_tfidf = X_test
predicted_output_tfidf['Actual_sentiment'] = y_test['airline_sentiment']
predicted_output_tfidf['Predicted_sentiment'] = result_tfidf
predicted_output_tfidf.head(5)
from sklearn.metrics import accuracy_score
accuracy_score(predicted_output_tfidf['Actual_sentiment'], predicted_output_tfidf['Predicted_sentiment'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


0.7563752276867031

In [339]:
predicted_output_tfidf.head(10)

Unnamed: 0,clean_text,Actual_sentiment,Predicted_sentiment
13983,americanair in car gng to dfw pull over hr ago...,-1,-1
14484,americanair after all the plane do not land in...,-1,-1
6403,southwestair can not believe how many pay cust...,-1,-1
9653,usairway i can legitimately say that i would h...,-1,-1
13268,americanair still no response from aa great jo...,-1,-1
2384,united we have developer fly down tmrw morn w ...,0,-1
9613,usairway hello anyone there,-1,0
11612,usairways husainhaqqani mr husain u shld prote...,-1,-1
9252,usairway not likely flightaware say plane be s...,-1,-1
13923,americanair they do not even give an option to...,-1,-1


In [334]:
# The classifier offers an accuracy of 75.6%

### Summary

In [335]:
# The pre processing steps help to sanitize the input text by :
# removing HTML tags
# removing numbers
# removing special characters and punctuations
# converting all text to lower case. This helps to eliminate duplicate case sensitive features

In [336]:
# Lemmatization groups together the inflected forms of a word so that they can be analyzed as a 
# single item. This is identifued by the word's lemma or dictionary form. 

In [337]:
# We used count vectorization to count the number of times a word appears in the corpus. 
# Also, compared this against Tf-idf vectorization. The tf-idf value increases in proportion to the
# number of times a word appears in a document in the corpus. This helps to adjust for the fact 
# that some words appear more frequently in general. tf-idf is one of the most popular term
# weighing schemes. 

In [338]:
# The accuracy of the classifer using random forest classifier is estimated to be around 75%.