# Customer Sentiment Analysis using NLP

### Data Set: IMDB Reviews


### PART - 1

Roadmap we will follow:
1. Importing the Data Set
2. Processing the DataSet
3. Vectorizing the DataSet
4. Building a CLassifier for the DataSet to predict the **Sentiment**

Note: **The targets/labels we use will be the same for training and testing because both datasets are structured the same, where the first 12.5k are positive and the last 12.5k are negative.** This is the form in which our **full_train and full_test** are made!

In [1]:
# Importing Libraries
import numpy as np
import pandas as pd
import re,os
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
# Importing Data Set

'''We have 2 text files containing all the POSITIVE/NEGATIVE/NEUTRAL reviews. Firstly we have to fetch it into lists,
    and then to make sure, there is no space between the reviews, we have to STRIP off the spaces!. We also got the encoding error,
    hence had to use universal UTF8 encoding!'''

training_set = []
for line in open('E:/UpGrad_Data Science/Offile ML Projects/NLP for IMDB Customer Review/movie_data/full_train.txt','r',
                 encoding='utf8'):
    training_set.append(line.strip())
    
testing_set = []
for line in open('E:/UpGrad_Data Science/Offile ML Projects/NLP for IMDB Customer Review/movie_data/full_test.txt','r',
                 encoding='utf8'):
    testing_set.append(line.strip())

In [3]:
# Let's print atleast one review of each set!
print(training_set[8])

THE NIGHT LISTENER (2006) **1/2 Robin Williams, Toni Collette, Bobby Cannavale, Rory Culkin, Joe Morton, Sandra Oh, John Cullum, Lisa Emery, Becky Ann Baker. (Dir: Patrick Stettner) <br /><br />Hitchcockian suspenser gives Williams a stand-out low-key performance.<br /><br />What is it about celebrities and fans? What is the near paranoia one associates with the other and why is it almost the norm? <br /><br />In the latest derange fan scenario, based on true events no less, Williams stars as a talk-radio personality named Gabriel No one, who reads stories he's penned over the airwaves and has accumulated an interesting fan in the form of a young boy named Pete Logand (Culkin) who has submitted a manuscript about the travails of his troubled youth to No one's editor Ashe (Morton) who gives it to No one to read for himself. <br /><br />No one is naturally disturbed but ultimately intrigued about the nightmarish existence of Pete being abducted and sexually abused for years until he was 

In [4]:
print(testing_set[8])

This movie is amazing because the fact that the real people portray themselves and their real life experience and do such a good job it's like they're almost living the past over again. Jia Hongsheng plays himself an actor who quit everything except music and drugs struggling with depression and searching for the meaning of life while being angry at everyone especially the people who care for him most. There's moments in the movie that will make you wanna cry because the family especially the father did such a good job. However, this movie is not for everyone. Many people who suffer from depression will understand Hongsheng's problem and why he does the things he does for example keep himself shut in a dark room or go for walks or bike rides by himself. Others might see the movie as boring because it's just so real that its almost like a documentary. Overall this movie is great and Hongsheng deserved an Oscar for this movie so did his Dad.


Now, we can see that there are punctuations such as **:,',.,br** which have to be removed.   
The best way is to remove using **Regular Expressions**, which we are going to do next!

In [5]:
to_be_replaced_with_no_space = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)")
to_be_replaced_with_space = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def process_review(reviews):
    reviews = [to_be_replaced_with_no_space.sub('',line.lower()) for line in reviews]
    reviews = [to_be_replaced_with_space.sub(' ',line.lower()) for line in reviews]
    
    return reviews

In [6]:
# Now let's process our Reviews
cleaned_train = process_review(training_set)
cleaned_test = process_review(testing_set)

In [7]:
print(cleaned_train[8])

the night listener  **  robin williams toni collette bobby cannavale rory culkin joe morton sandra oh john cullum lisa emery becky ann baker dir patrick stettner  hitchcockian suspenser gives williams a stand out low key performance what is it about celebrities and fans what is the near paranoia one associates with the other and why is it almost the norm  in the latest derange fan scenario based on true events no less williams stars as a talk radio personality named gabriel no one who reads stories hes penned over the airwaves and has accumulated an interesting fan in the form of a young boy named pete logand culkin who has submitted a manuscript about the travails of his troubled youth to no ones editor ashe morton who gives it to no one to read for himself  no one is naturally disturbed but ultimately intrigued about the nightmarish existence of pete being abducted and sexually abused for years until he was finally rescued by a nurse named donna collette giving an excellent performan

In [8]:
print(cleaned_test[8])

this movie is amazing because the fact that the real people portray themselves and their real life experience and do such a good job its like theyre almost living the past over again jia hongsheng plays himself an actor who quit everything except music and drugs struggling with depression and searching for the meaning of life while being angry at everyone especially the people who care for him most theres moments in the movie that will make you wanna cry because the family especially the father did such a good job however this movie is not for everyone many people who suffer from depression will understand hongshengs problem and why he does the things he does for example keep himself shut in a dark room or go for walks or bike rides by himself others might see the movie as boring because its just so real that its almost like a documentary overall this movie is great and hongsheng deserved an oscar for this movie so did his dad


Ok, so now we have cleaned up our DATA for training and Testing, but the one thing that stands in between this data and modelling is the **way** in which we will feed our data to Model.    
The ML model will not be able to understand if we provide the dataset as it is, because the model needs a NUMERIC dataframe to process.Now how do we go onto do that, **Bag of Words** is the solution to this using **Vectorizer**.   
**Bag of Words splits the document into TOKENS using some sort of pattern. Then the weight of each token is assigned which is proportional to frequency with which it shows up in the document. Then a matrix is formed with each row representing a document and column addressing a token.**

We will be using **Count Vectorizer** here as ***it counts the number of times a token shows up and uses that value as weight**.

In [9]:
cv  = CountVectorizer(binary = True) #We put TRUE, because we want all non zero to be 1 as we are looking for Binary Outcomes!
cv.fit(cleaned_train)
X = cv.transform(cleaned_train)
X_test = cv.transform(cleaned_test)

In [10]:
target = [1 if i<12500 else 0 for i in range(25000)]
X_train,X_val,y_train,y_val = train_test_split(X,target,train_size = 0.7,random_state = 4)
for c in [0.001,0.01,0.05,0.25,0.5,0.75,1]:
    lr = LogisticRegression(C = c,max_iter = 300)
    lr.fit(X_train,y_train)
    print('For C: ', c , 'Accuracy Score: ',round(accuracy_score(y_val,lr.predict(X_val)),4))

For C:  0.001 Accuracy Score:  0.8423
For C:  0.01 Accuracy Score:  0.872
For C:  0.05 Accuracy Score:  0.8828
For C:  0.25 Accuracy Score:  0.8813
For C:  0.5 Accuracy Score:  0.8789
For C:  0.75 Accuracy Score:  0.8773
For C:  1 Accuracy Score:  0.8745


In [11]:
# Now let's check for the test set
test_model = LogisticRegression(C = 0.05)
test_model.fit(X,target)
print('Final Accuracy: ',accuracy_score(target,test_model.predict(X_test)))

Final Accuracy:  0.88144


So we have a ML Model with **88% Accuracy** of identifying **Negative and Positive Words**

### Finding Positive and Negative Words!

In [12]:
feature_to_coefficient = {word: coef for word,coef in zip(cv.get_feature_names(),test_model.coef_[0])}

print('Top 5 Positive Words are: ')
for best_positive_words in sorted(feature_to_coefficient.items(),key = lambda x: x[1],reverse = True)[:5]:
    print(best_positive_words)

print('------***********------')

print('Top 5 Negative Words are: ')
for best_negative_words in sorted(feature_to_coefficient.items(),key = lambda x: x[1])[:5]:
    print(best_negative_words)

Top 5 Positive Words are: 
('excellent', 0.9283544345896336)
('perfect', 0.7944277796364583)
('great', 0.6745552805685403)
('amazing', 0.6164834564379139)
('superb', 0.6055919684474416)
------***********------
Top 5 Negative Words are: 
('worst', -1.3679897707836337)
('waste', -1.1688808928148235)
('awful', -1.0273337525366806)
('poorly', -0.8748022406393018)
('boring', -0.8591221194172675)


Now, the above what we did was some basic stuff for NLP Starting, but what we didnt' do was removing **Stop Words** such as **if,he,she,but,we**, we also didn't normalised the words, which we can do using **Stemming and Lemmatizing**. These are the 2 ways in which **plural words are made singular**,we also did not took care of the new words coming up in **test data** which can make us loose valuable information which we will be doing using **ngram range** in our vectorization!    

We will also be looking at the number of **Word Counts** as it can give more predictive power to our Model!  
Then finally we will be building **Machine Learning Models** such as **SVM** and **Ensembling Technique** to see the result!

## Part -2

Let's Begin!!

Now, we will proceed on with removing **Stop Words**.

What are stop words?  
**Stop Words are very common occuring words like 'he,she,if,is,they' which should be removed(but not always). Removing them helps in improving the Performance of the Model as it shortens the Dimensions by being selective!**

In [13]:
from nltk.corpus import stopwords

In [14]:
the_library_of_stop_words = stopwords.words('english')
def remove_stop_words(text_input):
    removed_stop_words = []
    for review in text_input:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in the_library_of_stop_words])
        )
    return removed_stop_words

Let's remove the stop words from our data

In [15]:
cleaned_train_data_from_stop_words = remove_stop_words(cleaned_train)
cleaned_test_data_from_stop_words = remove_stop_words(cleaned_test)

In [16]:
cleaned_train_data_from_stop_words[2]

'brilliant acting lesley ann warren best dramatic hobo lady ever seen love scenes clothes warehouse second none corn face classic good anything blazing saddles take lawyers also superb accused turncoat selling boss dishonest lawyer pepto bolt shrugs indifferently im lawyer says three funny words jeffrey tambor favorite later larry sanders show fantastic mad millionaire wants crush ghetto character malevolent usual hospital scene scene homeless invade demolition site time classics look legs scene two big diggers fighting one bleeds movie gets better time see quite often'

In [17]:
# Now let's do the modelling again
cv = CountVectorizer(binary = True)
cv.fit(cleaned_train_data_from_stop_words)
X = cv.transform(cleaned_train_data_from_stop_words)
X_test = cv.transform(cleaned_test_data_from_stop_words)

In [18]:
X_train,X_val,y_train,y_val = train_test_split(X,target,test_size = 0.3)

In [19]:
for hyper_parameter in [0.01,0.05,0.25,0.5,0.75]:
    lr_model = LogisticRegression(C = hyper_parameter,max_iter = 300)
    lr_model.fit(X_train,y_train)

    print('Accuracy at C=%s: %s'%(hyper_parameter,round(accuracy_score(y_val,lr_model.predict(X_val)),4)))

Accuracy at C=0.01: 0.8755
Accuracy at C=0.05: 0.884
Accuracy at C=0.25: 0.8816
Accuracy at C=0.5: 0.8795
Accuracy at C=0.75: 0.8776


This is the Accuracy Obtained after removing the **Stop Words**!

In [20]:
# Now let's check for the test set
test_model = LogisticRegression(C = 0.05)
test_model.fit(X,target)
print('Final Accuracy: ',accuracy_score(target,test_model.predict(X_test)))

Final Accuracy:  0.87972


In [21]:
def show_me_top_words(model):
    feature_to_coefficient = {word: coef for word,coef in zip(cv.get_feature_names(),model.coef_[0])}

    print('Top 5 Positive Words are: ')
    for best_positive_words in sorted(feature_to_coefficient.items(),key = lambda x: x[1],reverse = True)[:5]:
        print(best_positive_words)

    print('------***********------')

    print('Top 5 Negative Words are: ')
    for best_negative_words in sorted(feature_to_coefficient.items(),key = lambda x: x[1])[:5]:
        print(best_negative_words)

In [22]:
show_me_top_words(test_model)

Top 5 Positive Words are: 
('excellent', 0.9156937726215434)
('perfect', 0.7811217155798308)
('great', 0.6583322243450895)
('amazing', 0.6290837446397891)
('favorite', 0.6256923899072799)
------***********------
Top 5 Negative Words are: 
('worst', -1.3848468542026808)
('waste', -1.199588607791577)
('awful', -1.0409031829500135)
('poorly', -0.8863416928527236)
('disappointment', -0.8540634899012562)


## Stemming and Lemmatizing!

We generally feed a Normalised Data to our Machine Learning Model in order to obtain fair and better results, but here we haven't done that!    
**Stemming**: Stemming is one way of Normalizing the Data, where a word is brought to its root word or base line. Ex: 'Asked,Asking,Ask' are all of the same stem **'Ask'**  
**Lemmatization**: Lemmatization is process in which the word is brought to it's more root.base form by using a set of conditions in a more calculated process. It resolves words to their dictionary form Ex: 'Asked' will be resolved to dictionary form which is supposed to be 'Ask'

**Now let's do both of the process one by one**

### Stemming!

In [23]:
from nltk.stem.porter import PorterStemmer
#Porter STemmer is one way of Stripping out the Suffix
# CONNECT
# CONNECTIONS --> CONNECT
# CONNECTING --> CONNECT
# CONNECTION --> CONNECT
# CONNECTED --> CONNECT

def stemming_of_data(text_input):
    stemmer = PorterStemmer()
    return [' '.join(stemmer.stem(word) for word in text.split()) for text in text_input]

In [24]:
stemmed_train = stemming_of_data(cleaned_train_data_from_stop_words)
stemmed_test = stemming_of_data(cleaned_test_data_from_stop_words)

In [25]:
cv = CountVectorizer(binary=True)
cv.fit(stemmed_train)
X = cv.transform(stemmed_train)
X_test = cv.transform(stemmed_test)

In [26]:
X_train,X_val,y_train,y_val = train_test_split(X,target,test_size=0.3,random_state = 4)


for hyper_parameter in [0.1,0.01,0.05,0.25,0.5,0.75]:
    lr_model_stemmed = LogisticRegression(C = hyper_parameter,max_iter = 300)
    lr_model_stemmed.fit(X_train,y_train)
    print('Accuracy for C=%s :%s'%(hyper_parameter,round(accuracy_score(y_val,lr_model_stemmed.predict(X_val)),2)))

Accuracy for C=0.1 :0.88
Accuracy for C=0.01 :0.87
Accuracy for C=0.05 :0.88
Accuracy for C=0.25 :0.88
Accuracy for C=0.5 :0.87
Accuracy for C=0.75 :0.87


In [27]:
# Now let's check for the test set
lr_model_stemmed = LogisticRegression(C = 0.05)
lr_model_stemmed.fit(X,target)
print('Final Accuracy: ',round(accuracy_score(target,lr_model_stemmed.predict(X_test)),4))

Final Accuracy:  0.8766


We can see that using the STEMMING on the Data Frame which is free of **Stop Words** is giving us low accuracy, so we will try the whole lot same on the Normal Data

In [28]:
stemmed_train = stemming_of_data(cleaned_train)
stemmed_test = stemming_of_data(cleaned_test)

cv = CountVectorizer(binary=True)
cv.fit(stemmed_train)
X = cv.transform(stemmed_train)
X_test = cv.transform(stemmed_test)

X_train,X_val,y_train,y_val = train_test_split(X,target,test_size=0.3,random_state = 4)


for hyper_parameter in [0.1,0.01,0.05,0.25,0.5,0.75]:
    lr_model_stemmed = LogisticRegression(C = hyper_parameter,max_iter = 300)
    lr_model_stemmed.fit(X_train,y_train)
    print('Accuracy for C=%s :%s'%(hyper_parameter,round(accuracy_score(y_val,lr_model_stemmed.predict(X_val)),4)))

Accuracy for C=0.1 :0.8809
Accuracy for C=0.01 :0.8733
Accuracy for C=0.05 :0.8829
Accuracy for C=0.25 :0.878
Accuracy for C=0.5 :0.8756
Accuracy for C=0.75 :0.8743


In [29]:
# Now let's check for the test set
lr_model_stemmed = LogisticRegression(C = 0.05)
lr_model_stemmed.fit(X,target)
print('Final Accuracy: ',round(accuracy_score(target,lr_model_stemmed.predict(X_test)),4))

Final Accuracy:  0.8771


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Lemmatization!

In [30]:
from nltk.stem import WordNetLemmatizer

def lemmatizing_of_data(text_input):
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in text.split()]) for text in text_input]

Performing Modelling on Text which is free of Stop Words

In [31]:
# For Stop Words
lemmatized_reviews_train_stop_words_excluded = lemmatizing_of_data(cleaned_train_data_from_stop_words)
lemmatized_reviews_test_stop_words_excluded = lemmatizing_of_data(cleaned_test_data_from_stop_words)

cv = CountVectorizer(binary=True)
cv.fit(lemmatized_reviews_train_stop_words_excluded)
X = cv.transform(lemmatized_reviews_train_stop_words_excluded)
X_test = cv.transform(lemmatized_reviews_test_stop_words_excluded)

X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.7
)

for c in [0.1,0.01, 0.05, 0.25, 0.5, 0.75]:
    
    lr = LogisticRegression(C=c,max_iter = 300)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.1: 0.8793333333333333
Accuracy for C=0.01: 0.8734666666666666
Accuracy for C=0.05: 0.88
Accuracy for C=0.25: 0.8768
Accuracy for C=0.5: 0.8726666666666667
Accuracy for C=0.75: 0.8710666666666667


Performing Modelling on Cleaned processed data

In [32]:
lemmatized_reviews_train = lemmatizing_of_data(cleaned_train)
lemmatized_reviews_test = lemmatizing_of_data(cleaned_test)

cv = CountVectorizer(binary=True)
cv.fit(lemmatized_reviews_train)
X = cv.transform(lemmatized_reviews_train)
X_test = cv.transform(lemmatized_reviews_test)

X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.7
)

for c in [0.1,0.01, 0.05, 0.25, 0.5, 0.75]:
    
    lr = LogisticRegression(C=c,max_iter = 300)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.1: 0.8805333333333333
Accuracy for C=0.01: 0.8682666666666666
Accuracy for C=0.05: 0.8812
Accuracy for C=0.25: 0.8784
Accuracy for C=0.5: 0.8762666666666666
Accuracy for C=0.75: 0.8750666666666667


### N-Grams

**What is N-Gram? and why we are going to use it?**
N-Gram is a parameter of **Vectorization** which helps in identifying high power **bigrams(2 words)** as well.
For example we want to use it to check what weight does words like **well worth** carry if joined. Does it increases the predictive power.

Let's see it in practical application here!

In [35]:
# We build a Matrix which consists of Weights of each word, occuring number of times!
ngram_cv = CountVectorizer(binary=True,ngram_range=(1,2))
ngram_cv.fit(cleaned_train_data_from_stop_words)
X = ngram_cv.transform(cleaned_train_data_from_stop_words)
X_test = ngram_cv.transform(cleaned_test_data_from_stop_words)

In [36]:
X_train,X_val,y_train,y_val = train_test_split(X,target,test_size=0.3)

for c in [0.1,0.01,0.05,0.25,0.5,0.75]:
    ngram_lr = LogisticRegression(C = c,max_iter = 300)
    ngram_lr.fit(X_train,y_train)
    print('Accuracy for C=%s :%s'%(c,round(accuracy_score(y_val,ngram_lr.predict(X_val)),4)))

Accuracy for C=0.1 :0.8839
Accuracy for C=0.01 :0.8741
Accuracy for C=0.05 :0.882
Accuracy for C=0.25 :0.8837
Accuracy for C=0.5 :0.8845
Accuracy for C=0.75 :0.8836


In [37]:
# Now let's check for the test set
ngram_lr_test = LogisticRegression(C = 0.5)
ngram_lr_test.fit(X,target)
print('Final Accuracy: ',round(accuracy_score(target,ngram_lr_test.predict(X_test)),4))

Final Accuracy:  0.8883


So we see that the **Train Acc is 88% and the Test Acc is also 88%** which is a decent result, but we did it on the **DataFrame which is cleaned from Stop Words**, now we will do it on our original Data frame!

In [39]:
show_me_top_words(ngram_lr_test)

Top 5 Positive Words are: 
('propagandait', 0.9872406751776253)
('famousis', 0.42355665927160313)
('pimple', 0.3654362259262154)
('neuro', 0.36138242164183687)
('walid', 0.3483109777878208)
------***********------
Top 5 Negative Words are: 
('structuring', -0.861275299952086)
('negligee', -0.5464281663974457)
('asexual', -0.4519909900147761)
('hominid', -0.39828429048426606)
('sm', -0.39645153425531376)


In [40]:
ngram_cv_original = CountVectorizer(binary=True,ngram_range=(1,2))
ngram_cv_original.fit(cleaned_train)
X = ngram_cv_original.transform(cleaned_train)
X_test = ngram_cv_original.transform(cleaned_test)

In [43]:
X_train,X_val,y_train,y_val = train_test_split(X,target,test_size = 0.3)

for c in [0.01,0.05,0.1,0.25,0.5,0.75]:
    ngram_original_lr = LogisticRegression(C = c,max_iter=300)
    ngram_original_lr.fit(X_train,y_train)
    print('Accuracy for C=%s :%s'%(c,round(accuracy_score(y_val,ngram_original_lr.predict(X_val)),4)))

Accuracy for C=0.01 :0.8805
Accuracy for C=0.05 :0.8867
Accuracy for C=0.1 :0.8877
Accuracy for C=0.25 :0.8891
Accuracy for C=0.5 :0.8899
Accuracy for C=0.75 :0.8888


In [44]:
# Test Performance for the same
ngram_test_original = LogisticRegression(C = 0.5) # we are not using max_iter, just to show the warning which is recieved
ngram_test_original.fit(X,target)
print('Accuracy is:',round(accuracy_score(target,ngram_test_original.predict(X_test)),4))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy is: %s 0.8987


Now what we get to see is that without using the **Stop Words cleaned DF**, we get an Accuracy of 90%(approx) which is quite amusing and our test accuracy is also approx 90%.
There is an increase of almost 1.5% from **Lemmatization** and approx 2% from **Stemming**.
Also, it is not needed that everytime, increasing the **ngrams** will give us better accuracy or performance.

In [45]:
show_me_top_words(ngram_original_lr)

Top 5 Positive Words are: 
('minnieapolis', 0.8969108998668025)
('thoroughfare', 0.35846085881026135)
('saldana', 0.34478594426826975)
('scenesanyway', 0.33108570217008254)
('latham', 0.3292864357439761)
------***********------
Top 5 Negative Words are: 
('verbosity', -0.651051851761527)
('alchemize', -0.4027733731446192)
('interrogating', -0.3897407442638686)
('bch', -0.3591348441867934)
('talbot', -0.32489170326554556)


## Using SVM for N-Gram parameter and then we end this Exercise!

In [46]:
from sklearn.svm import LinearSVC

for c in [0.01,0.05,0.1,0.25,0.5,0.75]:
    svm_linear = LinearSVC(C=c)
    svm_linear.fit(X_train,y_train)
    print('Accuracy for C=%s :%s'%(c,round(accuracy_score(y_val,svm_linear.predict(X_val)),4)))

Accuracy for C=0.01 :0.8888
Accuracy for C=0.05 :0.8884
Accuracy for C=0.1 :0.8885
Accuracy for C=0.25 :0.8872
Accuracy for C=0.5 :0.8869
Accuracy for C=0.75 :0.8869




In [48]:
# Test Set Check
svm_test = LinearSVC(C = 0.25)
svm_test.fit(X,target)
print('Accuracy is:',round(accuracy_score(target,svm_test.predict(X_test)),4))

Accuracy is:  0.8941


In [49]:
show_me_top_words(svm_linear)

Top 5 Positive Words are: 
('minnieapolis', 0.20478487267224568)
('astronomical', 0.08464045307387719)
('thoroughfare', 0.08130683705471237)
('bathebo', 0.08061123497385746)
('kidsmy', 0.07991351408947818)
------***********------
Top 5 Negative Words are: 
('verbosity', -0.14250695785955148)
('talbot', -0.11069014475510253)
('alchemize', -0.10210967112327878)
('roman', -0.09983813841913194)
('interrogating', -0.09888506164926293)


# Thank You for your Time!

Do let me know any improvements!