Sentiment Analysis using Textblob
* Using the IMDB Dataset from Kaggle.com
* Sentiment Analysis is a process by which we can find the sentiment of a text. Sentiment can be Positive, Negative or Neutral. The data is analyzed and classified into catagories.
* Sentiment Analysis can help us find the mood and emotions of a customer review. It helps with gathering insightful information and context.

In [3]:
import sys

import train
!{sys.executable} -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import pandas as pd
from textblob import TextBlob
from nltk.tokenize.toktok import ToktokTokenizer
import re
tokenizer = ToktokTokenizer()
import spacy
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
nlp.max_length = 2_000_000  # or set to train['review'].str.len().max() + 1

In [3]:
TextBlob("this is a good app").sentiment

Sentiment(polarity=0.7, subjectivity=0.6000000000000001)

In [4]:
TextBlob("this is not a good app").sentiment

Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)

In [5]:
TextBlob("Everyone says this app is poorly written").sentiment

Sentiment(polarity=-0.4, subjectivity=0.6)

### Polarity and Subjectivity
* Polarity is a float within the range [-1.0, 1.0] where 0.0 is neutral, and 1.0 is positive.
* Polarity gives you what is the sentiment (positive or negative) of the text.
* Subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. Subjectivity defines if the statement is an opinion or not.

In [6]:
# Data Loading
train = pd.read_csv("data/IMDB Dataset_1.csv")
train

Unnamed: 0,review,label
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


In [7]:
label_0 = train[train['label']==0].sample(n=5000)
label_1 = train[train['label']==1].sample(n=5000)

In [8]:
train = pd.concat([label_0,label_1])
from sklearn.utils import shuffle
train = shuffle(train)

In [9]:
train

Unnamed: 0,review,label
22902,Ghost Town starts as Kate Barrett (Catherine H...,0
18333,My tolerance for shlocky direction was overwhe...,0
12095,"I have seen many a horror flick in my time, al...",0
12413,"""Pixote: A Lei do Mais Fraco"" deals with what ...",1
5476,I had the chance to watch Blind Spot in Barcel...,1
...,...,...
18017,"I may be biased, I am the author of the novel ...",1
44899,this is one of the funniest shows i have ever ...,1
7262,I'm not a Steve Carell fan however I like this...,1
39076,"Jamie Foxx is my favorite comedian. However, I...",0


The data here has two labels, 0 and 1.  0 is negative and 1 is positive.

## Data Preprocessing

In [10]:
train.isnull().sum() # Check for null values

review    0
label     0
dtype: int64

In [11]:
import numpy as np
train.replace(r'^\s*$', np.nan, regex = True, inplace = True)
train.dropna(axis = 0, how = 'any', inplace = True)

In [12]:
train.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value = ["",""], regex = True, inplace = True)
print('escape seq removed')

escape seq removed


In [13]:
train

Unnamed: 0,review,label
22902,Ghost Town starts as Kate Barrett (Catherine H...,0
18333,My tolerance for shlocky direction was overwhe...,0
12095,"I have seen many a horror flick in my time, al...",0
12413,"""Pixote: A Lei do Mais Fraco"" deals with what ...",1
5476,I had the chance to watch Blind Spot in Barcel...,1
...,...,...
18017,"I may be biased, I am the author of the novel ...",1
44899,this is one of the funniest shows i have ever ...,1
7262,I'm not a Steve Carell fan however I like this...,1
39076,"Jamie Foxx is my favorite comedian. However, I...",0


In [14]:
train['review'] = train['review'].str.encode('ascii', 'ignore').str.decode('ascii')
print('non-ascii data removed')

non-ascii data removed


In [15]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
def remove_punctuation(review):
    import string
    for punctuation in string.punctuation:
        review = review.replace(punctuation,'')
    return review
train['review'] = train['review'].apply(remove_punctuation)

In [17]:
train

Unnamed: 0,review,label
22902,Ghost Town starts as Kate Barrett Catherine Hi...,0
18333,My tolerance for shlocky direction was overwhe...,0
12095,I have seen many a horror flick in my time all...,0
12413,Pixote A Lei do Mais Fraco deals with what is ...,1
5476,I had the chance to watch Blind Spot in Barcel...,1
...,...,...
18017,I may be biased I am the author of the novel T...,1
44899,this is one of the funniest shows i have ever ...,1
7262,Im not a Steve Carell fan however I like this ...,1
39076,Jamie Foxx is my favorite comedian However I f...,0


In [18]:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [19]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

In [20]:
def custom_remove_stopwords(review, is_lower_case = False):
    tokens = tokenizer.tokenize(review)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_review = ' '.join(filtered_tokens)
    return filtered_review

In [21]:
train['review'] = train['review'].apply(custom_remove_stopwords)

In [22]:
train

Unnamed: 0,review,label
22902,Ghost Town starts Kate Barrett Catherine Hickl...,0
18333,tolerance shlocky direction overwhelmed choice...,0
12095,seen many horror flick time absurdly bad none ...,0
12413,Pixote Lei Mais Fraco deals perhaps greatest B...,1
5476,chance watch Blind Spot Barcelona enjoyed trem...,1
...,...,...
18017,may biased author novel Hungry Bachelors Club ...,1
44899,one funniest shows ever seen really refreshing...,1
7262,Im not Steve Carell fan however like movie Dan...,1
39076,Jamie Foxx favorite comedian However feel sold...,0


In [23]:
def remove_special_characters(review):
    review = re.sub('[^a-zA-Z0-9\s]', '', review)
    return review

In [24]:
train['review'] = train['review'].apply(remove_special_characters)

In [25]:
def remove_html(review):
    import re
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', review)

In [26]:
train['review'] = train['review'].apply(remove_html)

In [27]:
def remove_URL(review):
    url = re.compile(r'https?://\S+|www\S+')
    return url.sub(r'', review)

In [28]:
train['review'] = train['review'].apply(remove_URL)

In [29]:
def remove_numbers(review):
    """ Remove integers """
    review = ''.join([i for i in review if not i.isdigit()])
    return review

In [30]:
train['review'] = train['review'].apply(remove_numbers)

In [31]:
def cleanse(word):
    rx = re.compile(r'\D*\d')
    if rx.match(word):
        return ''
    return word

def remove_alphanumeric(s):
    return " ".join(filter(None, (cleanse(w) for w in s.split())))

In [32]:
train['review'] = train['review'].apply(remove_alphanumeric)

In [33]:
train

Unnamed: 0,review,label
22902,Ghost Town starts Kate Barrett Catherine Hickl...,0
18333,tolerance shlocky direction overwhelmed choice...,0
12095,seen many horror flick time absurdly bad none ...,0
12413,Pixote Lei Mais Fraco deals perhaps greatest B...,1
5476,chance watch Blind Spot Barcelona enjoyed trem...,1
...,...,...
18017,may biased author novel Hungry Bachelors Club ...,1
44899,one funniest shows ever seen really refreshing...,1
7262,Im not Steve Carell fan however like movie Dan...,1
39076,Jamie Foxx favorite comedian However feel sold...,0


In [45]:
train['sentiment'] = train['review'].apply(lambda x: TextBlob(x).sentiment)

In [46]:
train

Unnamed: 0,review,label,sentiment
22902,Ghost Town starts Kate Barrett Catherine Hickl...,0,"(-0.02980990783410139, 0.44915834613415273)"
18333,tolerance shlocky direction overwhelmed choice...,0,"(0.17857142857142858, 0.38571428571428573)"
12095,seen many horror flick time absurdly bad none ...,0,"(-0.06874999999999998, 0.4895833333333333)"
12413,Pixote Lei Mais Fraco deals perhaps greatest B...,1,"(0.14166666666666666, 0.6166666666666666)"
5476,chance watch Blind Spot Barcelona enjoyed trem...,1,"(0.37302489177489173, 0.69629329004329)"
...,...,...,...
18017,may biased author novel Hungry Bachelors Club ...,1,"(0.2527777777777778, 0.5935185185185186)"
44899,one funniest shows ever seen really refreshing...,1,"(0.09930555555555555, 0.6267361111111112)"
7262,Im not Steve Carell fan however like movie Dan...,1,"(0.3375, 0.6545454545454545)"
39076,Jamie Foxx favorite comedian However feel sold...,0,"(0.2569444444444444, 0.5222222222222223)"


In [47]:
sentiment_series = train['sentiment'].tolist()

In [48]:
columns = ['polarity', 'subjectivity']
df1 = pd.DataFrame(sentiment_series, columns = columns, index = train.index)

In [49]:
df1

Unnamed: 0,polarity,subjectivity
22902,-0.029810,0.449158
18333,0.178571,0.385714
12095,-0.068750,0.489583
12413,0.141667,0.616667
5476,0.373025,0.696293
...,...,...
18017,0.252778,0.593519
44899,0.099306,0.626736
7262,0.337500,0.654545
39076,0.256944,0.522222


In [50]:
result = pd.concat([train, df1], axis = 1)

In [41]:
result.drop(['sentiment'], axis = 1, inplace = True)

In [42]:
result.loc[result['polarity'] >= 0.3, 'Sentiment'] = "Positive"
result.loc[result['polarity'] < 0.3, 'Sentiment'] = "Negative"

In [43]:
result

Unnamed: 0,review,label,polarity,subjectivity,Sentiment
22902,Ghost Town starts Kate Barrett Catherine Hickl...,0,-0.029810,0.449158,Negative
18333,tolerance shlocky direction overwhelmed choice...,0,0.178571,0.385714,Negative
12095,seen many horror flick time absurdly bad none ...,0,-0.068750,0.489583,Negative
12413,Pixote Lei Mais Fraco deals perhaps greatest B...,1,0.141667,0.616667,Negative
5476,chance watch Blind Spot Barcelona enjoyed trem...,1,0.373025,0.696293,Positive
...,...,...,...,...,...
18017,may biased author novel Hungry Bachelors Club ...,1,0.252778,0.593519,Negative
44899,one funniest shows ever seen really refreshing...,1,0.099306,0.626736,Negative
7262,Im not Steve Carell fan however like movie Dan...,1,0.337500,0.654545,Positive
39076,Jamie Foxx favorite comedian However feel sold...,0,0.256944,0.522222,Negative


In [44]:
result.loc[result['polarity'] == 1, 'Sentiment'] = 1
result.loc[result['polarity'] == 0, 'Sentiment'] = 0