<a href="https://colab.research.google.com/github/hannahmypham/SentimentAnalysis/blob/main/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment Analysis

First, I will import the libraries needed

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from nltk.tokenize import word_tokenize,sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/meghjoshii/NSDC_DataScienceProjects_SentimentAnalysis/main/IMDB%20Dataset.csv")

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=df)

Taking a look at the summary of data

In [4]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


## Count the of the unique positive and negative reviews

In [5]:
df.value_counts("sentiment")

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
negative,25000
positive,25000


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


There is no null values in those 2 columns, so the data is ready for analysis. First, I tokenize each words in the review.

In [7]:
import nltk
nltk.download('punkt_tab')
df['review'] = df['review'].apply(word_tokenize)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [8]:
df['review'][1]

['A',
 'wonderful',
 'little',
 'production',
 '.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'The',
 'filming',
 'technique',
 'is',
 'very',
 'unassuming-',
 'very',
 'old-time-BBC',
 'fashion',
 'and',
 'gives',
 'a',
 'comforting',
 ',',
 'and',
 'sometimes',
 'discomforting',
 ',',
 'sense',
 'of',
 'realism',
 'to',
 'the',
 'entire',
 'piece',
 '.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'The',
 'actors',
 'are',
 'extremely',
 'well',
 'chosen-',
 'Michael',
 'Sheen',
 'not',
 'only',
 '``',
 'has',
 'got',
 'all',
 'the',
 'polari',
 "''",
 'but',
 'he',
 'has',
 'all',
 'the',
 'voices',
 'down',
 'pat',
 'too',
 '!',
 'You',
 'can',
 'truly',
 'see',
 'the',
 'seamless',
 'editing',
 'guided',
 'by',
 'the',
 'references',
 'to',
 'Williams',
 "'",
 'diary',
 'entries',
 ',',
 'not',
 'only',
 'is',
 'it',
 'well',
 'worth',
 'the',
 'watching',
 'but',
 'it',
 'is',
 'a',
 'terrificly',
 'written',
 'and',
 'performed',
 'piece',
 '.',
 'A',
 'masterful

In [9]:
 df['review'] = df['review'].apply(lambda x: [item for item in x if item.isalpha()])

Next, I want to look at the first row of the data frame concatted to see if the punctuations have been removed.

In [10]:
df['review'] = df['review'].apply(lambda x: [item.lower() for item in x])

In [11]:
print(df['review'][1])

['a', 'wonderful', 'little', 'production', 'br', 'br', 'the', 'filming', 'technique', 'is', 'very', 'very', 'fashion', 'and', 'gives', 'a', 'comforting', 'and', 'sometimes', 'discomforting', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece', 'br', 'br', 'the', 'actors', 'are', 'extremely', 'well', 'michael', 'sheen', 'not', 'only', 'has', 'got', 'all', 'the', 'polari', 'but', 'he', 'has', 'all', 'the', 'voices', 'down', 'pat', 'too', 'you', 'can', 'truly', 'see', 'the', 'seamless', 'editing', 'guided', 'by', 'the', 'references', 'to', 'williams', 'diary', 'entries', 'not', 'only', 'is', 'it', 'well', 'worth', 'the', 'watching', 'but', 'it', 'is', 'a', 'terrificly', 'written', 'and', 'performed', 'piece', 'a', 'masterful', 'production', 'about', 'one', 'of', 'the', 'great', 'master', 'of', 'comedy', 'and', 'his', 'life', 'br', 'br', 'the', 'realism', 'really', 'comes', 'home', 'with', 'the', 'little', 'things', 'the', 'fantasy', 'of', 'the', 'guard', 'which', 'rather', 'than', 'u

Then, I remove stop words

In [12]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
stop_words = set(stopwords.words('english'))

In [14]:
df['review'] = df['review'].apply(lambda x: [item for item in x if item not in stop_words])

In [15]:
print(df['review'][1])

['wonderful', 'little', 'production', 'br', 'br', 'filming', 'technique', 'fashion', 'gives', 'comforting', 'sometimes', 'discomforting', 'sense', 'realism', 'entire', 'piece', 'br', 'br', 'actors', 'extremely', 'well', 'michael', 'sheen', 'got', 'polari', 'voices', 'pat', 'truly', 'see', 'seamless', 'editing', 'guided', 'references', 'williams', 'diary', 'entries', 'well', 'worth', 'watching', 'terrificly', 'written', 'performed', 'piece', 'masterful', 'production', 'one', 'great', 'master', 'comedy', 'life', 'br', 'br', 'realism', 'really', 'comes', 'home', 'little', 'things', 'fantasy', 'guard', 'rather', 'use', 'traditional', 'techniques', 'remains', 'solid', 'disappears', 'plays', 'knowledge', 'senses', 'particularly', 'scenes', 'concerning', 'orton', 'halliwell', 'sets', 'particularly', 'flat', 'halliwell', 'murals', 'decorating', 'every', 'surface', 'terribly', 'well', 'done']


Now, let's look at the first 5 rows of the data frame. Now the reviews have been broken down into usable tokens

In [16]:
print(df.head())

                                              review sentiment
0  [one, reviewers, mentioned, watching, oz, epis...  positive
1  [wonderful, little, production, br, br, filmin...  positive
2  [thought, wonderful, way, spend, time, hot, su...  positive
3  [basically, family, little, boy, jake, thinks,...  negative
4  [petter, mattei, love, time, money, visually, ...  positive


In [17]:
# stemming user PorterStemmer

from nltk.stem import PorterStemmer
ps = PorterStemmer()

df['review'] = df['review'].apply(lambda x: [ps.stem(item) for item in x])


In [18]:
print(df['review'][1])

['wonder', 'littl', 'product', 'br', 'br', 'film', 'techniqu', 'fashion', 'give', 'comfort', 'sometim', 'discomfort', 'sens', 'realism', 'entir', 'piec', 'br', 'br', 'actor', 'extrem', 'well', 'michael', 'sheen', 'got', 'polari', 'voic', 'pat', 'truli', 'see', 'seamless', 'edit', 'guid', 'refer', 'william', 'diari', 'entri', 'well', 'worth', 'watch', 'terrificli', 'written', 'perform', 'piec', 'master', 'product', 'one', 'great', 'master', 'comedi', 'life', 'br', 'br', 'realism', 'realli', 'come', 'home', 'littl', 'thing', 'fantasi', 'guard', 'rather', 'use', 'tradit', 'techniqu', 'remain', 'solid', 'disappear', 'play', 'knowledg', 'sens', 'particularli', 'scene', 'concern', 'orton', 'halliwel', 'set', 'particularli', 'flat', 'halliwel', 'mural', 'decor', 'everi', 'surfac', 'terribl', 'well', 'done']


In [19]:
df['review'] = df['review'].apply(lambda x: " ".join(x))

In [20]:
print(df)

                                                  review sentiment
0      one review mention watch oz episod hook right ...  positive
1      wonder littl product br br film techniqu fashi...  positive
2      thought wonder way spend time hot summer weeke...  positive
3      basic famili littl boy jake think zombi closet...  negative
4      petter mattei love time money visual stun film...  positive
...                                                  ...       ...
49995  thought movi right good job creativ origin fir...  positive
49996  bad plot bad dialogu bad act idiot direct anno...  negative
49997  cathol taught parochi elementari school nun ta...  negative
49998  go disagre previou comment side maltin one sec...  negative
49999  one expect star trek movi high art fan expect ...  negative

[50000 rows x 2 columns]


Now I am splitting the data for training and testing

In [21]:
test_review = df[40000:]

In [22]:
train_review = df[:40000]
print(train_review)

                                                  review sentiment
0      one review mention watch oz episod hook right ...  positive
1      wonder littl product br br film techniqu fashi...  positive
2      thought wonder way spend time hot summer weeke...  positive
3      basic famili littl boy jake think zombi closet...  negative
4      petter mattei love time money visual stun film...  positive
...                                                  ...       ...
39995  marvel funni comedi great cast john ritter kat...  positive
39996  plot central charact move camera fact film fol...  positive
39997  show awesom love actor great stori line charac...  positive
39998  fact movi entitl success movi switzerland film...  negative
39999  confess sever br br version way compet version...  negative

[40000 rows x 2 columns]


In [40]:

test_sen = df['review'][40000:]
train_sen = df['review'][:40000]

In [24]:
cv = CountVectorizer(min_df=0.1, max_df=1, binary = False, ngram_range = (1,3))

In [25]:
print(cv)

CountVectorizer(max_df=1, min_df=0.1, ngram_range=(1, 3))


In [26]:
cv_train_reviews = cv.fit_transform(train_review)

In [27]:
print(cv_train_reviews)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 2 stored elements and shape (2, 2)>
  Coords	Values
  (0, 0)	1
  (1, 1)	1


In [33]:
cv_test_reviews = cv.transform(test_review)

In [34]:
print(cv_test_reviews)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 2 stored elements and shape (2, 2)>
  Coords	Values
  (0, 0)	1
  (1, 1)	1


In [30]:
lb = LabelBinarizer()


In [42]:
lb_train_sen = lb.fit_transform(train_sen)
lb_test_sen = lb.transform(test_sen)


In [44]:
mnb = MultinomialNB()

In [45]:
mnb_bow = mnb.fit(cv_train_reviews, lb_train_sen)

ValueError: y should be a 1d array, got an array of shape (40000, 39731) instead.