Import required libraries

In [79]:
import pandas as pd
import regex as re
import advertools as adv
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import EnglishStemmer
import nltk

In [13]:
df = pd.read_csv ('IMDB-Dataset.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


Explore dataframe

In [14]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


Check for emojis using an external library "advertools"

In [59]:
#Testing emojis
testEmojiDict = adv.extract_emoji("Hi ☺️ My name is Mo B. I love data science❤️💕😊😇😉😌😍🥰😗😙😚😚😋😛😝😜🤪🤨🧐😎")

count = 0
for emoji_count in testEmojiDict['emoji_counts']:
    if (emoji_count >= 1):
        count +=1
print("Total emojis:", count)
print(testEmojiDict['emoji_flat'])

Total emojis: 21
['☺', '❤', '💕', '😊', '😇', '😉', '😌', '😍', '🥰', '😗', '😙', '😚', '😚', '😋', '😛', '😝', '😜', '🤪', '🤨', '🧐', '😎']


In [15]:
emoji_dict = adv.extract_emoji(df.review)

In [61]:
count = 0

for emoji_count in emoji_dict['emoji_counts']:
    if (emoji >= 1):
        count +=1
print("Total emojis:", count)

print(emoji_dict['emoji_flat'])
print(emoji_dict['emoji_flat_text'])

Total emojis: 0
['®', '®', '®', '©', '®']
['registered', 'registered', 'registered', 'copyright', 'registered']


Seems like they are no emojis in the dataframe. Bear in mind data has not been altered in any way as of yet. Only Registered and copyright symbols are found which would not have any effect on the sentiment analyser whatsoever. Therefore they will be removed

Considerable number of reviews contained HTML tags, as well as brackets. Therefore they are removed.
Special characters are removed as well.

In [74]:
#Removing the html strips
def remove_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text

#Removing the noisy text
def clean_text(text):
    text = remove_html(text)
    text = remove_between_square_brackets(text)
    text = remove_special_characters(text)
    return text
#Apply function on review column
df['review'] = df['review'].apply(clean_text)

Tokenise and normalise words

In [81]:
nltk.download('punkt')
stemmer = EnglishStemmer()

def stem_words(tokenizedList):
    stemmedList = []
    for word in tokenizedList:
        stemmedList.append(stemmer.stem(word))
    return stemmedList

def smarter_tokenize_and_preprocess(text):
    tokenizedWords = nltk.word_tokenize(text)
    return stem_words(tokenizedWords)
df['review'] = df['review'].apply(smarter_tokenize_and_preprocess)

[nltk_data] Downloading package punkt to /Users/moham/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [84]:
df.review

0        [one, of, the, other, review, has, mention, th...
1        [a, wonder, littl, product, the, film, techniq...
2        [i, thought, this, was, a, wonder, way, to, sp...
3        [basic, there, a, famili, where, a, littl, boy...
4        [petter, mattei, love, in, the, time, of, mone...
                               ...                        
49995    [i, thought, this, movi, did, a, down, right, ...
49996    [bad, plot, bad, dialogu, bad, act, idiot, dir...
49997    [i, am, a, cathol, taught, in, parochi, elemen...
49998    [im, go, to, have, to, disagre, with, the, pre...
49999    [no, one, expect, the, star, trek, movi, to, b...
Name: review, Length: 50000, dtype: object

Possibly use tf-idf, can be used with logistic regression. More on that later when we pick a model