### Project 3: IMDb Review Dataset

> **Project by :** Kishan Kanaiyalal Patel  
**Student Id  :** 200527734

In [None]:
import numpy as np 
import pandas as pd
from os import path
from pandas import DataFrame
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
import re

import nltk
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer    # Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('wordnet')

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import style
import matplotlib.colors

import wordcloud  
from wordcloud import WordCloud, STOPWORDS 
from PIL import Image

In [None]:
#loading the dataset

df=pd.read_excel('IMDB_Dataset.xlsx')
df.head()


## Text Pre-processing


> One of the necessary stages we will perform while developing an NLP application is text pre-processing. As humans, we frequently produce text that has several spelling mistakes, short words, unique symbols, emoticons, etc. We can understand this language, but if we want the computer to understand it, we must preprocess it. We'll go over a few of the different kinds of text pre-processing you would need to do when working with text data in this notebook.

### Lowercasing

It is the process of changing a word's case to lower case. Our model will treat two words differently if one word (let's say Book) starts the sentence with a capital letter and another word (book) comes later in the phrase without a capital letter. Lowercasing is typically a relatively easy process, and we can use the. lower() method

In [None]:
def lowercasing(column):
    column = column.str.lower()
    return column

In [None]:
print(f"Before applying lower casing: {df['review'][0][:20]}")

df['cleaned_text'] = lowercasing(df['review'])

print(f"After applying lower casing : {df['cleaned_text'][0][:20]}")


### Removing HTML Tags

HTML tags such as header, body, anchor, etc. will always be present while web scraping. These tags shouldn't be used because they won't improve the text data we already have. Using regular expressions, these HTML tags can be removed.

Our dataset does not seem to be having the HTML tags, but if it is the case, we are performing below mentioned task to remove them.

In [None]:
import re
def html_tag(text):
    re_html = re.compile('<.*?>')
    return re_html.sub(r'', text)

In [None]:
text = '<h1> This is the first heading in HTML </h1>'
print(html_tag(text))

In [None]:
print(f"Before removing HTML tags: {df['cleaned_text'][1][:60]}")
df['cleaned_text'] = df['cleaned_text'].apply(html_tag)
print(f"After removing HTML tags : {df['cleaned_text'][1][:60]}")

### Removing URLs

Just like HTML tags, URLs are useless for checking the sentiment of reviews, therefore we'll remove them if spotted any in our dataset.

In [None]:
text = 'My portfolio can be seen at: https://www.kishandigitallab.com/portfolio'
def url(text):
    re_url = re.compile('https?://\S+|www\.\S+')
    return re_url.sub('', text)

In [None]:
print(f'Text before removing URL: {text}')
print(f'Text after removing URL : {url(text)}')

In [None]:
#applying the function on the dataset.
df['cleaned_text'] = df['cleaned_text'].apply(url)

In [None]:
df.head()

### Removing Punctuations

Similar to lowercasing, punctuation is often removed because we want the words 'yeah' and 'yeah!' to be handled equally in certain contexts. The term "can't" can be translated to "cant" and "can t" depending on the parameter we set.

In [None]:
import string
exclude = string.punctuation

def punctuations(text):
    return text.translate(str.maketrans('', '', exclude))

In [None]:
text = 'Yeah!'
print(f'Text before punctuation: {text}')
no_punc = punctuations(text)
print(f'Text after punctuation : {no_punc}')

In [None]:
print(f"Text before removing punctuation: {df['cleaned_text'][0]}\n")
df['cleaned_text'] = df['cleaned_text'].apply(punctuations)
print(f"Text after removing punctuation : {df['cleaned_text'][0]}")

In [None]:
df.head()

In [None]:
# def textblob_func(text):
#     try:
#         return TextBlob(text).correct()
#     except:
#         return None

# df['cleaned_text'] = df['cleaned_text'].apply(textblob_func)

In [None]:
# df.cleaned_text.head()

In [None]:
# from textblob import TextBlob

# def translate(x):
#     blob =TextBlob(x)
#     return blob.correct()

In [None]:
# df['xxx']=df['cleaned_text'].apply(lambda x: str(TextBlob(x).correct()))

In [None]:
# df.xxx.head()

### Removing stop words


Stop words are words like the, an, so, and that are frequently found in texts but don't offer the model any useful information. By eliminating these words, we may concentrate on the text's more significant information.

In [None]:
from nltk.corpus import stopwords
stopwords_english = stopwords.words('english')

def stopwords(text):
    new_text = []
    for word in text.split():
        if word in stopwords_english:
            continue
        else:
            new_text.append(word)

    return ' '.join(new_text)

In [None]:
#applying the function on the dataset.
df['cleaned_text'] = df['cleaned_text'].apply(stopwords)

In [None]:
def token(text):
    token_list=[]
    token_list=re.findall('\w+',text)
    return token_list

df['tokened_text']=''
for i in range(0,len(df['cleaned_text'])):    
    df['tokened_text'].iloc[i]=token(df['cleaned_text'][i].lower())

### Stemming

Stemming is a process by which we bring the words to their root forms. For e.g. the stem of walking, walks, walked is walk

In [None]:
stemming = PorterStemmer()
# def stemmer(text):
#     new_text = [stemming.stem(word) for word in text.split()]
#     return ' '.join(new_text)

def stemmer(list_token):
    stemmed_list=[]
    for i in list_token:
#         print(i)
        stemmed_list.append(stemming.stem(i))
    return stemmed_list

In [None]:
df['stemmed_text'] = df['tokened_text'].apply(stemmer)

In [None]:
df.head()

### Splitting large dataset in smaller one 

Since 25,000 can take so much of time, I have reduced the dataset to 6000 points, having equal ratio of positive reviews and negative reviews.

In [None]:
df["sentiment"] = df["sentiment"].map({"negative": 0, "positive": 1})

In [None]:
# Sampling data with balance
    # Total sample: 25k points
    # 3k points -> class 0
    # 3k points -> class 1

negative_samples = df[df["sentiment"] == 0].sample(n=3000, random_state=60)
positive_samples = df[df["sentiment"] == 1].sample(n=3000, random_state=60)

# Merge and shuffle the imbalanced data
reduced_df = pd.concat([negative_samples, positive_samples]).sample(frac=1, random_state=60)

In [None]:
reduced_df.shape


## TFIDF Vectorization

According to the TF-IDF, a term's relevance is inversely proportional to how frequently it appears in various documents. A term's frequency in a document is revealed by TF, while its relative rarity within the corpus of texts is revealed by IDF. We can determine our final TF-IDF value by multiplying these numbers collectively.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(analyzer = stemmer)
X_tfidf = tfidf.fit_transform(reduced_df['tokened_text'])
print(X_tfidf.shape)
print(tfidf.get_feature_names())

In [None]:
tfidf_df = pd.DataFrame(X_tfidf.toarray())
tfidf_df.columns = tfidf.get_feature_names()
tfidf_df

In [None]:
def count_punctuation(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)

reduced_df['body_len'] = reduced_df['review'].apply(lambda x: len(x) - x.count(" "))
reduced_df['punct%'] = reduced_df['review'].apply(lambda x: count_punctuation(x))

In [None]:
X_features = pd.concat([reduced_df['body_len'], reduced_df['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()