# Challenge: Sentiment Analysis for ImDb Movies

### About Dataset


> IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.

#### [Link to the Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

### 1. Import Libraries and Dataset

**1.1 Import the libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
from pprint import pprint
from nltk.tokenize import RegexpTokenizer
from nltk.probability import FreqDist
import seaborn as sns
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB
from gensim.models import Word2Vec


**1.2 Download Modules from nltk**

In [2]:
nltk.download(['stopwords', 'movie_reviews', 'wordnet'])
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/priyankadhamija/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/priyankadhamija/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/priyankadhamija/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/priyankadhamija/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**1.3 Read the dataset**

In [3]:
pd.set_option('display.max_colwidth', None) #Dispay the entire row
df = pd.read_csv(r'Imdb Dataset.csv')
df.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'Imdb Dataset.csv'

### 2. Data Manipulation & EDA

In [None]:
print('Shape of the dataset:', df.shape)
print('Summary of the dataset:') 
df.describe() #There are a few duplicates in the reviews

**2.1 View and Drop duplicates**

In [None]:
#Drop the duplicates
df = df.drop_duplicates()

**2.1.1 Verify if there are any null values, No nulls**

In [None]:
df[df.isnull().any(axis = 1)]

**2.1.2 Remove the numeric digits**

In [None]:
df['review'] = [re.sub(r'\b(?:\d+|\w)\b\s*', '', x) for x in df['review']]
df.describe()

**2.2 Check class balance. Class looks balanced**

In [None]:
df['sentiment'].value_counts()

**2.3 Plot sentiment class distribution**

**2.3.1 Plot Pie Chart for % Class Distribution**

In [None]:
sentiment_sort = df['sentiment'].value_counts().sort_values(ascending = False)
sentiment_sort.plot.pie(autopct="%1.1f%%", colors = ["wheat", "lavender"], labels = sentiment_sort.index, startangle=45 )
plt.title("Distribution of Sentiments %")
plt.axes().set_ylabel('')

**2.3.2 Plot Bar Chart for Abs Class Distribution**

In [None]:
ax = sns.countplot(df.sentiment, palette =  ["wheat", "lavender"])
plt.title('Distribution of Sentiments')
plt.show()

**2.4 Remove html tags and convert to lower case**

In [None]:
# Write a function to remove html tags
def remove_html_tags(col):
    col = (col.str.replace(r'<[^<>]*>', '', regex=True))
    return col

In [None]:
df['review'] = df['review'].astype(str).str.lower()
df['review'] = remove_html_tags(df['review'])
df.head(3)

In [None]:
df.describe()

**Drop the duplicates again**

In [None]:
df = df.drop_duplicates()
#df.describe()

### 3. Use NLTK to analyze the Reviews

**3.1 Lemmatize the words**

In [None]:
def lemmatize_words(text):
    words = text.split()
    words = [WordNetLemmatizer().lemmatize(word,pos='v') for word in words]
    return ' '.join(words)
df['review'] = df['review'].apply(lemmatize_words)
df.head(3)

In [None]:
df.describe()

**3.2 Remove English stopwords from nltk**

**3.2.1 Load English stopwords from nltk**

In [None]:
stopwords = nltk.corpus.stopwords.words("english")

**3.2.2 Add other stopwords like movie**

In [None]:
stopwords.extend(['movies', 'movie', 'film', 'films'])

**3.2.3 Tokenize the words**

In [None]:
#Create an additional column with tokenized words
regexp = RegexpTokenizer('\w+')
df['review_token']=df['review'].apply(regexp.tokenize)
df.head(3)

**3.2.4 Remove the Stopwords from token**

In [None]:
df['review_token'] = df['review_token'].apply(lambda x: [item for item in x if item not in stopwords])
df.head(3)

**3.3 Create a list of all words to check the frequency**

In [None]:
#Join all words to create a list
df['review_cleaned'] = df['review_token'].apply(lambda x: ' '.join([item for item in x if len(item)>1]))

all_words = ' '.join([word for word in df['review_cleaned']])

# Tokenize all words together
tokenized_words = nltk.tokenize.word_tokenize(all_words)
fdist = FreqDist(tokenized_words)
fdist

**3.4 Check the distribution for word frequency**

In [None]:
df_word_frequency = pd.DataFrame(list(fdist.items()), columns = ["Word","Frequency"]).sort_values(by = 'Frequency', ascending = False).reset_index(drop = True)
df_word_frequency.head(15)

**Count total words**

In [None]:
df_word_frequency.shape

**3.4.1 Check how many words occur only once and the distribution of words**

In [None]:
print(df_word_frequency.describe())
print('Number of words that occur only once:', (df_word_frequency[df_word_frequency['Frequency']==1].count()) )
#Power distribution Law

**Insight 1: 40% of the total words in all reviews occur only once**

**3.4.2 Plot the Frequency Distribution of Top Words**

In [None]:
df_freq = df_word_frequency[ df_word_frequency['Frequency'] > 1]
fig = plt.figure(figsize = (15, 5))
plt.bar(df_freq['Word'].head(40), df_freq['Frequency'].head(40), color = 'blue', width = 0.8)
plt.xticks(rotation=45)
plt.show()

**Insight 2: Words like movie and film have the largest occurences.**
We removed those words because they don't add much value to predicting the sentiment in the movie dataset. After removing them words like one, make, like have the highest occurences

**3.4.3 Show all words like movie**

In [None]:
[col for col in df_word_frequency['Word'] if 'movie' in col]

**Insight 3: There are typing errors like forgetting to add space between two words. 
This might be one of the reasons why 40% of the words have only single occurence**

**3.4.4 Look at the words with least occurence**

In [None]:
df_word_frequency.describe() # Power Law Distribution: Natural Law Language Statistics
df_word_frequency.tail(15)

In [None]:
df.head(1)

**3.5 Show WordCloud for Positive and Negative Reviews**

**3.5.1 Build the function for wordcloud**

In [None]:
def wordcloud_text(df, col_name):
    review_words = ' '.join([word for word in df[col_name]])
    return review_words

**3.5.2 WordCloud for Positive Reviews**

In [None]:
positive_words = wordcloud_text(df[df.sentiment == 'positive'], 'review_cleaned')
plt.figure(figsize=(10,10))
positive_wordcloud = WordCloud(max_words=400, width=3000, height=1500, background_color = 'white').generate(positive_words)
plt.imshow(positive_wordcloud, interpolation='bilinear')
plt.title('Frequent words in Positive Reviews', fontsize = 20)
plt.show()

**3.5.3 WordCloud for Positive Reviews**

In [None]:
negative_words = wordcloud_text(df[df.sentiment == 'negative'], 'review_cleaned')
plt.figure(figsize=(10,10))
negative_wordcloud = WordCloud(max_words=400, width=3000, height=1500, background_color = 'white').generate(negative_words)
plt.imshow(negative_wordcloud, interpolation='bilinear')
plt.title('Frequent words in Negative Reviews', fontsize = 20)
plt.show()

### 4. Split data into train and test (70/30)

In [None]:
training, test = train_test_split(df, test_size=0.30, random_state=10)
# Define independent variable (x) and dependent variable (y)
train_x, train_y = training.review.tolist(), training.sentiment.tolist()
# train_x, train_y = [training.review for x in training], [training.sentiment for y in training]

# Test Dataset
test_x, test_y = test.review.tolist(), test.sentiment.tolist()


print("Size of train dataset: ",training.shape )
print("Size of test dataset: ",test.shape )

Check the class balance for train and test

In [None]:
print(training['sentiment'].value_counts(normalize = 'True').round(3))
print(test['sentiment'].value_counts(normalize = 'True').round(3))

**Bag of Words**

In [None]:
# Using countvectorizer in sklearn, create a bag of words
corpus = df['review_cleaned'].to_list()
vectorizer = CountVectorizer()

#print(vectorizer.get_feature_names()) # the dictionary
#print(vectorizer.fit_transform(corpus).toarray())


# Convert train data to feature vector matrix
train_x_vectors = vectorizer.fit_transform(train_x)
print('Shape of Train Data Vector', train_x_vectors.shape)

# Convert test data to feature vector matrix
test_x_vectors = vectorizer.transform(test_x)
print('Shape of Test Data Vector', test_x_vectors.shape)

### Word2Vec 
Word2Vec is a method for generating word embeddings. It uses neural network to learn from a large corpus of documents. Word2Vec has advantages over one hot encoding as Word2Vec captures the semantic relationship between words. It's a transformation method. Once we generate the word embedding, we can use models like Naive Bayes to predict the accuracy. Another model that can be used for embeddings is Bert.

### Bert
Bert is a bidirectional model i.e instead of reading the text from left to right or right to left it reads the text from both sides. It's a fairly new model launched by Google in 2018. It has been trained on large corpus of data. It's a transformer based language model.

### tf-idf
tf-idf measures the importance of a word to the document. It adjusts for the Power Law of Distribution in linguistics and assigns the weighting factor accordingly. 

Term frequency is the relative frequency of term within the document. The inverse document frequency measures how much information does the word provide. It is the relative frequency within the set of documents. Weights generally tend to filter out the common words.

tf-idf = tf*idf </br>
tf = # of times terms appears in document/ # of terms in the document
idf = log (# of documents in the corpus/ # of documents in the corpus containing the term)


**For this assignment, we're going to try out the Word2Vec, tf-idf and One Hot Encoding Techniques**

In [None]:
# Write a function to convert Words to Embeddings using W2V

model = Word2Vec(train_x, vector_size=500, window=5, min_count=1)
def w2v_embedding(col):
    return_list = []
    for document in col:
        embeddings = [model.wv[word] for word in document if word in model.wv]
        average_embedding = sum(embeddings) / len(embeddings)
        return_list.append(average_embedding)
    return return_list

In [None]:
# Convert text to word embeddings using word2vec
train_x_w2v = w2v_embedding(train_x)
test_x_w2v = w2v_embedding(test_x)

In [None]:
print(len(train_x_w2v))
print(len(test_x_w2v))
#train_x_w2v

### 5. Train Different Models

#### 5.1 SVM Classifier using Word2Vec
Support Vector Machine builds a hyperplane or set of hyper planes in a highly dimesional place. It uses functinal margin (largest distance to nearest training data point) to build the hyperplane. 

SVM is a good model for high dimensional datasets but doesn't work well on very large datasets.

In [None]:
# Train the SVM classifier
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_w2v, train_y)

# Evaluate the model on test set
pred_svm = clf_svm.predict(test_x_w2v)
accuracy = accuracy_score(test_y, pred_svm)

print(f"Test set accuracy: {accuracy:.3f}")
print(classification_report(test_y, pred_svm))

#### 5.2 Logistic Regression using One Hot Encoding
Logistic Regression is a classic classification model. It takes the output of a logistic regression function and uses a sigmoid function to estimate te probability of the given class. <br> <br>
There are three types of Logistic Regressions:
<br> 1) **Binomial**: Only 2 possible DV (0/1, True/False, Pass/Fail)
<br> 2) **Multinomial**: 3 or more possible unodered DV (cat/dog/cow)
<br> 3) **Ordinal**: 3 or more possible odered DV (low/medium/high)

In [None]:
# Train the Logistic Regression model
clf_lr = LogisticRegression(solver='liblinear')
clf_lr.fit(train_x_vectors, train_y)

# Evaluate the model on test set
pred_lr = clf_lr.predict(test_x_vectors)
accuracy = accuracy_score(test_y, pred_lr)

print(f"Test set accuracy: {accuracy:.3f}")
print(classification_report(test_y, pred_lr))

#### 5.3 Random Forest Classifier Using One Hot Encoding
It constructs multiple decision trees and can handle both classification and regression. RF searches for the best feature among a subset of random features. It adds randomness to the model thus preventing overfitting.

In [None]:
# Train the Random Forest model
clf_rfc = RandomForestClassifier()
clf_rfc.fit(train_x_vectors, train_y)

# Evaluate the model on test set
pred_rfc = clf_rfc.predict(test_x_vectors)
accuracy = accuracy_score(test_y, pred_rfc)

print(f"Test set accuracy: {accuracy:.3f}")
print(classification_report(test_y, pred_rfc))

#### 5.4  Naive Bayes using Term Frequency Inverse Document Frequency (tf-idf) embeddings
tf-idf measures the importance of a word to the document. It adjusts for the Power Law of Distribution in linguistics and assigns the weighting factor accordingly. 

Term frequency is the relative frequency of term within the document. The inverse document frequency measures how much information does the word provide. It is the relative frequency within the set of documents. Weights generally tend to filter out the common words.

tf-idf = tf*idf </br>
tf = # of times terms appears in document/ # of terms in the document
idf = log (# of documents in the corpus/ # of documents in the corpus containing the term)

In [None]:
# Train the TFIDF model
clf_tdfidf = TfidfVectorizer()
clf_tdfidf.fit(train_x, train_y)

# Apply tf-idf to training data
train_x_tf = clf_tdfidf.fit_transform(train_x)
#applying tf idf to training data
test_x_tf = clf_tdfidf.transform(test_x)

In [None]:
# Train the Naive Bayes Classifier
clf_nb = MultinomialNB()
clf_nb.fit(train_x_tf, train_y)

# Evaluate the model on test set
pred_nb = clf_nb.predict(test_x_tf)
accuracy = accuracy_score(test_y, pred_nb)

print(f"Test set accuracy: {accuracy:.3f}")
print(classification_report(test_y, pred_nb))
                                                                

Of all the models tested, model X has the highest accuracy. 

### Next Steps:
1) Fine Tuning Some of these models <br>
2) Post-hoc analysis on where the model did and didn't do well <br>
3) Re-applying those learning to fine tune and clean further to improve the accuracy