<a href="https://colab.research.google.com/github/padmasre/sentiment_analysis/blob/main/notebook/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using Machine Learning

The goal of this project is to train a machine learning model that will predict the sentiment of a movie review as "Positive" or "Negative".

Dataset used to traing the model: [IMDB dataset](https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format?resource=download)

## Setup Kaggle Token

Setup kaggle token to download kaggle dataset needed for this project

In [None]:
# Creating this folder and file to store kaggle API key to download the dataset
!mkdir -p ~/.kaggle
!touch ~/.kaggle/kaggle.json

In [None]:
kaggle_username = input("Kaggle Username:")
kaggle_key = input("Kaggle Key:")
api_token = {"username":kaggle_username,"key":kaggle_key}

In [None]:
import json

with open('/root/.kaggle/kaggle.json', 'w+') as file:
    json.dump(api_token, file)

!chmod 600 ~/.kaggle/kaggle.json

## Import Required Libraries

In [2]:
import re
import os
import pandas as pd
import pickle
from string import punctuation
from textblob import Word
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize

In [32]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Data Loading and Preprocessing

In [12]:
path = '/content/drive/MyDrive/Colab Notebooks/sentiment_analysis'

train = pd.read_csv(f"{path}/dataset/Train.csv")
test = pd.read_csv(f"{path}/dataset/Test.csv")
validate = pd.read_csv(f"{path}/dataset/Valid.csv")

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    40000 non-null  object
 1   label   40000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 625.1+ KB


In [None]:
train.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5000 non-null   object
 1   label   5000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 78.2+ KB


In [None]:
validate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5000 non-null   object
 1   label   5000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 78.2+ KB


In [None]:
def preprocess(df):
    #HTML Tags removal
    df['text'] = df['text'].apply(lambda words: re.sub('<.*?>','',words))
    #Word Tokenization
    df['text'] = df['text'].apply(word_tokenize)
    #Lower case conversion
    df['text'] = df['text'].apply(lambda words: [x.lower() for x in words])
    #Punctuation removal
    df['text'] = df['text'].apply(lambda words: [x for x in words if not x in punctuation])
    #Number removal
    df['text'] = df['text'].apply(lambda words: [x for x in words if not x.isdigit()])
    #Stopword removal
    df['text'] = df['text'].apply(lambda words: [x for x in words if x not in stopwords.words('english')])
    #Frequent word removal
    temp = df['text'].apply(lambda words: " ".join(words))
    freq = pd.Series(temp).value_counts()[:10]
    df['text'] = df['text'].apply(lambda words: [x for x in words if x not in freq.keys()])
    #Lemmatization
    df['text'] = df['text'].apply(lambda words: " ".join([Word(x).lemmatize() for x in words]))
    return df

In [None]:
train = preprocess(train)
test = preprocess(test)
validate = preprocess(validate)

In [None]:
train.head()

Unnamed: 0,text,label
0,grew b watching loving thunderbird mate school...,0
1,put movie dvd player sat coke chip expectation...,0
2,people know particular time past like feel nee...,0
3,even though great interest biblical movie bore...,0
4,im die hard dad army fan nothing ever change g...,1


In [30]:
r = train.iloc[10:11, [0]]
r

Unnamed: 0,text
10,I can't believe people are looking for a plot ...


In [33]:
r['text'].apply(word_tokenize)

10    [I, ca, n't, believe, people, are, looking, fo...
Name: text, dtype: object

In [None]:
X_train = train.text
Y_train = train.label
X_validate = validate.text
Y_validate = validate.label
X_test = test.text
Y_test = test.label

## Model Training

Training a logistic regression model using the training dataset.
Leveraging scikit-learn pipeline to build the training pipeline for the model.


Trainine Steps
*   CountVectorizer - Convert a collection of text documents to a matrix of token counts.
*   LogisticRegression - Logistic Regression model



In [None]:
#Creating a Pipeline
clf = Pipeline(steps =[
('preprocessing', CountVectorizer()),
('classifier', LogisticRegression(max_iter=2000))
])
#Fitting the model
clf.fit(X_train, Y_train)

In [None]:
#Calculation model Scores
clf.score(X_valid, Y_valid)

0.8918

In [None]:
clf.score(X_test,Y_test)

In [None]:
p = clf.predict(X_test)

In [None]:
p

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
pickle.dump(clf, open(f'{path}/model/model.pkl', 'wb'))

In [None]:
print(f'Number of reviews classified as Positive: {list(p).count(1)}')
print(f'Number of reviews classified as Negative: {list(p).count(0)}')

Number of reviews classified as Positive: 2536
Number of reviews classified as Negative: 2464


In [13]:
model = pickle.load(open(f'{path}/model/model.pkl', 'rb'))

In [41]:
predict = pd.DataFrame(['This is a good review'])

In [42]:
my_series =predict.squeeze()
my_series

'This is a good review'

In [47]:
predictions = model.predict(['This is a good review', 'This is a good review']).tolist()

In [48]:
for i in predictions:
  if i == 1:
    print("This is a positive review")
  elif i == 0:
    print("This is a negative review")
  else:
    print("None")

This is a positive review
This is a positive review


In [39]:
chart_data = pd.DataFrame(
    [[200,0],[0,300]],
    columns=['Positive', 'Negative'])

In [40]:
chart_data

Unnamed: 0,Positive,Negative
0,200,0
1,0,300


In [21]:
  np.random.randn(20, 3)

array([[-1.05781617,  0.39334404, -1.01565912],
       [-1.48715301,  0.78074176,  1.57628683],
       [-0.59688476,  2.32707919, -1.20237068],
       [ 1.04110768, -0.52042361, -1.26620086],
       [-0.41468758,  0.0069709 , -0.3660296 ],
       [ 0.735953  ,  0.64858551,  1.67973523],
       [-0.71054453, -2.95103247, -0.93894642],
       [ 0.34588429, -0.35022688, -0.63195249],
       [ 1.04478357,  0.18685342,  0.04481071],
       [-1.53501423, -1.05195037,  1.8711162 ],
       [-2.0055381 ,  0.21996   , -0.82075488],
       [ 2.01911217,  1.65550793,  0.59411169],
       [ 0.41589276, -0.23802585, -0.81580313],
       [-0.94365842, -0.62713262,  0.15471939],
       [ 0.96469725, -1.28131103, -0.55416044],
       [ 1.18055898, -0.15264767, -1.31039208],
       [-1.84776827,  0.53921411,  0.19580684],
       [-0.99734574, -0.61300443,  0.29417201],
       [ 0.08702336,  0.49681679,  1.36052278],
       [-1.51795493, -0.95655675,  0.83811669]])