## Sentiment Classification using Logistic Regression

##### NLP Task for classifying sentiments in movie reviews sourced from IMDB

In [1]:
import pandas as pd

df = pd.read_csv("input/imdb_movrev.csv")

df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [2]:
print(df['sentiment'].unique())


['positive' 'negative' ' Jim Abrahams']


In [3]:
df = df[df['sentiment'] != ' Jim Abrahams']

print(df['sentiment'].unique())

['positive' 'negative']


In [4]:
df.dropna(subset=['sentiment'], inplace=True)

#### Convert Sentiment to Binary Encoding

In [5]:
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


#### Shuffling the dataframe

In [6]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

In [7]:
df.to_csv('input/movie_data.csv', index=False, encoding='utf-8')

In [8]:
df = pd.read_csv('input/movie_data.csv')
df.head(5)

Unnamed: 0,review,sentiment
0,John Cassavetes is on the run from the law. He...,1
1,I must say I was surprised to find several pos...,0
2,A March 1947 New York Times article described ...,1
3,"""Eaten Alive"" goes down much easier than Rugge...",0
4,In this tale of a tightly wound Christian fami...,1


In [9]:
df.shape

(49998, 2)

#### Text Preprocessing

In [10]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()                       # Lowercase
    text = re.sub(r'[^\w\s]', '', text)       # Remove punctuation
    text = ' '.join([word for word in text.split() if word not in stop_words]) # Remove stopwords
    return text

df['review'] = df['review'].apply(clean_text)
df.head(5)



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,review,sentiment
0,john cassavetes run law bottom heap sees negro...,1
1,must say surprised find several positive comme...,0
2,march 1947 new york times article described cr...,1
3,eaten alive goes much easier ruggero deodatos ...,0
4,tale tightly wound christian family three four...,1


#### Feature Extraction with TF-IDF

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['review']).toarray()
y = df['sentiment'].values

#### Split Data for Training and Testing

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### Training the Logistic Regression Model

In [13]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)


#### Evaluating The Model

In [14]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.8861

Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.88      0.89      5019
           1       0.88      0.89      0.89      4981

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000


Confusion Matrix:
 [[4406  613]
 [ 526 4455]]
