# Sentiment Analysis for Movie Reviews


The project involves building a sentiment analysis model using the IMDB movie review dataset. The goal is to classify movie reviews as either positive or negative.

Tools Needed: Python programming language, Jupyter Notebook, Scikit-learn library, NLTK library and IMDB movie review dataset

Step 1: Import Libraries and Load Data In this step, we will import the necessary libraries and load the IMDB movie review dataset.

In [1]:
import pandas as pd

data = pd.read_csv('imdb_dataset.csv')

Step 2: Data cleaning and preprocessing
Before we start building the model, we need to preprocess the data by removing unnecessary information, converting the text to lowercase, removing stopwords, and performing stemming or lemmatization. Run the following code to perform these preprocessing steps:

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
stopwords = stopwords.words('english')
stemmer = PorterStemmer()

def clean_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.lower()
    text = text.split()
    text = [stemmer.stem(word) for word in text if word not in stopwords]
    text = ' '.join(text)
    return text

data['review'] = data['review'].apply(clean_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ivyajanga/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Step 3: Split the dataset into training and testing sets
We will split the dataset into training and testing sets, with 80% of the data used for training and 20% used for testing. Run the following code to split the dataset:

In [3]:
from sklearn.model_selection import train_test_split

X = data['review']
y = data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Step 4: Vectorize the text
We need to convert the text into a numerical format that can be used by machine learning algorithms. We will use the "CountVectorizer" class from the "sklearn.feature_extraction.text" module to convert the text into a bag-of-words representation. Run the following code to vectorize the text:

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)


Step 5: Train the model
We will use the "LogisticRegression" class from the "sklearn.linear_model" module to train the sentiment analysis model. Run the following code to train the model:

In [5]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

Step 7: Evaluate the model
We will use the accuracy score to evaluate the performance of the model on the test set. Run the following code to evaluate the model:

In [6]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.8821


Findings
The sentiment analysis model achieved an accuracy of around 88.21%, which is a good



