<a href="https://colab.research.google.com/github/moon11boon/moon11boon/blob/main/Copy_of_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here is the IMDb dataset that consists of 50,000 movie reviews, containing two main columns: ' review ' (textual movie reviews) and 'sentiment ' (label indicating  ' positive ' or   'negative ').

The primary goal of this project is to build a machine learning model that predicts the sentiment (positive or negative) of a movie review. This is a binary text classification task, as there are only two possible outcomes: positive or negative sentiment.


Here, the essential libraries for data preprocessing and analysis are imported, such as pandas, numpy, and nltk. The IMDb dataset to Google Colab are also uploaded.

In [7]:
#importing libraries
import pandas as pd
import numpy as np
import string
import nltk
from nltk.tokenize import word_tokenize

In [8]:
from google.colab import files
uploaded = files.upload()

Saving imdb2dataset.csv to imdb2dataset.csv


 NLTK's tokenizer data is downloaded, which will be used for text preprocessing.

In [9]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:

nltk.download('punkt')

df = pd.read_csv('imdb2dataset.csv')

df.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [11]:
#defining function for text preprocessing
def preprocess_text(text):
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    tokens = word_tokenize(text)
    return ' '.join(tokens)

In [12]:
#applying text pre-processing to the 'review' column
df['review'] = df['review'].apply(preprocess_text)

#encod sentiment label
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

#displaying pre-processed data
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production br br the filmin...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically theres a family where a little boy j...,0
4,petter matteis love in the time of money is a ...,1


In [13]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [14]:
X = df['review']  #text
y = df['sentiment']  #labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#displaying shapes of training and testing set
print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

Training data shape: (40000,)
Testing data shape: (10000,)


In [15]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  #  1000 is enough number of feature s
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [16]:
#initialize and training LogisticrlRegression model
model = LogisticRegression(max_iter=5500)
model.fit(X_train_tfidf, y_train)

Predictions were made on the testing data, and key performance metrics were calculated.
These metrics include accuracy, a classification report providing precision, recall, and F1-score, and a confusion matrix illustrating the model's performance in a tabular form.

In [17]:
#made prediction on testing data

y_pred = model.predict(X_test_tfidf)


#accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")


#classification report

report = classification_report(y_test, y_pred)

print("Classification Report:\n", report)


#confusion matrix

confusion = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:\n", confusion)

Accuracy: 0.87
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.86      0.87      4961
           1       0.87      0.88      0.87      5039

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Confusion Matrix:
 [[4274  687]
 [ 614 4425]]


In short, this project is focused on binary text classification for movie reviews aiming to predict whether a review's sentiment is positive or negative. The implemention of the text preprocessing, dataset balancing, model training, visualization, and model evaluation in this code is successful and achieved this goal.