In [13]:
import pandas as pd
import re
from bs4 import BeautifulSoup

def clean_text(text):
    text = BeautifulSoup(text, "html.parser").get_text()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text.lower()
df = pd.read_csv('IMDB Dataset.csv')
df['review'] = df['review'].apply(clean_text)

Imports: It imports pandas for data handling, re for regular expressions (pattern matching), and BeautifulSoup to handle HTML content.

clean_text Function:

HTML Removal: Strips out tags like br,div, etc., using BeautifulSoup.

Regex Cleaning: re.sub(r'[^a-zA-Z\s]', '', text) keeps only letters (A-Z, a-z) and spaces. It removes punctuation, numbers, and special symbols.

Normalization: Converts everything to lowercase so "The" and "the" are treated as the same word.

Applying the Clean: It loads the CSV file and applies this cleaning function to every row in the review column.

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2)

tfidf = TfidfVectorizer(
    stop_words='english', 
    ngram_range=(1, 2),  
    min_df=2             
)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Accuracy: 0.3


train_test_split: This is crucial. It sets aside 20% of your data so you can test the model on reviews it has never seen before. This tells you if the model actually "learned" or just "memorized" the data.

TfidfVectorizer:

TF (Term Frequency): How often a word appears in a review.

IDF (Inverse Document Frequency): Down-weights common words like "the" or "and" and up-weights unique sentiment words like "extraordinary" or "dreadful."

max_features=5000: Limits the vocabulary to the top 5,000 most important words to keep the model fast and prevent noise.

LogisticRegression: Despite the name "regression," it is a powerful classification algorithm that calculates the probability of a review being "positive" (1) or "negative" (0).

In [15]:
def predict_sentiment(text):
    cleaned = clean_text(text)
    vectorized = tfidf.transform([cleaned])
    prediction = model.predict(vectorized)
    return prediction[0]

print(predict_sentiment("This movie was really bad! The acting was 1/10."))

negative


Preprocessing: It is vital to run clean_text on your new review. If the model was trained on lowercase words without punctuation, it won't understand a raw sentence with exclamation marks and capital letters.

Transformation: The tfidf.transform step converts your sentence into the exact same "mathematical map" (the 5,000 features) that the model learned during training.

Inference: The model.predict function calculates whether the words in your sentence align more closely with the "positive" patterns or "negative" patterns it discovered in your CSV file.