Day 1 – IMDb Sentiment Analysis

Steps to follow:

Load Dataset

Use datasets library or download from IMDb manually.

Inspect the data: what are the columns? How many samples?

Preprocess Text

Remove punctuation, lowercase text.

Optionally remove stopwords.

Check for empty or missing reviews.

Encode Labels

IMDb labels are usually 0 (negative) and 1 (positive).

Make sure they are numeric if needed.

Train/Test Split

Use train_test_split from scikit-learn.

Example: 80% train, 20% test.

Vectorize Text

Use CountVectorizer/TfidfVectorizer from scikit-learn.

Fit on training data, transform both train and test sets.

Train Model

Use MultinomialNB (Naive Bayes for text).

Fit the model on training vectors and labels.

Evaluate

Predict on test data.

Calculate accuracy and print a classification report.

Reflection

Compare performance.

Check which words are most informative using .feature_log_prob_.

In [None]:
#install dataset library if not done yet
!pip install -q scikit_learn pandas
#import
from datasets import load_dataset
import pandas as pd
#Load IMDb dataset
imdb=load_dataset("imdb")
#check keys to know about data format
print(imdb)
print(imdb.keys())
#Convert train and test splits to pandas DataFrames
train_df=pd.DataFrame({
    "text": imdb['train']['text'],
    "label": imdb['train']['label']
})
test_df=pd.DataFrame({
    "text": imdb['test']['text'],
    "label": imdb['test']['label']
})

print("Train DataFrame head:")
print(train_df.head())

print("\nTest DataFrame head:")
print(test_df.head())

#check label distribution in train set
print("\nTrain label distribution:")
print(train_df['label'].value_counts())
#standardizing data
import string
def standardized_text(text):
   return text.str.lower().str.replace(r'[^\w\s]','',regex = True)

train_df['text']=standardized_text(train_df['text'])
test_df['text']=standardized_text(test_df['text'])
#apply countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

# you can squeeze a bit more performance by adding stop-word removal or limiting features
vectorizer=CountVectorizer(stop_words='english',max_features=5000,ngram_range=(1,2))
X_train_counts=vectorizer.fit_transform(train_df['text'])
X_test_counts=vectorizer.transform(test_df['text'])
print(X_train_counts.shape,X_test_counts.shape)
#Train a classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
clf=MultinomialNB()
clf.fit(X_train_counts,train_df['label'])
#prediction
label_pred=clf.predict(X_test_counts)
#Evaluation
print("Accuracy:",accuracy_score(test_df['label'],label_pred))
print("\nClassification Report:\n",classification_report(test_df['label'],label_pred,target_names=['Negative','Positive']))
print("\n Confusion matrix :\n",confusion_matrix(test_df['label'],label_pred))
#
#If you want to see which words are most influential:
feature_names = vectorizer.get_feature_names_out()
log_probs = clf.feature_log_prob_
print("Top 10 words for each class:")
print("Negative influential words:")
neg_words=" ".join([feature_names[i] for i in log_probs[0].argsort()[-10:][::-1]])
print(neg_words)
print("Positive influential words:")
pos_words=" ".join([feature_names[i] for i in log_probs[1].argsort()[-10:][::-1]])
print(pos_words)


# working on joblib to save codes
import joblib
joblib.dump(clf, "imdb_nb_model.pkl")
joblib.dump(vectorizer, "imdb_vectorizer.pkl")




DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
dict_keys(['train', 'test', 'unsupervised'])
Train DataFrame head:
                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0

Test DataFrame head:
                                                text  label
0  I love sci-fi and am willing to put up with a ...      0
1  Worth the entertainment value of a rental, esp...      0
2  its a totally average film with a few semi-alr...      0
3  STAR RATING: *****

['imdb_vectorizer.pkl']

1️⃣ Load & Inspect Data
from datasets import load_dataset
imdb = load_dataset("imdb")


Downloads the IMDb dataset (25 k training + 25 k test movie reviews).

imdb['train'] and imdb['test'] are the two splits.

train_df = pd.DataFrame({"text": imdb['train']['text'],
                         "label": imdb['train']['label']})


Converts the Hugging Face dataset into regular pandas DataFrames for easier handling.

2️⃣ Text Cleaning
def standardized_text(text):
    return text.str.lower().str.replace(r'[^\w\s]','',regex=True)


Lowercases everything.

Removes punctuation and other non-word characters.

3️⃣ Feature Extraction
vectorizer = CountVectorizer(stop_words='english',
                             max_features=5000,
                             ngram_range=(1,2))


Turns each review into a bag-of-words matrix of counts.

Uses English stop-word removal.

Keeps the 5 000 most frequent single words and two-word phrases (bigrams).

4️⃣ Train the Model
clf = MultinomialNB()
clf.fit(X_train_counts, train_df['label'])


Fits a Multinomial Naive Bayes classifier on the training vectors.

5️⃣ Evaluate

accuracy_score, classification_report, and confusion_matrix show performance on the test set.

The section with feature_log_prob_ lists the top 10 words most indicative of negative vs positive sentiment.

6️⃣ Save for Later
import joblib
joblib.dump(clf, "imdb_nb_model.pkl")
joblib.dump(vectorizer, "imdb_vectorizer.pkl")


joblib.dump writes any Python object to disk in a fast, compressed format:

imdb_nb_model.pkl → your trained Naive Bayes model.

imdb_vectorizer.pkl → the fitted CountVectorizer.

Later you can load them back without retraining:

from joblib import load
clf = load("imdb_nb_model.pkl")
vectorizer = load("imdb_vectorizer.pkl")

# Predict a new review
new_text = ["The movie was absolutely fantastic!"]
X_new = vectorizer.transform(new_text)
print(clf.predict(X_new))   # → 1 for positive, 0 for negative


This way, you can deploy the classifier or run predictions in a new script or on a server without repeating the entire training pipeline.

In [None]:

#import
from datasets import load_dataset
import pandas as pd
#Load IMDb dataset
imdb=load_dataset("imdb")
#check keys to know about data format
print(imdb)
print(imdb.keys())
#Convert train and test splits to pandas DataFrames
train_df=pd.DataFrame({
    "text": imdb['train']['text'],
    "label": imdb['train']['label']
})
test_df=pd.DataFrame({
    "text": imdb['test']['text'],
    "label": imdb['test']['label']
})

print("Train DataFrame head:")
print(train_df.head())

print("\nTest DataFrame head:")
print(test_df.head())

#check label distribution in train set
print("\nTrain label distribution:")
print(train_df['label'].value_counts())
#standardizing data
import string
def standardized_text(text):
   return text.str.lower().str.replace(r'[^\w\s]','',regex = True)

train_df['text']=standardized_text(train_df['text'])
test_df['text']=standardized_text(test_df['text'])
#apply countvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# you can squeeze a bit more performance by adding stop-word removal or limiting features
vectorizer=TfidfVectorizer(stop_words='english',max_features=5000,ngram_range=(1,2))
X_train_tfidf=vectorizer.fit_transform(train_df['text'])
X_test_tfidf=vectorizer.transform(test_df['text'])
print(X_train_tfidf.shape,X_test_tfidf.shape)
#Train a classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
clf1=MultinomialNB()
clf1.fit(X_train_tfidf,train_df['label'])
#prediction
label_pred=clf1.predict(X_test_tfidf)
#Evaluation
print("Accuracy:",accuracy_score(test_df['label'],label_pred))
print("\nClassification Report:\n",classification_report(test_df['label'],label_pred,target_names=['Negative','Positive']))
print("\n Confusion matrix :\n",confusion_matrix(test_df['label'],label_pred))
#
#If you want to see which words are most influential:
feature_names = vectorizer.get_feature_names_out()
log_probs = clf1.feature_log_prob_
print("Top 10 words for each class:")
print("Negative influential words:")
neg_words=" ".join([feature_names[i] for i in log_probs[0].argsort()[-10:][::-1]])
print(neg_words)
print("Positive influential words:")
pos_words=" ".join([feature_names[i] for i in log_probs[1].argsort()[-10:][::-1]])
print(pos_words)


# working on joblib to save codes
import joblib
joblib.dump(clf1, "imdb_nb_tfidf_model.pkl")
joblib.dump(vectorizer, "imdb_tfidf_vectorizer.pkl")




DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
dict_keys(['train', 'test', 'unsupervised'])
Train DataFrame head:
                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0

Test DataFrame head:
                                                text  label
0  I love sci-fi and am willing to put up with a ...      0
1  Worth the entertainment value of a rental, esp...      0
2  its a totally average film with a few semi-alr...      0
3  STAR RATING: *****

['imdb_vectorizer.pkl']

1️⃣ Load & Inspect Data
from datasets import load_dataset
imdb = load_dataset("imdb")


Downloads the IMDb dataset (25 k training + 25 k test movie reviews).

imdb['train'] and imdb['test'] are the two splits.

train_df = pd.DataFrame({"text": imdb['train']['text'],
                         "label": imdb['train']['label']})


Converts the Hugging Face dataset into regular pandas DataFrames for easier handling.

2️⃣ Text Cleaning
def standardized_text(text):
    return text.str.lower().str.replace(r'[^\w\s]','',regex=True)


Lowercases everything.

Removes punctuation and other non-word characters.

3️⃣ Feature Extraction
vectorizer = TfidfVectorizer(stop_words='english',
                             max_features=5000,
                             ngram_range=(1,2))


Turns each review into a bag-of-words matrix of counts.

Uses English stop-word removal.

Keeps the 5 000 most frequent single words and two-word phrases (bigrams).

4️⃣ Train the Model
clf1 = MultinomialNB()
clf1.fit(X_train_tfidf, train_df['label'])


Fits a Multinomial Naive Bayes classifier on the training vectors.

5️⃣ Evaluate

accuracy_score, classification_report, and confusion_matrix show performance on the test set.

The section with feature_log_prob_ lists the top 10 words most indicative of negative vs positive sentiment.

6️⃣ Save for Later
import joblib
joblib.dump(clf1, "imdb_nb_tfidf_model.pkl")
joblib.dump(vectorizer, "imdb_tfidf_vectorizer.pkl")


joblib.dump writes any Python object to disk in a fast, compressed format:

imdb_nb_model.pkl → your trained Naive Bayes model.

imdb_vectorizer.pkl → the fitted TfidfVectorizer.

Later you can load them back without retraining:

from joblib import load
clf = load("imdb_nb_tfidf_model.pkl")
vectorizer = load("imdb_tfidf_vectorizer.pkl")

# Predict a new review
new_text = ["The movie was absolutely fantastic!"]
X_new = vectorizer.transform(new_text)
print(clf1.predict(X_new))   # → 1 for positive, 0 for negative


This way, you can deploy the classifier or run predictions in a new script or on a server without repeating the entire training pipeline.