In [None]:
##kaggle setup for downloading dataset
import os

os.environ['KAGGLE_USERNAME'] = 'your_username'
os.environ['KAGGLE_KEY'] = 'your_api_key'





In [None]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
  0% 0.00/25.7M [00:00<?, ?B/s]
100% 25.7M/25.7M [00:00<00:00, 1.03GB/s]


In [None]:
##unzipping the downloaded dataset
!unzip imdb-dataset-of-50k-movie-reviews.zip

Archive:  imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


In [None]:
#importing libraries
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [None]:

df = pd.read_csv("IMDB Dataset.csv")
print(df)




                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


In [None]:
#statistical measures of data
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


## Observation

The dataset contains **50,000 movie reviews** with two columns:

- **review**: textual movie reviews  
- **sentiment**: target label (*positive / negative*)

### Key Observations
- The dataset is **perfectly balanced** with **25,000 positive** and **25,000 negative** reviews.  
- There are **49,582 unique reviews**, indicating a small number of duplicate texts.  
- The most frequent sentiment label is **positive**, occurring **25,000 times**.

 **Conclusion:**  
The dataset is **balanced, clean, and suitable for supervised sentiment classification**.
```



In [None]:
df['sentiment'].value_counts()


Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


The class distribution is perfectly balanced, with 25,000 samples per class.So no class imbalance handling techniques are needed in this dataset.

###  Text Cleaning and Preprocessing

Raw text data often contains noise such as HTML tags, punctuation, numbers, and inconsistent casing.  
To ensure effective feature extraction and improve model performance, the text is cleaned before vectorization.

The cleaning process includes:
- Converting all text to lowercase for consistency
- Removing HTML tags
- Removing special characters and numbers
- Normalizing extra whitespaces

This step ensures that the model focuses only on meaningful textual information.


In [None]:
##Text cleaning


def clean_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove HTML tags
    text = re.sub(r"<.*?>", "", text)

    # Remove special characters and numbers
    text = re.sub(r"[^a-zA-Z\s]", "", text)

    # Remove extra whitespaces
    text = re.sub(r"\s+", " ", text)

    return text.strip()

# Apply cleaning function
df['clean_review'] = df['review'].apply(clean_text)




In [None]:
##inspecting changes in original and cleaned dataframe

df[["review", "clean_review"]].head()

Unnamed: 0,review,clean_review
0,One of the other reviewers has mentioned that ...,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,a wonderful little production the filming tech...
2,I thought this was a wonderful way to spend ti...,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,basically theres a family where a little boy j...
4,"Petter Mattei's ""Love in the Time of Money"" is...",petter matteis love in the time of money is a ...


In [None]:
##Train Test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['clean_review'],
    df['sentiment'],
    test_size=0.2,
    random_state=42
)


In [None]:
#verifying shapes
print("Original shapes:")
print(df['clean_review'].shape, df['sentiment'].shape)

print("\nTraining part shape:")
print(X_train.shape, X_test.shape)

print("\nTesting part shape:")
print(y_train.shape, y_test.shape)

print(y_train.shape, y_test.shape)

Original shapes:
(50000,) (50000,)

Training part shape:
(40000,) (10000,)

Testing part shape:
(40000,) (10000,)
(40000,) (10000,)


###  Feature Engineering using TF-IDF

To convert textual reviews into numerical features, TF-IDF is used .

TF-IDF assigns higher importance to words that:
- Appear frequently in a document
- Appear less frequently across the entire dataset

This helps the model focus on sentiment-bearing words while reducing the impact of common words.

The text data is vectorized using:
- Unigrams and bigrams (ngram_range = 1‚Äì2)
- A limited vocabulary size to avoid overfitting
- Stopword removal for cleaner representations

This step transforms raw text into numerical vectors suitable for machine learning models.


In [None]:
#creating TF-IDF vectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1,2),
    stop_words='english'
)


In [None]:
##Vectorized versions of training part
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


In [None]:
#trainning the model

# Initialize model
model = MultinomialNB()

# Train model on vectorized training data
model.fit(X_train_vec, y_train)


In [None]:
# Predictions on training data
y_train_pred = model.predict(X_train_vec)

# Predictions on test data
y_test_pred = model.predict(X_test_vec)

# ---- Training performance ----
print("TRAINING PERFORMANCE")
print("Accuracy:", accuracy_score(y_train, y_train_pred))


# ---- Testing performance ----
print("\nTESTING PERFORMANCE")
print("Accuracy:", accuracy_score(y_test, y_test_pred))
print(classification_report(y_test, y_test_pred))

# Confusion Matrix
print("\nConfusion Matrix (Test Data):")
print(confusion_matrix(y_test, y_test_pred))


TRAINING PERFORMANCE
Accuracy: 0.865325

TESTING PERFORMANCE
Accuracy: 0.8503
              precision    recall  f1-score   support

    negative       0.86      0.84      0.85      4961
    positive       0.84      0.86      0.85      5039

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000


Confusion Matrix (Test Data):
[[4150  811]
 [ 686 4353]]


# Model Evaluation Analysis

This section provides a detailed interpretation of the model‚Äôs performance based on the training and testing results.

---

##  Overall Summary

The model shows **stable and consistent performance** across training and testing datasets, indicating **good generalization** and **no major overfitting or underfitting**.

---

##  Training Performance

- **Accuracy:** 0.8653  

The training accuracy is reasonably high, meaning the model has learned meaningful patterns from the training data without memorizing it excessively.

---

## Testing Performance

### Accuracy
- **Test Accuracy:** 0.8503  

The small difference between training and test accuracy (~1.5%) suggests:
- No significant overfitting  
- Model generalizes well to unseen data  

---

###  Classification Report Breakdown

| Class     | Precision | Recall | F1-Score | Support |
|----------|-----------|--------|----------|---------|
| Negative | 0.86 | 0.84 | 0.85 | 4961 |
| Positive | 0.84 | 0.86 | 0.85 | 5039 |
| **Overall** | ‚Äî | ‚Äî | **0.85** | **10000** |

#### Interpretation:
- **Precision (~0.85):** Predictions labeled as positive/negative are mostly correct.
- **Recall (~0.85):** The model successfully captures most actual positive and negative samples.
- **F1-score (~0.85):** Balanced performance between precision and recall.
- **Balanced support:** Nearly equal class distribution ‚Üí no class imbalance issue.

---

## üîç Confusion Matrix Analysis

| 4150 | 811 |
|------|-----|
| 686  | 4353 |



|               | Predicted Negative | Predicted Positive |
|---------------|-------------------|-------------------|
| Actual Negative | 4150 | 811 |
| Actual Positive | 686 | 4353 |

### Interpretation:
- **True Negatives (4150):** Correctly identified negative samples  
- **True Positives (4353):** Correctly identified positive samples  
- **False Positives (811):** Some negative samples misclassified as positive  
- **False Negatives (686):** Some positives missed  

The error rates are balanced and acceptable for a baseline text classification model.

---

##  Final Verdict

- The model is **well-trained and stable**
- No signs of overfitting or underfitting
- Suitable as a **baseline ML model**
- Can be further improved using:
  - TF-IDF tuning (ngrams, max_features)
  - Hyperparameter tuning
  - Advanced models (Logistic Regression, Linear SVM)




In [None]:
#predictions on new dataset
new_text = ["This movie was absolutely amazing"]
new_vec = vectorizer.transform(new_text)

prediction = model.predict(new_vec)
print("Prediction:", prediction[0])


Prediction: negative


## Model Prediction Analysis

### Input Text
"This movie was absolutely amazing"

### Model Output
**Predicted Label:** negative

### ‚ö†Ô∏è Interpretation of the Result

Although the sentence clearly expresses positive sentiment, the model classified it as negative.  
This does not indicate a bug or implementation error, but rather highlights a known limitation of classical machine learning models using TF-IDF features.

### Why This Happens

- The model relies on word frequency patterns, not true semantic understanding.
- If words like "amazing" appear frequently in negative contexts within the training data, the model may associate them incorrectly.
- TF-IDF does not capture word order, sentiment flow, or contextual meaning.

### How to Improve Performance

- **Use Logistic Regression or Linear SVM**  
  ‚Üí Provides better decision boundaries than Naive Bayes.
- **Increase dataset size and diversity**  
  ‚Üí Reduces bias and improves generalization.
- **Use n-grams instead of only unigrams**  
  ‚Üí Captures short phrases such as ‚Äúnot good‚Äù or ‚Äúvery bad‚Äù.
- **Move to semantic models**
  - Word Embeddings (Word2Vec, GloVe, FastText)
  - Transformer-based models (BERT, DistilBERT)

### Note on Using N-grams

Although using bigrams (n = 1‚Äì2) or trigrams (n = 1‚Äì3) helps capture limited contextual patterns, the improvement is often incremental rather than transformative.

This is because TF-IDF remains a frequency-based representation, lacking true semantic understanding ‚Äî unlike embedding-based or transformer-based approaches.

### ‚úÖ Final Verdict

The model is functioning correctly, but its performance is inherently limited by the simplicity of TF-IDF + Naive Bayes.  
For production-grade sentiment analysis, more advanced models are required.
