# Homework 4: Movie Reviews for Text and Sentiment Analysis
---
**Summer 2025 - Intructor: Joyce Yang**

**Adapted from teaching materials by Prof. Chris Volinksy Fall 2024.**

The data in this homework is 20k movie reviews which have been already labelled as positive or negative.   We will apply our text toolbox to see if we can fit an effective supervised model.

The Movie Review data can be [downloaded from this link](https://drive.google.com/uc?export=download&id=1UA9CyRd8y7Wi4RKruXfItXadT3hY92bE)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import re
from sklearn.model_selection import train_test_split


In [None]:
from google.colab import files
uploaded = files.upload()

Saving movie_reviews_20k.csv to movie_reviews_20k.csv


In [None]:
# read in movie_reviews.csv
df = pd.read_csv('movie_reviews_20k.csv')
df.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


The data is simply a full text review, and a rating of 0(bad) or 1(good), determined by a human labeller.


We need to remove HTML tags...you can run the following code to remove them.

In [None]:
# remove html tags
df['text'] = df['text'].apply(lambda x: re.sub('<[^<]+?>', '', x))

**1) Lets do all of the things we need to do to prepare text data.  Lemmatize, tokenize, removing stopwords and punctuation.  Feel free to grab the exact code from the `Text Mining` notebook during Lecture (specifically the function `clean_text`) and run it.   Create a new field called "clean_review" and append it to your data frame, so that you retain the original text in one feature, and have the cleaned text in another feature. This will allow us to go back and look at the original text of the review when we are evaluating the model**

In [None]:
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
    # Lowercase
    text = text.lower()
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if w.isalpha() and w not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return ' '.join(tokens)

df['clean_review'] = df['text'].apply(clean_text)

# Check the result
df.head()

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,label,clean_review
0,I grew up (b. 1965) watching and loving the Th...,0,grew b watching loving thunderbird mate school...
1,"When I put this movie in my DVD player, and sa...",0,put movie dvd player sat coke chip expectation...
2,Why do people who do not know what a particula...,0,people know particular time past like feel nee...
3,Even though I have great interest in Biblical ...,0,even though great interest biblical movie bore...
4,Im a die hard Dads Army fan and nothing will e...,1,im die hard dad army fan nothing ever change g...


**2) Split data into training and test.  Run a TFIDF Vectorizer on the cleaned reviews - you need to `fit` the Vectorizer to the training data, and then `transform` both the training and the test sets using the vectorizer.  Review our `Lab 8` notebooks for syntax.**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Split data into train and test (80/20 split)
train, test = train_test_split(df, test_size=0.2, random_state=42)

# Initialize TFIDF Vectorizer, fit on train, transform both
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train['clean_review'])
X_test = vectorizer.transform(test['clean_review'])
y_train = train['label']
y_test = test['label']

**3) Fit your favorite classification model to the training set - you can use Logistic Regression (but be sure to regularlize!), or something more complex like XGBoost or Random Forests, or any other classification model.  Apply your model to the test set and report the AUC.**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Fit Logistic Regression (C=1.0 for regularization)
model = LogisticRegression(penalty='l2', C=1.0, max_iter=200)
model.fit(X_train, y_train)

# Predict probabilities on test set
y_pred_prob = model.predict_proba(X_test)[:, 1]

# Compute and report AUC
auc = roc_auc_score(y_test, y_pred_prob)
print(f"AUC on test set: {auc:.4f}")

AUC on test set: 0.9501


**4) Find the P(label=1) for all cases in the test set.  Identify the review in the test set which has the _lowest_ probability of being a good review BUT has label=1 (good).  This is an error!  Print the complete original review text for this error.  Do you think this is actually a good or bad review?  Do you think this error is due to an error in the labelling (y_test) or a problem with the model?  Explain.**

In [None]:
# Add predicted probabilities to test dataframe
test['prob'] = y_pred_prob

# Filter for actual good reviews (label=1), sort by lowest probability
good_errors = test[test['label'] == 1].sort_values('prob')

# Get the review with the lowest probability
lowest_prob_review = good_errors.iloc[0]['text']
lowest_prob = good_errors.iloc[0]['prob']
print(f"Lowest probability of being truly good : {lowest_prob:.4f}")
print("Complete original review text:\n", lowest_prob_review)

Lowest probability of being good (but actually labeled good): 0.0182
Complete original review text:
 **SPOILERS AHEAD**It is really unfortunate that a movie so well produced turns out to besuch a disappointment. I thought this was full of (silly) clichés andthat it basically tried to hard. To the (American) guys out there: how many of you spend yourtime jumping on your girlfriend's bed and making monkeysounds? To the (married) girls: how many of you have suddenlygone from prudes to nymphos overnight--but not with yourhusband? To the French: would you really ask about someonebeing "à la fac" when you know they don't speak French? Wouldn'tyou use a more common word like "université"? I lived in France for a while and I sort of do know and understandEurope (and I love it), but my (German) roommate and I found thispretty insulting overall. It looked like a movie funded by theEuropean Parliament, and it tried too hard basically. It had allsorts of differences that it tried to tie together (

**5) To get an objective view of whether this review is _really_ positive or negative, we can use a pre-defined sentiment model built off of an existing lexicon.  One such model is called Vader.  [documentation here](https://medium.com/@rslavanyageetha/vader-a-comprehensive-guide-to-sentiment-analysis-in-python-c4f1868b0d2e).  Using your incorrectly labelled review from the last probem and the Vader code below, report what the negativity score is from Vader ('neg' in the output). Does this support your conclusion about the error above?**

In [None]:
import nltk
nltk.download('vader_lexicon')

# Import SentimentIntensityAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

text = lowest_prob_review  # From step 4
scores = analyzer.polarity_scores(text)

print(scores)
neg_score = scores['neg']
print(f"Negativity score ('neg') from VADER: {neg_score:.4f}")

{'neg': 0.103, 'neu': 0.789, 'pos': 0.108, 'compound': 0.5735}
Negativity score ('neg') from VADER: 0.1030


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


**6) `eli5` is a Python library that tries to "explain" machine learning models.  It works for simple models like logistic regression as well as more complicated, black box models like XGBoost and Random Forests.   Look up the documentation for `eli5` and use it to show the words contributing most to positive and negative scores in your model.**

In [None]:
!pip install eli5
import eli5


# Show top words contributing to positive (class 1) and negative (class 0) predictions
eli5.show_weights(model, vec=vectorizer, top=20)  # Display top 20