# Sentiment Analysis and Preprocessing Text

In [2]:
import warnings
# Suppress all warnings
warnings.filterwarnings("ignore")

## Part 1: Using the TextBlob Sentiment Analyzer (4).

### Import the movie review data as a data frame and ensure that the data is loaded properly. 

In [5]:
# Import necessary libraries
import pandas as pd
import zipfile

# Read the TSV file
df = pd.read_csv("labeledTrainData.tsv", sep="\t")

# Display the first few rows
display(df.head())

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


### How many of each positive and negative reviews are there?

In [7]:
from textblob import TextBlob
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

# Download VADER lexicon if not already downloaded
#nltk.download('vader_lexicon')

# Step 1: Count actual positive and negative reviews
positive_count = (df['sentiment'] == 1).sum()
negative_count = (df['sentiment'] == 0).sum()
print(f"Positive reviews: {positive_count}, Negative reviews: {negative_count}")


Positive reviews: 12500, Negative reviews: 12500


### Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

In [9]:
# Step 2: Use TextBlob to classify sentiment
# TextBlob provides polarity (negative to positive: -1 to 1) 
def textblob_sentiment(text):
    return 1 if TextBlob(text).sentiment.polarity >= 0 else 0

df['textblob_pred'] = df['review'].apply(textblob_sentiment)

### Check the accuracy of this model. 

In [11]:
# Step 3: Check accuracy
textblob_accuracy = (df['textblob_pred'] == df['sentiment']).mean()
print(f"TextBlob Accuracy: {textblob_accuracy:.2%}")

TextBlob Accuracy: 68.52%


In [12]:
# Step 4: Compare with random guessing (50% baseline)
print(f"Is TextBlob better than random guessing? {'Yes' if textblob_accuracy > 0.5 else 'No'}")

Is TextBlob better than random guessing? Yes


In [13]:
# Step 5 (Extra Credit): Use VADER sentiment analysis
sia = SentimentIntensityAnalyzer()

def vader_sentiment(text):
    return 1 if sia.polarity_scores(text)['compound'] >= 0 else 0

df['vader_pred'] = df['review'].apply(vader_sentiment)

# Check VADER accuracy
vader_accuracy = (df['vader_pred'] == df['sentiment']).mean()
print(f"VADER Accuracy: {vader_accuracy:.2%}")



VADER Accuracy: 69.36%


In [14]:
# Compare VADER with random guessing
print(f"Is VADER better than random guessing? {'Yes' if vader_accuracy > 0.5 else 'No'}")

Is VADER better than random guessing? Yes


## Part 2: Prepping Text for a Custom Model


Convert all text to lowercase letters.

Remove punctuation and special characters from the text
.
Remove stop word
s.
Apply NLTK’s PorterStemm
er.
Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data fr
ame.
Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [17]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer


# Download necessary resources
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('punkt_tab')

# Initialize NLTK tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

# Preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Tokenize words
    tokens = word_tokenize(text)
    # Remove stop words and apply stemming
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    # Join back into a single string
    return " ".join(tokens)

# Apply preprocessing to each review
df['processed_review'] = df['review'].apply(preprocess_text)

df['processed_review']

0        stuff go moment mj ive start listen music watc...
1        classic war world timothi hine entertain film ...
2        film start manag nichola bell give welcom inve...
3        must assum prais film greatest film opera ever...
4        superbl trashi wondrous unpretenti 80 exploit ...
                               ...                        
24995    seem like consider gone imdb review film went ...
24996    dont believ made film complet unnecessari firs...
24997    guy loser cant get girl need build pick strong...
24998    30 minut documentari buñuel made earli 1930 on...
24999    saw movi child broke heart stori unfinish end ...
Name: processed_review, Length: 25000, dtype: object

### Create Bag-of-Words matrix

CountVectorizer() in NLP
CountVectorizer() is a text preprocessing tool from sklearn.feature_extraction.text 
that converts a collection of text documents into a matrix of token (word) counts.
It is commonly used in Natural Language Processing (NLP) to transform raw text into numerical features for machine learning models.

Tokenization – Splits text into individual words or n-grams.

Lowercasing – Converts words to lowercase (by default).

Stop Word Removal (optional) – Removes common words like "the", "is", etc.

Word Frequency Count – Counts the occurrence of each unique word in the document.

In [20]:
# Initialize the CountVectorizer
count = CountVectorizer()

# Create the bag of words deature matrix
bag_of_words = count.fit_transform(df['processed_review'])

# show feature matrix
bag_of_words

print(bag_of_words.toarray())

# show feature names
count.get_feature_names_out()


[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


array(['00', '000', '0000000000001', ..., 'überannoy', 'überspi',
       'üvegtigri'], dtype=object)

### Create TF-IDF matrix

In sentiment analysis, a TF-IDF (Term Frequency-Inverse Document Frequency) matrix represents the importance of words in documents, 
aiding in classifying text sentiment by highlighting words specific to a document or corpus. 

TF-IDF transforms text data into numerical vectors, where each row represents a document and each column represents a word (or term). 
The values in the matrix represent the TF-IDF scores for each word in each document.


The model learns to associate specific TF-IDF scores with positive, negative, or neutral sentiments. 
When presented with new text, the model uses the TF-IDF scores to predict the sentiment of that text

Term Frequency (TF): Measures how often a word appears in a document. 


Inverse Document Frequency (IDF): Measures how important a word is across a collection of documents (corpus).

 
TF-IDF Score: Calculated by multiplying TF and IDF, resulting in a score that reflects the relevance of a word to a document within the corpus

TfidfVectorizer() is a key component in natural language processing (NLP) used for text feature extraction. 
It converts raw text data into numerical representations that machine learning models can understand 
by applying Term Frequency-Inverse Document Frequency (TF-IDF) transformation.

Tokenization – Splits text into individual words or n-grams.

Lowercasing – Converts words to lowercase to maintain uniformity.

Stop Word Removal (optional) – Removes common words like "the", "is", etc.

TF Calculation – Counts the frequency of each word in a document.

IDF Calculation – Gives higher weight to less common words across documents.

TF-IDF Weight Calculation – Combines TF and IDF to represent the importance of words.

Reduces the impact of frequently occurring words (like "the", "is") while keeping important words.

Captures meaningful word importance for better text classification, clustering, and NLP tasks.

Improves performance in text-based machine learning models by reducing noise.. .  

In [22]:
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['processed_review'])
display(X_tfidf)

<25000x92345 sparse matrix of type '<class 'numpy.float64'>'
	with 2438862 stored elements in Compressed Sparse Row format>

### Display dimensions

In [23]:
print(f"Bag-of-Words Matrix Shape: {bag_of_words.shape}")  # (num_reviews, num_unique_words)

Bag-of-Words Matrix Shape: (25000, 92345)


In [26]:
print(f"TF-IDF Matrix Shape: {X_tfidf.shape}")  # Should match BoW dimensions

TF-IDF Matrix Shape: (25000, 92345)
