<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Text Frequency

---



### About:  
This notebook explores CountVectorizer and TF-IDF and uses the results to predict customer star ratings based on their reviews of a product. This will serve as proxy for sentiment analysis. 

### Learning Objective:
- Apply CountVectorizer and TF-IDF to complete a sentiment analysis of customer reviews.

### Notebook Guide

- NLP Scenario
- Introduction
- Stop words
- N-gram Length
- TF-IDF
- Try It!
- Conclusions and Takeaways

### Imports

In [None]:
# imports
import numpy as np
import pandas as pd

# NLP imports
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# model building imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
# code to avoid truncation of the output below 
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)


# NLP Scenario 
You are an analyst for a marketing company that just launched a new product suite of mobile devices. You have data from product reviews of the one of these new products, TechWave X1. For this exercise you will use the text from the reviews to predict customer sentiment. 

#####  Product Reviews 
1. "I absolutely love the TechWave X1! It has made my daily tasks so much easier and more efficient. Highly recommend it!"
2. "I'm not very impressed with the TechWave X1. It lacks some essential features and is quite slow."
3. "The TechWave X1 is fantastic! It has exceeded my expectations and has become an essential part of my daily routine."
4. "I found the TechWave X1 to be quite average. It does the job, but there's nothing particularly special about it."
5. "The TechWave X1 is terrible. It's full of glitches and crashes frequently. I regret purchasing it."
6.  "The TechWave X1 is disappointing. It doesn't live up to the hype and is missing several key functionalities."

# Introduction 
Computers understand the world in numbers. In order to use text data we have to first represent it in a way that computers can interpret. 

Let's convert the text above to a data frame and start working with the text data. 


This is a toy example with a small data set from demo/walkthrough purposes, which is why we're manually converting it to a dataframe. In a real task/in later tasks you would import the data from csv or txt files. 

In [None]:
# data setup
reviews_with_ratings = [
    ("I absolutely love the TechWave X1! It has made my daily tasks so much easier and more efficient. Highly recommend it!", 5),
    ("I'm not very impressed with the TechWave X1. It lacks some essential features and is quite slow.", 2),
    ("The TechWave X1 is fantastic! It has exceeded my expectations and has become an essential part of my daily routine.", 5),
    ("I found the TechWave X1 to be quite average. It does the job, but there's nothing particularly special about it.", 3),
    ("The TechWave X1 is terrible. It's full of glitches and crashes frequently. I regret purchasing it.", 1),
    ("The TechWave X1 is disappointing. It doesn't live up to the hype and is missing several key functionalities.", 2)
]

# Create a DataFrame
df = pd.DataFrame(reviews_with_ratings, columns=['reviews', 'star_rating'])
df.head()

Below we will convert the text from the reviews into a matrix that creates a column for each word and tracks which reviews contain each word. Take a look at the output below. 

In [None]:
# Initialize CountVectorizer for n-grams (single words in this case)
vectorizer = CountVectorizer(ngram_range=(1, 1))

# Fit and transform the reviews
X = vectorizer.fit_transform(df['reviews'])

# Create a new DataFrame with the n-grams
ngrams_df = pd.DataFrame(
    X.toarray(), columns=vectorizer.get_feature_names_out())
print("ngrams_df shape:", ngrams_df.shape)
print("ngrams_df columns:", ngrams_df.columns)

# Display the new DataFrame
ngrams_df.head()

# Stop words
Notice that we have some words that are so common (for example: "an", "and", "the") that they are not useful for our analysis. These words are called stop words. We can remove them by using the stop_words parameter in the CountVectorizer class. Notice the difference in the output below. 

In [None]:
# Initialize CountVectorizer for n-grams (single words in this case)
vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words='english')

# Fit and transform the reviews
X = vectorizer.fit_transform(df['reviews'])

# Create a new DataFrame with the n-grams
ngrams_df = pd.DataFrame(
    X.toarray(), columns=vectorizer.get_feature_names_out())
print("ngrams_df shape:", ngrams_df.shape)
print("ngrams_df columns:", ngrams_df.columns)

# Display the new DataFrame
ngrams_df.head()

We went from 58 n-grams/words to 34 n-grams/words by removing the stop words. This will allow for more meaningful analysis of the reviews.  

# N-gram Length
Another modification we may want to make is adjusting how words are grouped. By using single words, we are missing important context such as "Techwave X1" being the product name of interest. Let's edit our code to capture bigrams. 

In [None]:
# Initialize CountVectorizer for n-grams (single words in this case)
vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words='english')

# Fit and transform the reviews
X = vectorizer.fit_transform(df['reviews'])

# Create a new DataFrame with the n-grams
ngrams_df = pd.DataFrame(
    X.toarray(), columns=vectorizer.get_feature_names_out())
print("ngrams_df shape:", ngrams_df.shape)
print("ngrams_df columns:", ngrams_df.columns)

# Display the new DataFrame
ngrams_df.head()

Interesting! Let's edit this a bit more to capture both bigrams and single words. 

In [None]:
# Initialize CountVectorizer for n-grams (single words in this case)
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')

# Fit and transform the reviews
X = vectorizer.fit_transform(df['reviews'])

# Create a new DataFrame with the n-grams
ngrams_df = pd.DataFrame(
    X.toarray(), columns=vectorizer.get_feature_names_out())
print("ngrams_df shape:", ngrams_df.shape)
print("ngrams_df columns:", ngrams_df.columns)

# Display the new DataFrame
ngrams_df.head()

Great job! You've successfully created n-grams from the reviews using CountVectorizer. You've also learned how to filter out stop words and create n-grams of different lengths. This will be useful when you start building machine learning models to analyze text data. 

# TF-IDF

Now let's explore another popular technique for text analysis: TF-IDF (Term Frequency-Inverse Document Frequency). TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is often used in information retrieval and text mining. Let's see how it works in practice.

TF-IDF combines two metrics:
- Term Frequency (TF): How often a word appears in a document
- Inverse Document Frequency (IDF): How unique that word is across all documents
This helps identify words that are both frequent and meaningful in specific documents.

We will first explore this with the social media posts we saw in the slides: 

- I love learning about text. 
- Text models are fun. 
- I love learning at GA. Learning is fun!

In [None]:
# Sample text for demo 
text= ["I love learning about text.", "Text models are fun.", "I love learning at GA. Learning is fun!"]

In [None]:
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 1))

X = vectorizer.fit_transform(text)
tfidf_df = pd.DataFrame(
    X.toarray(), columns=vectorizer.get_feature_names_out())
print("tfidf_df shape:", tfidf_df.shape)
print("tfidf_df columns:", tfidf_df.columns)
tfidf_df 

In [None]:
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')

# Fit and transform the reviews
X = vectorizer.fit_transform(df['reviews'])

# Create a new DataFrame with the TF-IDF values
tfidf_df = pd.DataFrame(
    X.toarray(), columns=vectorizer.get_feature_names_out())
print("tfidf_df shape:", tfidf_df.shape)
print("tfidf_df columns:", tfidf_df.columns)

# Display the new DataFrame
tfidf_df.head()

Notice that the TF-IDF values are different from the count values. This is because the TF-IDF values are normalized by the term frequency and inverse document frequency, which gives more weight to terms that are rare in the corpus and less weight to terms that are common. This helps to identify the most important terms in a document.

# Try It! 
Let's use CountVectorizer and/or TF-IDF to predict the start rating of the reviews for the TechWave X1.

We will first import the full data set from `reviews.csv` (`./data/reviews.csv`).  Then we will convert the start rating into a binary outcome (0/1) to approximate sentiment analysis.  Finally, we will build a logistic regression model using the words as features to predict sentiment/ratings. Here we use logistic regression, but other classification models, such as Naive Bayes, Random Forest, or SVM, may perform better. Feel free to explore if you would like extra practice.

In this exercise, you will:
1. Process review text using both CountVectorizer and TF-IDF
2. Build logistic regression models to predict sentiment
3. Compare the performance of both approaches 

#### Read in the data and convert the `rating` feature to a numeric

In [None]:
df_x1 = pd.read_csv('./data/reviews.csv')
df_x1['rating'] = pd.to_numeric(df_x1['rating'])
df_x1.info()

#### Create outcome variable from rating 

You will convert the 5-star rating into a proxy for sentiment where star ratings:  
- 1-3: Negative (0)
- 4-5: Positive (1)

We will do this with a lambda function and `.apply()` and create a new sentiment column that will be used as the outcome in our model. 

In [None]:
# Convert 5-star ratings to binary sentiment
# 1-3: Negative (0), 4-5: Positive (1)

df_x1['sentiment'] = df_x1['rating'].apply(lambda x: 0 if x <= 3 else 1)
pd.crosstab(df_x1['rating'], df_x1['sentiment'])

In [None]:
df_x1.head()

#### Create a training and test data set 

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df_x1['review'], df_x1['sentiment'], test_size=0.2, random_state=1212)

#### Process our test with CountVectorizer 

In [None]:
# Initialize a CountVectorizer, fit and transform the training data, and transform the testing data
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')

# Fit and transform the training data
X_train_transformed = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_transformed = vectorizer.transform(X_test)


#### Build a logistic model to predict sentiment using the processed review text

Look back at prior lessons for Logistic Regression or checkout the [scikit-learn logistic regression docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).


##### Hint!
```python
# Initialize a Logistic Regression model
model = LogisticRegression()
```

In [None]:
# Initialize a Logistic Regression model, train the model, predict the ratings, evaluate the model using accuracy, and show the confusion matrix.
model = LogisticRegression()

# Train the model on the transformed training data
model.fit(X_train_transformed, y_train)

# Predict the ratings for the testing data
y_pred = model.predict(X_test_transformed)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display the confusion matrix
print(confusion_matrix(y_test, y_pred))

#### Repeat the process with TF-IDF

In [None]:
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')

# Fit and transform the training data
X_train_transformed = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_transformed = vectorizer.transform(X_test)


#### Build a Logistic Regression model with TF-IDF processed text

In [None]:
# Initialize a Logistic Regression model
model = LogisticRegression()

# Train the model on the transformed training data
model.fit(X_train_transformed, y_train)

# Predict the ratings for the testing data
y_pred = model.predict(X_test_transformed)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display the confusion matrix
print(confusion_matrix(y_test, y_pred))

#### What did we observe? 

CountVectorizer achieved slightly higher accuracy than TF-IDF on our test data. 

Looking at the confusion matrices:
- CountVectorizer correctly identified 2 negative reviews and all positive reviews
- TF-IDF classified all reviews as positive

This suggests that the raw word frequencies captured by CountVectorizer were more informative for our task than the weighted frequencies from TF-IDF. This makes sense because if someone uses negative words multiple times in a review, that repetition itself may be meaningful for sentiment, rather than something that should be downweighted as is the case with TF-IDF.


# Conclusions and Takeaways
- Bag-of-words methods to represent text numerically so that it can be used for by the computer for analysis.
- These methods are effective while also being fast and cost effective to implement 
- These methods don't capture context from the words- if that is required for your task consider using more embeddings instead 