# Sentiment Analysis

In this notebook, we'll explore a basic example of sentiment analysis in which our goal is to learn sentiments from some transcript data. Our aim here is to create a system that describes whether a trascript/text data point has a neutral, positive, or negative corresponding emotion.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.cluster import KMeans
import pandas as pd
import string
import matplotlib.pyplot as plt
import numpy as np

## Data Loading and Preprocessing

In [3]:
# Load the dataframe.
df = pd.read_excel('train_sentiment.xlsx')

# Display the first 5 rows.
df.head()

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


In our dataset, our main focus is to use the `text` column as features and `sentiment` column as labels. We'll keep these columns and remove the rest.

In [4]:
df = df[['text', 'sentiment']]
df

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD I will miss you here in San Diego!!!,negative
2,my boss is bullying me...,negative
3,what interview! leave me alone,negative
4,"Sons of ****, why couldn`t they put them on t...",negative
...,...,...
27476,wish we could come see u on Denver husband l...,negative
27477,I`ve wondered about rake to. The client has ...,negative
27478,Yay good for both of you. Enjoy the break - y...,positive
27479,But it was worth it ****.,positive


## Feature Extraction

Now, we can do direct sentiment analysis on this one but we would expect the process to be more challenging due to variations in letter cases, abundance of punctuations and numbers, and more.

In this example notebook, we'd like to make this more simple by removing the punctuations, numbers, and standardizing the letter case with the assumption that they do not contribute to the overall sentiment of the transcript.

In [None]:
"adbc".translate(str.maketrans('abc', 'xyz', 'd'))

'xyz'

In [5]:
def remove_punctuation(text:str):
  """Removes punctuation from a string."""
  return text.translate(str.maketrans('', '', string.punctuation))

In [6]:
def remove_numbers(text: str):
  """Removes numbers from a string without using the re package."""
  return ''.join([i for i in text if not i.isdigit()])

In [7]:
# Remove punctuation
df['text_cleaned'] = df['text'].astype(str).apply(remove_punctuation)

# remove numbers
df['text_cleaned'] = df['text_cleaned'].astype(str).apply(remove_numbers)

# Convert to lowercase
df['text_cleaned'] = df['text_cleaned'].str.lower()

# display first 5 rows
df.head()

Unnamed: 0,text,sentiment,text_cleaned
0,"I`d have responded, if I were going",neutral,id have responded if i were going
1,Sooo SAD I will miss you here in San Diego!!!,negative,sooo sad i will miss you here in san diego
2,my boss is bullying me...,negative,my boss is bullying me
3,what interview! leave me alone,negative,what interview leave me alone
4,"Sons of ****, why couldn`t they put them on t...",negative,sons of why couldnt they put them on the rel...


Now, remember that ML models can't understand these texts. Rather, we will need to represent each transcript into numbers (via vectorization) in order for the model to gain context out of them.

---
We can assume hereon that the more a word is shown in a transcript, the more important it is. So perhaps the frequency of words in a transcript can be used as features that a model can learn from.

That is possible but this approach is still limited. Using this technique leads to very common English words like "the", "a", "is", etc. to have very high values which would not contribute much to the sentiment of the transcript. Hence, we'll use the Term Frequency Inverse Document Frequency (TF-IDF) Vectorizer to convert the texts into numbers.

---
**Term Frequency** - measures how often a term appears in a document. A higher frequency means the term is more relevant to that specific document.

**Inverse Document Frequency** - measures how important a term is across the entire corpus. Terms that appear in many documents have a low IDF, while terms that appear in few documents have a high IDF.

In [8]:
# Tokenize and vectorize the text using TfidfVectorizer
# TfidfVectorizer includes tokenization and lowercasing by default, but we already did lowercasing
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text_cleaned'])

In [9]:
tfidf_matrix.shape  # n_samples, n_tokens

(27481, 27971)

The TF-IDF matrix shows a vector (or a list of numbers) for each row or data point. Let's look at the most common token/word in some of the trascripts.

In [10]:
feature_names = tfidf_vectorizer.get_feature_names_out()

idx = 1
n = 10

# get top n terms in the given data point idx
top_n_indices = np.argsort(tfidf_matrix.toarray()[idx])[::-1][:n]
top_n_terms = [feature_names[i] for i in top_n_indices]
top_n_tfidf_values = tfidf_matrix.toarray()[idx][top_n_indices]


# show text, terms, and TF-IDF values
print(f"Transcript: {df.loc[idx, 'text_cleaned']}")
pd.DataFrame({'Term': top_n_terms, 'TF-IDF Value': top_n_tfidf_values})

Transcript:  sooo sad i will miss you here in san diego


Unnamed: 0,Term,TF-IDF Value
0,diego,0.506458
1,san,0.479819
2,sooo,0.378548
3,sad,0.294575
4,miss,0.27977
5,here,0.278431
6,will,0.259006
7,in,0.177635
8,you,0.169608
9,glyders,0.0


## Clustering

In this part, we'll be using a very common clustering algorithm (KMeans) to cluster our data points into a number of groups/clusters of our liking. Here, we won't be using the sentiment columns as labels. Instead, we'll just determine whether there's a logic in the groupings done by the model on our preprocessed trascript data and check its alignment with the actual sentiment.

In [11]:
# Perform KMeans clustering
# Specify the number of clusters to 3 (for neutral, positive, and negative sentiment)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=100)
kmeans.fit(tfidf_matrix)

In [12]:
# Add the cluster labels to the dataframe
df['kmeans_cluster'] = kmeans.labels_

# Display the first few rows with the new cluster labels
df[['text', 'sentiment', 'kmeans_cluster']].tail()

Unnamed: 0,text,sentiment,kmeans_cluster
27476,wish we could come see u on Denver husband l...,negative,0
27477,I`ve wondered about rake to. The client has ...,negative,0
27478,Yay good for both of you. Enjoy the break - y...,positive,1
27479,But it was worth it ****.,positive,0
27480,All this flirting going on - The ATG smiles...,neutral,0


In [13]:
# Compare the original sentiment labels with the KMeans clusters
sentiment_cluster_comparison = pd.crosstab(df['sentiment'], df['kmeans_cluster'])

# Display the comparison
sentiment_cluster_comparison

kmeans_cluster,0,1,2
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,7027,744,10
neutral,9644,1424,50
positive,6469,1536,577


It seems that there's no clear distinction between the predicted clusters by KMeans and the actual sentiment labels. Let's take a closer look into the most common tokens in each cluster.

In [14]:
# Get the feature names (words)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Initialize a dictionary to store top terms for each cluster
top_terms_by_cluster = {}

# Iterate through each cluster
for cluster_id in sorted(df['kmeans_cluster'].unique()):
    # Get the indices of the documents in the current cluster
    cluster_indices = df[df['kmeans_cluster'] == cluster_id].index

    # Get the TF-IDF matrix for the current cluster
    tfidf_matrix_cluster = tfidf_matrix[cluster_indices]

    # Sum the TF-IDF scores for each term across all documents in the cluster
    sum_tfidf_cluster = tfidf_matrix_cluster.sum(axis=0)

    # Get the indices of the top N terms in the cluster
    top_n = 5  # You can adjust N as needed
    top_n_indices_cluster = np.argsort(sum_tfidf_cluster.A1)[::-1][10:15]

    # Get the top N terms and their scores for the cluster
    top_n_terms_cluster = [feature_names[i] for i in top_n_indices_cluster]
    top_n_scores_cluster = sum_tfidf_cluster.A1[top_n_indices_cluster]

    # Store the top terms and scores in the dictionary
    top_terms_by_cluster[cluster_id] = list(zip(top_n_terms_cluster, top_n_scores_cluster))

# Display the top terms for each cluster
for cluster_id, top_terms in top_terms_by_cluster.items():
    print(f"Top terms for Cluster {cluster_id}:")
    for term, score in top_terms:
        print(f"- {term}: {score:.2f}")
    print("\n")

Top terms for Cluster 0:
- on: 364.80
- so: 355.18
- me: 355.15
- that: 341.95
- its: 321.10


Top terms for Cluster 1:
- your: 75.48
- that: 74.84
- know: 65.87
- love: 65.33
- of: 62.14


Top terms for Cluster 2:
- out: 19.31
- wars: 19.07
- star: 17.36
- and: 15.04
- love: 13.72




## Classification

Since clustering techniques doesn't give us a general idea/insight into the similarities of texts having the same sentiment, let's try to use a classification model instead. For this one, we'll use Decision Trees as our ML model.

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Define features (X) and target (y)
X = tfidf_matrix
y = df['sentiment']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

# Initialize and train a Decision Tree Classifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Predict sentiment on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

Accuracy: 0.61726897889886
Classification Report:
              precision    recall  f1-score   support

    negative       0.57      0.58      0.58      1158
     neutral       0.60      0.61      0.61      1697
    positive       0.68      0.66      0.67      1268

    accuracy                           0.62      4123
   macro avg       0.62      0.62      0.62      4123
weighted avg       0.62      0.62      0.62      4123



In [16]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to tune
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Perform hyperparameter tuning
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding accuracy
print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_}")

# Evaluate the model with the best hyperparameters on the test set
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
report_tuned = classification_report(y_test, y_pred_tuned)

print(f"\nTest set accuracy with best hyperparameters: {accuracy_tuned}")
print("Classification Report with best hyperparameters:")
print(report_tuned)

Best hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10}
Best cross-validation accuracy: 0.6027913348745612

Test set accuracy with best hyperparameters: 0.6150861023526558
Classification Report with best hyperparameters:
              precision    recall  f1-score   support

    negative       0.57      0.57      0.57      1158
     neutral       0.59      0.63      0.61      1697
    positive       0.69      0.64      0.67      1268

    accuracy                           0.62      4123
   macro avg       0.62      0.61      0.62      4123
weighted avg       0.62      0.62      0.62      4123



## Predicting on Unseen Data Points
Now, what if we explore our other (test) dataset and use the trained model to predict this.

In [18]:
# Load the dataframe.
df_test = pd.read_excel('test_sentiment.xlsx')

# limit to important columns
df_test = df_test[['text', 'sentiment']]

# Remove punctuation
df_test['text_cleaned'] = df_test['text'].astype(str).apply(remove_punctuation)

# remove numbers
df_test['text_cleaned'] = df_test['text_cleaned'].astype(str).apply(remove_numbers)

# Convert to lowercase
df_test['text_cleaned'] = df_test['text_cleaned'].str.lower()

In [19]:
tfidf_matrix_test = tfidf_vectorizer.transform(df_test['text_cleaned'])

# set testing set
X_test = tfidf_matrix_test
y_test = df_test['sentiment']


# train on entire train dataset
best_model.fit(X, y)

# predict on entire test dataset
y_pred = best_model.predict(X_test)

report_test = classification_report(y_test, y_pred)
print("Classification report on test dataset with best hyperparameters: ")
print(report_test)

Classification report on test dataset with best hyperparameters: 
              precision    recall  f1-score   support

    negative       0.58      0.57      0.58      1001
     neutral       0.59      0.63      0.61      1430
    positive       0.70      0.65      0.68      1103

    accuracy                           0.62      3534
   macro avg       0.62      0.62      0.62      3534
weighted avg       0.62      0.62      0.62      3534



# Future Directions

Now that you know how to do sentiment analysis using both clustering and classification models, try to apply what we did here on your own transcript/text dataset.