## **#03. Sentiment, Emotion, and Predictive Analysis with Text Data**
- Instructor: [Jaeung Sim](https://jaeungs.github.io/) (University of Connecticut)
- Course: OPIM 5671 Data Mining and Time Series Forecasting
- Last updated: September 17, 2025

**Objectives**
1. Predict sentiment and other emotion variables using pre-trained models.
1. Build a predictive model using sentiment and emotion variables.

**References**
* [Disneyland Reviews at Kaggle (Data Source)](https://www.kaggle.com/datasets/arushchillar/disneyland-reviews)
* [Python | Lemmatization with NLTK](https://www.geeksforgeeks.org/python-lemmatization-with-nltk/)
* [A friendly guide to NLP: Bag-of-Words with Python example](https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/)

#### **Part 1. Understanding the Data**

**Introduction to the Dataset**
* **Source:** Disney Land Review Dataset at Kaggle (<https://www.kaggle.com/datasets/arushchillar/disneyland-reviews>)
* **About this file**
  * The dataset includes 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong, posted by visitors on Trip Advisor. You can refer to https://www.kaggle.com/datasets/arushchillar/disneyland-reviews for more details.
  * Column Description
    1. `Review_ID`: unique id given to each review
    1. `Rating`: ranging from 1 (unsatisfied) to 5 (satisfied)
    1. `Year_Month`: when the reviewer visited the theme park
    1. `Reviewer_Location`: country of origin of visitor
    1. `Review_Text`: comments made by visitor
    1. `Disneyland_Branch`: location of Disneyland Park

**Download data with Python codes**

In [None]:
# Libraries for data downloading and processing
import numpy as np
import pandas as pd
import kagglehub
import os

In [None]:
# Libraries for visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use("fivethirtyeight")
pd.set_option('display.max_colwidth', 80)
import matplotlib.patheffects as path_effects
import seaborn as sns

In [None]:
# Download latest version
path = kagglehub.dataset_download("arushchillar/disneyland-reviews")

print("Path to dataset files:", path)

**Deal with DataFrame**

In [None]:
# List files in the downloaded dataset directory
files = os.listdir(path)
print("Files in dataset:", files)

# Load the CSV file (assuming there's only one CSV file)
csv_file = [f for f in files if f.endswith('.csv')][0]  # Get the first CSV file
csv_path = os.path.join(path, csv_file)

In [None]:
# Read into DataFrame with default encoding (UTF-8)
df = pd.read_csv(csv_path) # Yield an error

In [None]:
# Attempt with ISO-8859-1 encoding
df = pd.read_csv(csv_path, encoding="ISO-8859-1") # Or type: encoding="latin-1" / encoding="Windows-1252"
df.head()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Check data types
df.info()

In [None]:
# Create a bar plot with value counts
sns.countplot(x='Rating', data=df)

#### **Part 2. Processing Text Data**

In [None]:
# NLP libraries
import nltk
import string
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

Here are a few additions to in earlier notebooks:
* Dealing with contractions
* Extending the stop word set by adding contextual terms

**Considering contractions in English**

In [None]:
# A dictionary of main contractions in English
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are",
"you've": "you have"
}

**Extending the stop word set by adding contextual terms**

In [None]:
# Define a basic stop word set
stop_words = set(stopwords.words('english'))

In [None]:
# Extend the stop word set
stop_words.update(['park', 'disney', 'disneyland']) # Context-specific stopwords

**Define a text processing function**

In [None]:
# Bring lemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
# Define a text pre-processing function
def process_text(text):
    # Lowercasing
    text = text.lower()

    # Expand contractions
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\'', ' ', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stopwords & perform lemmatization
    filtered_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return ' '.join(filtered_tokens)

In [None]:
# Apply the pre-processing function
df['Review_Clean'] = df['Review_Text'].apply(process_text)
df['Review_Clean']

In [None]:
from collections import Counter

# Join text together
review_words = ','.join(list(df['Review_Clean'].values))

# Count each word
Counter = Counter(review_words.split())
most_frequent = Counter.most_common(30)

# Bar plot of frequent words
fig = plt.figure(1, figsize = (20,10))
_ = pd.DataFrame(most_frequent, columns=("words","count"))
sns.barplot(x = 'words', y = 'count', data = _, palette = 'winter')
plt.xticks(rotation=45);

In [None]:
# Generate the word cloud
wordcloud = WordCloud(background_color="white",
                      max_words= 200,
                      contour_width = 8,
                      contour_color = "steelblue",
                      collocations=False).generate(review_words)

# Visualize the word cloud
fig = plt.figure(1, figsize = (10, 10))
plt.axis('off')
plt.imshow(wordcloud)
plt.show()

#### **Part 3. Sentiment Analysis**

##### **3.1. Predict text sentiment with the `TextBlob` library**
* **Type:** Rule-based sentiment analysis.
* **Features:** Returns polarity (negative/positive) and subjectivity (fact/opinion).



In [None]:
# Import necessary libraries
from textblob import TextBlob

In [None]:
# Function to get sentiment scores using TextBlob
def get_textblob_sentiment(text):
    analysis = TextBlob(text)
    return analysis.sentiment.polarity  # Returns a score between -1 (negative) and 1 (positive)

In [None]:
# Apply sentiment analysis function to the processed text
df['TextBlob_Sentiment'] = df['Review_Clean'].astype(str).apply(get_textblob_sentiment)

In [None]:
# Display the updated DataFrame
df

**Explore the sentiment variable**

In [None]:
# Summary statistics
df['TextBlob_Sentiment'].describe()

In [None]:
import matplotlib.pyplot as plt

# Plot histogram for the 'TextBlob_Sentiment' column
plt.figure(figsize=(10, 5))
plt.hist(df['TextBlob_Sentiment'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Histogram of TextBlob Sentiment Scores')
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()

##### **3.2. Predict text sentiment with DistilBERT (Hugging Face)**
* **Type:** Transformer-based deep learning model
* **Features:** Provides sentiment class probabilities (e.g., `positive`, `neutral`, `negative`).

In [None]:
# Import necessary libraries
from transformers import pipeline

In [None]:
# Initialize DistilBERT sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

In [None]:
# Apply sentiment analysis to the processed text column (it takes soooooooooo long...)
df['DistilBERT_Sentiment'] = df['Review_Clean'].astype(str).apply(lambda text: sentiment_pipeline(text)[0]['label'])

In [None]:
# Display the updated DataFrame
df

#### **Part 4. Emotion Features**

##### **4.1. Predict emotions with NRC Emotion Lexicon**
* **Type:** Lexicon-based emotion detection.
* **Features:** Labels text with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, disgust).


In [None]:
# Install the NRC Emotion Lexicon if not already available
!pip install nrclex

In [None]:
# Import necessary libraries
from nrclex import NRCLex

In [None]:
# Function to extract emotion scores using NRC Emotion Lexicon
def get_emotions(text):
    emotion = NRCLex(text)
    return emotion.raw_emotion_scores  # Returns a dictionary of emotion scores

In [None]:
# Apply the emotion analysis function to the processed text column
df['NRC_Emotions'] = df['Review_Clean'].astype(str).apply(get_emotions)
df['NRC_Emotions']

In [None]:
# Convert emotion dictionaries into separate columns
emotion_df = df['NRC_Emotions'].apply(pd.Series).fillna(0)

In [None]:
# Merge the emotion features into the original DataFrame
df = pd.concat([df, emotion_df], axis=1)
df

**Explore the updated DataFrame `df`**

In [None]:
df.describe()

#### **Part 5: Predictive Analysis with Features**

* Dependent variable: `Rating`
* Independent variable: `TextBlob_Sentiment`, `anger`, `fear`, `anticipation`, `trust`, `surprise`, `sadness`, `joy`, `disgust`

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
# Define dependent and independent variables
dependent_var = 'Rating'
independent_vars = ['TextBlob_Sentiment', 'anger', 'fear', 'anticipation', 'trust',
                    'surprise', 'sadness', 'joy', 'disgust']

In [None]:
# Drop missing values to ensure complete cases
df = df.dropna(subset=[dependent_var] + independent_vars)

In [None]:
# Split the data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(df[independent_vars], df[dependent_var],
                                                    test_size=0.3, random_state=123)

In [None]:
# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

**Explore coefficients** (you don't need to understand the whole lines)

In [None]:
# Import necessary libraries
import scipy.stats as stats

# Get coefficients and intercept
coefficients = model.coef_
intercept = model.intercept_

# Compute predictions and residuals
y_pred = model.predict(X_train)
residuals = y_train - y_pred

# Compute standard errors
n = X_train.shape[0]  # Number of observations
p = X_train.shape[1]  # Number of predictors
X_with_const = np.c_[np.ones(n), X_train]  # Add constant for intercept
var_residuals = np.sum(residuals**2) / (n - p - 1)
cov_matrix = np.linalg.inv(X_with_const.T @ X_with_const) * var_residuals
std_errors = np.sqrt(np.diag(cov_matrix))

# Compute t-statistics and p-values
t_stats = coefficients / std_errors[1:]  # Exclude intercept from std_errors
p_values = [2 * (1 - stats.t.cdf(np.abs(t), df=n - p - 1)) for t in t_stats]

# Create DataFrame for results
regression_results = pd.DataFrame({
    'Variable': ['Intercept'] + independent_vars,
    'Coefficient': [intercept] + list(coefficients),
    'Standard Error': std_errors,
    't-Statistic': [intercept / std_errors[0]] + list(t_stats),
    'p-Value': [2 * (1 - stats.t.cdf(np.abs(intercept / std_errors[0]), df=n - p - 1))] + p_values
})

In [None]:
regression_results

**Explore predictive performance**

In [None]:
# Predict on both training and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [None]:
# Compute metrics for training set
train_r2 = r2_score(y_train, y_train_pred)
train_adj_r2 = 1 - (1-train_r2) * (len(y_train)-1) / (len(y_train)-X_train.shape[1]-1)
train_mse = mean_squared_error(y_train, y_train_pred)
train_mae = mean_absolute_error(y_train, y_train_pred)

# Compute metrics for test set
test_r2 = r2_score(y_test, y_test_pred)
test_adj_r2 = 1 - (1-test_r2) * (len(y_test)-1) / (len(y_test)-X_test.shape[1]-1)
test_mse = mean_squared_error(y_test, y_test_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)

In [None]:
# Create a DataFrame to display results
metrics_df = pd.DataFrame({
    'Metric': ['R-squared', 'Adjusted R-squared', 'Mean Squared Error', 'Mean Absolute Error'],
    'Training Set': [train_r2, train_adj_r2, train_mse, train_mae],
    'Test Set': [test_r2, test_adj_r2, test_mse, test_mae]
})

metrics_df