# QM 701: Advanced Data Analytics and Applications
# Homework 2


-------

## Objective
The goal of this homework is to help you further familiarize with the concept of sentiment analysis and its application piplines. In this homework, you will delve into the fundamental principles of sentiment analysis, explore various techniques, and gain hands-on experience by implementing sentiment analysis on real-world text data.

## Tasks
This homework includes the following 6 questions.
 - **Q1**: Labels. Map the initial emotions to the simpler categories. (10 points)
 - **Q2**: Text Pre-processing. How would you pre-process the text to clean the data for sentiment analysis? (20 points)
 - **Q3**: Vectorization: Convert the text to data using BoW or TF-IDF using the sklearn library (CountVectorizer or TfidfVectorizer). You will also need to split the data into training/test in this step. (20 points)
 - **Q4**: Naive Bayes Model: Use Naive-Bayes (MultinomialNB) within sklearn to train your model and test your results (20 points)
 - **Q5**: Multi-Categorical Logistic Regression: Use LogisticRegression (Multinomial Logistic Regression) within sklearn to train your model and test your results (15 points)
 - **Q6**: Apply the TextBlob sentiment analysis library to the same data set. Compare the results with the Naive Bayes model. (15 points + 5 Bonus Points)

**Files required**:
*  english_stopwords
*  text_classification dataset

**Source for Corpus**:
*  https://data.world/crowdflower/sentiment-analysis-in-text
*  Author: [@CrowdFlower](https://data.world/crowdflower)
*  Direct download link via Box: https://duke.box.com/s/phdcgpx2bgbrrg0ydiw7l05837q82c2y

**Homework topics covered**:
1.  Text Vectorization (Converting text to numbers)
2.  Text Pre-processing
3.  Text Classification
4.  Applied Sentiment Analysis
5.  Python for NLP and Machine Learning

**Colab Note**:
Don't forget you can click the > arrows next to topics to expand/hide sections of code within the notebook.

## File and Data setup

*  Loads stopwords: *stopwords*
*  Loads a dataframe: *emotion_raw_csv*

In [122]:
# General dataframe imports
import pandas as pd
import numpy as np

# sklearn imports
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn import naive_bayes
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import text
import sklearn.feature_extraction

# Stopwords
import nltk

# Sentiment Analysis
from textblob import TextBlob

# Regular Expressions
import re

# Preprocessing
import gensim

In [123]:
# Download Stopwords
# Downloads stopwords to /root/nltk_data/corpora/stopwords/english
from nltk.corpus import stopwords
nltk.download('stopwords')

# Import stopwords to a list:
sfile = open(stopwords._root.path + '/english','r')
stopwords = sfile.read().splitlines()
sfile.close()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [124]:
# Download the working dataset
# saves a file to your directory as /content/text_emotion.csv
!wget https://duke.box.com/shared/static/ll2bqmkdxsnj8wmm6zmqwlmct2hbf7oe -O text_emotion.csv

--2024-06-27 04:17:31--  https://duke.box.com/shared/static/ll2bqmkdxsnj8wmm6zmqwlmct2hbf7oe
Resolving duke.box.com (duke.box.com)... 74.112.186.144
Connecting to duke.box.com (duke.box.com)|74.112.186.144|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/ll2bqmkdxsnj8wmm6zmqwlmct2hbf7oe [following]
--2024-06-27 04:17:31--  https://duke.box.com/public/static/ll2bqmkdxsnj8wmm6zmqwlmct2hbf7oe
Reusing existing connection to duke.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://duke.app.box.com/public/static/ll2bqmkdxsnj8wmm6zmqwlmct2hbf7oe [following]
--2024-06-27 04:17:31--  https://duke.app.box.com/public/static/ll2bqmkdxsnj8wmm6zmqwlmct2hbf7oe
Resolving duke.app.box.com (duke.app.box.com)... 74.112.186.144
Connecting to duke.app.box.com (duke.app.box.com)|74.112.186.144|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://public.boxcloud.com/d/1/b1!WyV1G

In [125]:
import os
file_path = os.path.join(os.getcwd(), 'text_emotion.csv')

# Use pandas to read the text_emotion_file into memory
emotion_raw_csv = pd.read_csv(file_path)

# Print out the shape of emotion_raw_csv
print(emotion_raw_csv.shape)

(40000, 4)


In [126]:
# Peek at the dataframe
emotion_raw_csv.head()

Unnamed: 0,tweet_id,sentiment,author,content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...


## Q1: **Labels**
Since there are multiple emotions in the initial file, you need to map the initial emotions to the simpler categories.


In [127]:
# We have 13 emotions in this text file:
emotions_list = ['empty','sadness','enthusiasm','neutral','worry',
          'surprise','love','fun','hate','happiness','boredom',
          'relief','anger']

### a) Decide your model forms

Please indicate whehter you want to select a model with:
*  multi-categorical labels between positive, negative, and neutral
or
*  binary labels between positive and negative.


Enter your choice, along with a short explanation:


**Answer: **I would choose mult-categorical labels since they are more comprehensive to include many different types of emotions

### b) Map the emotions into the 'y' labels. (Note: be sure to review the code below and make the changes if needed)

In [128]:
# Based on your choice of the model form,
# re-assign neutral (0) (if applicable), positive (1),
# and negative (-1) to the 13 emotions by changing the following dict values

emotions_dict = {
    'empty': -1,
    'sadness': -1,
    'enthusiasm': 1,
    'neutral': 0,
    'worry': -1,
    'surprise': 0,
    'love': 1,
    'fun': 1,
    'hate': -1,
    'happiness': 1,
    'boredom': -1,
    'relief': 1,
    'anger': -1
    }

### c) Briefly discuss or comments as appropriate for your decision on the model form and the emotion mappings:

**Answer: **. I chose multi-categorical labels for deeper insights and more accurate sentiment representation where neutral plays a role, because we have so many emotions that are actually ambiguous and neutral.



### d) Create the 'y' column in the dataframe.
We provide the codes for creating the 'y' column. You can just run it without any edits needed.

In [129]:
# Mapping the 'sentiment' column values using the emotions_dict
# This will create a new column 'y' in the DataFrame with the mapped values
emotion_raw_csv['y'] = emotion_raw_csv['sentiment'].map(emotions_dict)

# Displaying the first 10 rows of the updated DataFrame
print(emotion_raw_csv.head(10))
# You can then form a 'y' variable as
y = emotion_raw_csv['y']

     tweet_id   sentiment         author  \
0  1956967341       empty     xoshayzers   
1  1956967666     sadness      wannamama   
2  1956967696     sadness      coolfunky   
3  1956967789  enthusiasm    czareaquino   
4  1956968416     neutral      xkilljoyx   
5  1956968477       worry  xxxPEACHESxxx   
6  1956968487     sadness       ShansBee   
7  1956968636       worry       mcsleazy   
8  1956969035     sadness    nic0lepaula   
9  1956969172     sadness     Ingenue_Em   

                                             content  y  
0  @tiffanylue i know  i was listenin to bad habi... -1  
1  Layin n bed with a headache  ughhhh...waitin o... -1  
2                Funeral ceremony...gloomy friday... -1  
3               wants to hang out with friends SOON!  1  
4  @dannycastillo We want to trade with someone w...  0  
5  Re-pinging @ghostridah14: why didn't you go to... -1  
6  I should be sleep, but im not! thinking about ... -1  
7               Hmmm. http://www.djhero.com/ is dow

## Q2: **Text Pre-Processing**


### a) Describe and identify as many preprocessing steps for the corpus `emotion_raw_csv['content']` as you can.

**Answer:** We can perform the following preprocessing:<br>
Lowercasing, Removing Punctuation, Number Removal, Whitespaces Removal.<br>
Tokenization, remove Stop words, Stemming and Lemmatization, replacing "isn't" with "is_not", remove Special characters and non-ASCII characters, Rare Words/Token Removal, Synonyms Consolidation and Dependency Parsing.

### b) Select at least three text preprocessing steps and apply them on the corpus `emotion_raw_csv['content']`.

* Please store the processed text as `emotion_raw_csv['processed_content']`.
* Ensure that each entry in `emotion_raw_csv['processed_content']` is a string, not a list of tokens.

Hint: See `PreClass2_Preprocess_SocialMedia.ipynb` posted on Canvas

In [130]:
# Lowercasing, Removing punctuation, Removing stop words, Stemmer and Lemmatizer:
import string
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk

from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
# Lowercasing
emotion_raw_csv['processed_content'] = emotion_raw_csv['content'].str.lower()

# Removing punctuation
emotion_raw_csv['processed_content'] = emotion_raw_csv['processed_content'].apply(
    lambda text: text.translate(str.maketrans('', '', string.punctuation))
)

# Removing stop words
emotion_raw_csv['processed_content'] = emotion_raw_csv['processed_content'].apply(
    lambda text: ' '.join([word for word in word_tokenize(text) if word not in stopwords])
)

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Stemming, Lemmatization, and Removing digits
emotion_raw_csv['processed_content'] = emotion_raw_csv['processed_content'].apply(
    lambda text: ' '.join([stemmer.stem(word) for word in word_tokenize(text.lower()) if word.isalpha()])
)

emotion_raw_csv['processed_content'] = emotion_raw_csv['processed_content'].apply(
    lambda text: ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(text)])
)

# Removing digits
emotion_raw_csv['processed_content'] = emotion_raw_csv['processed_content'].apply(
    lambda text: re.sub(r'\d+', '', text)
)
# Display the DataFrame to check results
print(emotion_raw_csv[['content', 'processed_content']])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


                                                 content  \
0      @tiffanylue i know  i was listenin to bad habi...   
1      Layin n bed with a headache  ughhhh...waitin o...   
2                    Funeral ceremony...gloomy friday...   
3                   wants to hang out with friends SOON!   
4      @dannycastillo We want to trade with someone w...   
...                                                  ...   
39995                                   @JohnLloydTaylor   
39996                     Happy Mothers Day  All my love   
39997  Happy Mother's Day to all the mommies out ther...   
39998  @niariley WASSUP BEAUTIFUL!!! FOLLOW ME!!  PEE...   
39999  @mopedronin bullet train from tokyo    the gf ...   

                                       processed_content  
0      tiffanylu know listenin bad habit earlier star...  
1                  layin n bed headach ughhhhwaitin call  
2                            funer ceremonygloomi friday  
3                                  want han

In [131]:
# Run this code to compare the original content with your processed text
# Displaying the first 5 rows of the updated DataFrame
emotion_raw_csv.head(5)

Unnamed: 0,tweet_id,sentiment,author,content,y,processed_content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...,-1,tiffanylu know listenin bad habit earlier star...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...,-1,layin n bed headach ughhhhwaitin call
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...,-1,funer ceremonygloomi friday
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!,1,want hang friend soon
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...,0,dannycastillo want trade someon houston ticket...


### c) Next, complete the code below to split the data into a training set and a test set with the split rate 80% : 20%.

In [132]:
# Complete the last line of code to split the data into training and testing sets.

# Define 'X' variable
X = emotion_raw_csv['processed_content']
# Recall that 'y' variable is already defined as y = emotion_raw_csv['y']

# Split the data into training and testing set:
X_train, X_test, y_train, y_test =  train_test_split(X, y,  test_size=0.20, random_state=42)


## Q3: **Vectorization**

### a) Pick one vectorizer from the follows:
*   Count Vectorizer (BoW)
*   TF-IDF
*   BoW with Stopword Removal
*   BoW with n-gram and Stopword Removal

In [133]:
# Example Vectorizers using sklearn library
# Pick one of these, and uncomment it

# Count Vectorizer (BoW)
#vectorizer = text.CountVectorizer()

# TF-IDF Vectorizer
#vectorizer = text.TfidfVectorizer()

# BOW with stopword removal
#vectorizer = text.CountVectorizer(stop_words=stopwords)

# 2-grams with stopword removal
vectorizer = text.CountVectorizer(ngram_range=(1,2),stop_words=stopwords)


### b) Fit the vectorizer to the training set and transform the test set.

We provide the codes for demonstrating how to fit / transform the vectorizer. You can just run it without any editings needed.

In [134]:
# You fit and transform your training text and just transform the test text
# fit_transform builds the dictionary and transforms the text
# transform function uses the existing dictionary to transform text
# You must use the same dictionary or else the model shape will not match
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

### c) As we increase the `n` in the n-gram vectorizer, how does the size of the feature names change?

**Answer:** If n=2, in bigram mode, For instance, the phrase "the quick brown fox" would take considerations as combinations like "the quick", "quick brown", and "brown fox".<br>
In trigram mode (n=3), it would include "the quick brown", "quick brown fox", etc.<br>
There will be increase in feature sizes, and feature matrix becomes more sparse.The dimensionality of the feature matrix increases. This could lead to overfitting. Removing stop words reduces the number of potential n-grams.<br> This also means Increased Computational Load, memory usage, it can improve model performance with increased model complexity.

### d) Describe how does 2-gram differ from BoW and TF-IDF.

**Answer:** 2-Gram(bigram), is pairs of words. It can capture context and sequence of the words.<br>
Bag of Words (BoW), cares about occurrence of words. It measures the presence of defined words. By default, it uses 1-Gram.<br>
TF-IDF (Term Frequency-Inverse Document Frequency), It calculates a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term is given a weight in the document. TF-IDF considers not just the occurrence of a word in a single document but in the entire corpus, across documents, TF-IDF can highlight which words are truly distinctive in a document.

### e) What is the # of feature names stored in the vectorizor? (To answer this question, you will first need to write code and run it)

In [135]:
num_features = len(vectorizer.get_feature_names_out())

print(f"Number of features (bigrams) in the vectorizer: {num_features}")

Number of features (bigrams) in the vectorizer: 187730


**Answer:** for 2-gram, the answer is 187730. (if I change it to 3-gram, the number will be 354763)

## Q4: **Naive-Bayes Model**
Use Naive-Bayes (MultinomialNB) within sklearn to fit your model and test your results.


### a) Build your Naive-Bayes model on the training set, and generate predictions on the test set.
We provide the codes for Naive-Bayes model initialization, training and prediction. You can just run it without any editings needed.

In [136]:
## Fit the Naive-Bayes Model and Form a prediction
# Your variables for X and y may be different.
# You will need to ensure you are passing the vectorized X in and the Y labels

# Initialize a multinomial Naive Bayes model
model = naive_bayes.MultinomialNB()

# Fit/train the model using the training data
model.fit(X_train_vectors, y_train)

# Use the model to make prediction using the testing data
y_pred = model.predict(X_test_vectors)

### b)  Evaluate your Naive-Bayes model performance with Precision, Recall, and F1 Score.
We provide the codes for model performance evaluation with Precision, Recall, and F1 Score.

You can just run it without any editings needed.

In [137]:
# Get our performance metrics, precision, recall, F1

precision_class = precision_score(y_test, y_pred, average=None, zero_division=0.0)
recall_class = recall_score(y_test, y_pred, average=None, zero_division=0.0)
f1_class = f1_score(y_test, y_pred, average=None, zero_division=0.0)

print("{:=^50s}".format("Naive-Bayes Performance by Class (entries are ordered as negative, neutral and positive)"))

# Output results by class
print("Precision:", precision_class)
print("Recall:", recall_class)
print("F1 Score:", f1_class)

Naive-Bayes Performance by Class (entries are ordered as negative, neutral and positive)
Precision: [0.53455571 0.49425287 0.59092599]
Recall: [0.83114035 0.07944573 0.60121075]
F1 Score: [0.65064378 0.13688818 0.59602401]


Next, compute model performance evaluation with weighted Precision, Recall, and F1 Score.

Hint: use the code above, but change `average=None` to `average='weighted'`

In [138]:
# Get our performance metrics, precision, recall, F1

precision_class = precision_score(y_test, y_pred, average='weighted', zero_division=0.0)
recall_class = recall_score(y_test, y_pred, average='weighted', zero_division=0.0)
f1_class = f1_score(y_test, y_pred, average='weighted', zero_division=0.0)

print("{:=^50s}".format("Naive-Bayes Performance by Class (entries are ordered as negative, neutral and positive). weighted"))

# Output results by class
print("Precision:", precision_class)
print("Recall:", recall_class)
print("F1 Score:", f1_class)

Naive-Bayes Performance by Class (entries are ordered as negative, neutral and positive). weighted
Precision: 0.5422720886340335
Recall: 0.55175
F1 Score: 0.4935636620421808


### c) Explain the meaning of precision and recall, and interpret the model performance with the above results.

**Answer:** Precision: Precision = True Positives / (True Positives + False Positives). It measures correctly predicted positive observations to the total predicted positives. <br>
Recall: Recall = True Positives / (True Positives + False Negatives). It measures correctly predicted positive observations to all observations in the actual class.<br>
The F1 Score: F1 Score = 2 * (Precision * Recall) / (Precision + Recall). It is the weighted average of Precision and Recall. It is particularly useful when the class distribution is uneven. The F1 Score takes both false positives and false negatives into account, providing a more holistic view of the model's performance. <br>
For weighted results: Precision 0.544 (54.4%). This is low, indicating that the model may be generating a considerable number of false positives.<br>
Recall 0.54875. This suggests moderate sensitivity, meaning that the model misses a good portion of positive cases (around 45.1%).<br>
The F1 Score of 48.7% is relatively low, suggesting that the model does not perform exceptionally well on either front, either leaving out several positive cases or incorrectly labeling negatives as positives.<br>
for the average none, the number is for positive, neutral and negative specifically.

## Q5 **Multi-Class Logistic Regression**
Next, let us train a logistic regression model to perform the same classification task. We will then evaluate the model using the same testing data.

### a) Build your mutinomial logistic regression model on the training set, and generate predictions on the test set.
Hint: You may check the Q4 a) for model initialization, training and prediction. Also, do not worry about enabling multinomial LR, the LogisticRegression algorithm from sklearn chooses the multinomial LR automatically once it recognize it has more than two classes.

In [139]:

# Building the multinomial logistic regression model
log_reg_model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=200)
log_reg_model.fit(X_train_vectors, y_train)

# Predicting the labels for the test set
y_pred = log_reg_model.predict(X_test_vectors)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### b)  Evaluate your LR model performance with Precision, Recall, and F1 Score.

Hint: Again, you may check some of the codes provided in Q4.

In [140]:
# Get our performance metrics, precision, recall, F1

precision_class = precision_score(y_test, y_pred, average=None, zero_division=0.0)
recall_class = recall_score(y_test, y_pred, average=None, zero_division=0.0)
f1_class = f1_score(y_test, y_pred, average=None, zero_division=0.0)

print("{:=^50s}".format("Logistic Regression Performance by Class (entries are ordered as negative, neutral and positive)"))

# Output results by class
print("Precision:", precision_class)
print("Recall:", recall_class)
print("F1 Score:", f1_class)

Logistic Regression Performance by Class (entries are ordered as negative, neutral and positive)
Precision: [0.61684149 0.43657817 0.61878453]
Recall: [0.66322055 0.41016166 0.59326523]
F1 Score: [0.63919082 0.42295785 0.60575623]


### c)  Evaluate your Logisitc Regression model performance with weighted Precision, Recall, and F1 Score.

In [141]:
recision_class = precision_score(y_test, y_pred, average='weighted', zero_division=0.0)
recall_class = recall_score(y_test, y_pred, average='weighted', zero_division=0.0)
f1_class = f1_score(y_test, y_pred, average='weighted', zero_division=0.0)

print("{:=^50s}".format("Logistic Regression Performance by Class (entries are ordered as negative, neutral and positive). Weighted"))

# Output results by class
print("Precision:", precision_class)
print("Recall:", recall_class)
print("F1 Score:", f1_class)

Logistic Regression Performance by Class (entries are ordered as negative, neutral and positive). Weighted
Precision: [0.61684149 0.43657817 0.61878453]
Recall: 0.571625
F1 Score: 0.5696268193676524


### d) Briefly discuss whether you prefer multinomial Naive Bayes or Logistic Regression for this classification task, and why.

**Answer:** I will choose LR.<br>NB is less precise compared to LR, lower in recall and slightly lower in F-1 score (better balance). <br>
Logistic Regression is more robust to the independence assumption since it does not assume that the features are conditionally independent given the class, unlike Naive Bayes.<br>
LR works better with larger dataset, but is also more computational intensive.

## Q6: **TextBlob**
In this quesiton, we will apply TextBlob from NLTK on the dataset for sentiment analysis. Recall that TextBlob returns a sentiment score between -1.0 (most negative) to +1.0 (most positive).


### a) Run the code below that computes and stores the textblob scores for each (processed) tweet in the testing set.

In [142]:
y_pred_textblob_score = X_test.apply(lambda x: TextBlob(str(x)).sentiment[0])

### b) We are going to use TextBlob to classify the tweets as follows: we first select the positive cutoff to be +0.33, and the negative cutoff to be -0.33.Then, we classify all tweets with scores > 0.33 as +1 (positive), all tweets with scores < -0.33 as -1 (negative), and the rest of the tweets with -0.33 <= scores <= 0.33 as 0 (neutral).

Based on the selected cutoffs, classify tweets as -1, 0, 1 using the Textblob sentiment scores.

In [143]:
# Your code for the question above.

# Define cutoffs
positive_cutoff = 0.33
negative_cutoff = -0.33

# Hint: you may want to use the code below but you need to define the right cutoffs
y_pred_textblob = y_pred_textblob_score.apply(lambda x: 1 if x > positive_cutoff else (-1 if x < negative_cutoff else 0))


### c) Evaluate your Textblob model performance with weighted Precision, Recall, and F1 Score.
Hint: You may follow the codes provided for Q4 b).

In [144]:
# Your code for the question above.

# Calculate precision, recall, and F1 score with 'weighted' average
precision_class = precision_score(y_test, y_pred_textblob, average='weighted', zero_division=0)
recall_class = recall_score(y_test, y_pred_textblob, average='weighted', zero_division=0)
f1_class = f1_score(y_test, y_pred_textblob, average='weighted', zero_division=0)

print("{:=^50s}".format("TextBlob Performance by Class (entries are ordered as negative, neutral and positive)"))
# Print the performance metrics
print("Precision:", precision_class)
print("Recall:", recall_class)
print("F1 Score:", f1_class)

TextBlob Performance by Class (entries are ordered as negative, neutral and positive)
Precision: 0.5565483975567815
Recall: 0.3835
F1 Score: 0.34671952329699124


### d) Would a different set of cutoffs improve Textblob's performance? If so, how would you find such a set without using the testing data set? (The 5 bonus points are reserved for coding up an implementation to optimize the cutoffs)

Please enter your answer to the question above here:

In [145]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
from textblob import TextBlob

# Define a function to calculate sentiment scores using TextBlob
def get_sentiment(text):
    return TextBlob(text).sentiment.polarity

# Define a function to classify sentiments based on cutoffs
def classify_with_cutoffs(sentiments, positive_cutoff, negative_cutoff):
    return np.array([1 if score > positive_cutoff else -1 if score < negative_cutoff else 0 for score in sentiments])

# Prepare sentiment scores for training data
X_train_sentiments = np.array([get_sentiment(text) for text in X_train])

# Setting up K-Fold cross-validation
kf = KFold(n_splits=5)  # 5-fold cross-validation
best_score = 0
best_cutoffs = (0, 0)
cutoffs = np.linspace(-0.5, 0.5, 21)  # Generates 21 points from -0.5 to 0.5

for pos_cutoff in cutoffs[cutoffs > 0]:
    for neg_cutoff in cutoffs[cutoffs < 0]:
        scores = []
        for train_index, val_index in kf.split(X_train_sentiments):
            # Get training and validation subsets for this fold
            # Use .iloc to access elements by position since the index might not be a simple range
            X_train_fold, y_train_fold = X_train_sentiments[train_index], y_train.iloc[train_index]
            X_val_fold, y_val_fold = X_train_sentiments[val_index], y_train.iloc[val_index]

            # Classify sentiments in the validation fold
            y_val_pred = classify_with_cutoffs(X_val_fold, pos_cutoff, neg_cutoff)
            # Calculate F1 score for this fold and cutoffs
            score = f1_score(y_val_fold, y_val_pred, average='weighted', zero_division=0)
            scores.append(score)

        # Average score across all folds
        avg_score = np.mean(scores)
        if avg_score > best_score:
            best_score = avg_score
            best_cutoffs = (pos_cutoff, neg_cutoff)

print("Best Cutoffs:", best_cutoffs, "with F1 Score:", best_score)
#Best Cutoffs: (0.050000000000000044, -0.04999999999999999) with F1 Score: 0.4235336093844955

Best Cutoffs: (0.050000000000000044, -0.04999999999999999) with F1 Score: 0.4235336093844955




---
## End of Assignment
