## Original Model

I am going to start by importing codecarbon to keep track of my machine learning model's carbon emissions.

In [None]:
from codecarbon import EmissionsTracker

tracker = EmissionsTracker(allow_multiple_runs=True)

tracker.start()

[codecarbon INFO @ 21:10:29] [setup] RAM Tracking...
[codecarbon INFO @ 21:10:29] [setup] GPU Tracking...
[codecarbon INFO @ 21:10:29] No GPU found.
[codecarbon INFO @ 21:10:29] [setup] CPU Tracking...
 Mac OS and ARM processor detected: Please enable PowerMetrics sudo to measure CPU

[codecarbon INFO @ 21:10:30] CPU Model on constant consumption mode: Apple M2
[codecarbon INFO @ 21:10:30] >>> Tracker's metadata:
[codecarbon INFO @ 21:10:30]   Platform system: macOS-14.6-arm64-arm-64bit
[codecarbon INFO @ 21:10:30]   Python version: 3.11.8
[codecarbon INFO @ 21:10:30]   CodeCarbon version: 2.8.0
[codecarbon INFO @ 21:10:30]   Available RAM : 16.000 GB
[codecarbon INFO @ 21:10:30]   CPU count: 8
[codecarbon INFO @ 21:10:30]   CPU model: Apple M2
[codecarbon INFO @ 21:10:30]   GPU count: None
[codecarbon INFO @ 21:10:30]   GPU model: None
[codecarbon INFO @ 21:10:32] Saving emissions data to file /Users/kristinlloyd/Downloads/ClimateSentiment-Politics-Project/Sentiment/emissions.csv


The below code is adapted from, 
https://github.com/MichaelOmosebi/Sentiments-Analysis-Climate-Change. The first time I run through this code, I am only going to make minor changes to the code and change the test.csv and train.csv to match my own. The second time I run through this code, I am going to make my own changes to the model and play around with the hyperparameters. 

In [6]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from imblearn.over_sampling import SMOTE
from collections import Counter

My train_df dataset consists of fake tweets generated by ChatGPT. Initially, I attempted to use a dataset from GitHub containing real tweets about climate change, but the accuracy scores were only around 50%. I believe using ChatGPT generated tweets improved the model's accuracy because the dataset was cleaner and I prompted it to capture emotions such as sarcasm. 

Below, we load and process the training and test datasets. The code removes URLs, mentions, hashtags, punctuation, digits, and stopwords. It also lemmatizes words and tokenizes the text. 

In [7]:
train_df = pd.read_csv("../Data/raw-data/train.csv")
test_df = pd.read_csv("../Data/raw-data/test.csv")

In [None]:
def preprocess(text):

    tokenizer = TreebankWordTokenizer() 
    lemmatizer = WordNetLemmatizer()
    stopwords_list = stopwords.words('english')
    point_noise = string.punctuation + '0123456789'
    
    cleanText = re.sub(r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+', "", text)
    cleanText = re.sub(r'@[a-zA-Z0-9\_\w]+', '', cleanText)
    cleanText = re.sub(r'#[a-zA-Z0-9]+', '', cleanText)
    cleanText = re.sub(r'RT', '', cleanText)
    cleanText = cleanText.lower()
    cleanText = re.sub(r'([https][http][htt][th][ht])', "", cleanText)
    cleanText = ''.join([word for word in cleanText if word not in point_noise])
    cleanText = "".join(word for word in cleanText if ord(word)<128)
    cleanText = tokenizer.tokenize(cleanText)
    cleanText = [lemmatizer.lemmatize(word) for word in cleanText if word not in stopwords_list]
    cleanText = [word for word in cleanText if len(word) >= 2]
    cleanText = ' '.join(cleanText)
    return cleanText

[codecarbon INFO @ 21:10:47] Energy consumed for RAM : 0.000025 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 21:10:47] Energy consumed for all CPUs : 0.000177 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 21:10:47] 0.000202 kWh of electricity used since the beginning.


In [9]:
train_df['message'] = train_df['message'].apply(preprocess)
test_df['message'] = test_df['message'].apply(preprocess)

TfidfVectorizer converts text data into numerical features. Min_df = 2 means that it will ignore phrases that appear in less than two messages. Ngram_range = (1,20) means that it will look at one word up to 20 consecutive words. This lets the vectorizer capture phrases. 

In [10]:
vector = TfidfVectorizer(ngram_range=(1,20), min_df=2)
train_features = vector.fit_transform(train_df['message'])
test_features = vector.transform(test_df['message'])

The code below splits the data into training and validation sets. In this code, we are doing an 80/20 split.

In [11]:
X_train, X_val, y_train, y_val = train_test_split(
    train_features, 
    train_df['sentiment'],
    test_size=0.2,
    shuffle=True,
    random_state=42
)

SMOTE is used incase there is an imbalance between opportunistic and risky sentiment. The purpose of SMOTE is to create a balanced dataset.

In [12]:
print("Applying SMOTE...")
sm = SMOTE(random_state=42)
X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)

print('Before SMOTE:', Counter(y_train))
print('After SMOTE:', Counter(y_train_sm))

Applying SMOTE...
Before SMOTE: Counter({1: 163, -1: 109})
After SMOTE: Counter({-1: 163, 1: 163})


The models used are LogisticRegression, RandomForestClassifier, NaiveBayes, LinearSVM, and KNNClassifier. LogisticRegression predicts whether something belongs to one category or another based on patterns in the data, RandomForest uses decision trees to make predictions, LinearSVM finds the best boundary to separate categories, and KNN looks at nearest neighbors to choose categories. 

In [13]:
names = ['LogisticRegression', 'ForestClassifier', 'NaiveBayes', 'LinearSVM', 'KNNClassifier']
classifiers = [
    LogisticRegression(C=10),
    RandomForestClassifier(criterion='entropy'),
    MultinomialNB(alpha=1),
    LinearSVC(C=10, class_weight=None),
    KNeighborsClassifier(n_neighbors=10)
]

results = []
models = {}

for name, clf in zip(names, classifiers):
    print(f'Training {name}...')
    
    clf.fit(X_train_sm, y_train_sm)
    
    val_pred = clf.predict(X_val)
    test_pred = clf.predict(test_features)
    
    val_accuracy = accuracy_score(y_val, val_pred)
    val_f1 = f1_score(y_val, val_pred, average='macro')
    
    models[name] = clf
    results.append([name, val_accuracy, val_f1])
    
    test_df[f'{name}_predictions'] = test_pred
    
    print(f'{name} - Validation Accuracy: {val_accuracy:.4f}, F1: {val_f1:.4f}')

Training LogisticRegression...
LogisticRegression - Validation Accuracy: 0.9130, F1: 0.9126
Training ForestClassifier...
ForestClassifier - Validation Accuracy: 0.8696, F1: 0.8640
Training NaiveBayes...
NaiveBayes - Validation Accuracy: 0.9130, F1: 0.9126
Training LinearSVM...
LinearSVM - Validation Accuracy: 0.9420, F1: 0.9419
Training KNNClassifier...
KNNClassifier - Validation Accuracy: 0.4783, F1: 0.3463


In [14]:
results_df = pd.DataFrame(results, columns=['Classifier', 'Validation Accuracy', 'Validation F1'])
results_df.set_index('Classifier', inplace=True)

print("\nModel Performance on Validation Set:")
print(results_df.sort_values('Validation F1', ascending=False))


Model Performance on Validation Set:
                    Validation Accuracy  Validation F1
Classifier                                            
LinearSVM                      0.942029       0.941919
LogisticRegression             0.913043       0.912584
NaiveBayes                     0.913043       0.912584
ForestClassifier               0.869565       0.863965
KNNClassifier                  0.478261       0.346316


LinearSVM performed the best with 94.2% accuracy, followed by LogisticRegression and NaiveBayes at 91.3% accuracy, followed by ForestClassifier at 86.9% accuracy, followed by the worst performer -- KNNClassifier at a low 47.8% accuracy. 

In [None]:
output_file = "model_predictions.csv"
test_df.to_csv(output_file, index=False)
print(f"\nPredictions saved to {output_file}")


Predictions saved to model_predictions.csv


[codecarbon INFO @ 21:11:02] Energy consumed for RAM : 0.000050 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 21:11:02] Energy consumed for all CPUs : 0.000354 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 21:11:02] 0.000404 kWh of electricity used since the beginning.


In [None]:
if 'sentiment' in test_df.columns:
    print("\nModel Performance on Dataset:")
    for name in names:
        predictions = test_df[f'{name}_predictions']
        accuracy = accuracy_score(test_df['sentiment'], predictions)
        f1 = f1_score(test_df['sentiment'], predictions, average='macro')
        print(f"\n{name}:")
        print(f"Accuracy: {accuracy:.4f}")
        print(f"F1 Score: {f1:.4f}")


Model Performance on Dataset:

LogisticRegression:
Accuracy: 0.8455
F1 Score: 0.8430

ForestClassifier:
Accuracy: 0.6701
F1 Score: 0.6697

NaiveBayes:
Accuracy: 0.8455
F1 Score: 0.8405

LinearSVM:
Accuracy: 0.8309
F1 Score: 0.8278

KNNClassifier:
Accuracy: 0.6221
F1 Score: 0.3937


[codecarbon INFO @ 21:11:17] Energy consumed for RAM : 0.000075 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 21:11:17] Energy consumed for all CPUs : 0.000531 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 21:11:17] 0.000606 kWh of electricity used since the beginning.
[codecarbon INFO @ 21:11:32] Energy consumed for RAM : 0.000100 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 21:11:32] Energy consumed for all CPUs : 0.000709 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 21:11:32] 0.000809 kWh of electricity used since the beginning.
[codecarbon INFO @ 21:11:47] Energy consumed for RAM : 0.000125 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 21:11:47] Energy consumed for all CPUs : 0.000886 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 21:11:47] 0.001011 kWh of electricity used since the beginning.


On the test dataset, LogisticRegression and NaiveBayes performed the best with 84.55% accuracy, followed by LinearSVM at 83% accuracy, followed by ForestClassifier at 67% accuracy, followed by the worst performer -- KNNClassifier at 62% accuracy.

In [17]:
df = pd.read_csv("model_predictions.csv")

selected_columns = ['LogisticRegression_predictions', 'NaiveBayes_predictions', 'LinearSVM_predictions']
df['model_label'] = df[selected_columns].mode(axis=1).iloc[:,0]

for col in df.columns:
    if '_predictions' in col:
        df = df.drop(col, axis=1)

df.to_csv("../Data/processed-data/model_predictions.csv", index=False)

print("\nDistribution of final model labels:")
print(df['model_label'].value_counts())


Distribution of final model labels:
model_label
-1    244
 1    235
Name: count, dtype: int64


## My Additions 

In [18]:
train_df = pd.read_csv("../Data/raw-data/train.csv")
test_df = pd.read_csv("../Data/raw-data/test.csv")

I made a few very minor changes in preprocessing the text

In [19]:
def preprocess(text):
    tokenizer = TreebankWordTokenizer() 
    lemmatizer = WordNetLemmatizer()
    stopwords_list = set(stopwords.words('english'))  # Convert to set for faster lookup
    point_noise = string.punctuation + '0123456789'
    
    cleanText = re.sub(r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+', "", text)
    cleanText = re.sub(r'@[a-zA-Z0-9\_\w]+', '', cleanText)
    cleanText = re.sub(r'#[a-zA-Z0-9]+', '', cleanText)
    cleanText = re.sub(r'RT', '', cleanText)
    cleanText = cleanText.lower()
    
    cleanText = re.sub(r'[^\x00-\x7F]+', '', cleanText)  # Removed emojis
    cleanText = re.sub(r"'s\b", "", cleanText)  # Removed possessive
    cleanText = re.sub(r"n't\b", " not", cleanText)  # Handled negations
    cleanText = re.sub(r'([https][http][htt][th][ht])', "", cleanText)
    
    cleanText = ''.join([word for word in cleanText if word not in point_noise])
    cleanText = "".join(word for word in cleanText if ord(word)<128)
    cleanText = tokenizer.tokenize(cleanText)
    cleanText = [lemmatizer.lemmatize(word) for word in cleanText if word not in stopwords_list]
    cleanText = [word for word in cleanText if len(word) >= 2]
    cleanText = ' '.join(cleanText)

    return cleanText

In [None]:
print("Preprocessing training data...")
train_df['message'] = train_df['message'].apply(preprocess)
print("Preprocessing test data...")
test_df['message'] = test_df['message'].apply(preprocess)

Preprocessing training data...
Preprocessing test data...


[codecarbon INFO @ 21:12:02] Energy consumed for RAM : 0.000150 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 21:12:02] Energy consumed for all CPUs : 0.001063 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 21:12:02] 0.001213 kWh of electricity used since the beginning.


Below, I changed the ngram_range from (1,20) to (1,3). I did this for more efficient processing and to reduce noise/overfitting. (1,3) is more practical, especially because my dataset is a lot smaller. I also added max_df = 0.95, which removes words that appear in more than 95% of the document. Likely, the words "climate" and "change" will dissapear. This is beneficial because everyone in the dataset is talking about climate change so these terms do not help determine sentiment. I also added max_features = 5000. Even though the datasets are small, it is important to set a cap. I also added token_pattern = r'\b\w+\b' for properly separating words. 

In [21]:
# Improved TF-IDF vectorization
vector = TfidfVectorizer(
    ngram_range=(1,3),
    min_df=2,
    max_df=0.95,
    max_features=5000,
    token_pattern=r'\b\w+\b'
)

In [22]:
print("Extracting features...")
train_features = vector.fit_transform(train_df['message'])
test_features = vector.transform(test_df['message'])

X_train, X_val, y_train, y_val = train_test_split(
    train_features, 
    train_df['sentiment'],
    test_size=0.2,
    shuffle=True,
    random_state=42
)

print("Applying SMOTE...")
sm = SMOTE(random_state=42)
X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)

print('Before SMOTE:', Counter(y_train))
print('After SMOTE:', Counter(y_train_sm))

Extracting features...
Applying SMOTE...
Before SMOTE: Counter({1: 163, -1: 109})
After SMOTE: Counter({-1: 163, 1: 163})


For the Logistic Regression and Linear SVC, I changed C from 10 to 1 to prevent overfitting. Adding a class weight made sure the models pay equal attention to opportunistic and risky tweets, even though the dataset has more opportunistic tweets. I also increased the training iterations. Changing C in LinearSVC helped but after playing around with the C in LogisticRegression, I decided to change it back to 10. 

For the Random Forest model, I added 200 more trees to help the model reach a conclusion and added min_samples_split = 5 to prevent the model from overfitting.

For Naive Bayes, I changed alpha to equal 0.5 because this makes it more flexible in learning pattterns. 

For K-Nearest Neighbors, I reuced the number of neighbors to 5. This way, the model will only look at the most similar content. I also added the cosine metric because it is generaly good with text data.

In [32]:
names = ['LogisticRegression', 'ForestClassifier', 'NaiveBayes', 'LinearSVM', 'KNNClassifier']
classifiers = [
    LogisticRegression(C=10, class_weight='balanced', max_iter=1000),
    RandomForestClassifier(
        n_estimators=200, 
        criterion='entropy',
        max_depth=None,
        min_samples_split=5,
        class_weight='balanced',
        random_state=42
    ),
    MultinomialNB(alpha=0.5),
    LinearSVC(C=1.0, class_weight='balanced', max_iter=2000),
    KNeighborsClassifier(
        n_neighbors=5,
        weights='distance',
        metric='cosine'
    )
]

Below, I reorganized the training and evaluation code into a function called evaluate_model. My goal was to make the code cleaner and easier to reuse/modify later. I also added cross validation, because it is a very reliable way to test how well the model works. Cross validation splits the data into 5 parts and uses each as the test set once. This tells us if our model is getting lucky or if it is performing similarly across all splits. 

In [33]:
# Training and evaluation function

from sklearn.model_selection import cross_val_score

def evaluate_model(name, clf, X_train, y_train, X_val, y_val, X_test, test_df):
    print(f'\nTraining {name}...')
    
    # Added Cross-validation
    cv_scores = cross_val_score(clf, X_train, y_train, cv=5)
    print(f"CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    
    clf.fit(X_train, y_train)
    
    val_pred = clf.predict(X_val)
    test_pred = clf.predict(X_test)
    
    val_accuracy = accuracy_score(y_val, val_pred)
    val_f1 = f1_score(y_val, val_pred, average='macro')
    
    test_df[f'{name}_predictions'] = test_pred
    
    print(f'Validation Accuracy: {val_accuracy:.4f}, F1: {val_f1:.4f}')
    
    return val_accuracy, val_f1, test_pred

In [34]:
# Train and evaluate all models
results = []
models = {}

for name, clf in zip(names, classifiers):
    val_accuracy, val_f1, test_pred = evaluate_model(
        name, clf, X_train_sm, y_train_sm, 
        X_val, y_val, test_features, test_df
    )
    models[name] = clf
    results.append([name, val_accuracy, val_f1])

results_df = pd.DataFrame(results, columns=['Classifier', 'Validation Accuracy', 'Validation F1'])
results_df.set_index('Classifier', inplace=True)

print("\nModel Performance on Validation Set:")
print(results_df.sort_values('Validation F1', ascending=False))


Training LogisticRegression...
CV Score: 0.9694 (+/- 0.0433)
Validation Accuracy: 0.9275, F1: 0.9273

Training ForestClassifier...
CV Score: 0.9203 (+/- 0.1054)
Validation Accuracy: 0.8696, F1: 0.8640

Training NaiveBayes...
CV Score: 0.9755 (+/- 0.0460)
Validation Accuracy: 0.9275, F1: 0.9270

Training LinearSVM...
CV Score: 0.9724 (+/- 0.0408)
Validation Accuracy: 0.9275, F1: 0.9273

Training KNNClassifier...
CV Score: 0.9632 (+/- 0.0499)
Validation Accuracy: 0.9130, F1: 0.9126

Model Performance on Validation Set:
                    Validation Accuracy  Validation F1
Classifier                                            
LogisticRegression             0.927536       0.927292
LinearSVM                      0.927536       0.927292
NaiveBayes                     0.927536       0.926984
KNNClassifier                  0.913043       0.912584
ForestClassifier               0.869565       0.863965


Logistic Regression performs very consistently and does a good job, along with NaiveBayes and LinearSVM. ForestClassifier is 93% accurate in cross validation but drops to 88% on the validation set. KNN also does a good job.

Below, I decided to add in precision and recall scores, confusion matrices, and a classification report for each model. I also organized these results into a DataFrame to easily compare how different models perform across all metrics at once.

In [26]:
for name, clf in zip(names, classifiers):
    print(f'Training {name}...')
    
    clf.fit(X_train_sm, y_train_sm) 
    test_pred = clf.predict(test_features)
    test_df[f'{name}_predictions'] = test_pred

def calculate_test_metrics(y_true, y_pred):
    return {
        'accuracy': accuracy_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred, average='macro'),
        'precision': precision_score(y_true, y_pred, average='macro'),
        'recall': recall_score(y_true, y_pred, average='macro')
    }

print("\nModel Performance on Test Set:")
test_results = []

for name in names:
    if 'sentiment' in test_df.columns:
        predictions = test_df[f'{name}_predictions']
        metrics = calculate_test_metrics(test_df['sentiment'], predictions)
        
        test_results.append([
            name,
            metrics['accuracy'],
            metrics['f1'],
            metrics['precision'],
            metrics['recall']
        ])
        
        print(f"\n{name}:")
        print(f"Accuracy: {metrics['accuracy']:.4f}")
        print(f"F1 Score: {metrics['f1']:.4f}")
        print(f"Precision: {metrics['precision']:.4f}")
        print(f"Recall: {metrics['recall']:.4f}")
        
        print("\nConfusion Matrix:")
        print(confusion_matrix(test_df['sentiment'], predictions))
        
        print("\nDetailed Classification Report:")
        print(classification_report(test_df['sentiment'], predictions))

test_results_df = pd.DataFrame(
    test_results, 
    columns=['Classifier', 'Test Accuracy', 'Test F1', 'Test Precision', 'Test Recall']
)
test_results_df.set_index('Classifier', inplace=True)

print("\nOverall Test Set Performance Summary:")
print(test_results_df.sort_values('Test F1', ascending=False))

Training LogisticRegression...
Training ForestClassifier...
Training NaiveBayes...
Training LinearSVM...
Training KNNClassifier...

Model Performance on Test Set:

LogisticRegression:
Accuracy: 0.8434
F1 Score: 0.8411
Precision: 0.8432
Recall: 0.8642

Confusion Matrix:
[[231  66]
 [  9 173]]

Detailed Classification Report:
              precision    recall  f1-score   support

          -1       0.96      0.78      0.86       297
           1       0.72      0.95      0.82       182

    accuracy                           0.84       479
   macro avg       0.84      0.86      0.84       479
weighted avg       0.87      0.84      0.85       479


ForestClassifier:
Accuracy: 0.7077
F1 Score: 0.7076
Precision: 0.7612
Recall: 0.7558

Confusion Matrix:
[[165 132]
 [  8 174]]

Detailed Classification Report:
              precision    recall  f1-score   support

          -1       0.95      0.56      0.70       297
           1       0.57      0.96      0.71       182

    accuracy          

KNN and ForestClassifier saw a significant increase in accuracy. NaiveBayes and LinearSVM went up slightly. LogisticRegression went down very slightly. 

In [35]:
selected_columns = ['LogisticRegression_predictions', 'NaiveBayes_predictions', 'LinearSVM_predictions']

test_df['model_label'] = test_df[selected_columns].mode(axis=1).iloc[:,0]

for col in test_df.columns:
    if '_predictions' in col:
        test_df = test_df.drop(col, axis=1)

test_df.to_csv("../Data/processed-data/model_predictions.csv", index=False)

president_df = test_df[test_df['state'] == 'US']
senators_df = test_df[test_df['state'] != 'US']

president_df.to_csv("../Data/processed-data/President_sentiment.csv", index=False)
senators_df.to_csv("../Data/processed-data/Senators_sentiment.csv", index=False)

print("\nDistribution of final model labels:")
print(test_df['model_label'].value_counts())


Distribution of final model labels:
model_label
-1    244
 1    235
Name: count, dtype: int64


In [36]:
emissions = tracker.stop()
print(f"Total CO2 emissions: {emissions:.4f} kg")

[codecarbon INFO @ 21:25:32] Energy consumed for RAM : 0.001501 kWh. RAM Power : 6.0 W
[codecarbon INFO @ 21:25:32] Energy consumed for all CPUs : 0.010631 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 21:25:32] 0.012132 kWh of electricity used since the beginning.


Total CO2 emissions: 0.0045 kg
