Question (a)

In [25]:
import pandas as pd

train_data = pd.read_csv("data/train.csv")
test_data = pd.read_csv("data/test.csv")

# (1)
training_dp = len(train_data)
test_dp = len(test_data)

print(f"Number of training data points: {training_dp}")
print(f"Number of test data points: {test_dp}")

# (2)
disaster_percentage = train_data['target'].value_counts(normalize=True) * 100

print(f"\nPercentage of training tweets that are real disasters: {disaster_percentage[1]:.2f}%")
print(f"Percentage of training tweets that are not real disasters: {disaster_percentage[0]:.2f}%")




Number of training data points: 7613
Number of test data points: 3263

Percentage of training tweets that are real disasters: 42.97%
Percentage of training tweets that are not real disasters: 57.03%


According to the description on Kaggle, there are 7503 *unique values* in the Text column in the training dataset. There are 4342 with target value of 0 (not real disasters) and 3271 with target value of 1 (real disasters); there are 3243 *unique values* in the Text column in the test dataset.


Question (b) Split the training data.

In [26]:
from sklearn.model_selection import train_test_split


# 70% train, 30% dev
train_set, dev_set = train_test_split(train_data, test_size=0.3, random_state=42)

print("Training Set:", len(train_set))
print("Development Set:", len(dev_set))


Training Set: 5329
Development Set: 2284


Question (c) Preprocess the data.

In [27]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


def preprocess(text):
    text = text.lower()  # Convert to lowercase: case does not affect the meaning of the word; this helps reduce the size of the vocabulary and avoid unnecessary duplication

    # Remove @ mentions and URLs: so that we can reduce unnecessary noise in the data
    text = re.sub(r'@\w+|http\S+', '', text)

    # Remove punctuation, except for ! and ?: punctuation does not carry much meaning, but for classifying disaster, strong punctuation like ! and ? might be useful 
    text = re.sub(r'[^\w\s!?]', '', text)

    # Lemmatize: reduces the words to their base form; root carries meaning of the words
    lemmatized_words = [lemmatizer.lemmatize(word) for word in text.split()]

    # Remove stop words: they usually do not carry much meaning
    filtered_text = ' '.join([word for word in lemmatized_words if word not in stop_words])
    
    # TODO: Remove numbers: numerical lexical chunks do not carry much meaning

    return filtered_text


train_set['text'] = train_set['text'].apply(preprocess)
dev_set['text'] = dev_set['text'].apply(preprocess)

print("\nSample Preprocessed Training Data:")
print(train_set['text'].head())


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yuewenyyy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/yuewenyyy/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yuewenyyy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



Sample Preprocessed Training Data:
1186    ash 2015 australiaûªs collapse trent bridge am...
4071    great michigan technique camp b1g thanks goblu...
5461    cnn tennessee movie theater shooting suspect k...
5787                 still rioting couple hour left class
7445    crack path wiped morning beach run surface wou...
Name: text, dtype: object


(d) Bag of words model.

In [57]:

# using the CountVectorizer class in sklearn.
from sklearn.feature_extraction.text import CountVectorizer


# let's say the words we are interested in are the ones that appear in at least 5 tweets
#TODO: come back to see if tuning M will change the results
M = 5 


vectorizer = CountVectorizer(binary=True, min_df=M) #binary = True, min_df=M

# fit CountVectorizer only once on the training set
vectorizer.fit(train_set['text'])

# use the same instance to process training and development sets
X_train = vectorizer.transform(train_set['text'])
X_dev = vectorizer.transform(dev_set['text'])


num_features = len(vectorizer.get_feature_names_out())


print(f"Total number of features in the Bag of Words model: {num_features}")


print("\nSample feature names:", vectorizer.get_feature_names_out()[20:30])


print("\nSample feature vectors (binary Bag of Words):")
print(X_train[:3].toarray())


Total number of features in the Bag of Words model: 509

Sample feature names: ['apocalypse' 'area' 'armageddon' 'army' 'around' 'arson' 'atomic'
 'attack' 'attacked' 'away']

Sample feature vectors (binary Bag of Words):
[[0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


How did I decide on M?

M = 1: 11785
M = 2: 4598
M = 3: 3076
M = 4: 2359
M = 5: 1925
M = 6: 1630
M = 7: 1442
M = 8: 1281

The decrement in number slowed down a lot when M is above 5. Tha probably indicates that when M=5 we discarded words that only appear in few number of tweets, which might cause extra noises, and we still retain the data that are informative. 

(e) Logistic regression. 

(i) Logistic regression model without regularization terms.

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
import numpy as np

# Fit a Logistic Regression model *without* regularization 
no_reg = LogisticRegression(penalty=None, solver='lbfgs', max_iter=1000)

no_reg.fit(X_train, train_set['target'])

# on train set
train_preds_no_reg = no_reg.predict(X_train)
f1_train_no_reg = f1_score(train_set['target'], train_preds_no_reg)

# on dev set
dev_preds_no_reg = no_reg.predict(X_dev)
f1_dev_no_reg = f1_score(dev_set['target'], dev_preds_no_reg)


print(f"F1 Score for Logistic Regression without regularization (Training Set): {f1_train_no_reg:.4f}")
print(f"F1 Score for Logistic Regression without regularization (Development Set): {f1_dev_no_reg:.4f}")


F1 Score for Logistic Regression without regularization (Training Set): 0.9558
F1 Score for Logistic Regression without regularization (Development Set): 0.6836


There's a high discrepancy between the training and development set performance, which is an indicator of overfitting. 

ii. Train a logistic regression model with L1 regularization.

In [30]:
# L1 regularization (penalty='l1')
l1_reg = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000)

# fit on the training data
l1_reg.fit(X_train, train_set['target'])

# predicts
train_preds_l1 = l1_reg.predict(X_train)
dev_preds_l1 = l1_reg.predict(X_dev)

# eval
f1_train_l1 = f1_score(train_set['target'], train_preds_l1)
f1_dev_l1 = f1_score(dev_set['target'], dev_preds_l1)

print(f"F1 Score for Logistic Regression with L1 regularization (Training Set): {f1_train_l1:.4f}")
print(f"F1 Score for Logistic Regression with L1 regularization (Development Set): {f1_dev_l1:.4f}")


F1 Score for Logistic Regression with L1 regularization (Training Set): 0.8387
F1 Score for Logistic Regression with L1 regularization (Development Set): 0.7389


In [31]:
# # Also tried to adapt the code on sklearn's documentation
# import numpy as np
# from sklearn import linear_model
# from sklearn.svm import l1_min_c
# from sklearn.preprocessing import StandardScaler
# import matplotlib.pyplot as plt

# scaler = StandardScaler(with_mean=False)
# X_train_scaled = scaler.fit_transform(X_train)


# y_train = train_set['target']


# cs = l1_min_c(X_train_scaled, y_train, loss="log") * np.logspace(0, 10, 16)

# clf = linear_model.LogisticRegression(
#     penalty="l1",
#     solver="liblinear",
#     tol=1e-6,
#     max_iter=1000,
#     warm_start=True,
#     intercept_scaling=10000.0,
# )

# coefs_ = []

# for c in cs:
#     clf.set_params(C=c)
#     clf.fit(X_train_scaled, y_train)
#     coefs_.append(clf.coef_.ravel().copy())

# coefs_ = np.array(coefs_)

# plt.figure(figsize=(10, 6))
# plt.plot(np.log10(cs), coefs_, marker="o")
# plt.xlabel("log(C)")
# plt.ylabel("Coefficients")
# plt.title("L1 Regularization Path for Logistic Regression")
# plt.axis("tight")
# plt.show()


iii. Train a logistic regression model with L2 regularization. 

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Fit a Logistic Regression model with L2 regularization
l2_reg = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)


l2_reg.fit(X_train, train_set['target'])

# train set
train_preds_l2 = l2_reg.predict(X_train)
f1_train_l2 = f1_score(train_set['target'], train_preds_l2)

# dev set
dev_preds_l2 = l2_reg.predict(X_dev)
f1_dev_l2 = f1_score(dev_set['target'], dev_preds_l2)

print(f"F1 Score for Logistic Regression with L2 regularization (Training Set): {f1_train_l2:.4f}")
print(f"F1 Score for Logistic Regression with L2 regularization (Development Set): {f1_dev_l2:.4f}")


F1 Score for Logistic Regression with L2 regularization (Training Set): 0.8622
F1 Score for Logistic Regression with L2 regularization (Development Set): 0.7464


iv. Model comparison

Overfitting was observed in the model without regularization, as shown by the large gap between F1 scores on the training and development set (0.9558 vs. 0.6836). Compared to the model without regularization, in both L1 and L2 the F1 score on the training set decreased, while the development set score improved. This suggests regularization helped reduce overfitting. The L2-regularized model performed the best. It has the best F1 scores, especially on the development set, with an F1 score of 0.7464, which is the highest among the three. This indicates that it reduces overfitting problem and it has the best generalization among the three classifiers. 

v. Inspect the weight vector of the classifier with L1 regularization 

In [33]:
# access the weight vector 
weights_l1 = l1_reg.coef_[0]

In [34]:
# investigate the important words

words = vectorizer.get_feature_names_out()

# sort by weight
feature_weights = list(zip(words, weights_l1))

important_words = sorted(feature_weights, key=lambda x: abs(x[1]), reverse=True)

print("\nThe top important words for deciding if a tweet is about a real disaster include the following ones:")
for word in important_words[:15]:
    print(f"{word}")


The top important words for deciding if a tweet is about a real disaster include the following ones:
('mh370', np.float64(3.4391886970023005))
('spill', np.float64(3.341914631185037))
('derailment', np.float64(3.321095055419975))
('airport', np.float64(3.279606218815375))
('hiroshima', np.float64(3.277345781406181))
('migrant', np.float64(3.1763880556007984))
('typhoon', np.float64(3.159369334750422))
('wildfire', np.float64(3.0648154558641068))
('earthquake', np.float64(2.8549198981316484))
('crew', np.float64(2.6314179188404925))
('debris', np.float64(2.595266384043409))
('outbreak', np.float64(2.5472103687207555))
('drought', np.float64(2.5330847828009952))
('evacuated', np.float64(2.514332995998605))
('sinkhole', np.float64(2.510076190145047))


(f) Bernoulli Naive Bayes.

In [35]:
import numpy as np
from sklearn.metrics import f1_score

y_train = train_set['target'].values
y_dev = dev_set['target'].values


In [36]:


X_train_processed = X_train.toarray()
y_train = train_set['target'].values
X_dev_processed = X_dev.toarray()
y_dev = dev_set['target'].values


n = X_train_processed.shape[0] 
d = X_train_processed.shape[1] 
K = 2  #binary classes

# laplace smoothing parameter
alpha = 1

In [103]:
# [ADAPTED FROM THE CLASS EXAMPLE]

# shapes of the parameters
psis = np.zeros([K, d])  
phis = np.zeros([K])     

# compute parameters 
for k in range(K):
    X_k = X_train_processed[y_train == k]
    psis[k] = (np.sum(X_k, axis=0) + alpha) / (X_k.shape[0] + 2 * alpha)
    phis[k] = X_k.shape[0] / float(n)

print("Prior probabilities:")
print(phis)

ValueError: could not broadcast input array from shape (1925,) into shape (1255,)

In [38]:
# compute predictions using Bayes’ rule. [ADAPTED FROM THE CLASS EXAMPLE]

def nb_predictions(x, psis, phis):
    # adjust shapes
    n, d = x.shape
    x = np.reshape(x, (1, n, d))
    psis = np.reshape(psis, (K, 1, d))
    
    # clip probabilities to avoid log(0)
    psis = psis.clip(1e-14, 1 - 1e-14)
    
    # compute log-probabilities
    logpy = np.log(phis).reshape([K, 1])
    logpxy = x * np.log(psis) + (1 - x) * np.log(1 - psis)
    logpyx = logpxy.sum(axis=2) + logpy
    
    return logpyx.argmax(axis=0).flatten(), logpyx.reshape([K, n])

# training set
train_preds, _ = nb_predictions(X_train_processed, psis, phis)
train_accuracy = (train_preds == y_train).mean()
print(f"Training accuracy: {train_accuracy:.4f}")

# dev set
dev_preds, _ = nb_predictions(X_dev_processed, psis, phis)

# Calculate F1 Score for development set
from sklearn.metrics import f1_score

f1_dev = f1_score(y_dev, dev_preds)
print(f"F1 Score for Bernoulli Naive Bayes on the development set is: {f1_dev:.4f}")

Training accuracy: 0.8411
F1 Score for Bernoulli Naive Bayes on the development set is: 0.7396


(g) Model comparison

Logistic Regression with L2 Regularization has the highest F1 score on dev set which is 0.746. That is slightly better than the F1 score the Bernoulli Naive Bayes got, which is 0.7396. 

Generative models like Naive Bayes is less prone to overfitting (didn't need techniques like regularization). However, we know that there's a strong assumption on the conditional independence, which is often not true in real-world cases for text classfication. This might influence the performance, especially when there are comlex relationships between features (language task could be one example). 

Discriminative Model like Logistic Regression doesn't assume the conditional independence between features, which may allow it to capture more nuances. However, since it's more prone to overfitting, and hence it is necessary sometimes to use regularization such as L1 and L2 - this may increase the computational complexity a lot. Usually it also requires a larger dataset to be effective since it cannot leverage the priors.


Naive Bayes assumes that words are uncorrelated and then uses Baye's theorum for determining the posterior probability, whereas Logistic Regression without necessarily assuming the distribution of the features.

Words are correlated in reality, but Naive Bayes still gives reasonable accuracy in terms of this binary classification task. It could miss some nuanced relationships between words, which might make it not the most optimal model for this type of task. 

(h) N-gram model.

In [100]:
from sklearn.feature_extraction.text import CountVectorizer

# construct feature representations 

M = 2 #has to appear in at least M tweets
# M = 5
# 2 gram, set ngram_range=(2,2) 
vectorizer_2gram = CountVectorizer(ngram_range=(2, 2), binary=True, min_df=M)


X_train_2gram = vectorizer_2gram.fit_transform(train_set['text'])


X_dev_2gram = vectorizer_2gram.transform(dev_set['text'])


num_2grams = len(vectorizer_2gram.get_feature_names_out())

# the total number of 2-grams
print(f"Total number of 2-grams in the vocabulary (M={M}): {num_2grams}")


sample_2grams = vectorizer_2gram.get_feature_names_out()[:10]
print("\nSample 10 2-grams from the vocabulary:")
for ngram in sample_2grams:
    print(ngram)

Total number of 2-grams in the vocabulary (M=2): 3542

Sample 10 2-grams from the vocabulary:
0104 utc
010401 utc20150805
10 year
10 yr
101 cook
1030 pm
10401 utc
109 sn
10km maximum
10th death


To decide on M, similar to Bag of Words model, when M is around 4 and 5, the decrement of number valid features (i.e. appeared in M tweets) decreased and stabelized. To keep the features still informative and not noisy, I chose M = 4.

(M=1): 33231

(M=2): 3542

(M=3): 1542

(M=4): 954

(M=5): 654

(M=6): 483

In [101]:
# Logistic Regression with L2 regularization

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

l2_reg_2gram = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)

# Fit the model on the 2-gram training data
l2_reg_2gram.fit(X_train_2gram, train_set['target'])

# Make predictions on the training and development sets
train_preds_2gram_l2 = l2_reg_2gram.predict(X_train_2gram)
dev_preds_2gram_l2 = l2_reg_2gram.predict(X_dev_2gram)

# Calculate F1 scores for training and development sets
f1_train_2gram_l2 = f1_score(train_set['target'], train_preds_2gram_l2)
f1_dev_2gram_l2 = f1_score(dev_set['target'], dev_preds_2gram_l2)

print(f"F1 Score for Logistic Regression with L2 regularization (Training Set): {f1_train_2gram_l2:.4f}")
print(f"F1 Score for Logistic Regression with L2 regularization (Dev Set): {f1_dev_2gram_l2:.4f}")

F1 Score for Logistic Regression with L2 regularization (Training Set): 0.7250
F1 Score for Logistic Regression with L2 regularization (Dev Set): 0.5645



These results dropped significantly compared with bag of words.

F1 Score for Logistic Regression with L2 regularization (Training Set): 0.5556

F1 Score for Logistic Regression with L2 regularization (Dev Set): 0.4776

Above are the data when M=5. When I changed M to 2, the performance went up to 0.7250 and 0.5645 (dev set).

The dimentionality of the features increased a lot and also the sparsity of the model. Since the tweets are relatively short chunks, not like normal writing in long paragraphs, 2-grams are not observed as often in other tweets.

Next I am gonna try Bernoulli classifier.

In [None]:
import numpy as np


X_train_2gram_processed = X_train_2gram.toarray()
X_train_2gram_processed = X_dev_2gram.toarray()
y_train = train_set['target'].values
y_dev = dev_set['target'].values

# checking dimensions (TDOO: this throws an error previously)
print("Shape of X_train_2gram_dense:", X_train_2gram_processed.shape)
print("Length of y_train:", len(y_train))

# ADAPTED FROM THE CLASS EXAMPLE
def nb_predictions(x, psis, phis):

    n, d = x.shape  
    psis = psis.clip(1e-14, 1 - 1e-14)  # TODO:solve numerical issues log(0)
    
    logpy = np.log(phis).reshape(-1, 1) 
    logpxy = x @ np.log(psis.T) + (1 - x) @ np.log(1 - psis.T)  
    logpyx = logpy.T + logpxy  


    return logpyx.argmax(axis=1), logpyx


X_train_2gram_processed = X_dev_2gram.toarray()
dev_preds_2gram_nb, _ = nb_predictions(X_train_2gram_processed, psis, phis)


from sklearn.metrics import f1_score
f1_dev_2gram_nb = f1_score(y_dev, dev_preds_2gram_nb)
print(f"F1 score of Bernoulli Naive Bayes on 2-gram (dev set):", f1_dev_2gram_nb)



The F1 score of Bernoulli Naive Bayes on 2-gram (dev set): 0.4884437596302003, which is very low compared with previous methods. I tried mixed N-grams, (mixing uni gram and 2 gram) and increased M to 10, but the F1 score was only improved to 0.6950617283950618.

(i) Determine performance with the *test set*. I choose Bag of Words with L2 regularization of Logistic Regression.

In [114]:
from sklearn.feature_extraction.text import CountVectorizer


train_text = train_data['text'].tolist()
train_labels = train_data['target'].values
vectorizer = CountVectorizer(binary=True, min_df=5)

X_new = vectorizer.fit_transform(train_text)

# check
print(X_new.shape)

(7613, 2795)


In [116]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

model_l2_reg = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)

model_l2_reg.fit(X_new, train_labels)

train_preds_combined = model_l2_reg.predict(X_new)
f1_train_new= f1_score(train_labels, train_preds_combined)
print(f"F1 Score for Logistic Regression with L2 regularization (new): {f1_train_new:.4f}")

F1 Score for Logistic Regression with L2 regularization (new): 0.8758


In [117]:
X_test = vectorizer.transform(test_data['text'])

test_preds = model_l2_reg.predict(X_test)

submission = pd.DataFrame({
    'id': test_data['id'],
    'target': test_preds
})

submission.to_csv('submission.csv', index=False)

Use of GenAI: debugging, search for methods, and conservative use of autocompletions.