PA3
Author: Zhuangzhuang Gong, Richard Hua

## 1. Reflection on the annotation task

**Challenges**  
One of the key challenges is that we found it hard to understand the tweets with the culture references and context outside of the tweet itself.  
In this case, we had to make judgement calls,
which made us more aware of the subjectivity involved in annotation. Another takeaway is dealing
with mixed-sentiment tweets. For example, some tweets consist of both positive and negative emotions, making it hard to label them definitively.

**Ideals of improvement**  
To improve the process of annotation, refine and expand the guidelines of annotation by providing clearer annotation instructions, especially for
mixed-sentiment tweets. This will reduce the subjectivity in the annotation.
Another way to improve is to introduce InterAnnotator Agreement by having multiple annotators and comparing their results. This not only improve the quality of annotation but also give the insight how subjective the sentiment label is.

## 2. Exploring the crowdsourced data


**Read and observe the data**  
As we can see, the sentiment annotations are not expected. There are many inconsistent label formats, such as 'neutral' and 'Neutral' or typos like 'Nutral'. Therefore, we need to clean the data.

In [19]:
import pandas as pd

crowd_train_data = pd.read_csv('crowdsourced_train.csv',sep='\t')
gold_train_data = pd.read_csv('gold_train.csv', sep='\t')
test_data = pd.read_csv('test.csv', sep='\t')
label_counts = crowd_train_data['sentiment'].value_counts()
print(crowd_train_data)

      sentiment                                               text
0      Positive  There's so much misconception on Islam rn so s...
1      Positive  @Mr_Rondeau You should try Iron Maiden at abou...
2      Negative  Going to #FantasticFour tomorrow. Half expecti...
3       Neutral  @cfelan hey hey, just checkng to see if you or...
4      Positive  does anyone just get drunk and watch twilight ...
...         ...                                                ...
10671  Positive  Glad to hear there may be a bigger more public...
10672  Positive  Great stand by the Wolves on 3rd and long. Cur...
10673   Neutral  Ayyye I just purchased my Ed Sheeran tickets f...
10674  Negative  The anti-semitism, the misogyny, and the suppo...
10675  Positive  And yet, I have yet to see the whole series of...

[10676 rows x 2 columns]


**Check the agreement before cleaning the data**  
If we check the inner annotation agreement now by comparing the sentiment labels from the crowdsourced annotator and the gold annotator across 10,675 tweets. The overall accuracy is only 35.63%, and Cohen's Kappa is only 0.19, which indicates a low level of agreement.
This suggests there's either inconsistency or subjective interpretation of sentiment labels.

In [20]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import cohen_kappa_score

accuracy = accuracy_score(crowd_train_data['sentiment'], gold_train_data['sentiment'])
kappa = cohen_kappa_score(crowd_train_data['sentiment'], gold_train_data['sentiment'])
print(f"Cohen's Kappa: {kappa:.2f}")
print(f"Accuracy: {accuracy:.2%}")

Cohen's Kappa: 0.19
Accuracy: 35.63%


**Clean and re-classify labels**  
We easily fixed some inconsistent label formats by converting all labels to lowercase and stripping the blank spaces.  
Furthermore, we found several spelling variations and formatting issues in the label column (e.g., 'nuetral', 'positve', 'nedative', etc.). These were normalised using a mapping dictionary, and all labels were successfully consolidated into the three standard sentiment categories: positive, neutral, and negative.

In [21]:
# convert labels to lowercase
crowd_train_data['sentiment'] = crowd_train_data['sentiment'].str.strip().str.lower()
label_counts = crowd_train_data['sentiment'].value_counts()
print("before cleaning")
print(label_counts)

before cleaning
sentiment
neutral           5046
positive          3213
negative          2375
netural             20
nuetral              3
postive              2
postitive            1
neutal               1
npositive            1
neutra l             1
positie              1
negayive             1
nutral               1
neugral              1
negtaive             1
neutrla              1
neutrall             1
neural               1
netutral             1
_x0008_neutral       1
nedative             1
neutral?             1
positve              1
Name: count, dtype: int64


In [22]:
# create a dictionary
label_map = {
    # positive 
    'positve': 'positive',
    'postive': 'positive',
    'postitive': 'positive',
    'positie': 'positive',
    'npositive': 'positive',

    # neutral 
    'neutral?': 'neutral',
    'nuetral': 'neutral',
    '_x0008_neutral': 'neutral',
    'netural': 'neutral',
    'netutral': 'neutral',
    'neural': 'neutral',
    'neutrall': 'neutral',
    'neugral': 'neutral',
    'neutrla': 'neutral',
    'nutral': 'neutral',
    'neutra l': 'neutral',
    'neutal': 'neutral',

    # negative 
    'nedative': 'negative',
    'negtaive': 'negative',
    'negayive': 'negative',
}

label_map.update({
    'positive': 'positive',
    'neutral': 'neutral',
    'negative': 'negative'
})

# apply the dictionary
crowd_train_data['sentiment'] = crowd_train_data['sentiment'].map(label_map)

# check the result
print("After cleaning")
print(crowd_train_data['sentiment'].value_counts())


After cleaning
sentiment
neutral     5079
positive    3219
negative    2378
Name: count, dtype: int64


**Agreement checking after cleaning**  
We evaluated the agreement between the crowd-sourced labels and gold standard labels again across the data after cleaning the labels. The overall accuracy was 65.49% and the Cohen's Kappa was 0.45, indicating 'average' agreement, according to the standard in Richard Johansson's slides.  
This suggests that there's reasonable alignment, while some inconsistency still remains. It's likely due to subjective interpretation of semtimemt tweets.

In [23]:
accuracy = accuracy_score(crowd_train_data['sentiment'], gold_train_data['sentiment'])
kappa = cohen_kappa_score(crowd_train_data['sentiment'], gold_train_data['sentiment'])
print(f"Cohen's Kappa: {kappa:.2f}")
print(f"Accuracy: {accuracy:.2%}")

Cohen's Kappa: 0.45
Accuracy: 65.49%


**Analysis of the annotation distribution**  
We compared the annotation distribution across crowdsourced and gold data. Both showed a similar overall distribution structure, with the majority of labels being 'neutral', followed by 'positive' then 'negative'. However, there's a noticeable difference between 'negative', indicating the sensitivity or interpretation for negative tweets between two groups varies. It suggests that the crowd group is more sensitive to negative sentiment or subjective interpretation of negative tweets, which may contribute to the Cohen's Kappa of 0.45 and the general accuracy of 65.49%.

In [24]:
# Get label counts from each dataset
crowd_counts = crowd_train_data['sentiment'].value_counts()
gold_counts = gold_train_data['sentiment'].value_counts()

# Access values 
crowd_counts_dict = {
    'neutral': crowd_counts.get('neutral', 0),
    'positive': crowd_counts.get('positive', 0),
    'negative': crowd_counts.get('negative', 0)
}

gold_counts_dict = {
    'neutral': gold_counts.get('neutral', 0),
    'positive': gold_counts.get('positive', 0),
    'negative': gold_counts.get('negative', 0)
}

# Create Series
crowd_series = pd.Series(crowd_counts_dict, name='Crowd')
gold_series = pd.Series(gold_counts_dict, name='Gold')

# Combine and analyze
comparison_df = pd.concat([crowd_series, gold_series], axis=1)

# Add percentage comparision
total = comparison_df.sum()
comparison_df['Crowd (%)'] = (comparison_df['Crowd'] / total['Crowd'] * 100).round(2)
comparison_df['Gold (%)'] = (comparison_df['Gold'] / total['Gold'] * 100).round(2)

print(comparison_df)

          Crowd  Gold  Crowd (%)  Gold (%)
neutral    5079  5364      47.57     50.24
positive   3219  3652      30.15     34.21
negative   2378  1660      22.27     15.55


## 3. Implementation of a classifier

Clean tweet

In [25]:
import re
import emoji

def clean_tweet(text):
    # 1. Lowercase
    text = text.lower()
    # 2. Replace URLs
    text = re.sub(r'http\S+|www\.\S+', ' ', text)
    # 3. Replace user mentions
    text = re.sub(r'@\w+', ' ', text)
    # 4. Remove hashtags but keep the tag text
    text = re.sub(r'#', '', text)
    # 5. Remove RT (retweet marker)
    text = re.sub(r'\brt\b', ' ', text)
    # 6. Remove emojis or translate them to text
    text = emoji.demojize(text)  # ":smile:" etc
    # 7. Remove non-alphanumeric (keep spaces)
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    # 8. Collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

Pipeline of Clean+TfidfVectorizer+linearSVC

In [26]:
# the actual classification algorithm
from sklearn.svm import LinearSVC
# for converting training and test datasets into matrices
# TfidfVectorizer does this specifically for documents
from sklearn.feature_extraction.text import TfidfVectorizer
# for splitting the dataset into training and test sets 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
X = crowd_train_data['text']
Y = crowd_train_data['sentiment']

X_train, X_eval, Y_train, Y_eval = train_test_split(X, Y, test_size=0.2, random_state=0)
def train_document_classifier(X, Y):
    pipeline = make_pipeline( TfidfVectorizer(preprocessor=clean_tweet,), LinearSVC(dual='auto') )
    pipeline.fit(X, Y)
    return pipeline
clf = train_document_classifier(X_train, Y_train)
acc = accuracy_score(Y_eval, clf.predict(X_eval))
print("acc:",acc)

acc: 0.548689138576779


We used hyperparameter search and cross-validation to select a set of parameters that work better on unknown data in order to maximise the model's performance.

We also tried using Logistic regression

In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
X = gold_train_data['text']
Y = gold_train_data['sentiment']

X_train, X_eval, Y_train, Y_eval = train_test_split(X, Y, test_size=0.2, random_state=0)
def train_document_classifier(X, Y):
    pipeline = make_pipeline(TfidfVectorizer(preprocessor=clean_tweet,), LogisticRegression(
    penalty='l2',            # L2 regularization
    C=1.0,                   # Regularization strength (default)
    solver='liblinear',      # Solver suitable for smaller datasets
    max_iter=100,            # Iterations for convergence
    class_weight='balanced', # Handle imbalanced classes
    fit_intercept=True,      # Include intercept
))
    pipeline.fit(X, Y)
    return pipeline
clf = train_document_classifier(X_train, Y_train)
acc = accuracy_score(Y_eval, clf.predict(X_eval))
print("acc:",acc)

acc: 0.6441947565543071


In [44]:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report

# Models to compare
models = {
    "MultinomialNB": MultinomialNB(),
    "LogisticRegression": LogisticRegression(max_iter=200, class_weight='balanced', solver='liblinear'),
    "LinearSVC": LinearSVC(class_weight='balanced', max_iter=1000),
    "RidgeClassifier": RidgeClassifier(),
}

# TF-IDF settings
tfidf = TfidfVectorizer(
    preprocessor=clean_tweet,         # Apply tweet cleaning function
    ngram_range=(1, 2),               # Unigrams + bigrams
    stop_words='english',             # Use English stop words
    max_features=10000,               # Limit to top 10,000 features
    sublinear_tf=True,                # Use log scaling for term frequency
    min_df=3                          # Ignore terms that appear in fewer than 3 documents
)

# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("🔍 Benchmarking models...\n")
for name, model in models.items():
    pipeline = make_pipeline(tfidf, model)
    scores = cross_val_score(pipeline, X, Y, cv=cv, scoring='accuracy')
    print(f"{name:<25}: Mean Accuracy = {scores.mean():.4f} ± {scores.std():.4f}")

# Final evaluation on hold-out set
X_train, X_eval, Y_train, Y_eval = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=42)
final_model = make_pipeline(tfidf, RidgeClassifier())
final_model.fit(X_train, Y_train)
preds = final_model.predict(X_eval)

print("\n📊 Classification Report (RidgeClassifier on hold-out set):")
print(classification_report(Y_eval, preds))





🔍 Benchmarking models...

MultinomialNB            : Mean Accuracy = 0.5506 ± 0.0058
LogisticRegression       : Mean Accuracy = 0.5658 ± 0.0127
LinearSVC                : Mean Accuracy = 0.5303 ± 0.0138
RidgeClassifier          : Mean Accuracy = 0.5554 ± 0.0109

📊 Classification Report (RidgeClassifier on hold-out set):
              precision    recall  f1-score   support

    negative       0.47      0.34      0.40       476
     neutral       0.58      0.68      0.62      1016
    positive       0.54      0.52      0.53       644

    accuracy                           0.55      2136
   macro avg       0.53      0.51      0.52      2136
weighted avg       0.54      0.55      0.54      2136



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load your data (adjust this for your dataset)
# Assuming crowd_train_data is your dataset with columns 'text' and 'sentiment'
X = gold_train_data['text']
Y = gold_train_data['sentiment']

# Encode the labels (since you have 3 classes)
label_encoder = LabelEncoder()
Y_encoded = label_encoder.fit_transform(Y)

# Split into train and validation sets
X_train, X_eval, Y_train, Y_eval = train_test_split(X, Y_encoded, test_size=0.2, random_state=0)

# TF-IDF Vectorization
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_eval_tfidf = tfidf.transform(X_eval).toarray()

# MLP Neural Network Model
mlp = MLPClassifier(
    hidden_layer_sizes=(512, 256),  # Two hidden layers with 512 and 256 neurons
    activation='relu',              # ReLU activation function
    solver='adam',                  # Adam optimizer
    max_iter=100,                   # Number of iterations
    random_state=0,                 # Random seed for reproducibility
    verbose=True                    # Output progress during training
)

# Train the model
mlp.fit(X_train_tfidf, Y_train)

# Evaluate the model
Y_pred = mlp.predict(X_eval_tfidf)

# Accuracy score
acc = accuracy_score(Y_eval, Y_pred)
print(f"MLPClassifier Accuracy: {acc:.4f}")


Iteration 1, loss = 0.95884597
Iteration 2, loss = 0.66361254
Iteration 3, loss = 0.38862284
Iteration 4, loss = 0.18520055
Iteration 5, loss = 0.06855323
Iteration 6, loss = 0.02281874
Iteration 7, loss = 0.01015891
Iteration 8, loss = 0.00597680
Iteration 9, loss = 0.00387828
Iteration 10, loss = 0.00323096
Iteration 11, loss = 0.00301844
Iteration 12, loss = 0.00220541
