PA3
Author: Zhuangzhuang Gong, Richard Hua

## 1. Reflection on the annotation task

**Challenges**  
One of the key challenges is that we found it hard to understand the tweets with the culture references and context outside of the tweet itself.  
In this case, we had to make judgement calls,
which made us more aware of the subjectivity involved in annotation. Another takeaway is dealing
with mixed-sentiment tweets. For example, some tweets consist of both positive and negative emotions, making it hard to label them definitively.

**Ideals of improvement**  
To improve the process of annotation, refine and expand the guidelines of annotation by providing clearer annotation instructions, especially for
mixed-sentiment tweets. This will reduce the subjectivity in the annotation.
Another way to improve is to introduce InterAnnotator Agreement by having multiple annotators and comparing their results. This not only improve the quality of annotation but also give the insight how subjective the sentiment label is.

## 2. Exploring the crowdsourced data


**Read and observe the data**  
As we can see, the sentiment annotations are not expected. There are many inconsistent label formats, such as 'neutral' and 'Neutral' or typos like 'Nutral'. Therefore, we need to clean the data.

In [19]:
import pandas as pd

crowd_train_data = pd.read_csv('crowdsourced_train.csv',sep='\t')
gold_train_data = pd.read_csv('gold_train.csv', sep='\t')
test_data = pd.read_csv('test.csv', sep='\t')
label_counts = crowd_train_data['sentiment'].value_counts()
print(crowd_train_data)

      sentiment                                               text
0      Positive  There's so much misconception on Islam rn so s...
1      Positive  @Mr_Rondeau You should try Iron Maiden at abou...
2      Negative  Going to #FantasticFour tomorrow. Half expecti...
3       Neutral  @cfelan hey hey, just checkng to see if you or...
4      Positive  does anyone just get drunk and watch twilight ...
...         ...                                                ...
10671  Positive  Glad to hear there may be a bigger more public...
10672  Positive  Great stand by the Wolves on 3rd and long. Cur...
10673   Neutral  Ayyye I just purchased my Ed Sheeran tickets f...
10674  Negative  The anti-semitism, the misogyny, and the suppo...
10675  Positive  And yet, I have yet to see the whole series of...

[10676 rows x 2 columns]


**Check the agreement before cleaning the data**  
If we check the inner annotation agreement now by comparing the sentiment labels from the crowdsourced annotator and the gold annotator across 10,675 tweets. The overall accuracy is only 35.63%, and Cohen's Kappa is only 0.19, which indicates a low level of agreement.
This suggests there's either inconsistency or subjective interpretation of sentiment labels.

In [20]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import cohen_kappa_score

accuracy = accuracy_score(crowd_train_data['sentiment'], gold_train_data['sentiment'])
kappa = cohen_kappa_score(crowd_train_data['sentiment'], gold_train_data['sentiment'])
print(f"Cohen's Kappa: {kappa:.2f}")
print(f"Accuracy: {accuracy:.2%}")

Cohen's Kappa: 0.19
Accuracy: 35.63%


**Clean and re-classify labels**  
We easily fixed some inconsistent label formats by converting all labels to lowercase and stripping the blank spaces.  
Furthermore, we found several spelling variations and formatting issues in the label column (e.g., 'nuetral', 'positve', 'nedative', etc.). These were normalised using a mapping dictionary, and all labels were successfully consolidated into the three standard sentiment categories: positive, neutral, and negative.

In [21]:
# convert labels to lowercase
crowd_train_data['sentiment'] = crowd_train_data['sentiment'].str.strip().str.lower()
label_counts = crowd_train_data['sentiment'].value_counts()
print("before cleaning")
print(label_counts)

before cleaning
sentiment
neutral           5046
positive          3213
negative          2375
netural             20
nuetral              3
postive              2
postitive            1
neutal               1
npositive            1
neutra l             1
positie              1
negayive             1
nutral               1
neugral              1
negtaive             1
neutrla              1
neutrall             1
neural               1
netutral             1
_x0008_neutral       1
nedative             1
neutral?             1
positve              1
Name: count, dtype: int64


In [22]:
# create a dictionary
label_map = {
    # positive 
    'positve': 'positive',
    'postive': 'positive',
    'postitive': 'positive',
    'positie': 'positive',
    'npositive': 'positive',

    # neutral 
    'neutral?': 'neutral',
    'nuetral': 'neutral',
    '_x0008_neutral': 'neutral',
    'netural': 'neutral',
    'netutral': 'neutral',
    'neural': 'neutral',
    'neutrall': 'neutral',
    'neugral': 'neutral',
    'neutrla': 'neutral',
    'nutral': 'neutral',
    'neutra l': 'neutral',
    'neutal': 'neutral',

    # negative 
    'nedative': 'negative',
    'negtaive': 'negative',
    'negayive': 'negative',
}

label_map.update({
    'positive': 'positive',
    'neutral': 'neutral',
    'negative': 'negative'
})

# apply the dictionary
crowd_train_data['sentiment'] = crowd_train_data['sentiment'].map(label_map)

# check the result
print("After cleaning")
print(crowd_train_data['sentiment'].value_counts())


After cleaning
sentiment
neutral     5079
positive    3219
negative    2378
Name: count, dtype: int64


**Agreement checking after cleaning**  
We evaluated the agreement between the crowd-sourced labels and gold standard labels again across the data after cleaning the labels. The overall accuracy was 65.49% and the Cohen's Kappa was 0.45, indicating 'average' agreement, according to the standard in Richard Johansson's slides.  
This suggests that there's reasonable alignment, while some inconsistency still remains. It's likely due to subjective interpretation of semtimemt tweets.

In [23]:
accuracy = accuracy_score(crowd_train_data['sentiment'], gold_train_data['sentiment'])
kappa = cohen_kappa_score(crowd_train_data['sentiment'], gold_train_data['sentiment'])
print(f"Cohen's Kappa: {kappa:.2f}")
print(f"Accuracy: {accuracy:.2%}")

Cohen's Kappa: 0.45
Accuracy: 65.49%


**Analysis of the annotation distribution**  
We compared the annotation distribution across crowdsourced and gold data. Both showed a similar overall distribution structure, with the majority of labels being 'neutral', followed by 'positive' then 'negative'. However, there's a noticeable difference between 'negative', indicating the sensitivity or interpretation for negative tweets between two groups varies. It suggests that the crowd group is more sensitive to negative sentiment or subjective interpretation of negative tweets, which may contribute to the Cohen's Kappa of 0.45 and the general accuracy of 65.49%.

In [24]:
# Get label counts from each dataset
crowd_counts = crowd_train_data['sentiment'].value_counts()
gold_counts = gold_train_data['sentiment'].value_counts()

# Access values 
crowd_counts_dict = {
    'neutral': crowd_counts.get('neutral', 0),
    'positive': crowd_counts.get('positive', 0),
    'negative': crowd_counts.get('negative', 0)
}

gold_counts_dict = {
    'neutral': gold_counts.get('neutral', 0),
    'positive': gold_counts.get('positive', 0),
    'negative': gold_counts.get('negative', 0)
}

# Create Series
crowd_series = pd.Series(crowd_counts_dict, name='Crowd')
gold_series = pd.Series(gold_counts_dict, name='Gold')

# Combine and analyze
comparison_df = pd.concat([crowd_series, gold_series], axis=1)

# Add percentage comparision
total = comparison_df.sum()
comparison_df['Crowd (%)'] = (comparison_df['Crowd'] / total['Crowd'] * 100).round(2)
comparison_df['Gold (%)'] = (comparison_df['Gold'] / total['Gold'] * 100).round(2)

print(comparison_df)

          Crowd  Gold  Crowd (%)  Gold (%)
neutral    5079  5364      47.57     50.24
positive   3219  3652      30.15     34.21
negative   2378  1660      22.27     15.55


## 3. Implementation of a classifier

Clean tweet

In [25]:
import re
import emoji

def clean_tweet(text):
    # 1. Lowercase
    text = text.lower()
    # 2. Replace URLs
    text = re.sub(r'http\S+|www\.\S+', ' ', text)
    # 3. Replace user mentions
    text = re.sub(r'@\w+', ' ', text)
    # 4. Remove hashtags but keep the tag text
    text = re.sub(r'#', '', text)
    # 5. Remove RT (retweet marker)
    text = re.sub(r'\brt\b', ' ', text)
    # 6. Remove emojis or translate them to text
    text = emoji.demojize(text)  # ":smile:" etc
    # 7. Remove non-alphanumeric (keep spaces)
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    # 8. Collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

Pipeline of Clean+TfidfVectorizer+linearSVC

In [26]:
# the actual classification algorithm
from sklearn.svm import LinearSVC
# for converting training and test datasets into matrices
# TfidfVectorizer does this specifically for documents
from sklearn.feature_extraction.text import TfidfVectorizer
# for splitting the dataset into training and test sets 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
X = crowd_train_data['text']
Y = crowd_train_data['sentiment']

X_train, X_eval, Y_train, Y_eval = train_test_split(X, Y, test_size=0.2, random_state=0)
def train_document_classifier(X, Y):
    pipeline = make_pipeline( TfidfVectorizer(preprocessor=clean_tweet,), LinearSVC(dual='auto') )
    pipeline.fit(X, Y)
    return pipeline
clf = train_document_classifier(X_train, Y_train)
acc = accuracy_score(Y_eval, clf.predict(X_eval))
print("acc:",acc)

acc: 0.548689138576779


We used hyperparameter search and cross-validation to select a set of parameters that work better on unknown data in order to maximise the model's performance.

We also tried using Logistic regression

In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
X = gold_train_data['text']
Y = gold_train_data['sentiment']

X_train, X_eval, Y_train, Y_eval = train_test_split(X, Y, test_size=0.2, random_state=0)
def train_document_classifier(X, Y):
    pipeline = make_pipeline(TfidfVectorizer(preprocessor=clean_tweet,), LogisticRegression(
    penalty='l2',            # L2 regularization
    C=1.0,                   # Regularization strength (default)
    solver='liblinear',      # Solver suitable for smaller datasets
    max_iter=100,            # Iterations for convergence
    class_weight='balanced', # Handle imbalanced classes
    fit_intercept=True,      # Include intercept
))
    pipeline.fit(X, Y)
    return pipeline
clf = train_document_classifier(X_train, Y_train)
acc = accuracy_score(Y_eval, clf.predict(X_eval))
print("acc:",acc)

acc: 0.6441947565543071


In [48]:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report

# Models to compare
models = {
    "MultinomialNB": MultinomialNB(),
    "LogisticRegression": LogisticRegression(max_iter=200, class_weight='balanced', solver='liblinear'),
    "LinearSVC": LinearSVC(class_weight='balanced', max_iter=1000),
    "RidgeClassifier": RidgeClassifier(),
}

# TF-IDF settings
tfidf = TfidfVectorizer(
    preprocessor=clean_tweet,         # Apply tweet cleaning function
    ngram_range=(1, 2),               # Unigrams + bigrams
    stop_words='english',             # Use English stop words
    max_features=10000,               # Limit to top 10,000 features
    sublinear_tf=True,                # Use log scaling for term frequency
    min_df=3                          # Ignore terms that appear in fewer than 3 documents
)

# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("🔍 Benchmarking models...\n")
for name, model in models.items():
    pipeline = make_pipeline(tfidf, model)
    scores = cross_val_score(pipeline, X, Y, cv=cv, scoring='accuracy')
    print(f"{name:<25}: Mean Accuracy = {scores.mean():.4f} ± {scores.std():.4f}")

# Final evaluation on hold-out set
X_train, X_eval, Y_train, Y_eval = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=42)
final_model = make_pipeline(tfidf, RidgeClassifier())
final_model.fit(X_train, Y_train)
preds = final_model.predict(X_eval)

print("\n📊 Classification Report (RidgeClassifier on hold-out set):")
print(classification_report(Y_eval, preds))





🔍 Benchmarking models...

MultinomialNB            : Mean Accuracy = 0.6132 ± 0.0062
LogisticRegression       : Mean Accuracy = 0.6318 ± 0.0062
LinearSVC                : Mean Accuracy = 0.6035 ± 0.0104
RidgeClassifier          : Mean Accuracy = 0.6272 ± 0.0112

📊 Classification Report (RidgeClassifier on hold-out set):
              precision    recall  f1-score   support

    negative       0.60      0.32      0.42       332
     neutral       0.63      0.76      0.69      1073
    positive       0.64      0.58      0.61       731

    accuracy                           0.63      2136
   macro avg       0.62      0.55      0.57      2136
weighted avg       0.63      0.63      0.62      2136



In [47]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# Load your data (adjust this for your dataset)
# Assuming crowd_train_data is your dataset with columns 'text' and 'sentiment'
X = gold_train_data['text']
Y = gold_train_data['sentiment']

# Encode the labels (since you have 3 classes)
label_encoder = LabelEncoder()
Y_encoded = label_encoder.fit_transform(Y)

# Split into train and validation sets
X_train, X_eval, Y_train, Y_eval = train_test_split(X, Y_encoded, test_size=0.2, random_state=0)

# TF-IDF Vectorization
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_eval_tfidf = tfidf.transform(X_eval).toarray()

# MLP Neural Network Model
mlp = MLPClassifier(
    hidden_layer_sizes=(512, 256),  # Two hidden layers with 512 and 256 neurons
    activation='relu',              # ReLU activation function
    solver='adam',                  # Adam optimizer
    max_iter=100,                   # Number of iterations
    random_state=0,                 # Random seed for reproducibility
    verbose=True                    # Output progress during training
)

# Train the model
mlp.fit(X_train_tfidf, Y_train)

# Evaluate the model
Y_pred = mlp.predict(X_eval_tfidf)

# Accuracy score
acc = accuracy_score(Y_eval, Y_pred)
print(f"MLPClassifier Accuracy: {acc:.4f}")


Iteration 1, loss = 0.95884597
Iteration 2, loss = 0.66361254
Iteration 3, loss = 0.38862284
Iteration 4, loss = 0.18520055
Iteration 5, loss = 0.06855323
Iteration 6, loss = 0.02281874
Iteration 7, loss = 0.01015891
Iteration 8, loss = 0.00597680
Iteration 9, loss = 0.00387828
Iteration 10, loss = 0.00323096
Iteration 11, loss = 0.00301844
Iteration 12, loss = 0.00220541
Iteration 13, loss = 0.00274713
Iteration 14, loss = 0.00211303
Iteration 15, loss = 0.00208056
Iteration 16, loss = 0.00183821
Iteration 17, loss = 0.00168452
Iteration 18, loss = 0.00182496
Iteration 19, loss = 0.00206072
Iteration 20, loss = 0.00130422
Iteration 21, loss = 0.00212225
Iteration 22, loss = 0.00159790
Iteration 23, loss = 0.00154578
Iteration 24, loss = 0.00150607
Iteration 25, loss = 0.00240101
Iteration 26, loss = 0.00207064
Iteration 27, loss = 0.00164088
Iteration 28, loss = 0.00197591
Iteration 29, loss = 0.00188707
Iteration 30, loss = 0.00135071
Iteration 31, loss = 0.00140986
Training loss did

In [54]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Define a pipeline with TfidfVectorizer and MLPClassifier
pipeline = make_pipeline(
    TfidfVectorizer(preprocessor=clean_tweet),
    MLPClassifier(max_iter=100, random_state=0, verbose=True)
)

# Define the hyperparameter search space
param_dist = {
    'mlpclassifier__hidden_layer_sizes': [(50,), (100,), (150,), (50, 50), (100, 50)],
    'mlpclassifier__activation': ['relu', 'tanh', 'logistic'],
    'mlpclassifier__solver': ['adam', 'sgd', 'lbfgs'],
    'mlpclassifier__alpha': np.logspace(-5, 3, 9),  # Exponentially spaced values for alpha (L2 regularization)
    'mlpclassifier__learning_rate_init': [0.001, 0.01, 0.1, 0.5],
}

# Set up RandomizedSearchCV with 3-fold cross-validation
random_search = RandomizedSearchCV(
    pipeline, param_distributions=param_dist, n_iter=50, cv=3, n_jobs=-1, verbose=2, random_state=0, scoring='accuracy'
)
 
# Split the dataset into training and evaluation sets
X_train, X_eval, Y_train, Y_eval = train_test_split(X, Y, test_size=0.2, random_state=0)

# Fit the RandomizedSearchCV object
random_search.fit(X_train, Y_train)

# Print the best hyperparameters found
print("Best hyperparameters found:", random_search.best_params_)

# Evaluate the best model on the evaluation set
best_model = random_search.best_estimator_
Y_pred = best_model.predict(X_eval)
acc = accuracy_score(Y_eval, Y_pred)
print("Evaluation accuracy with best model: {:.4f}".format(acc))


Fitting 3 folds for each of 50 candidates, totalling 150 fits


  tmp = X - X.max(axis=1)[:, np.newaxis]
  tmp = X - X.max(axis=1)[:, np.newaxis]
  tmp = X - X.max(axis=1)[:, np.newaxis]
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the

Iteration 1, loss = 1.03187873
Iteration 2, loss = 1.00150204
Iteration 3, loss = 0.99970938
Iteration 4, loss = 0.99863038
Iteration 5, loss = 0.99715840
Iteration 6, loss = 0.99576138
Iteration 7, loss = 0.99414437
Iteration 8, loss = 0.99244218
Iteration 9, loss = 0.99067205
Iteration 10, loss = 0.98866992
Iteration 11, loss = 0.98677097
Iteration 12, loss = 0.98467793
Iteration 13, loss = 0.98235667
Iteration 14, loss = 0.98001638
Iteration 15, loss = 0.97738927
Iteration 16, loss = 0.97475543
Iteration 17, loss = 0.97199800
Iteration 18, loss = 0.96892930
Iteration 19, loss = 0.96581807
Iteration 20, loss = 0.96250372
Iteration 21, loss = 0.95896755
Iteration 22, loss = 0.95583468
Iteration 23, loss = 0.95152028
Iteration 24, loss = 0.94742007
Iteration 25, loss = 0.94306210
Iteration 26, loss = 0.93877794
Iteration 27, loss = 0.93418454
Iteration 28, loss = 0.92923885
Iteration 29, loss = 0.92428720
Iteration 30, loss = 0.91947063
Iteration 31, loss = 0.91403816
Iteration 32, los



Iteration 1, loss = 1.03343804
Iteration 2, loss = 1.00153643
Iteration 3, loss = 1.00025507
Iteration 4, loss = 0.99909833
Iteration 5, loss = 0.99806253
Iteration 6, loss = 0.99675503
Iteration 7, loss = 0.99557218
Iteration 8, loss = 0.99401990
Iteration 9, loss = 0.99254956
Iteration 10, loss = 0.99060208
Iteration 11, loss = 0.98864816
Iteration 12, loss = 0.98662688
Iteration 13, loss = 0.98449493
Iteration 14, loss = 0.98249385
Iteration 15, loss = 0.97997143
Iteration 16, loss = 0.97712490
Iteration 17, loss = 0.97462638
Iteration 18, loss = 0.97163067
Iteration 19, loss = 0.96838777
Iteration 20, loss = 0.96497817
Iteration 21, loss = 0.96158381
Iteration 22, loss = 0.95799890
Iteration 23, loss = 0.95423863
Iteration 24, loss = 0.95032632
Iteration 25, loss = 0.94602453
Iteration 26, loss = 0.94150766
Iteration 27, loss = 0.93745187
Iteration 28, loss = 0.93284859
Iteration 29, loss = 0.92764980
Iteration 30, loss = 0.92218826
Iteration 31, loss = 0.91692882
Iteration 32, los



Iteration 1, loss = nan
Iteration 2, loss = nan
Iteration 3, loss = nan
Iteration 4, loss = nan
Iteration 5, loss = nan
Iteration 6, loss = nan
Iteration 7, loss = nan
Iteration 8, loss = nan
Iteration 9, loss = nan
Iteration 10, loss = nan
Iteration 11, loss = nan
Iteration 12, loss = nan
Iteration 13, loss = nan
Iteration 14, loss = nan
Iteration 15, loss = nan
Iteration 16, loss = nan
Iteration 17, loss = nan
Iteration 18, loss = nan
Iteration 19, loss = nan
Iteration 20, loss = nan
Iteration 21, loss = nan
Iteration 22, loss = nan
Iteration 23, loss = nan
Iteration 24, loss = nan
Iteration 25, loss = nan
Iteration 26, loss = nan
Iteration 27, loss = nan
Iteration 28, loss = nan
Iteration 29, loss = nan
Iteration 30, loss = nan
Iteration 31, loss = nan
Iteration 32, loss = nan
Iteration 33, loss = nan
Iteration 34, loss = nan
Iteration 35, loss = nan
Iteration 36, loss = nan
Iteration 37, loss = nan
Iteration 38, loss = nan
Iteration 39, loss = nan
Iteration 40, loss = nan
Iteration



Iteration 1, loss = nan
Iteration 2, loss = nan
Iteration 3, loss = nan
Iteration 4, loss = nan
Iteration 5, loss = nan
Iteration 6, loss = nan
Iteration 7, loss = nan
Iteration 8, loss = nan
Iteration 9, loss = nan
Iteration 10, loss = nan
Iteration 11, loss = nan
Iteration 12, loss = nan
Iteration 13, loss = nan
Iteration 14, loss = nan
Iteration 15, loss = nan
Iteration 16, loss = nan
Iteration 17, loss = nan
Iteration 18, loss = nan
Iteration 19, loss = nan
Iteration 20, loss = nan
Iteration 21, loss = nan
Iteration 22, loss = nan
Iteration 23, loss = nan
Iteration 24, loss = nan
Iteration 25, loss = nan
Iteration 26, loss = nan
Iteration 27, loss = nan
Iteration 28, loss = nan
Iteration 29, loss = nan
Iteration 30, loss = nan
Iteration 31, loss = nan
Iteration 32, loss = nan
Iteration 33, loss = nan
Iteration 34, loss = nan
Iteration 35, loss = nan
Iteration 36, loss = nan
Iteration 37, loss = nan
Iteration 38, loss = nan
Iteration 39, loss = nan
Iteration 40, loss = nan
Iteration



[CV] END mlpclassifier__activation=tanh, mlpclassifier__alpha=0.0001, mlpclassifier__hidden_layer_sizes=(50, 50), mlpclassifier__learning_rate_init=0.01, mlpclassifier__solver=lbfgs; total time=  12.8s
Iteration 1, loss = 105.15470095
Iteration 2, loss = 14.01171930
Iteration 3, loss = 2.72639247
Iteration 4, loss = 1.29846753
Iteration 5, loss = 1.08667599
Iteration 6, loss = 1.03291004
Iteration 7, loss = 1.02473901
Iteration 8, loss = 1.02556946
Iteration 9, loss = 1.04337725
Iteration 10, loss = 1.04569356
Iteration 11, loss = 1.03460690
Iteration 12, loss = 1.03291078
Iteration 13, loss = 1.03854902
Iteration 14, loss = 1.02657348
Iteration 15, loss = 1.02637190
Iteration 16, loss = 1.08884425
Iteration 17, loss = 1.04077846
Iteration 18, loss = 1.02588682
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
[CV] END mlpclassifier__activation=logistic, mlpclassifier__alpha=1.0, mlpclassifier__hidden_layer_sizes=(100, 50), mlpclassifier__learnin

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Iteration 1, loss = 1.08755536
Iteration 2, loss = 1.00580331
Iteration 3, loss = 1.00137144
Iteration 4, loss = 1.00050756
Iteration 5, loss = 0.99944495
Iteration 6, loss = 0.99854208
Iteration 7, loss = 0.99756028
Iteration 8, loss = 0.99637276
Iteration 9, loss = 0.99520591
Iteration 10, loss = 0.99374251
Iteration 11, loss = 0.99226553
Iteration 12, loss = 0.99082876
Iteration 13, loss = 0.98920649
Iteration 14, loss = 0.98746905
Iteration 15, loss = 0.98575891
Iteration 16, loss = 0.98394912
Iteration 17, loss = 0.98178039
Iteration 18, loss = 0.97948122
Iteration 19, loss = 0.97739390
Iteration 20, loss = 0.97495949
Iteration 21, loss = 0.97203773
Iteration 22, loss = 0.96967165
Iteration 23, loss = 0.96655043
Iteration 24, loss = 0.96349153
Iteration 25, loss = 0.96018084
Iteration 26, loss = 0.95680952
Iteration 27, loss = 0.95322052
Iteration 28, loss = 0.94941610
Iteration 29, loss = 0.94568971
Iteration 30, loss = 0.94146460
Iteration 31, loss = 0.93717791
Iteration 32, los

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)



Iteration 44, loss = 1.00779123
Iteration 45, loss = 1.00637239
Iteration 46, loss = 1.00728227
Iteration 47, loss = 1.00909187
Iteration 48, loss = 1.00583962
Iteration 49, loss = 1.00624225
Iteration 50, loss = 1.00523353
Iteration 51, loss = 1.00756898
Iteration 52, loss = 1.00799968
Iteration 53, loss = 1.00631587
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
[CV] END mlpclassifier__activation=logistic, mlpclassifier__alpha=10.0, mlpclassifier__hidden_layer_sizes=(50, 50), mlpclassifier__learning_rate_init=0.5, mlpclassifier__solver=sgd; total time=  27.9s
Iteration 1, loss = 1.49141160
Iteration 2, loss = 1.06745851
Iteration 3, loss = 1.08648895
Iteration 4, loss = 1.15920405
Iteration 5, loss = 1.23630790
Iteration 6, loss = 1.25596990
Iteration 7, loss = 1.26054826
Iteration 8, loss = 1.27477331
Iteration 9, loss = 1.24310118
Iteration 10, loss = 1.19452975
Iteration 11, loss = 1.22483573
Iteration 12, loss = 1.22901810
Iteration 13,

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


Iteration 1, loss = 104.43537480
Iteration 2, loss = 13.85031087
Iteration 3, loss = 3.21856874
Iteration 4, loss = 1.30701410
Iteration 5, loss = 1.05515484
Iteration 6, loss = 1.03466108
Iteration 7, loss = 1.02222294
Iteration 8, loss = 1.02050546
Iteration 9, loss = 1.03010842
Iteration 10, loss = 1.01605241
Iteration 11, loss = 1.02000908
Iteration 12, loss = 1.01784408
Iteration 13, loss = 1.01473794
Iteration 14, loss = 1.01927262
Iteration 15, loss = 1.01546301
Iteration 16, loss = 1.01297706
Iteration 17, loss = 1.01326323
Iteration 18, loss = 1.01938941
Iteration 19, loss = 1.01991854
Iteration 20, loss = 1.01161342
Iteration 21, loss = 1.01772051
Iteration 22, loss = 1.02711169
Iteration 23, loss = 1.01658524
Iteration 24, loss = 1.02104460
Iteration 25, loss = 1.01589977
Iteration 26, loss = 1.06161259
Iteration 27, loss = 1.04991953
Iteration 28, loss = 1.01596111
Iteration 29, loss = 1.01684201
Iteration 30, loss = 1.01866079
Iteration 31, loss = 1.03622982
Training loss 

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
3 fits failed out of a total of 150.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/richardhua/dev/applied-ml/venv/lib/python3.13/site-packa

Iteration 1, loss = 1.03189580
Iteration 2, loss = 1.00122300
Iteration 3, loss = 1.00003775
Iteration 4, loss = 0.99855953
Iteration 5, loss = 0.99718828
Iteration 6, loss = 0.99535666
Iteration 7, loss = 0.99352166
Iteration 8, loss = 0.99157300
Iteration 9, loss = 0.98921129
Iteration 10, loss = 0.98671516
Iteration 11, loss = 0.98395900
Iteration 12, loss = 0.98102961
Iteration 13, loss = 0.97769900
Iteration 14, loss = 0.97419781
Iteration 15, loss = 0.97052468
Iteration 16, loss = 0.96634637
Iteration 17, loss = 0.96165996
Iteration 18, loss = 0.95700362
Iteration 19, loss = 0.95190090
Iteration 20, loss = 0.94682235
Iteration 21, loss = 0.94085935
Iteration 22, loss = 0.93467344
Iteration 23, loss = 0.92827880
Iteration 24, loss = 0.92176443
Iteration 25, loss = 0.91484039
Iteration 26, loss = 0.90763823
Iteration 27, loss = 0.90042082
Iteration 28, loss = 0.89292594
Iteration 29, loss = 0.88591063
Iteration 30, loss = 0.87817981
Iteration 31, loss = 0.87036455
Iteration 32, los

