### Student Information
Name: 簡楷恒

Student ID: 113062582

GitHub ID: 73538884

Kaggle name: KaiHengChien

Kaggle private scoreboard snapshot: ![Kaggle private scoreboard](img/pic0.png)

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home exercises** in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)__. 

---

### Part 1

https://github.com/k910723/DM2024-Lab2-Master/blob/main/DM2024-Lab2-Master.ipynb

---

### Part 2

see the cell "student information" above

---

### Part 3

Loads the identification (train/test) and their label (emotion) of the datasets

In [None]:
import pandas as pd

# Load the datasets
data_identification = pd.read_csv('Data/data_identification.csv')
data_identification_train = data_identification[data_identification['identification'] == 'train']
emotion = pd.read_csv('Data/emotion.csv')

Loads the dataset of tweets from a JSON file, and merges identification, emotion and tweets data on "tweet_id"

In [None]:
import json

# Load the tweets dataset
tweet_id = []
text = []

with open('Data/tweets_DM.json', 'r') as file:
    for line in file.readlines():
        data_dic = json.loads(line)
        tweet_id.append(data_dic['_source']['tweet']['tweet_id'])
        text.append(data_dic['_source']['tweet']['text'])

data = pd.DataFrame(
    {
     'tweet_id': tweet_id,
     'text': text,
    }
)

data = pd.merge(data, data_identification, on='tweet_id', how='left')
data = pd.merge(data, emotion, on='tweet_id',how='left')

print(len(data))

1867535


Split dataset into training and testing sets based on the "identification" column.

Then drops the "identification" and "tweet_id" columns from both sets.

In [None]:
train_data = data[data['identification'] == 'train'].copy()
test_data = data[data['identification'] == 'test'].copy()

train_data = train_data.drop(columns=['identification', 'tweet_id'])
test_data = test_data.drop(columns=['identification', 'tweet_id'])

print(len(train_data))
print(len(test_data))

1455563
411972


Transforms the training and testing data into a `DatasetDict` format.

In [None]:
import datasets

# Transform the data into DatasetDict format
dataset = datasets.DatasetDict({
    'train': datasets.Dataset.from_pandas(train_data, preserve_index=False),
    'test': datasets.Dataset.from_pandas(test_data, preserve_index=False),
})

dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'emotion'],
        num_rows: 1455563
    })
    test: Dataset({
        features: ['text', 'emotion'],
        num_rows: 411972
    })
})

Set up the environment for finetuning a BERT model using the Hugging Face Transformers library.

In [None]:
from transformers import BertTokenizer, Trainer, TrainingArguments
import os
import torch
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# select the GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Use the one hot encoder to encode the emotion labels
encoder = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')
encoder = encoder.fit(np.reshape(dataset['train']['emotion'], (-1, 1)))




Tokenizes the text data and add the embeddings in the data.

In [None]:
# Tokenizer function
def tokenize_function(data):
    embeddings = tokenizer(data['text'])
    data.update(embeddings)
    data['label'] = encoder.transform(np.reshape(data['emotion'], (-1, 1)))
    return data

# Tokenize the dataset
# Store the tokenized dataset to disk to save time
if not os.path.exists('tokenized_dataset'):
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    tokenized_dataset.save_to_disk('tokenized_dataset')
else:
    tokenized_dataset = datasets.load_from_disk('tokenized_dataset')

tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'emotion', 'input_ids', 'token_type_ids', 'attention_mask', 'label'],
        num_rows: 1455563
    })
    test: Dataset({
        features: ['text', 'emotion', 'input_ids', 'token_type_ids', 'attention_mask', 'label'],
        num_rows: 411972
    })
})

Set up the classification model using the Hugging Face Transformers library.

In [None]:
from transformers import DataCollatorWithPadding, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
import evaluate

# data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# use the BERT model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(train_data['emotion'].unique()))
# training data is split into training and validation sets
split_dataset = tokenized_dataset['train'].train_test_split(test_size=0.2)

# function to evaluate the performance
# evaluate.load() is written outside the function to avoid the overhead of loading the metrics multiple time
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1", average="macro")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    labels = np.argmax(labels, axis=1)

    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1_score = f1_metric.compute(predictions=predictions, references=labels, average="macro")

    return {**accuracy, **f1_score}

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,  
    per_device_train_batch_size=32,  
    per_device_eval_batch_size=32,
    eval_strategy='steps',
    eval_steps=100000,
    save_total_limit=5,  # Only last 5 models are saved. Older ones are deleted.
    save_steps=100000,
    load_best_model_at_end=True,
    fp16=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset['train'],
    eval_dataset=split_dataset['test'],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [8]:
# Train the model
trainer.train()

  0%|          | 0/109170 [00:00<?, ?it/s]

{'loss': 0.3037, 'grad_norm': 0.8294836282730103, 'learning_rate': 4.977099935879821e-05, 'epoch': 0.01}
{'loss': 0.2598, 'grad_norm': 0.7039676904678345, 'learning_rate': 4.954199871759641e-05, 'epoch': 0.03}
{'loss': 0.2452, 'grad_norm': 0.7792961001396179, 'learning_rate': 4.9312998076394615e-05, 'epoch': 0.04}
{'loss': 0.2368, 'grad_norm': 1.6333123445510864, 'learning_rate': 4.908399743519282e-05, 'epoch': 0.05}
{'loss': 0.232, 'grad_norm': 0.9491744041442871, 'learning_rate': 4.885499679399103e-05, 'epoch': 0.07}
{'loss': 0.2269, 'grad_norm': 0.6999340057373047, 'learning_rate': 4.8625996152789234e-05, 'epoch': 0.08}
{'loss': 0.2263, 'grad_norm': 1.2519547939300537, 'learning_rate': 4.839699551158743e-05, 'epoch': 0.1}
{'loss': 0.2205, 'grad_norm': 0.557869553565979, 'learning_rate': 4.816799487038564e-05, 'epoch': 0.11}
{'loss': 0.2192, 'grad_norm': 1.549872636795044, 'learning_rate': 4.7938994229183846e-05, 'epoch': 0.12}
{'loss': 0.2214, 'grad_norm': 0.7243902087211609, 'learn

  0%|          | 0/9098 [00:00<?, ?it/s]

{'eval_loss': 0.1928524374961853, 'eval_accuracy': 0.6690529107253884, 'eval_f1': 0.5946160881877214, 'eval_runtime': 130.4487, 'eval_samples_per_second': 2231.629, 'eval_steps_per_second': 69.744, 'epoch': 2.75}
{'loss': 0.1429, 'grad_norm': 1.399204969406128, 'learning_rate': 3.987359164605661e-06, 'epoch': 2.76}
{'loss': 0.1434, 'grad_norm': 1.2433348894119263, 'learning_rate': 3.7583585234038656e-06, 'epoch': 2.78}
{'loss': 0.1464, 'grad_norm': 1.63530433177948, 'learning_rate': 3.5293578822020702e-06, 'epoch': 2.79}
{'loss': 0.1419, 'grad_norm': 2.370330333709717, 'learning_rate': 3.300357241000275e-06, 'epoch': 2.8}
{'loss': 0.141, 'grad_norm': 1.1799120903015137, 'learning_rate': 3.0718146010808835e-06, 'epoch': 2.82}
{'loss': 0.1441, 'grad_norm': 1.385223388671875, 'learning_rate': 2.8428139598790877e-06, 'epoch': 2.83}
{'loss': 0.1419, 'grad_norm': 1.6509250402450562, 'learning_rate': 2.6138133186772927e-06, 'epoch': 2.84}
{'loss': 0.1465, 'grad_norm': 1.9401695728302002, 'lea

TrainOutput(global_step=109170, training_loss=0.17593678468049223, metrics={'train_runtime': 6568.3072, 'train_samples_per_second': 531.849, 'train_steps_per_second': 16.621, 'total_flos': 9.190203055743763e+16, 'train_loss': 0.17593678468049223, 'epoch': 3.0})

In [None]:
# Make predictions on the test dataset
predictions = trainer.predict(tokenized_dataset['test'])

Generate the CSV file with the predicted emotions for each tweet.

In [None]:
# Get the predicted labels
predicted_labels = np.argmax(predictions.predictions, axis=1)

# Decode the one hot encoded labels back to the original emotion labels
print(predictions.predictions)
decoded_labels = encoder.inverse_transform(predictions.predictions)

# Create a DataFrame with the tweet_id and predicted emotion
result_df = pd.DataFrame({
    'id': data[data['identification'] == 'test']['tweet_id'],
    'emotion': decoded_labels.flatten()
})

# Save the result as a CSV file
result_df.to_csv('predicted_emotions.csv', index=False)

[[-8.375       0.09777832 -8.59375    ... -6.8867188  -7.1757812
  -1.0996094 ]
 [-9.859375   -1.8525391  -9.9140625  ... -8.4609375  -8.71875
   1.7412109 ]
 [-4.7109375  -1.0791016  -3.2480469  ... -2.2167969  -3.9765625
  -1.71875   ]
 ...
 [-3.8925781  -1.8144531  -3.1289062  ...  0.31689453 -4.0429688
  -3.5605469 ]
 [-5.2382812  -4.7148438  -5.296875   ... -4.796875   -5.2382812
  -1.0800781 ]
 [-1.1923828  -4.3476562  -3.6894531  ...  0.87890625 -4.1953125
  -5.8203125 ]]
