### Student Information
Name: 彭星樺

Student ID: 113065507

GitHub ID: ktpss97094

Kaggle name: ktpss97094

Kaggle private scoreboard snapshot:

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home** exercises in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook.


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking:
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (60-x)/6 + 20 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 29.5% out of 30%)   
    Submit your last submission **BEFORE the deadline (Nov. 26th 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained.


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th 11:59 pm, Tuesday)__.

## config

In [6]:
!pip install transformers[torch] datasets accelerate -U



In [7]:
### Begin Assignment Here
import os
from enum import Enum
import copy
import torch
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
import string
from nltk.stem.porter import PorterStemmer
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, pipeline
import torch
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import re

class EXEC_ENV_ENUM(Enum):
    COLAB = 1
    KAGGLE = 2
    LOCAL = 3
EXEC_ENV = EXEC_ENV_ENUM.COLAB
# EXEC_ENV = EXEC_ENV_ENUM.KAGGLE
# EXEC_ENV = EXEC_ENV_ENUM.LOCAL
input_path = ""
if EXEC_ENV == EXEC_ENV_ENUM.COLAB:
    from google.colab import drive
    drive.mount("/content/gdrive")
    input_path = "/content/gdrive/MyDrive/DMLab2/DM2024-Lab2-Homework"
elif EXEC_ENV == EXEC_ENV_ENUM.KAGGLE:
    input_path = "/kaggle/input/dm-2024-isa-5810-lab-2-homework"
    os.chdir("/kaggle/working")
else:
    input_path = "."

SEED = 42
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
EMOTIONS = ["anger", "anticipation", "disgust", "fear", "sadness", "surprise", "trust", "joy"]
NUM_EMOTIONS = len(EMOTIONS)
ID_TO_EMOTIONS = {str(index): emotion for index, emotion in enumerate(EMOTIONS)}
EMOTIONS_TO_ID = {emotion: index for index, emotion in enumerate(EMOTIONS)}

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [8]:
def remove_stopwords(text: str) -> str:
    '''
    E.g.,
        text: 'Here is a dog.'
        preprocessed_text: 'Here dog.'
    '''
    stop_word_list = stopwords.words('english')
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token.lower() not in stop_word_list]
    preprocessed_text = ' '.join(filtered_tokens)

    return preprocessed_text


def preprocessing_function(text: str) -> str:
    preprocessed_text = remove_stopwords(text)

    # 轉小寫
    preprocessed_text = preprocessed_text.lower()

    # 移除無用字元
    preprocessed_text = preprocessed_text.replace('<LH>', ' ').replace('<lh>', ' ')

    # 移除 tag (@) 人
    preprocessed_text = re.sub(r"@\w+\s*", "", preprocessed_text)

    # 移除punctuation
    preprocessed_text = "".join([char for char in preprocessed_text if char not in string.punctuation])

    # stemming
    stemmer = PorterStemmer()
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(preprocessed_text)
    stemming_tokens = [stemmer.stem(token) for token in tokens]
    preprocessed_text = ' '.join(stemming_tokens)

    return preprocessed_text

def tokenize(batch, tokenizer):
    return tokenizer(batch["text"], padding=True, truncation=True)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

def draw_plot(log_history):
    training_loss = []
    validation_loss = []
    for log in log_history:
        if "loss" in log:
            training_loss.append(log["loss"])
        if "eval_loss" in log:
            validation_loss.append(log["eval_loss"])

    plt.figure(figsize=(10, 6))
    plt.plot(training_loss, label="Training Loss")
    plt.plot(validation_loss, label="Validation Loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.title("Training and Validation Loss")
    plt.legend()
    plt.grid(True)
    plt.show()

## Preprocessing

In [9]:
preprocessing_train_path = os.path.join(input_path, "preprocessing_train.parquet")
preprocessing_test_path = os.path.join(input_path, "preprocessing_test.parquet")

model_ckpt = "distilbert-base-uncased"
distilbert_tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

if os.path.exists(preprocessing_train_path) and os.path.exists(preprocessing_test_path):
    train = pd.read_parquet(preprocessing_train_path)
    test = pd.read_parquet(preprocessing_test_path)
else:
    df = pd.read_json(os.path.join(input_path, "tweets_DM.json"), lines=True)

    # 取出 _source 中的資料
    df["hashtags"] = df["_source"].apply(lambda x: x["tweet"]["hashtags"])
    df["tweet_id"] = df["_source"].apply(lambda x: x["tweet"]["tweet_id"])
    df["text"] = df["_source"].apply(lambda x: x["tweet"]["text"])

    # 將 tweet_id 移到第一個 column
    cols = list(df.columns)
    cols.insert(0, cols.pop(cols.index('tweet_id')))
    df = df[cols]

    # 移除無用 columns
    df = df.drop(columns=["_index", "_crawldate", "_type", "_source"], errors="ignore")

    # 匯入 emotion
    df = df.merge(pd.read_csv(os.path.join(input_path, "emotion.csv")), on="tweet_id", how="left")

    # 分離 train set / test set
    df = df.merge(pd.read_csv(os.path.join(input_path, "data_identification.csv")), on="tweet_id", how="left")
    train = df[df["identification"] == "train"]
    test = df[df["identification"] == "test"]
    train = train.drop(columns=["identification"], errors="ignore")
    test = test.drop(columns=["identification"], errors="ignore")

    # 將 emotion label 編號
    train['label'] = train['emotion'].map(EMOTIONS_TO_ID)
    test['label'] = test['emotion'].map(EMOTIONS_TO_ID)

    # 做 text 的 preprocessing
    train.loc[:, "text"] = train.loc[:, "text"].apply(preprocessing_function)
    test.loc[:, "text"] = test.loc[:, "text"].apply(preprocessing_function)

    # 用 Distilbert 對 text 做 encode
    train_encoded = train.apply(tokenize, args=(distilbert_tokenizer,), axis=1)
    train['input_ids'] = train_encoded.apply(lambda x: x["input_ids"])
    train['attention_mask'] = train_encoded.apply(lambda x: x["attention_mask"])
    test_encoded = test.apply(tokenize, args=(distilbert_tokenizer,), axis=1)
    test['input_ids'] = test_encoded.apply(lambda x: x["input_ids"])
    test['attention_mask'] = test_encoded.apply(lambda x: x["attention_mask"])

    # 儲存
    train.to_parquet(preprocessing_train_path)
    test.to_parquet(preprocessing_test_path)

display(train)
display(test)

Unnamed: 0,tweet_id,_score,hashtags,text,emotion,label,input_ids,attention_mask
0,0x376b20,391,[Snapchat],peopl post add snapchat must dehydr cuz man,anticipation,1,"[101, 21877, 7361, 2140, 2695, 5587, 10245, 75...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,0x2d5350,433,"[freepress, TrumpLegacy, CNN]",see trump danger freepress around world trumpl...,sadness,4,"[101, 2156, 8398, 5473, 2489, 20110, 2105, 208...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
3,0x1cd5b0,376,[],issa stalk tasha 😂😂😂,fear,3,"[101, 26354, 2050, 23899, 25448, 100, 102]","[1, 1, 1, 1, 1, 1, 1]"
5,0x1d755c,120,"[authentic, LaughOutLoud]",thx best time tonight stori heartbreakingli au...,joy,7,"[101, 16215, 2595, 2190, 2051, 3892, 2358, 100...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
6,0x2c91a8,1021,[],still wait suppli liscu,anticipation,1,"[101, 2145, 3524, 10514, 9397, 3669, 5622, 288...","[1, 1, 1, 1, 1, 1, 1, 1, 1]"
...,...,...,...,...,...,...,...,...
1867526,0x321566,94,"[NoWonder, Happy]",happi nowond name show happi 👏👏👏👏👏,joy,7,"[101, 5292, 9397, 2072, 2085, 15422, 2171, 226...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
1867527,0x38959e,627,[],everi circumt like thank almighti jesu christ,joy,7,"[101, 2412, 2072, 25022, 11890, 2819, 2102, 20...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1867528,0x2cbca6,274,[blessyou],current two girl walk around librari hand red ...,joy,7,"[101, 2783, 2048, 2611, 3328, 2105, 5622, 1002...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
1867533,0x24faed,840,[],ah corpor life date use rel anachron last job ...,joy,7,"[101, 6289, 13058, 2953, 2166, 3058, 2224, 212...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


Unnamed: 0,tweet_id,_score,hashtags,text,emotion,label,input_ids,attention_mask
2,0x28b412,232,[bibleverse],confid obedi write know even ask philemon 1 21...,,,"[101, 9530, 8873, 2094, 15578, 4305, 4339, 211...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,0x2de201,989,[],trust faith friend someon trust put faith anyo...,,,"[101, 3404, 4752, 2767, 2070, 2239, 3404, 2404...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
9,0x218443,66,"[materialism, money, possessions]",enough satisfi goal realli money materi money ...,,,"[101, 2438, 2938, 2483, 8873, 3125, 2613, 3669...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
30,0x2939d5,104,"[GodsPlan, GodsWork]",god woke chase day godsplan godswork,,,"[101, 2643, 8271, 5252, 2154, 5932, 24759, 231...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"
33,0x26289a,310,[],tough time turn symbol hope,,,"[101, 7823, 2051, 2735, 6454, 3246, 102]","[1, 1, 1, 1, 1, 1, 1]"
...,...,...,...,...,...,...,...,...
1867525,0x2913b4,602,[],messag ye heard begin love one anoth john 3 11...,,,"[101, 6752, 8490, 6300, 2657, 4088, 2293, 2028...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1867529,0x2a980e,598,[],lad hath five barley loav two small fish among...,,,"[101, 14804, 6045, 2232, 2274, 21569, 8840, 11...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1867530,0x316b80,827,"[mixedfeeling, butimTHATperson]",buy last 2 ticket remain show sell mixedfeel b...,,,"[101, 4965, 2197, 1016, 7281, 3961, 2265, 5271...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1867531,0x29d0cb,368,[],swear hard work gone pay one day😈💰💸,,,"[101, 8415, 2524, 2147, 2908, 3477, 2028, 100,...","[1, 1, 1, 1, 1, 1, 1, 1, 1]"


In [16]:
print(type(train["label"][0]))

<class 'numpy.int64'>


## Fine-Tune

In [12]:
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, num_labels=NUM_EMOTIONS, id2label=ID_TO_EMOTIONS, label2id=EMOTIONS_TO_ID)
         .to(device))

batch_size = 64
logging_steps = len(train) // batch_size
model_name = f"{model_ckpt}-finetuned-emotion"
training_args = TrainingArguments(output_dir=os.path.join(input_path, model_name),
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  log_level="error")

# train 時要保留 4 個 column
train = train[['text', 'label', 'input_ids', 'attention_mask']]
# train = train[['input_ids', 'attention_mask', 'label']]

# train set 分割出 validation set
train, validation = train_test_split(train, test_size=0.05, random_state=SEED)

trainer = Trainer(model=model, args=training_args,
                  compute_metrics=compute_metrics,
                  train_dataset=train,
                  eval_dataset=validation,
                  tokenizer=distilbert_tokenizer)
trainer.train()

draw_plot(trainer.state.log_history)

preds_output = trainer.predict(validation)
preds_output.metrics

  trainer = Trainer(model=model, args=training_args,


KeyError: 737286

In [None]:
# sample = pd.read_csv(os.path.join(input_path, "sample_submission.csv"))
# submission = pd.DataFrame({"id": sample["id"], "emotion": ???})
# submission.to_csv('submission.csv', index=False)

# Report