### Student Information
Name: Yeh Po-Yu

Student ID: 11020140

GitHub ID: oldper

Kaggle name: oldpercent

Kaggle private scoreboard snapshot: ![](./score_bar.png)

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home exercises** in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)__. 

### Homework
1. I'm sorry that I can't move all code on this notebook, the notebook link is [here](https://github.com/oldper/DM2024-Lab2-Master/blob/main/DM2024-Lab2-Master.ipynb)
2. The second part is just like above
3. The third part and the code is missing due to my missave the colab notebook. There's only preprocessing and transformer part left. So, I will use text to tell about that.

#### preprocessing
In this part, I use nltk as my preprocessing tool. I do tokenize, remove stopwords(e.g. is, the), lemmatization(apples -> apple) and remove some punctuation and turn everything into lowercase.\
The code is just like below

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
from concurrent.futures import ProcessPoolExecutor
import os
import gc

nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")
nltk.download("wordnet")
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text)
    # Remove punctuation and lowercase
    tokens = [word.lower() for word in tokens if word.isalpha()]
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(tokens)

def preprocess_tweets_parallel(texts, preprocess_fn, num_workers=os.cpu_count()):
    with ProcessPoolExecutor(max_workers=num_workers) as executor:
        results = list(tqdm(executor.map(preprocess_fn, texts),
                            total=len(texts),
                            desc="Cleaning Tweets"))
    return results

cleaned_tweets = preprocess_tweets_parallel(train_df['text'], preprocess_text)
train_df['cleaned_text'] = cleaned_tweets
del cleaned_tweets
emotions = train_df["emotion"].unique()
label_to_id = {label: idx for idx, label in enumerate(emotions)}
id_to_label = {idx: label for label, idx in label_to_id.items()}
train_df["label"] = train_df["emotion"].map(label_to_id)
gc.collect()

#### Transformer
I originally want to use pretrained language model to classify it. But the strongest computational resource I have is colab pro. So, I didn't to that. If I do that, it will take me 20 hours for one epoch without any finetune. And my wallet will be empty. :(

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

dataset = Dataset.from_pandas(train_df)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(df):
    return tokenizer(df["cleaned_text"], padding="max_length", truncation=True)

dataset = dataset.map(tokenize_function, batched=True, num_proc=os.cpu_count())

train_test = dataset.train_test_split(test_size=0.2)
train_dataset = train_test["train"]
val_dataset = train_test["test"]
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=8)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

# Train
trainer.train()
results = trainer.evaluate()
print(results)

### My perspective with limited computational resource
I tried two techniques for feature creation, TF-IDF and Word2Vector. And I tried four machine learning models xgboost, lightgbm, catboost, random forest. For most of time, xgboost performs the best. However, lightgbm performs best in this dataset. And Word2Vector performs better than TF-IDF.\
To sum, there're two feature creation techniques for feature creation:
1. TF-IDF
2. Word2Vec (performs best)

Machine learning models I tried:
1. random forest 
2. xgboost
3. lightgbm (performs best)
4. catboost