### Student Information
Name: 吳征彥

Student ID: 113062636

GitHub ID: https://github.com/leoson-wu

Kaggle name: Leoson Wu

Kaggle private scoreboard snapshot:  
![Leaderboard](leaderboard.jpg)

----

1. First: __This part is worth 30% of your grade.__ Do the **take home exercises** in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)__. 

----

### Kaggle notebook

```python
!pip install transformers datasets torch

```python
# import package
import numpy as np # linear algebra
import pandas as pd # data processing
import torch
from datasets import Dataset
import os
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
```
---

```python
# Load tweets_df pickel file
tweets_df = pd.read_pickle("/kaggle/input/df-pickel/raw.pkl")
# Load the CSV files into DataFrames
data_path = '/kaggle/input/dm-2024-isa-5810-lab-2-homework'
submission_df = pd.read_csv(os.path.join(data_path, 'sampleSubmission.csv'))
data_identification_df = pd.read_csv(os.path.join(data_path, 'data_identification.csv'))
emotion_df = pd.read_csv(os.path.join(data_path, 'emotion.csv'))
```
---

```python
# Look into the distribution of _score
emotion_scores_distribution = tweets_df.groupby('emotion')['_score'].describe()
# Display the result
print(emotion_scores_distribution)
```
---

```python
# Preprocessing steps
import re
import pandas as pd

# Convert text to lowercase
tweets_df['text'] = tweets_df['text'].str.lower()

# Remove URLs from text
tweets_df['text'] = tweets_df['text'].str.replace(r'http\S+|www\S+', '', regex=True)

# Function to extract hashtags and clean text
def extract_hashtags(text):
    # Extract hashtags (without the '#' symbol)
    hashtags = re.findall(r'#(\w+)', text)
    # Remove '#' from hashtags in the text
    updated_text = re.sub(r'#(\w+)', r'\1', text)
    return updated_text, hashtags

# Apply the function and update the 'text' and 'hashtags' columns
tweets_df[['text', 'hashtags']] = tweets_df.apply(
    lambda row: pd.Series(extract_hashtags(row['text'])),
    axis=1
)

# Ensure 'hashtags' column contains empty lists instead of NaN
tweets_df['hashtags'] = tweets_df['hashtags'].apply(lambda x: x if isinstance(x, list) else [])

# Clean text further: remove splaceholder and repeated punctuation marks
tweets_df['text'] = tweets_df['text'].apply(lambda x: re.sub(r'<lh>', '', x).strip()) 
tweets_df['text'] = tweets_df['text'].apply(lambda x: re.sub(r'[!.,?]+', '', x)) 
```
---

```python
# trainning data selection (sampling) 
pd.set_option('display.max_colwidth', None)
tweets_test = tweets_df[tweets_df['emotion'] == 'test']
tweets_train = tweets_df[tweets_df['emotion'] != 'test']
tweets_train = tweets_train[tweets_df['_score'] > 769] # use top 25% _score dataset for training 
tweets_train = tweets_train.sample(frac=0.08, random_state=42)
```
---

```python
# Examine the sample dataset
# Checking we include considerable training data for each emotion 
sample_emotion_distribution = tweets_train.groupby('emotion')['_score'].describe()
print(sample_emotion_distribution)
```
---

```python
# Create a mapping from emotion labels to integers
emotions = tweets_train['emotion'].unique()
label_map = {label: idx for idx, label in enumerate(emotions)}
# Apply mapping to the emotion column
tweets_train['label'] = tweets_train['emotion'].map(label_map)
tweets_train = tweets_train[['tweet_id', 'text', 'label']]
```
---

```python
# Use robertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)
    
# Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(tweets_train)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
```
---

```python
# prepare model
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=len(label_map))
```
---

```python
# setting training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="no",  # Disables evaluation during training because we don't have test data for validation.
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)
```
---

```python
# start to train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer
)

trainer.train()
```
---

```python
# define a function to predict emotions from the input text  
def predict_emotions(texts):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=1)
    return [emotions[pred] for pred in predictions]
```
---

```python
# evoke the GPU
import torch
from tqdm import tqdm
# Check if CUDA (GPU) is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Ensure model is on the correct device
model = model.to(device)

# batch the input data to avoid memory issue
batch_size = 16
predictions = []

# do the classification
for i in tqdm(range(0, len(tweets_test), batch_size)):
    batch_texts = tweets_test['text'][i:i + batch_size].tolist()
    
    # Tokenize and move to the correct device
    inputs = tokenizer(batch_texts, padding=True, truncation=True, return_tensors="pt").to(device)
    
    # Get the predictions
    with torch.no_grad():  # No need to track gradients for inference
        batch_predictions = model(**inputs).logits.argmax(dim=-1).cpu().tolist()  # Move the result back to CPU if necessary
    
    predictions.extend(batch_predictions)
```
---

```python
# append the final result/answers
tweets_test['emotion'] = predictions
```
---

```python
# join the predicted result to the submission.csv  
submission_df.drop(columns='emotion')
submission_df = submission_df.merge(tweets_test[['tweet_id', 'emotion']], left_on='id', 
    right_on='tweet_id', how='left')
submission_df['emotion'] = submission_df['emotion_y'].apply(lambda x: emotions[x])
submission_df = submission_df[['id', 'emotion']]
submission_df.head()

```python
# save to the submission file
submission_df.to_csv('/kaggle/working/submission.csv', index=False, header=True)
```
---