### Student Information 
Name: 林靖淵<br>
Student ID: 113356040<br>
GitHub ID: https://github.com/jing-yuan-nccu<br>
Kaggle name: jingyaun_nccu<br>
Kaggle private scoreboard snapshot: <br>
***
### Instructions
1. First: This part is worth 30% of your grade. Do the take home exercises in the DM2024-Lab2-master Repo. You may need to copy some cells from the Lab notebook to this notebook.
2. Second: This part is worth 30% of your grade. Participate in the in-class Kaggle Competition regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place
in the Private Leaderboard ranking:
Bottom 40%: Get 20% of the 30% available for this section.
Top 41% - 100%: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)
Submit your last submission BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday). Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the img folder of this repository and rerun the cell Student Information.
3. Third: This part is worth 30% of your grade. A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model.
You can also mention different things you tried and insights you gained.
4. Fourth: This part is worth 10% of your grade. It's hard for us to follow if your code is messy :'(, so please tidy up your notebook.
Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.
Make sure to commit and save your changes to your repository BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)

## Third

In the Kaggle competition, I tried several methods. For embedding, I primarily used TF-IDF, though I also attempted using Word2Vec, but the process took too much time. Regarding model selection, I experimented with three machine learning models: Random Forest, Decision Tree, and XGBoost. Additionally, I also tried using deep learning methods.

**Import module**

In [None]:
import json
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
from collections import Counter

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from imblearn.over_sampling import SMOTE

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

***Data login & processing***

In [None]:
data = []
with open('dm-2024-isa-5810-lab-2-homework/tweets_DM.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))
 
f.close()

In [None]:
emotion = pd.read_csv('dm-2024-isa-5810-lab-2-homework/emotion.csv')
data_identification = pd.read_csv('dm-2024-isa-5810-lab-2-homework/data_identification.csv')

**1. Data Clean**

In [None]:
df = pd.DataFrame(data)
def clean_text(text):
    # Remove special characters, numbers, and extra spaces
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = text.lower().strip()  # Convert to lowercase and remove leading/trailing spaces
    return text

df["clean_text"] = df["text"].apply(clean_text)

**2. Tokenization**

In [None]:
def tokenize_text(text):
    return word_tokenize(text)

df["tokens"] = df["clean_text"].apply(tokenize_text)

**3. Lemmatization**

In [None]:
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

df["lemmatized_tokens"] = df["tokens"].apply(lemmatize_tokens)

In [None]:
_source = df['_source'].apply(lambda x: x['tweet'])
df = pd.DataFrame({
    'tweet_id': _source.apply(lambda x: x['tweet_id']),
    'hashtags': _source.apply(lambda x: x['hashtags']),
    'text': _source.apply(lambda x: x['text']),
})
df = df.merge(data_identification, on='tweet_id', how='left')

train_data = df[df['identification'] == 'train']
test_data = df[df['identification'] == 'test']

In [None]:
train_data = train_data.merge(emotion, on='tweet_id', how='left')

In [None]:
train_data.drop_duplicates(subset=['text'], keep=False, inplace=True)

In [None]:
train_data.head()

***Prepare data for training model***<br>
Due to the large size of the dataset, I chose to sample a certain percentage of the data for training. Even so, using just 20% of the data still required about an hour of training. Once I found suitable parameter combinations, I used the entire dataset for training. While this approach is not entirely rigorous, it serves as a useful reference.

In [None]:
train_data_sample = train_data.sample(frac=1, random_state=42)

In [None]:
y_train_data = train_data['emotion']
X_train_data = train_data.drop(['tweet_id', 'emotion', 'identification', 'hashtags'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_train_data, y_train_data, test_size=0.2, random_state=42, stratify=y_train_data)

**Feature engineering : TFIDF**

In [None]:
tfidf = TfidfVectorizer(max_features=500)
X = tfidf.fit_transform(X_train['text']).toarray()
X_test = tfidf.transform(X_test['text'])

In [None]:
le = LabelEncoder()
y = le.fit_transform(y_train)
y_test = le.transform(y_test)

In [None]:
label_mapping = dict(zip(le.classes_, range(len(le.classes_))))

***Training model & Accuracy***<br>
For the machine learning model selection, I tried XGBoost, Random Forest, and Decision Tree. Among them, XGBoost performed very poorly, producing outputs limited to only three emotion categories, resulting in poor outcomes. Initially, I determined that this was not a data issue, as the input was consistent across all models. Therefore, I decided not to use XGBoost as my model. As for Decision Tree and Random Forest, while Decision Tree trained faster, Random Forest produced better results. Ultimately, I chose to fine-tune Random Forest to find the best-performing model.

In [None]:
from xgboost import XGBClassifier

# Initialize the XGBoost classifier
clf = XGBClassifier()
# Fit the classifier to your data
clf.fit(X, y)

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X, y)

In [None]:
# Set parameters
param_grid = {
    'n_estimators': 100,
    'max_depth': 20,
    'min_samples_split': 5,
    'min_samples_leaf': 3
}

# Train the RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X, y)

***Try different combination of parameters***<br>
I tried a total of five parameters for Random Forest, each with three different values, resulting in 3^5 combinations. Ultimately, I found that the combination of max_depth=20 and max_features="sqrt" performed the best. However, the accuracy was around 0.44 for all combinations. The primary goal was to find a suitable set of parameters to avoid overfitting caused by a lack of parameter control.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, ParameterGrid
from joblib import Parallel, delayed
from tqdm import tqdm

# Parameter grid
param_grid = {
    'n_estimators': [50],
    'max_depth': [20],
    'min_samples_split': [2],
    'min_samples_leaf': [1],
    'max_features': ['sqrt']
}

# Wrap the grid search process with tqdm
def tqdm_grid_search(cv, estimator, param_grid, X, y, scoring='accuracy'):
    param_list = list(ParameterGrid(param_grid))  # Generate all parameter combinations
    results = []
    for params in tqdm(param_list, desc="Grid Search Progress"):
        clf = estimator.set_params(**params)
        scores = []
        for train_idx, test_idx in cv.split(X, y):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]
            clf.fit(X_train, y_train)
            scores.append(clf.score(X_test, y_test))
        results.append((params, scores))
    return results

# Example usage
from sklearn.model_selection import StratifiedKFold
import numpy as np

# Define cross-validation strategy
cv = StratifiedKFold(n_splits=3)

# Call tqdm_grid_search
results = tqdm_grid_search(
    cv=cv,
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    X=np.array(X),  # Convert to numpy array if needed
    y=np.array(y),  # Convert to numpy array if needed
)

# Print results
print("Best Parameters and Scores:")
for params, scores in results:
    print(f"Params: {params}, Mean Accuracy: {np.mean(scores):.4f}")


In [None]:
y_pred = clf.predict(X_test)
y_pred_train = clf.predict(X)

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
print(accuracy_score(y, y_pred_train))

In [None]:
from sklearn.metrics import classification_report
# Generate classification report
report = classification_report(y_test, y_pred, target_names=le.classes_, digits=4)

In [None]:
print(report)

### Deep learning 
I also tried using deep learning methods to train the model. Although the results were better than Random Forest, the training time was significantly longer. Therefore, I decided to first focus on fine-tuning the Random Forest model before adjusting the deep learning approach.

In [None]:
import keras
from sklearn.preprocessing import LabelEncoder

def label_encode(le, labels):
    enc = le.transform(labels)
    return keras.utils.to_categorical(enc)

def label_decode(le, one_hot_label):
    dec = np.argmax(one_hot_label, axis=1)
    return le.inverse_transform(dec)

label_encoder = LabelEncoder()
label_encoder.fit(y_train)

y = label_encode(label_encoder, y_train)
y_test = label_encode(label_encoder, y_test)

In [None]:
# I/O check
input_shape = X.shape[1]
print('input_shape: ', input_shape)

output_shape = len(le.classes_)
print('output_shape: ', output_shape)

In [None]:
print(X.shape)
print(y.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers import ReLU, Softmax

# input layer
model_input = Input(shape=(input_shape, ))  # 500
x = model_input

# 1st hidden layer
X_W1 = Dense(units=64)(x)  # 64
H1 = ReLU()(X_W1)

# 2nd hidden layer
H1_W2 = Dense(units=64)(H1)  # 64
H2 = ReLU()(H1_W2)

# output layer
H2_W3 = Dense(units=output_shape)(H2)  # 4
H3 = Softmax()(H2_W3)

model_output = H3

# create model
model = Model(inputs=[model_input], outputs=[model_output])

# loss function & optimizer
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# show model construction
model.summary()

In [None]:
print(X.shape)
print(y.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from keras.callbacks import CSVLogger

csv_logger = CSVLogger('logs/training_log.csv')

# training setting
epochs = 4
batch_size = 32

# training!
history = model.fit(X, y, 
                    epochs=epochs, 
                    batch_size=batch_size, 
                    callbacks=[csv_logger],
                    validation_data = (X_test, y_test))
print('training finish')

In [None]:
#Let's take a look at the training log
training_log = pd.DataFrame()
training_log = pd.read_csv("logs/training_log.csv")
training_log

**Draw loss plot**

In [None]:
# Answer here
df = pd.DataFrame(training_log)
plt.figure(figsize=(14, 6))
# Subplot 1: Accuracy
plt.subplot(1, 2, 1)
plt.plot(df['epoch'], df['accuracy'], label='Accuracy', marker='o')
plt.plot(df['epoch'], df['val_accuracy'], label='Validation Accuracy', marker='o')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid()

# Subplot 2: Loss
plt.subplot(1, 2, 2)
plt.plot(df['epoch'], df['loss'], label='Loss', marker='o')
plt.plot(df['epoch'], df['val_loss'], label='Validation Loss', marker='o')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid()

# Show the plots
plt.tight_layout()
plt.show()

### Use OpenAI embedding
I originally planned to use OpenAI's API for embedding, but I found that it required a lot of time—processing 300K records would take over 3 hours. Therefore, I determined that this method was not suitable for my current situation.

In [None]:
# Example data (replace with your actual data)
X_train_texts = X_train['text'].tolist()  # Convert training text column to a list
X_test_texts = X_test['text'].tolist()    # Convert testing text column to a list

In [None]:
import openai
import numpy as np
from sklearn.decomposition import TruncatedSVD
from tqdm import tqdm  # For showing progress bars
import getpass
import os

# Set OpenAI API Key
API_Key = ""
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
client = OpenAI(
  api_key=os.environ['OPENAI_API_KEY'],  # this is also the default, it can be omitted
)

# Function to generate embeddings from OpenAI
def generate_embeddings(texts, model="text-embedding-ada-002"):
    embeddings = []
    for text in tqdm(texts, desc="Generating embeddings"):
        response = client.embeddings.create(input=text, model=model)
        embedding = response.data[0].embedding
        embeddings.append(embedding)
    return np.array(embeddings)

# Generate embeddings for train and test data
train_embeddings = generate_embeddings(X_train_texts)
test_embeddings = generate_embeddings(X_test_texts)

# Reduce dimensions to 500 using Truncated SVD
'''svd = TruncatedSVD(n_components=500, random_state=42)
train_embeddings_500 = svd.fit_transform(train_embeddings)
test_embeddings_500 = svd.transform(test_embeddings)

# Train embeddings are now reduced to 500 dimensions
print("Shape of Train Embeddings:", train_embeddings_500.shape)
print("Shape of Test Embeddings:", test_embeddings_500.shape)'''


### Generate Answer

In [None]:
X_test_data = test_data.drop(['tweet_id', 'identification', 'hashtags'], axis=1)

In [None]:
X_test_data = tfidf.transform(X_test_data['text']).toarray()

In [None]:
# deep learning
y_test_pred = model.predict(X_test_data, batch_size=128)
y_pred_labels = label_decode(label_encoder, y_test_pred)

In [None]:
y_pred_labels

In [None]:
# machine learning
y_test_pred = clf.predict(X_test_data)

In [None]:
y_pred_labels = le.inverse_transform(y_test_pred)
y_pred_labels

In [None]:
submission = pd.DataFrame({
    'id': test_data['tweet_id'],
    'emotion': y_pred_labels
})

In [None]:
submission.to_csv('kaggle/submission.csv')