<a href="https://colab.research.google.com/github/haykalaul/bdc_satriadata2025-1/blob/main/BDC_SatriaData2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Kembangkan model pembelajaran mesin untuk menemukan solusi mengklasifikasikan emosi video ke dalam salah satu dari delapan kategori (bangga, percaya, gembira, terkejut, netral, sedih, takut, marah) dengan menggunakan data pelatihan yang disediakan (“/content/datatrain.csv”). Memprediksi emosi untuk data uji (“/content/datatest.csv”) dan menyimpan hasilnya dalam file csv dengan kolom ‘id’ dan “emotion”. kinerja model akan dievaluasi dengan menggunakan Macro-avaraged F-1 Score.


## Load data

### Subtask:
Load the training and testing datasets from the provided CSV files into pandas DataFrames.


**Reasoning**:
Import the pandas library and load the training and testing datasets into dataframes.



In [None]:
import pandas as pd

df_train = pd.read_csv('/content/datatrain.csv')
df_test = pd.read_csv('/content/datatest.csv')

display(df_train.head())
display(df_test.head())

Unnamed: 0,id,video,emotion
0,1,https://www.instagram.com/reel/DNKcHgdA-d1/?ig...,Surprise
1,2,https://www.instagram.com/reel/DNHwrh2gnBm/?ig...,Surprise
2,3,https://www.instagram.com/reel/DM7QsjnRCoa/?ig...,Surprise
3,4,https://www.instagram.com/reel/DNBBEt6Paxj/?ig...,Surprise
4,5,https://www.instagram.com/reel/DMz13fQzZsN/?ig...,Proud


Unnamed: 0,id,video
0,1,https://www.instagram.com/reel/DM2DYuURXAS/?ig...
1,2,https://www.instagram.com/reel/DMpUrCKxQj5/?ig...
2,3,https://www.instagram.com/reel/DKoahGuRUGB/?ig...
3,4,https://www.instagram.com/reel/DKZVA8DJq89/?ig...
4,5,https://www.instagram.com/reel/DLtizEnyaOn/?ig...


## Data preprocessing

### Subtask:
Handle categorical features, such as the 'emotion' column in the training data, by encoding them numerically. Also, consider any necessary feature engineering based on the 'video_frequency' column or potentially extracting features from the video URLs (though this is unlikely to be useful for simple models).


**Reasoning**:
Apply the emotion mapping, perform one-hot encoding on the emotion column, and calculate video frequencies for both training and test datasets. Then create the feature and target sets for training and feature set for testing.



In [None]:
emotion_mapping = {
    'Proud': 'Proud', 'Bangga': 'Proud', 'proud': 'Proud',
    'Trust': 'Trust', 'percaya': 'Trust', 'Percaya': 'Trust', 'Loyalty': 'Trust', 'Faith': 'Trust',
    'Joy': 'Joy', 'Senang': 'Joy', 'senang': 'Joy', 'Love': 'Joy',
    'Surprise': 'Surprise', 'Terkejut': 'Surprise', 'Trkejut': 'Surprise', 'Terkjut': 'Surprise', 'terkejut': 'Surprise', 'Trekejut': 'Surprise', 'Kaget': 'Surprise', 'kaget': 'Surprise',
    'Neutral': 'Neutral', 'Netral': 'Neutral',
    'Sadness': 'Sadness', 'Sedih': 'Sadness', 'sedih': 'Sadness', 'Sad': 'Sadness',
    'Fear': 'Fear', 'Takut': 'Fear', 'takut': 'Fear',
    'Anger': 'Anger', 'Marah': 'Anger', 'marah': 'Anger', 'Marh': 'Anger'
}

df_train['emotion'] = df_train['emotion'].map(emotion_mapping)

y_train = pd.get_dummies(df_train['emotion'], prefix='emotion')

video_counts_train = df_train['video'].value_counts()
df_train['video_frequency'] = df_train['video'].map(video_counts_train)

video_counts_test = df_test['video'].value_counts()
df_test['video_frequency'] = df_test['video'].map(video_counts_test).fillna(0).astype(int) # Handle videos in test not in train

X_train = df_train[['video_frequency']]
X_test = df_test[['video_frequency']]

display(df_train.head())
display(df_test.head())
display(y_train.head())
display(X_train.head())
display(X_test.head())

Unnamed: 0,id,video,emotion,video_frequency
0,1,https://www.instagram.com/reel/DNKcHgdA-d1/?ig...,Surprise,1
1,2,https://www.instagram.com/reel/DNHwrh2gnBm/?ig...,Surprise,1
2,3,https://www.instagram.com/reel/DM7QsjnRCoa/?ig...,Surprise,1
3,4,https://www.instagram.com/reel/DNBBEt6Paxj/?ig...,Surprise,1
4,5,https://www.instagram.com/reel/DMz13fQzZsN/?ig...,Proud,1


Unnamed: 0,id,video,video_frequency
0,1,https://www.instagram.com/reel/DM2DYuURXAS/?ig...,1
1,2,https://www.instagram.com/reel/DMpUrCKxQj5/?ig...,1
2,3,https://www.instagram.com/reel/DKoahGuRUGB/?ig...,1
3,4,https://www.instagram.com/reel/DKZVA8DJq89/?ig...,1
4,5,https://www.instagram.com/reel/DLtizEnyaOn/?ig...,1


Unnamed: 0,emotion_Anger,emotion_Fear,emotion_Joy,emotion_Neutral,emotion_Proud,emotion_Sadness,emotion_Surprise,emotion_Trust
0,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,True,False
4,False,False,False,False,True,False,False,False


Unnamed: 0,video_frequency
0,1
1,1
2,1
3,1
4,1


Unnamed: 0,video_frequency
0,1
1,1
2,1
3,1
4,1


## Model selection and training

### Subtask:
Choose a suitable machine learning model for multi-label classification. Train the model using the preprocessed training data.


**Reasoning**:
Train a Logistic Regression model for each emotion class and make predictions on the training data.



In [None]:
from sklearn.linear_model import LogisticRegression

models = {}
y_train_pred = pd.DataFrame()

for emotion in y_train.columns:
  model = LogisticRegression(max_iter=1000)
  model.fit(X_train, y_train[emotion])
  models[emotion] = model
  y_train_pred[emotion] = model.predict(X_train)

display(y_train_pred.head())

Unnamed: 0,emotion_Anger,emotion_Fear,emotion_Joy,emotion_Neutral,emotion_Proud,emotion_Sadness,emotion_Surprise,emotion_Trust
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False


## Prediction

### Subtask:
Use the trained model to predict the emotions for the test data.


**Reasoning**:
Iterate through the trained models, generate predictions for the test data, store them in a DataFrame with original ids, and convert boolean predictions to integers.



In [None]:
predicted_emotions = pd.DataFrame()

for emotion, model in models.items():
  predicted_emotions[emotion] = model.predict(X_test)

predicted_emotions['id'] = df_test['id']

for col in predicted_emotions.columns:
  if predicted_emotions[col].dtype == 'bool':
    predicted_emotions[col] = predicted_emotions[col].astype(int)

display(predicted_emotions.head())

Unnamed: 0,emotion_Anger,emotion_Fear,emotion_Joy,emotion_Neutral,emotion_Proud,emotion_Sadness,emotion_Surprise,emotion_Trust,id
0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,2
2,0,0,0,0,0,0,0,0,3
3,0,0,0,0,0,0,0,0,4
4,0,0,0,0,0,0,0,0,5


## Format and save results

### Subtask:
Format the predictions into a CSV file with 'id' and 'emotion' columns, as specified, and save it.


**Reasoning**:
Initialize an empty list to store predictions, iterate through the predicted emotions, find the predicted emotion for each row, and handle cases with no predicted emotion by using the most frequent emotion from the training data. Then, create a dataframe from the list, save it to a CSV file, and display the head of the dataframe.



In [None]:
predictions_list = []

for index, row in predicted_emotions.iterrows():
    predicted_emotion = None
    for emotion_col in [col for col in predicted_emotions.columns if col.startswith('emotion_')]:
        if row[emotion_col] == 1:
            predicted_emotion = emotion_col.replace('emotion_', '')
            break

    if predicted_emotion is None:
        most_frequent_emotion = y_train.sum().idxmax().replace('emotion_', '')
        predicted_emotion = most_frequent_emotion

    predictions_list.append({'id': row['id'], 'emotion': predicted_emotion})

predictions_df = pd.DataFrame(predictions_list)

predictions_df.to_csv('predictions.csv', index=False)

display(predictions_df.head())

Unnamed: 0,id,emotion
0,1,Surprise
1,2,Surprise
2,3,Surprise
3,4,Surprise
4,5,Surprise


## Summary:

### Data Analysis Key Findings

*   The training data contained various spellings and cases for emotion labels, which were standardized into eight consistent categories: 'Proud', 'Trust', 'Joy', 'Surprise', 'Neutral', 'Sadness', 'Fear', and 'Anger'.
*   The 'video\_frequency' feature, representing the count of each video in the dataset, was engineered and used as the sole feature for model training.
*   Separate Logistic Regression models were trained for each of the eight emotion categories using the 'video\_frequency' feature.
*   Predictions for the test data were generated using these trained models. In cases where a video wasn't explicitly classified into one of the eight categories, the most frequent emotion from the training data was assigned as the prediction.
*   The final output was saved to a CSV file with 'id' and 'emotion' columns, as required.

### Insights or Next Steps

*   The current model uses only a single engineered feature ('video\_frequency'). Incorporating more relevant features, potentially extracted from video content or metadata, could significantly improve model performance.
*   Exploring more complex multi-label classification models or techniques beyond training individual binary classifiers for each label might yield better results, especially for capturing potential correlations between emotions.
