#### Feature Engineering and Model Training
 
 1. Feature Engineering
 
 - **Numeric Features**: 'Age', 'Daily_Usage_Time (minutes)', 'Number_of_Friends', etc.
     - Already cleaned and converted to appropriate types.
 - **Categorical Features**: 'Gender', 'Platform'
     - Encode using **Label Encoding** or **One-Hot Encoding** for machine learning models.
 - **Target Variable**: 'Dominant_Emotion'
    - Encode target labels using **LabelEncoder**.

In [48]:
# import basic libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [49]:
# Load the dataset
df_train=pd.read_csv('../Dataset/clean_train_data.csv')
df_test = pd.read_csv("../Dataset/clean_test_data.csv")



In [50]:
# Train dataset
df_train.head()

Unnamed: 0,Age,Gender,Platform,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Dominant_Emotion
0,25,Female,Instagram,120.0,3.0,45.0,10.0,12.0,Happiness
1,30,Male,Twitter,90.0,5.0,20.0,25.0,30.0,Anger
2,22,Non-binary,Facebook,60.0,2.0,15.0,5.0,20.0,Neutral
3,28,Female,Instagram,200.0,8.0,100.0,30.0,50.0,Anxiety
4,33,Male,LinkedIn,45.0,1.0,5.0,2.0,10.0,Boredom


In [51]:
# Test dataset
df_test.head()

Unnamed: 0,Age,Gender,Platform,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Dominant_Emotion
0,27,Female,Snapchat,120,4,40,18,22,Neutral
1,21,Non-binary,Snapchat,60,1,18,7,12,Neutral
2,28,Non-binary,Snapchat,115,3,38,18,27,Anxiety
3,27,Male,Telegram,105,3,48,20,28,Anxiety
4,21,Non-binary,Facebook,55,3,17,7,12,Neutral


In [52]:
from sklearn.preprocessing import LabelEncoder

# Initialize encoder
le = LabelEncoder()

# Fit on train
df_train['Dominant_Emotion'] = le.fit_transform(df_train['Dominant_Emotion'])

# Apply same mapping to test
df_test['Dominant_Emotion'] = le.transform(df_test['Dominant_Emotion'])




 ### 1. Train-Test Split
- Split the dataset into **features (X)** and **target (y)**.
- Use `train_test_split` to create **training and validation sets**.

In [53]:
# --- Step 1: Split features and target ---
X = df_train.drop('Dominant_Emotion', axis=1)
y = df_train['Dominant_Emotion']

In [54]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [55]:
# Identify categorical and numerical columns 
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()


#### **Model Train**

In [56]:
# Important libraries
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [57]:
# --- Step 3: ColumnTransformer ---
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

#### Performance metrix function


In [58]:
def get_performance_metrics(y_true, y_pred):
    
    metrics = {
        "accuracy": accuracy_score(y_true, y_pred),
        "confusion_matrix": confusion_matrix(y_true, y_pred), 
        "classification_report": classification_report(y_true, y_pred,output_dict=True)  
    }
    
    return metrics

In [59]:
# --- Define models ---
models = {
    "Naive Bayes": GaussianNB(),

    "Random Forest": RandomForestClassifier(n_estimators=100,      # Number of trees
                                            max_depth=None,       # Tree depth
                                            random_state=42),

    "XGBoost": XGBClassifier(objective='binary:logistic',  # Use multi:softmax for multi-class
                            n_estimators=100,             # Number of trees
                            learning_rate=0.1,            
                            max_depth=3,                  # Maximum depth of a tree
                            random_state=42)
}

In [60]:
results = {}
for name, model in models.items():
    # Create pipeline
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Train
    pipeline.fit(X_train, y_train)
    
    # Predict
    y_pred = pipeline.predict(X_test)
    
    # Get metrics
    results[name] = get_performance_metrics(y_test, y_pred)

In [61]:
# --- Loop through results dictionary ---
for model_name, metrics in results.items():
    print(f"\n{'='*30}\nModel: {model_name}\n{'='*30}")
    print(f"Accuracy: {metrics['accuracy']}\n")
    
    
    print(f'Confusion_matrix: \n{metrics['confusion_matrix']}')
        
    
    print("\nClassification Report:")
    report = metrics['classification_report']
    # Print precision, recall, f1-score for each class
    for cls, cls_metrics in report.items():
        if cls not in ['accuracy', 'macro avg', 'weighted avg']:
            print(f"Class {cls}: Precision={cls_metrics['precision']:.2f}, "
                  f"Recall={cls_metrics['recall']:.2f}, F1-Score={cls_metrics['f1-score']:.2f}")



Model: Naive Bayes
Accuracy: 0.425

Confusion_matrix: 
[[22  0  3  1  0  0]
 [11  0  8 14  0  1]
 [ 5  0 23  0  0  0]
 [ 5  0  0 33  0  2]
 [ 9  0 21  9  0  1]
 [13  0  9  3  0  7]]

Classification Report:
Class 0: Precision=0.34, Recall=0.85, F1-Score=0.48
Class 1: Precision=0.00, Recall=0.00, F1-Score=0.00
Class 2: Precision=0.36, Recall=0.82, F1-Score=0.50
Class 3: Precision=0.55, Recall=0.82, F1-Score=0.66
Class 4: Precision=0.00, Recall=0.00, F1-Score=0.00
Class 5: Precision=0.64, Recall=0.22, F1-Score=0.33

Model: Random Forest
Accuracy: 0.97

Confusion_matrix: 
[[25  0  0  0  0  1]
 [ 0 32  1  1  0  0]
 [ 0  0 28  0  0  0]
 [ 0  0  0 39  1  0]
 [ 0  0  0  0 40  0]
 [ 0  2  0  0  0 30]]

Classification Report:
Class 0: Precision=1.00, Recall=0.96, F1-Score=0.98
Class 1: Precision=0.94, Recall=0.94, F1-Score=0.94
Class 2: Precision=0.97, Recall=1.00, F1-Score=0.98
Class 3: Precision=0.97, Recall=0.97, F1-Score=0.97
Class 4: Precision=0.98, Recall=1.00, F1-Score=0.99
Class 5: Prec

Evaluate on Final Test Data

In [62]:
# Features and target
X_full = df_train.drop('Dominant_Emotion', axis=1)
y_full = df_train['Dominant_Emotion']

# Final pipeline
final_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(
        objective='multi:softmax',
        num_class=len(y_full.unique()),
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    ))
])

# Train on full data
final_pipeline.fit(X_full, y_full)


In [63]:
X_new = df_test.drop('Dominant_Emotion', axis=1, errors='ignore')
y_new=df_test['Dominant_Emotion']

In [64]:
y_new_pred = final_pipeline.predict(X_new)


Evaluate the Perforam

In [65]:
y_true_test = df_test['Dominant_Emotion']

test_metrics = get_performance_metrics(y_true_test, y_new_pred)

print("Final Test Accuracy:", test_metrics["accuracy"])
print("\nConfusion Matrix:\n", test_metrics["confusion_matrix"])
print("\nClassification Report:")
print(pd.DataFrame(test_metrics["classification_report"]).T)


Final Test Accuracy: 0.9514563106796117

Confusion Matrix:
 [[ 9  0  0  0  0  0]
 [ 0 21  0  1  0  0]
 [ 0  0 15  0  0  1]
 [ 0  0  0 14  0  0]
 [ 0  0  1  0 27  0]
 [ 0  0  1  1  0 12]]

Classification Report:
              precision    recall  f1-score     support
0              1.000000  1.000000  1.000000    9.000000
1              1.000000  0.954545  0.976744   22.000000
2              0.882353  0.937500  0.909091   16.000000
3              0.875000  1.000000  0.933333   14.000000
4              1.000000  0.964286  0.981818   28.000000
5              0.923077  0.857143  0.888889   14.000000
accuracy       0.951456  0.951456  0.951456    0.951456
macro avg      0.946738  0.952246  0.948313  103.000000
weighted avg   0.954279  0.951456  0.951804  103.000000


###  Final Model Conclusion

The final XGBoost pipeline achieved a **high test accuracy of 95.15%**, indicating strong predictive performance.  
The confusion matrix shows that most predictions are correctly classified, with very few misclassifications.  
Precision, recall, and F1-scores are consistently high across all emotion classes, showing balanced performance.  

Overall, the model is **robust, reliable, and well-suited** for predicting dominant emotional states based on social media usage data.
