<a href="https://colab.research.google.com/github/jokefun022/Google-Colab/blob/main/Copy_of_21_08_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Explain the steps for building, training, evaluating, and deploying both ML and DL models on the multimodal dataset located at "/content/Complete Data With Emoji.csv". This explanation should include details on feature engineering and selection, hyperparameter tuning, model evaluation using confusion matrices, and the steps involved in model deployment.

## Feature engineering and selection

### Subtask:
Based on the multimodal nature of your data (text and emojis), engineer and select relevant features for both ML and DL models. This may involve techniques like TF-IDF for text, and one-hot encoding or embedding for emojis.


**Reasoning**:
Apply TF-IDF vectorization to the 'Tweet_Text_With_Emoji' column and perform one-hot encoding on the 'Emoji' column, then concatenate the results to create the feature set.



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Apply TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit features for manageability
tfidf_features = tfidf_vectorizer.fit_transform(df['Tweet_Text_With_Emoji']).toarray()
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_features, columns=tfidf_feature_names)

# One-hot encode emojis
emoji_one_hot = pd.get_dummies(df['Emoji'], prefix='Emoji')

# Concatenate features
X_multimodal = pd.concat([tfidf_df, emoji_one_hot], axis=1)

display(X_multimodal.head())

KeyError: 'Emoji'

**Reasoning**:
The previous code failed because the column 'Emoji' does not exist in the dataframe `df`. However, the `emoji_data_df` dataframe contains the 'Emoji' column that was created in a previous step. I will use this dataframe for one-hot encoding the emojis and then concatenate it with the TF-IDF features from `tfidf_df`.



In [None]:
# One-hot encode emojis using the correct dataframe
emoji_one_hot = pd.get_dummies(emoji_data_df['Emoji'], prefix='Emoji')

# Concatenate features
X_multimodal = pd.concat([tfidf_df, emoji_one_hot], axis=1)

display(X_multimodal.head())

Unnamed: 0,__,___,_mohyudin,_ul_,aa,aab,aaba,aabaad,aadami,aadat,...,zor,zulfi,zulm,Emoji_NO_EMOJI,Emoji_👿,Emoji_😃,Emoji_😊,Emoji_😑,Emoji_🙌,Emoji_🤬
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,True,False,False,False,False,False,False
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,True,False,False,False,False,False,False
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,True,False,False,False,False,False,False
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,True,False,False,False,False,False,False
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,False,False,False,True,False,False,False


## Ml model building and training

### Subtask:
Choose appropriate ML models for your classification task (e.g., Logistic Regression, SVM, RandomForest, Naive Bayes). Train each model using the engineered features. Perform hyperparameter tuning for each ML model to optimize performance.


**Reasoning**:
The task requires training several ML models and tuning their hyperparameters. This involves splitting the data, defining parameter grids, and using GridSearchCV for hyperparameter tuning. The first step is to import the necessary libraries and split the data according to the instructions.



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

# Split data into training+validation and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X_multimodal, y, test_size=0.2, random_state=42)

# Split training+validation data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.2, random_state=42)

# Define hyperparameter grids
lr_param_grid = {'C': [0.1, 1, 10, 100]}
svm_param_grid = {'C': [0.1, 1, 10, 100]}
rf_param_grid = {'n_estimators': [100, 200, 500], 'max_depth': [None, 10, 20, 30]}

# Initialize a dictionary to store trained models
trained_models = {}

**Reasoning**:
Now that the data is split and parameter grids are defined, the next step is to train and tune each of the specified ML models (Logistic Regression, SVM, and Random Forest) using GridSearchCV, and train the GaussianNB model directly.



In [None]:
# Train and tune Logistic Regression
lr = LogisticRegression(max_iter=1000) # Increase max_iter for convergence
lr_grid_search = GridSearchCV(lr, lr_param_grid, cv=5, scoring='accuracy')
lr_grid_search.fit(X_train, y_train)
trained_models['Logistic Regression'] = lr_grid_search.best_estimator_

# Train and tune SVM
svm = SVC()
svm_grid_search = GridSearchCV(svm, svm_param_grid, cv=5, scoring='accuracy')
svm_grid_search.fit(X_train, y_train)
trained_models['Linear SVM'] = svm_grid_search.best_estimator_

# Train and tune Random Forest
rf = RandomForestClassifier()
rf_grid_search = GridSearchCV(rf, rf_param_grid, cv=5, scoring='accuracy')
rf_grid_search.fit(X_train, y_train)
trained_models['Random Forest'] = rf_grid_search.best_estimator_

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
trained_models['Gaussian Naive Bayes'] = gnb