### Feature Engineering and preprocessing of the data for the classification Task

This notebook covers the feature engineering and preprocessing steps for my classification task. The Country column was grouped into continents, and ordinal encoding was applied to columns with inherent rankings, such as Academic Level. One-hot encoding (via get_dummies) was used for other categorical features. After completing the feature engineering and preprocessing, the dataset was split into training and testing sets and saved as CSV files to the Data folder.

In [3]:
## All libraries used

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
sns.set()
import numpy as np
import country_converter as coco
from datetime import datetime
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

import warnings
warnings.simplefilter(action= 'ignore')

In [6]:
## Load the data

df = pd.read_csv(r'/Users/sot/SDS-CP029-social-sphere/submissions/team-members/Patrick-Edosoma/Data/Raw/Students Social Media Addiction.csv')

In [7]:
df.head()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7


In [8]:
# column names normalization

df.columns = df.columns.str.lower().str.replace(' ', '_')

In [9]:
# checking for missing values

df.isna().sum()

student_id                      0
age                             0
gender                          0
academic_level                  0
country                         0
avg_daily_usage_hours           0
most_used_platform              0
affects_academic_performance    0
sleep_hours_per_night           0
mental_health_score             0
relationship_status             0
conflicts_over_social_media     0
addicted_score                  0
dtype: int64

In [10]:
# Grouping the conflict into low and high conflict

conflict_mapping = {
    0: 'Low Conflict',
    1: 'Low Conflict',
    2: 'Low Conflict',
    3: 'Low Conflict',
    4: 'High Conflict',
    5: 'High Conflict'
}

In [11]:

## Adding new column to the data  which is our target - which has a binary class ( low and high conflict)

df['conflict_level_in_relationship_over_social_media'] = df['conflicts_over_social_media'].map(conflict_mapping)

In [12]:
df.head()

Unnamed: 0,student_id,age,gender,academic_level,country,avg_daily_usage_hours,most_used_platform,affects_academic_performance,sleep_hours_per_night,mental_health_score,relationship_status,conflicts_over_social_media,addicted_score,conflict_level_in_relationship_over_social_media
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8,Low Conflict
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3,Low Conflict
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9,High Conflict
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4,Low Conflict
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7,Low Conflict


In [13]:

# Dropping the original column where the target variable was conned from. 

df.drop('conflicts_over_social_media',axis =1, inplace = True)

In [14]:
#  The target variable is a string. I will convert it to a binary class (0 and 1)

df['conflict_level_in_relationship_over_social_media'] = df['conflict_level_in_relationship_over_social_media'].map({'Low Conflict':0, 'High Conflict':1})

In [15]:
# I noticed that the country column has UAE as a country. I will replace it with United Arab Emirates 
# Important to change this because UAE is not a country in the country_converter library, 
# Using UAE like that  will convert the rows of UAE to unknown 

df['country'] = df['country'].replace({
    'UAE': 'United Arab Emirates'
   
})

In [16]:
# converting the country column to continent
df['continent'] = df['country'].apply(lambda x: coco.convert(names=x, to='continent'))

In [17]:
df.head()

Unnamed: 0,student_id,age,gender,academic_level,country,avg_daily_usage_hours,most_used_platform,affects_academic_performance,sleep_hours_per_night,mental_health_score,relationship_status,addicted_score,conflict_level_in_relationship_over_social_media,continent
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,8,0,Asia
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,3,0,Asia
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,9,1,America
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,4,0,Europe
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,7,0,America


In [18]:
# Dropping the country column because it is not a relevant feature for the model
df.drop('country',axis =1,inplace=True)

In [19]:
df['continent'].value_counts()

continent
Europe     278
Asia       276
America    123
Oceania     22
Africa       6
Name: count, dtype: int64

In [15]:
# I will encode the categorical columns

academic_level= ['High School', 'Undergraduate', 'Graduate']
gender = ['Female', 'Male']
affects_academic_performance= ['No', 'Yes']

encoder = OrdinalEncoder(categories=[academic_level, gender, affects_academic_performance])

categorical_cols = ['academic_level', 'gender', 'affects_academic_performance']

df[categorical_cols] = encoder.fit_transform(df[categorical_cols])

In [16]:
df.head()

Unnamed: 0,student_id,age,gender,academic_level,avg_daily_usage_hours,most_used_platform,affects_academic_performance,sleep_hours_per_night,mental_health_score,relationship_status,addicted_score,conflict_level_in_relationship_over_social_media,continent
0,1,19,0.0,1.0,5.2,Instagram,1.0,6.5,6,In Relationship,8,0,Asia
1,2,22,1.0,2.0,2.1,Twitter,0.0,7.5,8,Single,3,0,Asia
2,3,20,0.0,1.0,6.0,TikTok,1.0,5.0,5,Complicated,9,1,America
3,4,18,1.0,0.0,3.0,YouTube,0.0,7.0,7,Single,4,0,Europe
4,5,21,1.0,2.0,4.5,Facebook,1.0,6.0,6,In Relationship,7,0,America


In [17]:
# I will convert the rest of the categorical columns to dummy variables

df = pd.get_dummies(df, drop_first=True, dtype=int)


In [18]:
df.head()

Unnamed: 0,student_id,age,gender,academic_level,avg_daily_usage_hours,affects_academic_performance,sleep_hours_per_night,mental_health_score,addicted_score,conflict_level_in_relationship_over_social_media,...,most_used_platform_VKontakte,most_used_platform_WeChat,most_used_platform_WhatsApp,most_used_platform_YouTube,relationship_status_In Relationship,relationship_status_Single,continent_America,continent_Asia,continent_Europe,continent_Oceania
0,1,19,0.0,1.0,5.2,1.0,6.5,6,8,0,...,0,0,0,0,1,0,0,1,0,0
1,2,22,1.0,2.0,2.1,0.0,7.5,8,3,0,...,0,0,0,0,0,1,0,1,0,0
2,3,20,0.0,1.0,6.0,1.0,5.0,5,9,1,...,0,0,0,0,0,0,1,0,0,0
3,4,18,1.0,0.0,3.0,0.0,7.0,7,4,0,...,0,0,0,1,0,1,0,0,1,0
4,5,21,1.0,2.0,4.5,1.0,6.0,6,7,0,...,0,0,0,0,1,0,1,0,0,0


In [19]:
# Checking relationship

df.corr()

Unnamed: 0,student_id,age,gender,academic_level,avg_daily_usage_hours,affects_academic_performance,sleep_hours_per_night,mental_health_score,addicted_score,conflict_level_in_relationship_over_social_media,...,most_used_platform_VKontakte,most_used_platform_WeChat,most_used_platform_WhatsApp,most_used_platform_YouTube,relationship_status_In Relationship,relationship_status_Single,continent_America,continent_Asia,continent_Europe,continent_Oceania
student_id,1.0,0.222306,-0.001087,0.194221,0.267524,0.05378,0.173793,-0.055037,0.041637,0.053481,...,-0.052373,0.172763,0.141071,-0.173932,0.007454,0.125034,0.109344,-0.231217,0.179396,-0.01892
age,0.222306,1.0,0.49471,0.824932,-0.113682,-0.13714,0.125265,0.160278,-0.166396,-0.150371,...,0.126151,0.092137,0.077751,-0.039426,0.145176,-0.10038,0.010351,-0.06452,0.061523,-0.020483
gender,-0.001087,0.49471,1.0,0.58223,-0.073582,-0.024736,0.046946,0.046534,-0.049692,-0.124026,...,0.131777,0.069015,0.171081,0.120122,0.038675,-0.032628,0.011865,0.00695,0.001145,-0.048693
academic_level,0.194221,0.824932,0.58223,1.0,-0.12556,-0.091373,0.201456,0.175512,-0.16772,-0.236893,...,0.134163,0.098247,0.190029,-0.131913,0.17746,-0.087098,0.03306,0.001725,-0.00261,-0.033127
avg_daily_usage_hours,0.267524,-0.113682,-0.073582,-0.12556,1.0,0.661474,-0.790582,-0.801058,0.832,0.669387,...,-0.070034,0.004844,0.356934,-0.080069,0.008008,0.00637,0.336116,0.088889,-0.316473,-0.063055
affects_academic_performance,0.05378,-0.13714,-0.024736,-0.091373,0.661474,1.0,-0.625373,-0.808921,0.866049,0.451395,...,-0.17643,-0.033602,0.214812,-0.085738,-0.178718,0.156066,0.288289,0.125253,-0.306626,-0.104452
sleep_hours_per_night,0.173793,0.125265,0.046946,0.201456,-0.790582,-0.625373,1.0,0.707439,-0.764858,-0.630251,...,0.102961,0.064299,-0.255403,-0.05528,-0.028743,0.106815,-0.266792,-0.150272,0.336204,0.087541
mental_health_score,-0.055037,0.160278,0.046534,0.175512,-0.801058,-0.808921,0.707439,1.0,-0.945051,-0.695644,...,0.09212,0.032007,-0.179939,0.040523,0.053309,-0.028757,-0.233251,-0.141161,0.278366,0.110866
addicted_score,0.041637,-0.166396,-0.049692,-0.16772,0.832,0.866049,-0.764858,-0.945051,1.0,0.705489,...,-0.119215,-0.034416,0.186327,-0.025478,-0.049566,0.014795,0.297544,0.119873,-0.310097,-0.106016
conflict_level_in_relationship_over_social_media,0.053481,-0.150371,-0.124026,-0.236893,0.669387,0.451395,-0.630251,-0.695644,0.705489,1.0,...,-0.07964,-0.089233,0.126707,-0.045517,0.01643,-0.038225,0.109904,0.144391,-0.193476,-0.090203


In [20]:
## Splitting the data into train and test and saving them in the processed folder



# Split the data (80% train, 20% test)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Define your output directory
output_dir = "/Users/sot/SDS-CP029-social-sphere/submissions/team-members/Patrick-Edosma/Data/classification_processed_data"

# Make sure the directory exists
os.makedirs(output_dir, exist_ok=True)

# Save train and test CSVs
train_df.to_csv(os.path.join(output_dir, "train.csv"), index=False)
test_df.to_csv(os.path.join(output_dir, "test.csv"), index=False)





## ðŸ’¡ Insights and Suggestions


 final observation, i noticed that certain features such as avg_daily_usage_hours, affects_academic_performance, sleep_hours_per_night, and mental_health_score may introduce noise into the model because of multicollinarity. Including these features together could lead to overfitting and potential data leakage. To prevent this and build a more efficient model, it is necessary to consider dropping some of these features.

Alternatively, I will apply Principal Component Analysis (PCA) to reduce noise by transforming these features into two principal components. Hence, i won't drop them in my workflow.

Itâ€™s also important to note that I performed some preprocessing on the categorical columns before the train-test split. Specifically, I manually handled the mapping and used get_dummies for encoding. Please keep in mind: do not use any fit_transform  when preprocessing your dataset before splitting. Always split the data first, then apply .fit() only on the training set and .transform() on both the training and testing sets to avoid data leakage.