   ### Q3. What are the most significant predictors of a user's Stress_Level, and can a model accurately predict whether a user falls into the high or low stress categories based on lifestyle factors?

# Machine Learning Prediction
The Goal is to create a model to predict the Stress binary level (low or high ) by using the available parameters:'Daily_Screen_Time(hrs)', 'Sleep_Quality(1-10)',
    'Exercise_Frequency(week)', 'Happiness_Index(1-10)', 'Stress_Level_Binary(low_stress), high_stress(6-10))', and 'Wellbeing_Score

## For Binary classification, the two models above will be compared 

# Conceptual representation of model philosophies
1. Linear_Model = "Logistic Regression"      # Simplicity & interpretability
2. "Random Forest"          # Complexity & predictive power

### The Process (Same for Both!):
1. Set up and Load data
2. Split into training and testing
3. Train the model
4. Make predictions
5. Findings

 ### 1.1 Set up

In [49]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# For Regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# For Classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Make graphs look nice
plt.style.use('seaborn-v0_8-whitegrid')

 ### 1.2 Load Data

In [50]:
df = pd.read_csv('socialmedia_clean.csv')

In [51]:
df.head(3)

Unnamed: 0,User_ID,Age,Gender,Daily_Screen_Time(hrs),Sleep_Quality(1-10),Stress_Level(1-10),Days_Without_Social_Media,Exercise_Frequency(week),Social_Media_Platform,Happiness_Index(1-10)
0,U001,44,Male,3.1,7.0,6.0,2.0,5.0,Facebook,10.0
1,U002,30,Other,5.1,7.0,8.0,5.0,3.0,LinkedIn,10.0
2,U003,23,Other,7.4,6.0,7.0,1.0,3.0,YouTube,6.0


In [52]:
df_ml = df.copy()
df_ml.head(3)

Unnamed: 0,User_ID,Age,Gender,Daily_Screen_Time(hrs),Sleep_Quality(1-10),Stress_Level(1-10),Days_Without_Social_Media,Exercise_Frequency(week),Social_Media_Platform,Happiness_Index(1-10)
0,U001,44,Male,3.1,7.0,6.0,2.0,5.0,Facebook,10.0
1,U002,30,Other,5.1,7.0,8.0,5.0,3.0,LinkedIn,10.0
2,U003,23,Other,7.4,6.0,7.0,1.0,3.0,YouTube,6.0


In [53]:
df_ml['Wellbeing_Score'] = (
    df_ml['Happiness_Index(1-10)'] +
    df_ml['Sleep_Quality(1-10)'] +
    df_ml['Exercise_Frequency(week)'] +
    (11 - df_ml['Stress_Level(1-10)'])
)

In [54]:
df_ml.columns

Index(['User_ID', 'Age', 'Gender', 'Daily_Screen_Time(hrs)',
       'Sleep_Quality(1-10)', 'Stress_Level(1-10)',
       'Days_Without_Social_Media', 'Exercise_Frequency(week)',
       'Social_Media_Platform', 'Happiness_Index(1-10)', 'Wellbeing_Score'],
      dtype='object')

##### 1.3.2 Stress_Level_Binary

In [55]:
#value is 1, otherwise 0, where 0= low stress (1-5) 1= high stress (6-10)
Stress_Level_Binary = df_ml['Stress_Level(1-10)'] >= 6

In [56]:
df_ml['Stress_Level_Binary(0=low_stress(1-5),1=high_stress(6-10))'] = np.where(Stress_Level_Binary, 1, 0)

In [57]:
df_ml.head(3)

Unnamed: 0,User_ID,Age,Gender,Daily_Screen_Time(hrs),Sleep_Quality(1-10),Stress_Level(1-10),Days_Without_Social_Media,Exercise_Frequency(week),Social_Media_Platform,Happiness_Index(1-10),Wellbeing_Score,"Stress_Level_Binary(0=low_stress(1-5),1=high_stress(6-10))"
0,U001,44,Male,3.1,7.0,6.0,2.0,5.0,Facebook,10.0,27.0,1
1,U002,30,Other,5.1,7.0,8.0,5.0,3.0,LinkedIn,10.0,23.0,1
2,U003,23,Other,7.4,6.0,7.0,1.0,3.0,YouTube,6.0,19.0,1


In [58]:
# We nolonger need the Stress_Level(1-10)
df_ml = df_ml.drop(columns=['Stress_Level(1-10)'], inplace=False)

In [59]:
df_ml.columns

Index(['User_ID', 'Age', 'Gender', 'Daily_Screen_Time(hrs)',
       'Sleep_Quality(1-10)', 'Days_Without_Social_Media',
       'Exercise_Frequency(week)', 'Social_Media_Platform',
       'Happiness_Index(1-10)', 'Wellbeing_Score',
       'Stress_Level_Binary(0=low_stress(1-5),1=high_stress(6-10))'],
      dtype='object')

##### 1.4 Feature Importance
##### We had earlier looked at the relationships during ETL, we need to look at feture importance too so as to determine most important features.

In [60]:
# Checks what to Keep vs Drop
# Drop target + ID column

df_clean = df_ml.drop(columns=[
    "User_ID",
    "Stress_Level_Binary(0=low_stress(1-5),1=high_stress(6-10))"
], errors="ignore")

# One-hot encode categorical features automatically
X = pd.get_dummies(df_clean)


# Target
y = df_ml["Stress_Level_Binary(0=low_stress(1-5),1=high_stress(6-10))"]

# Train model
rf = RandomForestRegressor(random_state=42)
rf.fit(X, y)

# Feature importance
importances = pd.Series(rf.feature_importances_, index=X.columns)

# Sort smallest â†’ largest
importances = importances.sort_values()

print(importances)

Gender_Other                         0.005111
Social_Media_Platform_Instagram      0.005264
Social_Media_Platform_TikTok         0.007236
Social_Media_Platform_YouTube        0.007947
Social_Media_Platform_Facebook       0.008725
Gender_Female                        0.010529
Social_Media_Platform_LinkedIn       0.011149
Gender_Male                          0.014277
Social_Media_Platform_X (Twitter)    0.016072
Days_Without_Social_Media            0.042394
Sleep_Quality(1-10)                  0.053120
Age                                  0.066023
Exercise_Frequency(week)             0.077739
Happiness_Index(1-10)                0.095944
Daily_Screen_Time(hrs)               0.163391
Wellbeing_Score                      0.415080
dtype: float64


#### What This Tells Us About Stress:
###### Wellbeing Score is king (41.5%) - Overall life satisfaction drives stress
###### Screen time matters (16.3%) - Digital overload affects stress
###### Happiness index matters (9.6%) - Unhappy people are more stressed
###### Exercise matters (7.8%) - Active people handle stress better
###### Social media choice matters (5.6%) - Platform impacts stress levels
###### Gender matters (3.0%) - Different stress experiences by gender
###### Everything else matters - Age, sleep, social media breaks all contribute since all are > 1%

In [67]:
# Prepare data for classification
# We'll use all the features to predict Stress_Level_ Binary 
# 1. Make a copy to avoid modifying the original
df_prepared = df_ml.copy()

# 2. Define target
target_col = "Stress_Level_Binary(0=low_stress(1-5),1=high_stress(6-10))"
y = df_prepared[target_col]

# 3. Drop target and User_ID from features
X = df_prepared.drop(columns=[target_col, 'User_ID'])

# 4. Encode ALL categorical variables (Gender, Social_Media_Platform)
X_encoded = pd.get_dummies(X, drop_first=True)

# 5. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42, stratify=y
)
    
print(f"Training on {len(X_train)} people")
print(f"Testing on {len(X_test)} people")
print()
print("Distribution in training set:")
print(y_train.value_counts())
print()
print("Distribution in testing set:")
print(y_test.value_counts())

Training on 400 people
Testing on 100 people

Distribution in training set:
Stress_Level_Binary(0=low_stress(1-5),1=high_stress(6-10))
1    305
0     95
Name: count, dtype: int64

Distribution in testing set:
Stress_Level_Binary(0=low_stress(1-5),1=high_stress(6-10))
1    76
0    24
Name: count, dtype: int64


In [80]:
# Train the classification model
class_model = LogisticRegression(random_state=42, max_iter=1000)
class_model.fit(X_train, y_train)

# Make predictions
y_pred_train = class_model.predict(X_train)
y_pred_test = class_model.predict(X_test) 

train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)

print(f"Training Accuracy: {train_accuracy:.2%}")
print(f"Testing Accuracy:  {test_accuracy:.2%}")

Training Accuracy: 99.75%
Testing Accuracy:  99.00%


#### The Metrics
#### Precision, Recall, F1-Score

In [82]:
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred_class)
recall = recall_score(y_test, y_pred_class)
f1 = f1_score(y_test, y_pred_class)

print(f"Precision: {precision:.2%}")  # Of predicted high stress, how many actually high?
print(f"Recall:    {recall:.2%}")     # Of actual high stress, how many did we catch?
print(f"F1-Score:  {f1:.2%}")         # Balance of precision and recall

Precision: 98.70%
Recall:    100.00%
F1-Score:  99.35%


###### 2.2 Random Forest