# digital diet mental health 

##  Business Context

###  Analysis of Screen Time Impact on Mental Health

In today’s digital world, screen time is steadily increasing—whether for work, social interaction, or entertainment. This raises a critical question: **how does screen time affect mental health?**

The goal of this study is to explore the **relationship between digital habits and psychological well-being**, in order to:
- Identify the types of screen use most associated with poor or stable mental health,
- Understand the influence of sleep, physical activity, and lifestyle habits,
- Model the **mental health score** using explanatory variables,
- Develop a predictive tool to detect at-risk individuals (via regression or binary classification),
- Provide recommendations for healthcare professionals or wellness app developers.

The dataset includes variables related to screen usage (total time, type of device, social media, gaming), lifestyle behaviors (sleep, diet, exercise), and mental state (mood, stress, anxiety, etc.).

---

####  Variable Dictionary

| Variable                             | Description |
|--------------------------------------|-------------|
| `user_id`                            | Unique user identifier. |
| `age`                                | Age of the individual. |
| `gender`                             | Gender of the user (`Male`, `Female`, etc.). |
| `daily_screen_time_hours`            | Total number of screen hours per day. |
| `phone_usage_hours`                  | Daily phone usage time (in hours). |
| `laptop_usage_hours`                 | Daily laptop usage time. |
| `tablet_usage_hours`                 | Daily tablet usage time. |
| `tv_usage_hours`                     | Daily television viewing time. |
| `social_media_hours`                 | Time spent on social media per day. |
| `work_related_hours`                 | Screen time dedicated to work-related tasks. |
| `entertainment_hours`                | Screen time spent on entertainment (videos, streaming, etc.). |
| `gaming_hours`                       | Time spent playing video games. |
| `sleep_duration_hours`               | Average sleep duration per night (in hours). |
| `sleep_quality`                      | Self-reported sleep quality (score or category). |
| `mood_rating`                        | User’s self-assessed mood score. |
| `stress_level`                       | Reported level of stress. |
| `physical_activity_hours_per_week`   | Total hours of physical activity per week. |
| `location_type`                      | Type of living area (`Urban`, `Rural`, etc.). |
| `mental_health_score`                | **Target variable** measuring mental health. Can be used for regression or converted for binary classification. |
| `uses_wellness_apps`                 | Whether the user uses wellness apps (`Yes`, `No`). |
| `eats_healthy`                       | Whether the user maintains healthy eating habits (`Yes`, `No`). |
| `caffeine_intake_mg_per_day`         | Daily caffeine intake (in mg). |
| `weekly_anxiety_score`               | Weekly anxiety score. |
| `weekly_depression_score`            | Weekly depression score. |
| `mindfulness_minutes_per_day`        | Daily time spent practicing mindfulness (e.g., meditation). |

---


## load package 

In [1]:
import pandas as pd 
import pathlib
import os
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
import time
import ipywidgets as widgets
from IPython.display import display, clear_output
import warnings

## load data

In [2]:
data=pd.read_csv(r"C:\Users\Client\Desktop\R studio project\digital_diet_mental_health\digital_diet_mental_health.csv")
df=pd.DataFrame(data)
df.head()

Unnamed: 0,user_id,age,gender,daily_screen_time_hours,phone_usage_hours,laptop_usage_hours,tablet_usage_hours,tv_usage_hours,social_media_hours,work_related_hours,...,stress_level,physical_activity_hours_per_week,location_type,mental_health_score,uses_wellness_apps,eats_healthy,caffeine_intake_mg_per_day,weekly_anxiety_score,weekly_depression_score,mindfulness_minutes_per_day
0,user_1,51,Female,4.8,3.4,1.3,1.6,1.6,4.1,2.0,...,10,0.7,Urban,32,1,1,125.2,13,15,4.0
1,user_2,64,Male,3.9,3.5,1.8,0.9,2.0,2.7,3.1,...,6,4.3,Suburban,75,0,1,150.4,19,18,6.5
2,user_3,41,Other,10.5,2.1,2.6,0.7,2.2,3.0,2.8,...,5,3.1,Suburban,22,0,0,187.9,7,3,6.9
3,user_4,27,Other,8.8,0.0,0.0,0.7,2.5,3.3,1.6,...,5,0.0,Rural,22,0,1,73.6,7,2,4.8
4,user_5,55,Male,5.9,1.7,1.1,1.5,1.6,1.1,3.6,...,7,3.0,Urban,64,1,1,217.5,8,10,0.0


##  2.	Analyze the Structure of the Data 

In [3]:
def data_summary(df):
    from IPython.display import display, Markdown
    import pandas as pd

    separator = "\n" + "=" * 70 + "\n"
    
    display(Markdown("##  **Overall Dataset Summary**"))
    
    total_rows, total_cols = df.shape
    display(Markdown(f"- **Number of rows:** {total_rows:,}"))
    display(Markdown(f"- **Number of columns:** {total_cols:,}"))

    display(Markdown("###  **Column Names**"))
    display(Markdown(", ".join([f"`{col}`" for col in df.columns])))

    print(separator)
    
    display(Markdown("## **Data Types**"))
    display(df.dtypes.value_counts().to_frame("Number of columns").rename_axis("Data Type"))

    qualitative_columns = df.select_dtypes(include=["object"]).columns.tolist()
    quantitative_columns = df.select_dtypes(include=["number"]).columns.tolist()

    print(separator)

    display(Markdown("## **Qualitative (Categorical) Variables**"))
    if qualitative_columns:
        display(Markdown(", ".join([f"`{col}`" for col in qualitative_columns])))
    else:
        display(Markdown("_No qualitative variables._"))

    display(Markdown("## **Quantitative (Numerical) Variables**"))
    if quantitative_columns:
        display(Markdown(", ".join([f"`{col}`" for col in quantitative_columns])))
    else:
        display(Markdown("_No quantitative variables._"))

    print(separator)

    display(Markdown("## **Missing Values**"))
    missing_values = df.isnull().sum()
    missing_values = missing_values[missing_values > 0]
    if missing_values.empty:
        display(Markdown("_No missing values detected._"))
    else:
        display(missing_values.to_frame("Missing Values").rename_axis("Variable"))
data_summary(df)


##  **Overall Dataset Summary**

- **Number of rows:** 2,000

- **Number of columns:** 25

###  **Column Names**

`user_id`, `age`, `gender`, `daily_screen_time_hours`, `phone_usage_hours`, `laptop_usage_hours`, `tablet_usage_hours`, `tv_usage_hours`, `social_media_hours`, `work_related_hours`, `entertainment_hours`, `gaming_hours`, `sleep_duration_hours`, `sleep_quality`, `mood_rating`, `stress_level`, `physical_activity_hours_per_week`, `location_type`, `mental_health_score`, `uses_wellness_apps`, `eats_healthy`, `caffeine_intake_mg_per_day`, `weekly_anxiety_score`, `weekly_depression_score`, `mindfulness_minutes_per_day`





## **Data Types**

Unnamed: 0_level_0,Number of columns
Data Type,Unnamed: 1_level_1
float64,13
int64,9
object,3






## **Qualitative (Categorical) Variables**

`user_id`, `gender`, `location_type`

## **Quantitative (Numerical) Variables**

`age`, `daily_screen_time_hours`, `phone_usage_hours`, `laptop_usage_hours`, `tablet_usage_hours`, `tv_usage_hours`, `social_media_hours`, `work_related_hours`, `entertainment_hours`, `gaming_hours`, `sleep_duration_hours`, `sleep_quality`, `mood_rating`, `stress_level`, `physical_activity_hours_per_week`, `mental_health_score`, `uses_wellness_apps`, `eats_healthy`, `caffeine_intake_mg_per_day`, `weekly_anxiety_score`, `weekly_depression_score`, `mindfulness_minutes_per_day`





## **Missing Values**

_No missing values detected._

In [4]:

df.describe()

Unnamed: 0,age,daily_screen_time_hours,phone_usage_hours,laptop_usage_hours,tablet_usage_hours,tv_usage_hours,social_media_hours,work_related_hours,entertainment_hours,gaming_hours,...,mood_rating,stress_level,physical_activity_hours_per_week,mental_health_score,uses_wellness_apps,eats_healthy,caffeine_intake_mg_per_day,weekly_anxiety_score,weekly_depression_score,mindfulness_minutes_per_day
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,38.8055,6.0256,3.0237,1.99995,0.99565,1.5037,2.0392,2.01025,2.46735,1.2795,...,5.591,5.5415,3.08715,49.6505,0.3875,0.5075,148.0797,9.8875,10.049,10.75375
std,14.929203,1.974123,1.449399,0.997949,0.492714,0.959003,1.133435,1.116111,1.23686,0.8945,...,2.899814,2.885731,1.885258,17.546717,0.487301,0.500069,48.86066,6.027853,6.05334,7.340269
min,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,20.0,0.0,0.0,0.8,0.0,0.0,0.0
25%,26.0,4.7,2.0,1.3,0.6,0.8,1.2,1.2,1.6,0.6,...,3.0,3.0,1.6,35.0,0.0,0.0,113.9,5.0,5.0,4.9
50%,39.0,6.0,3.0,2.0,1.0,1.5,2.0,2.0,2.4,1.2,...,6.0,6.0,3.0,49.0,0.0,1.0,147.45,10.0,10.0,10.4
75%,51.0,7.325,4.0,2.7,1.3,2.2,2.8,2.8,3.3,1.9,...,8.0,8.0,4.4,64.25,1.0,1.0,180.7,15.0,15.0,15.8
max,64.0,13.3,8.4,5.6,2.5,4.7,5.8,5.9,6.8,4.0,...,10.0,10.0,9.7,80.0,1.0,1.0,364.9,20.0,20.0,36.4


##  Univariate Analysis

In [5]:
dtypes = df.dtypes.reset_index()
dtypes.columns = ['column_name', 'dtype']
dtypes['dtype'] = dtypes['dtype'].astype(str)  
print('Df dtypes:')
print(dtypes.groupby('dtype').size())

Df dtypes:
dtype
float64    13
int64       9
object      3
dtype: int64


### numerical variables 

In [6]:
def numeric_analysis(df):
        return print(df.describe().T)
numeric_analysis(df)

                                   count       mean        std   min    25%  \
age                               2000.0   38.80550  14.929203  13.0   26.0   
daily_screen_time_hours           2000.0    6.02560   1.974123   0.0    4.7   
phone_usage_hours                 2000.0    3.02370   1.449399   0.0    2.0   
laptop_usage_hours                2000.0    1.99995   0.997949   0.0    1.3   
tablet_usage_hours                2000.0    0.99565   0.492714   0.0    0.6   
tv_usage_hours                    2000.0    1.50370   0.959003   0.0    0.8   
social_media_hours                2000.0    2.03920   1.133435   0.0    1.2   
work_related_hours                2000.0    2.01025   1.116111   0.0    1.2   
entertainment_hours               2000.0    2.46735   1.236860   0.0    1.6   
gaming_hours                      2000.0    1.27950   0.894500   0.0    0.6   
sleep_duration_hours              2000.0    6.53755   1.203856   3.0    5.7   
sleep_quality                     2000.0    5.56700 

In [7]:
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

def univariate_analysis_dropdown(df):
    numeric_cols = df.select_dtypes(include=['number']).columns.tolist()

    dropdown = widgets.Dropdown(
        options=numeric_cols,
        description='Variable:',
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='50%')
    )
    output = widgets.Output()
    def plot_variable(change):
        output.clear_output()
        col = change['new']

        with output:
            fig, axes = plt.subplots(2, 2, figsize=(14, 10))
            fig.suptitle(f'Distribution de "{col}"', fontsize=18, fontweight='bold', y=1.02)
            plt.subplots_adjust(hspace=0.4)

            if df[col].nunique() < 20:
                sns.barplot(
                    x=df[col].value_counts().index,
                    y=df[col].value_counts().values,
                    palette="magma",
                    ax=axes[0, 0]
                )
                axes[0, 0].set_title('Bar Chart')
                axes[0, 0].set_xlabel(col)
                axes[0, 0].set_ylabel('Fréquence')
            else:
                axes[0, 0].axis('off')
                axes[0, 0].text(0.5, 0.5, "Trop de valeurs uniques pour un bar chart",
                                ha='center', va='center', fontsize=10, color='gray')

            # Box Plot
            sns.boxplot(y=df[col], palette="coolwarm", ax=axes[0, 1])
            axes[0, 1].set_title("Box Plot")
            axes[0, 1].set_xlabel("")
            axes[0, 1].set_ylabel(col)

            # KDE Plot
            sns.kdeplot(df[col].dropna(), fill=True, color="skyblue", ax=axes[1, 0])
            axes[1, 0].set_title("Density Plot (KDE)")
            axes[1, 0].set_xlabel(col)
            axes[1, 0].set_ylabel("Densité")

            # Histogram
            sns.histplot(df[col].dropna(), bins=30, kde=False, color="teal", ax=axes[1, 1])
            axes[1, 1].set_title("Histogramme")
            axes[1, 1].set_xlabel(col)
            axes[1, 1].set_ylabel("Fréquence")

            plt.show()

    dropdown.observe(plot_variable, names='value')

    display(dropdown, output)

    dropdown.value = numeric_cols[0]

In [8]:
univariate_analysis_dropdown(df)

Dropdown(description='Variable:', layout=Layout(width='50%'), options=('age', 'daily_screen_time_hours', 'phon…

Output()

### categorical data 

In [9]:
def categorical_analysis(df):
    return print(df.select_dtypes(include='object').describe().T)
categorical_analysis(df)

              count unique     top freq
user_id        2000   2000  user_1    1
gender         2000      3  Female  935
location_type  2000      3   Urban  999


In [10]:
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

def univariate_analysis_dropdown_categorical(df):
    exclude_cols = {'user_id'}
    categorical_cols = [col for col in df.select_dtypes(include=['object', 'category']).columns if col not in exclude_cols]

    dropdown = widgets.Dropdown(
        options=categorical_cols,
        description='Variable:',
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='50%')
    )

    output = widgets.Output()

    def plot_categorical(change):
        output.clear_output()
        col = change['new']
        col_data = df[col].dropna()

        with output:
            fig, axes = plt.subplots(1, 2, figsize=(15, 6))
            fig.suptitle(f'Analyse Univariée de {col}', fontsize=16, fontweight='bold')

            sns.countplot(
                x=col_data,
                order=col_data.value_counts().index,
                palette="viridis",
                ax=axes[0]
            )
            axes[0].set_title('Count Plot')
            axes[0].set_xlabel(col)
            axes[0].set_ylabel('Fréquence')
            axes[0].tick_params(axis='x', rotation=45)

            # Pie Chart
            explode_values = [0.05] * len(col_data.unique())
            col_data.value_counts().plot.pie(
                autopct='%1.1f%%',
                colors=sns.color_palette("viridis", len(col_data.unique())),
                explode=explode_values,
                ax=axes[1]
            )
            axes[1].set_title('Pie Chart')
            axes[1].set_ylabel('')

            plt.tight_layout()
            plt.show()

    dropdown.observe(plot_categorical, names='value')
    
    display(dropdown, output)
    if categorical_cols:
        dropdown.value = categorical_cols[0]
    else:
        print("Aucune variable catégorielle valide trouvée.")

In [11]:
univariate_analysis_dropdown_categorical(df)

Dropdown(description='Variable:', layout=Layout(width='50%'), options=('gender', 'location_type'), style=Descr…

Output()

## Multyvarie Analysis

In [12]:
print(df["mental_health_score"].value_counts(normalize=True) * 100)
target="mental_health_score"

mental_health_score
75    2.35
36    2.20
44    2.15
23    2.15
74    2.10
      ... 
66    1.25
67    1.25
25    1.25
69    1.15
79    1.10
Name: proportion, Length: 61, dtype: float64


### multivariate analysis with respect to the target variable

In [13]:
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

def multivariate_analysis_interactive(df, target, sample_frac=0.3):
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns
    numeric_cols = df.select_dtypes(include=['float64','int64']).drop(columns=[target], errors='ignore').columns

    cat_dropdown = widgets.Dropdown(
        options=categorical_cols,
        description='Catégorie vs Target',
        layout=widgets.Layout(width='60%')
    )
    
    num_dropdown = widgets.Dropdown(
        options=numeric_cols,
        description='Numérique vs Target',
        layout=widgets.Layout(width='60%')
    )

    cat_output = widgets.Output()
    num_output = widgets.Output()

    def update_cat_plot(change):
        col = change['new']
        cat_output.clear_output()
        with cat_output:
            modalities = df[col].dropna().unique()
            n = len(modalities)
            fig, axes = plt.subplots(1, n, figsize=(6 * n, 5), sharey=True)
            if n == 1:
                axes = [axes]
            for i, modality in enumerate(modalities):
                subset = df[df[col] == modality]
                sns.histplot(subset[target], kde=True, ax=axes[i], color='skyblue')
                axes[i].set_title(f"{col} = {modality}")
                axes[i].set_xlabel(target)
            plt.suptitle(f"{target} Distribution par {col}", fontsize=16)
            plt.tight_layout()
            plt.show()

    def update_num_plot(change):
        col = change['new']
        num_output.clear_output()
        with num_output:
            sampled_df = df.sample(frac=sample_frac, random_state=42)
            modalities = sampled_df[target].dropna().unique()
            n = len(modalities)
            fig, axes = plt.subplots(1, n, figsize=(6 * n, 5), sharey=True)
            if n == 1:
                axes = [axes]

            for i, modality in enumerate(modalities):
                subset = sampled_df[sampled_df[target] == modality]
                sns.scatterplot(data=subset, x=target, y=col, ax=axes[i], alpha=0.5, color='purple')

                if pd.api.types.is_numeric_dtype(subset[target]) and pd.api.types.is_numeric_dtype(subset[col]):
                    sns.kdeplot(data=subset, x=target, y=col, ax=axes[i], color="red", levels=3)

                axes[i].set_title(f"{target} = {modality}")
                axes[i].set_xlabel(target)
                axes[i].set_ylabel(col)

            plt.suptitle(f"{target} vs {col} (Scatter + KDE)", fontsize=16)
            plt.tight_layout()
            plt.show()

    cat_dropdown.observe(update_cat_plot, names='value')
    num_dropdown.observe(update_num_plot, names='value')

    display(widgets.HTML("<h3>Analyse Multivariée : Catégorielle vs Target</h3>"))
    display(cat_dropdown, cat_output)
    if len(categorical_cols) > 0:
        cat_dropdown.value = categorical_cols[0]

    display(widgets.HTML("<h3>Analyse Multivariée : Numérique vs Target</h3>"))
    display(num_dropdown, num_output)
    if len(numeric_cols) > 0:
        num_dropdown.value = numeric_cols[0]


In [14]:
multivariate_analysis_interactive(df, target)


HTML(value='<h3>Analyse Multivariée : Catégorielle vs Target</h3>')

Dropdown(description='Catégorie vs Target', layout=Layout(width='60%'), options=('user_id', 'gender', 'locatio…

Output()

HTML(value='<h3>Analyse Multivariée : Numérique vs Target</h3>')

Dropdown(description='Numérique vs Target', layout=Layout(width='60%'), options=('age', 'daily_screen_time_hou…

Output()

### Advanced analysis

In [15]:
def bar_plot_combinations_interactive(df, target):

    categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    numeric_cols = df.select_dtypes(include=['number']).columns.tolist()

    if target in categorical_cols:
        categorical_cols.remove(target)
    if target in numeric_cols:
        numeric_cols.remove(target)

    cat_dropdown = widgets.Dropdown(options=categorical_cols, description="Catégorie", layout=widgets.Layout(width="45%"))
    num_dropdown = widgets.Dropdown(options=numeric_cols, description="Numérique", layout=widgets.Layout(width="45%"))

    output_plot = widgets.Output()

    def update_plot(change=None):
        cat_col = cat_dropdown.value
        num_col = num_dropdown.value
        output_plot.clear_output()

        with output_plot:
            if not cat_col or not num_col:
                print("Veuillez sélectionner une variable catégorielle et une numérique.")
                return

            plt.figure(figsize=(10, 6))
            sns.barplot(data=df, x=num_col, y=target, hue=cat_col, palette="viridis", alpha=0.6)
            plt.title(f"{target} vs {num_col} by {cat_col}")
            plt.xlabel(num_col)
            plt.ylabel(target)
            plt.legend(title=cat_col)
            plt.tight_layout()
            plt.show()

    cat_dropdown.observe(update_plot, names="value")
    num_dropdown.observe(update_plot, names="value")

    display(widgets.HTML("<h3>Analyse : bar Plot avec Cible, Numérique et Catégorie</h3>"))
    display(widgets.HBox([cat_dropdown, num_dropdown]))
    display(output_plot)

    if categorical_cols and numeric_cols:
        cat_dropdown.value = categorical_cols[0]
        num_dropdown.value = numeric_cols[0]


In [16]:
bar_plot_combinations_interactive(df, target)

HTML(value='<h3>Analyse : bar Plot avec Cible, Numérique et Catégorie</h3>')

HBox(children=(Dropdown(description='Catégorie', layout=Layout(width='45%'), options=('user_id', 'gender', 'lo…

Output()

## correlation matrix

In [17]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
import matplotlib.patches as mpatches

def correlation_and_significance(numeric_df, base_folder="correlation_analysis"):
    if not os.path.exists(base_folder):
        os.makedirs(base_folder)
    
    cols = numeric_df.columns
    corr_matrix = numeric_df.corr()
    p_values = pd.DataFrame(np.ones((len(cols), len(cols))), columns=cols, index=cols)
    
    for row in cols:
        for col in cols:
            if row != col:
                _, p_value = pearsonr(numeric_df[row], numeric_df[col])
                p_values.loc[row, col] = p_value
    
    plt.figure(figsize=(14, 12))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
    
    for i in range(len(cols)):
        for j in range(len(cols)):
            if i != j:
                p_val = p_values.iloc[i, j]
                x = j + 0.5
                y = i + 0.5
                color = 'green' if p_val < 0.05 else 'red'
                plt.plot(x, y, 'o', color=color, markersize=8)
    
    green_patch = mpatches.Patch(color='green', label='Significant (p < 0.05)')
    red_patch = mpatches.Patch(color='red', label='Not Significant (p ≥ 0.05)')
    plt.legend(handles=[green_patch, red_patch], loc='upper left', bbox_to_anchor=(1.05, 1))
    
    plt.title('Correlation Matrix with Significance (Numerical Variables)', fontsize=18)
    plt.xticks(rotation=45)
    plt.yticks(rotation=0)
    plt.tight_layout()

    corr_plot_path = os.path.join(base_folder, "correlation_matrix_significance.png")
    plt.savefig(corr_plot_path, bbox_inches='tight')
    plt.close()
    
    print(f"Correlation matrix with significance saved in: {corr_plot_path}")

    return corr_matrix, p_values, corr_plot_path


## cleaned data 

### data issus

#### Corrects invalid categorical and  numerical values

In [18]:
categorical_rules = {
            "gender": ["Male", "Female", "Other"],
            "location_type": ["Urban", "Suburban", "Rural"],
            "uses_wellness_apps": ["Yes", "No"],
            "eats_healthy": ["Yes", "No"]
        }
numeric_rules = {
            "age": (0, 100),
            "daily_screen_time_hours": (0, 24),
            "phone_usage_hours": (0, 24),
            "laptop_usage_hours": (0, 24),
            "tablet_usage_hours": (0, 24),
            "tv_usage_hours": (0, 24),
            "social_media_hours": (0, 24),
            "work_related_hours": (0, 24),
            "entertainment_hours": (0, 24),
            "gaming_hours": (0, 24),
            "sleep_duration_hours": (0, 24),
            "mood_rating": (0, 10),
            "stress_level": (0, 10),
            "physical_activity_hours_per_week": (0, 40),
            "caffeine_intake_mg_per_day": (0, 1000),
            "weekly_anxiety_score": (0, 100),
            "weekly_depression_score": (0, 100),
            "mindfulness_minutes_per_day": (0, 1440),
            "mental_health_score": (0, 100)
        }


In [19]:
class ScreenTimeDataCleaner:
    def __init__(self, df):
        self.df = df.copy()

    def correct_categorical_values(self):    
        for col, valid_values in categorical_rules.items():
            if col in self.df.columns:
                self.df[col] = self.df[col].where(self.df[col].isin(valid_values), np.nan)
                mode_val = self.df[col].mode(dropna=True)
                if not mode_val.empty:
                    self.df[col].fillna(mode_val[0], inplace=True)
        return self

    def correct_numerical_values(self):        
        for col, (min_val, max_val) in numeric_rules.items():
            if col in self.df.columns:
                self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
                self.df[col] = np.where((self.df[col] < min_val) | (self.df[col] > max_val), np.nan, self.df[col])
                median_val = self.df[col].median(skipna=True)
                self.df[col].fillna(median_val, inplace=True)
        return self

    def clean_data(self):
        return (
            self.correct_categorical_values()
                .correct_numerical_values()
                .df
        )

cleaner = ScreenTimeDataCleaner(df)
df_cleaned = cleaner.clean_data()


###  Removing duplicates rows

In [20]:
def remove_duplicates(df, subset=None, keep='first', inplace=False):
    total_rows = len(df)
    
    duplicate_rows = df.duplicated(subset=subset, keep=False).sum()
    percentage_duplicates = (duplicate_rows / total_rows) * 100
    print(f"Total Rows: {total_rows}")
    print(f"Duplicate Rows: {duplicate_rows} ({percentage_duplicates:.2f}%)")
    if duplicate_rows == 0:
        print("No duplicates found. No rows removed.")
        return df if not inplace else None
        duplicated_mask = df.duplicated(subset=subset, keep=False)
        result = df[~duplicated_mask]
    else:
        result = df.drop_duplicates(subset=subset, keep=keep)
    remaining_rows = len(result)
    rows_removed = total_rows - remaining_rows
    print(f"Rows Removed: {rows_removed}")
    print(f"Remaining Rows: {remaining_rows}")
    if rows_removed > 0:
        print("Duplicates successfully removed.")
    else:
        print("No duplicates were removed.")
    if inplace:
        df.drop_duplicates(subset=subset, keep=keep, inplace=True)
    else:
        return result
remove_duplicates(df, subset=None, keep='first', inplace=False)

Total Rows: 2000
Duplicate Rows: 0 (0.00%)
No duplicates found. No rows removed.


Unnamed: 0,user_id,age,gender,daily_screen_time_hours,phone_usage_hours,laptop_usage_hours,tablet_usage_hours,tv_usage_hours,social_media_hours,work_related_hours,...,stress_level,physical_activity_hours_per_week,location_type,mental_health_score,uses_wellness_apps,eats_healthy,caffeine_intake_mg_per_day,weekly_anxiety_score,weekly_depression_score,mindfulness_minutes_per_day
0,user_1,51,Female,4.8,3.4,1.3,1.6,1.6,4.1,2.0,...,10,0.7,Urban,32,1,1,125.2,13,15,4.0
1,user_2,64,Male,3.9,3.5,1.8,0.9,2.0,2.7,3.1,...,6,4.3,Suburban,75,0,1,150.4,19,18,6.5
2,user_3,41,Other,10.5,2.1,2.6,0.7,2.2,3.0,2.8,...,5,3.1,Suburban,22,0,0,187.9,7,3,6.9
3,user_4,27,Other,8.8,0.0,0.0,0.7,2.5,3.3,1.6,...,5,0.0,Rural,22,0,1,73.6,7,2,4.8
4,user_5,55,Male,5.9,1.7,1.1,1.5,1.6,1.1,3.6,...,7,3.0,Urban,64,1,1,217.5,8,10,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,user_1996,58,Female,5.6,4.0,2.5,0.3,1.5,1.1,1.2,...,9,0.0,Urban,62,0,1,164.9,20,17,4.9
1996,user_1997,62,Female,3.9,3.1,1.0,1.5,1.1,2.7,4.1,...,8,2.7,Urban,29,0,0,172.6,15,15,25.5
1997,user_1998,64,Female,7.4,3.0,0.0,1.4,0.9,0.8,2.6,...,4,6.5,Urban,54,1,0,101.3,1,20,9.5
1998,user_1999,19,Male,4.2,4.4,2.3,0.9,1.4,1.7,1.2,...,8,2.6,Urban,28,0,0,123.7,1,11,13.4


### outliers 


In [21]:
class DataCleaner:
    def __init__(self, df):
        self.df = df
        self.numeric_columns = self.df.select_dtypes(include=['number']).columns.to_list()

    def remove_na_inf(self):
        self.df[self.numeric_columns] = self.df[self.numeric_columns].replace([np.inf, -np.inf], np.nan).dropna()
        print(" NaN and infinite values removed.")

    def detect_outliers_iqr(self):
        outliers_dict = {}
        for col in self.numeric_columns:
            Q1 = self.df[col].quantile(0.25)
            Q3 = self.df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            outliers = self.df[(self.df[col] < lower_bound) | (self.df[col] > upper_bound)].index
            outliers_dict[col] = outliers

        return outliers_dict

    def drop_outliers(self):
        """Drops rows containing outliers based on the IQR method."""
        outliers_dict = self.detect_outliers_iqr()
        outlier_indices = set(index for indices in outliers_dict.values() for index in indices)
        self.df = self.df.drop(outlier_indices)
        print(" Outliers removed.")

    def replace_outliers_iqr(self):
        for col in self.numeric_columns:
            Q1 = self.df[col].quantile(0.25)
            Q3 = self.df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            self.df[col] = np.where(self.df[col] < lower_bound, lower_bound, self.df[col])
            self.df[col] = np.where(self.df[col] > upper_bound, upper_bound, self.df[col])
        
        print(" Outliers replaced using IQR bounds.")

    def replace_outliers_median(self):
        for col in self.numeric_columns:
            Q1 = self.df[col].quantile(0.25)
            Q3 = self.df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            median_value = self.df[col].median()

            self.df[col] = np.where((self.df[col] < lower_bound) | (self.df[col] > upper_bound), median_value, self.df[col])
        
        print(" Outliers replaced with median.")

    def replace_outliers_knn(self, n_neighbors=5):
        imputer = KNNImputer(n_neighbors=n_neighbors)
        self.df[self.numeric_columns] = imputer.fit_transform(self.df[self.numeric_columns])
        print(" Outliers replaced using KNN imputation.")

    def save_cleaned_data(self, filename="cleaned_data.csv"):
        self.df.to_csv(filename, index=False)
        print(f" Cleaned data saved to {filename}.")

    def clean_data(self, method="median"):
        self.remove_na_inf()  

        if method == "drop":
            self.drop_outliers()
        elif method == "winsorization":
            self.replace_outliers_iqr()
        elif method == "median":
            self.replace_outliers_median()
        elif method == "knn":
            self.replace_outliers_knn()
        else:
            print("Invalid method. Choose from 'drop', 'winsorization', 'median', 'knn'.")

        self.save_cleaned_data()


In [22]:
cleaner = DataCleaner(df)
cleaner.clean_data(method="median") 

 NaN and infinite values removed.
 Outliers replaced with median.
 Cleaned data saved to cleaned_data.csv.


### no missing values and right  data types

## manipulation of target variables 

#### 1. Quantile-based binning (tertiles, quartiles, quintiles)

This splits the data based on distribution percentiles.

In [23]:
df['mental_health_quartile'] = pd.qcut(df['mental_health_score'], q=4, labels=[0, 1, 2, 3])
print(df['mental_health_quartile'].value_counts())

mental_health_quartile
0    522
3    500
2    493
1    485
Name: count, dtype: int64


####  2. Fixed interval binning (numeric ranges)
Use custom-defined numeric intervals to create categories

In [24]:
bins = [20, 40, 60, 80]  
labels = ['low', 'medium', 'high']
df['mental_health_category'] = pd.cut(df['mental_health_score'], bins=bins, labels=labels, include_lowest=True)
print(df['mental_health_category'].value_counts())

mental_health_category
low       701
medium    657
high      642
Name: count, dtype: int64


#### 3. K-Means clustering
Use unsupervised clustering (k-means) to automatically group the values.

In [25]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=0)
df['mental_health_cluster'] = kmeans.fit_predict(df[['mental_health_score']])
print(df['mental_health_cluster'].value_counts())


mental_health_cluster
1    701
0    657
2    642
Name: count, dtype: int64


## feature eng

### new variables 

#### Age Group 
Transforms age into meaningful groups, capturing behavioral or health patterns by life stage.


In [26]:
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 25, 35, 50, 65, 100],
                        labels=['Teen', 'Young Adult', 'Adult', 'Mid-Age', 'Senior', 'Elderly'])

#### screen_usage_ratio
Measures the proportion of screen time relative to sleep and physical activity.

In [27]:
df['screen_usage_ratio'] = df['daily_screen_time_hours'] / (
    df['sleep_duration_hours'] + df['physical_activity_hours_per_week'].replace(0, np.nan)
)
df['screen_usage_ratio'].fillna(0, inplace=True)

####  wellbeing_score
Composite well-being score based on mood, stress, sleep quality, and mental health.


In [28]:
df['wellbeing_score'] = (
    df['mood_rating'] - df['stress_level'] +
    df['sleep_quality'] + df['mental_health_score']
)

####  Productivity Ratio
Focused screen time (work-related) vs total screen usage.

In [29]:
total_usage = df[['phone_usage_hours', 'laptop_usage_hours', 'tablet_usage_hours', 'tv_usage_hours']].sum(axis=1)
df['productivity_ratio'] = df['work_related_hours'] / total_usage.replace(0, np.nan)
df['productivity_ratio'].fillna(0, inplace=True)

#### Healthy Habits Count
Number of healthy lifestyle choices active for a user.

In [30]:
df['healthy_habits_count'] = (
    df['eats_healthy'].astype(int) +
    df['uses_wellness_apps'].astype(int) +
    (df['physical_activity_hours_per_week'] > 2).astype(int) +
    (df['mindfulness_minutes_per_day'] > 10).astype(int)
)

###  to eliminate in order to reduce overfitting, using both correlation and VIF (Variance Inflation Factor) analysis

 VIF > 5 → high multicollinearity. Consider removing .

 VIF between 1 and 5 → acceptable.

 VIF ≈ 1 → ideal, no multicollinearity.

In [31]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler

numeric_df = df.select_dtypes(include=[np.number])
def compute_vif(data):
    """Calculate VIF for each numeric feature"""
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(data.dropna())
    vif_data = pd.DataFrame()
    vif_data["feature"] = data.columns
    vif_data["VIF"] = [variance_inflation_factor(X_scaled, i) for i in range(X_scaled.shape[1])]
    return vif_data.sort_values(by="VIF", ascending=False)

features_for_vif = numeric_df.drop(columns=['mental_health_score'], errors='ignore')
vif_result = compute_vif(features_for_vif)

print("\n--- VIF Results ---\n")
print(vif_result)



--- VIF Results ---

                             feature       VIF
24                productivity_ratio  6.636077
25              healthy_habits_count  5.377200
7                 work_related_hours  5.314624
16                      eats_healthy  2.511924
15                uses_wellness_apps  2.403909
22                screen_usage_ratio  2.141776
20       mindfulness_minutes_per_day  2.008611
1            daily_screen_time_hours  1.951982
14  physical_activity_hours_per_week  1.852514
2                  phone_usage_hours  1.707680
3                 laptop_usage_hours  1.362677
23                   wellbeing_score  1.361460
5                     tv_usage_hours  1.317282
21             mental_health_cluster  1.277256
10              sleep_duration_hours  1.178099
4                 tablet_usage_hours  1.120487
13                      stress_level  1.056435
11                     sleep_quality  1.040375
12                       mood_rating  1.037127
6                 social_media_hours  

In [32]:
cols_to_drop = ["age",'productivity_ratio','sleep_duration_hours','healthy_habits_count','work_related_hours' ]

df= df.drop(columns=cols_to_drop)


### label Ecoding 

In [33]:
def get_object_columns(df, target=None):
    exclude_vars = ['mental_health_quartile', 'mental_health_category']
    object_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

    if target and target in object_cols:
        object_cols.remove(target)

    # Supprimer les variables spécifiées si elles sont présentes
    for col in exclude_vars:
        if col in object_cols:
            object_cols.remove(col)

    return object_cols

df.set_index('user_id', inplace=True)
object_columns = get_object_columns(df, target='satisfaction')
print(len(object_columns), "\n", object_columns)


3 
 ['gender', 'location_type', 'age_group']


In [34]:
def label_encode_columns(df, columns=object_columns, verbose=True):
    for col in columns:
        if col in df.columns:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col])
            
            if verbose:
                mapping = dict(zip(le.classes_, le.transform(le.classes_)))
                print(f"\n[INFO] Label Encoding for '{col}': {mapping}")
        else:
            print(f"[WARNING] Column '{col}' not found in DataFrame.")
    
    return df
df=label_encode_columns(df, columns=object_columns, verbose=True)


[INFO] Label Encoding for 'gender': {'Female': 0, 'Male': 1, 'Other': 2}

[INFO] Label Encoding for 'location_type': {'Rural': 0, 'Suburban': 1, 'Urban': 2}

[INFO] Label Encoding for 'age_group': {'Adult': 0, 'Mid-Age': 1, 'Senior': 2, 'Teen': 3, 'Young Adult': 4}


## 	Modeling

In [35]:
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import time

### regression 

In [36]:
x= df.drop(columns=['mental_health_category','mental_health_quartile','mental_health_cluster',"mental_health_score"])
y= df['mental_health_score']

In [37]:
x.columns

Index(['gender', 'daily_screen_time_hours', 'phone_usage_hours',
       'laptop_usage_hours', 'tablet_usage_hours', 'tv_usage_hours',
       'social_media_hours', 'entertainment_hours', 'gaming_hours',
       'sleep_quality', 'mood_rating', 'stress_level',
       'physical_activity_hours_per_week', 'location_type',
       'uses_wellness_apps', 'eats_healthy', 'caffeine_intake_mg_per_day',
       'weekly_anxiety_score', 'weekly_depression_score',
       'mindfulness_minutes_per_day', 'age_group', 'screen_usage_ratio',
       'wellbeing_score'],
      dtype='object')

In [38]:
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.2)

In [39]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test  = scaler.fit_transform(x_test)

In [40]:
lr = LinearRegression()

training_start = time.perf_counter()
model = lr.fit(x_train, y_train)
training_end = time.perf_counter()

prediction_start = time.perf_counter()
y_pred_train = model.predict(x_train)
y_pred_test = model.predict(x_test)
prediction_end = time.perf_counter()

rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))

mae_train = mean_absolute_error(y_train, y_pred_train)
mae_test = mean_absolute_error(y_test , y_pred_test)

r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

train_time = training_end - training_start
prediction_time = prediction_end - prediction_start


In [41]:
print("Linear Regression Model Performance:")
print("\nTraining Set:")
print(f"  - RMSE: {rmse_train:.4f}")
print(f"  - MAE: {mae_train:.4f}")
print(f"  - R^2 Score: {r2_train:.4f}")
print(f"  - Training Time: {train_time:.4f} seconds")

print("\nTesting Set:")
print(f"  - RMSE: {rmse_test:.4f}")
print(f"  - MAE: {mae_test:.4f}")
print(f"  - R^2 Score: {r2_test:.4f}")
print(f"  - Prediction Time: {prediction_time:.5f} seconds")

Linear Regression Model Performance:

Training Set:
  - RMSE: 0.0000
  - MAE: 0.0000
  - R^2 Score: 1.0000
  - Training Time: 0.0308 seconds

Testing Set:
  - RMSE: 0.6663
  - MAE: 0.5930
  - R^2 Score: 0.9985
  - Prediction Time: 0.00144 seconds


### classification 

#### Compare binning methods and choose the best one (based on accuracy)

In [42]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
def evaluate_binning_methods(df, score_col='mental_health_score', feature_cols=None, n_neighbors=5):
    results = {}
    methods = {}

    if feature_cols is None:
        feature_cols = df.select_dtypes(include=[np.number]).columns.drop(score_col)

    X = df[feature_cols]
        
    df['quartile_class'] = pd.qcut(df[score_col], q=4, labels=[0, 1, 2, 3])
    methods['quartile_class'] = df['quartile_class'].astype(int)

    bins = [20, 40, 60, 80]
    labels = [0, 1, 2]
    df['fixed_bins_class'] = pd.cut(df[score_col], bins=bins, labels=labels, include_lowest=True)
    methods['fixed_bins_class'] = df['fixed_bins_class'].astype(int)

    kmeans = KMeans(n_clusters=3, random_state=42)
    df['kmeans_class'] = kmeans.fit_predict(df[[score_col]])
    methods['kmeans_class'] = df['kmeans_class']

    for method_name, y in methods.items():
        valid_idx = y.dropna().index
        X_valid = X.loc[valid_idx]
        y_valid = y.loc[valid_idx]

        X_train, X_test, y_train, y_test = train_test_split(X_valid, y_valid, test_size=0.3, random_state=42)

        knn_pipe = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
        ])

        knn_pipe.fit(X_train, y_train)
        y_pred = knn_pipe.predict(X_test)

        acc = accuracy_score(y_test, y_pred)
        results[method_name] = acc
        print(f"[{method_name}] Accuracy: {acc:.4f}")

    best_method = max(results, key=results.get)
    print(f"\n Best binning method: {best_method} with accuracy {results[best_method]:.4f}")

    return results, best_method
evaluate_binning_methods(df, score_col='mental_health_score')


[quartile_class] Accuracy: 0.5817
[fixed_bins_class] Accuracy: 0.8833
[kmeans_class] Accuracy: 0.8783

 Best binning method: fixed_bins_class with accuracy 0.8833


({'quartile_class': 0.5816666666666667,
  'fixed_bins_class': 0.8833333333333333,
  'kmeans_class': 0.8783333333333333},
 'fixed_bins_class')

In [43]:
df = df.drop(columns=['quartile_class', 'fixed_bins_class','kmeans_class'])


#### logistic regression 

In [44]:
x= df.drop(columns=['mental_health_category', 'mental_health_quartile','mental_health_cluster',"mental_health_score"])
y= df['mental_health_cluster']

In [45]:
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.2)

In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [47]:
clf = LogisticRegression(max_iter=1000)

training_start = time.perf_counter()
clf.fit(x_train, y_train)
training_end = time.perf_counter()

prediction_start = time.perf_counter()
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
prediction_end = time.perf_counter()

In [48]:
accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred_test)

precision_train = precision_score(y_train, y_pred_train, average='weighted')
precision_test = precision_score(y_test, y_pred_test, average='weighted')

recall_train = recall_score(y_train, y_pred_train, average='weighted')
recall_test = recall_score(y_test, y_pred_test, average='weighted')
f1_train = f1_score(y_train, y_pred_train, average='weighted')
f1_test = f1_score(y_test, y_pred_test, average='weighted')
train_time = training_end - training_start
prediction_time = prediction_end - prediction_start

In [49]:
print("Logistic Regression Model Performance:")

print("\n Training Set:")
print(f"  - Accuracy: {accuracy_train:.4f}")
print(f"  - Precision: {precision_train:.4f}")
print(f"  - Recall: {recall_train:.4f}")
print(f"  - F1 Score: {f1_train:.4f}")
print(f"  - Training Time: {train_time:.4f} seconds")

print("\n Testing Set:")
print(f"  - Accuracy: {accuracy_test:.4f}")
print(f"  - Precision: {precision_test:.4f}")
print(f"  - Recall: {recall_test:.4f}")
print(f"  - F1 Score: {f1_test:.4f}")
print(f"  - Prediction Time: {prediction_time:.5f} seconds")

print("\n Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

Logistic Regression Model Performance:

 Training Set:
  - Accuracy: 0.9794
  - Precision: 0.9794
  - Recall: 0.9794
  - F1 Score: 0.9794
  - Training Time: 5.4786 seconds

 Testing Set:
  - Accuracy: 0.9650
  - Precision: 0.9665
  - Recall: 0.9650
  - F1 Score: 0.9652
  - Prediction Time: 0.01090 seconds

 Confusion Matrix:
[[132   0   3]
 [  7 124   0]
 [  4   0 130]]


#### Random Forest

In [50]:
import time
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier


In [51]:

pipeline = Pipeline([
    ('scaler', StandardScaler()),  
    ('rf', RandomForestClassifier(random_state=42))
])

training_start = time.perf_counter()
pipeline.fit(x_train, y_train)
training_end = time.perf_counter()

prediction_start = time.perf_counter()
y_pred_train = pipeline.predict(x_train)
y_pred_test = pipeline.predict(x_test)
prediction_end = time.perf_counter()

In [52]:
print("Random Forest Performance:")
print(f"Train Accuracy: {accuracy_score(y_train, y_pred_train):.4f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_test):.4f}")
print(f"F1 Score (Test): {f1_score(y_test, y_pred_test, average='weighted'):.4f}")
print(f"Training Time: {training_end - training_start:.4f}s")
print(f"Prediction Time: {prediction_end - prediction_start:.5f}s")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

Random Forest Performance:
Train Accuracy: 1.0000
Test Accuracy: 0.8775
F1 Score (Test): 0.8783
Training Time: 1.0882s
Prediction Time: 0.12906s
Confusion Matrix:
[[115   9  11]
 [ 18 113   0]
 [ 11   0 123]]


#### KNN

In [53]:
from sklearn.neighbors import KNeighborsClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),  
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

training_start = time.perf_counter()
pipeline.fit(x_train, y_train)
training_end = time.perf_counter()

prediction_start = time.perf_counter()
y_pred_train = pipeline.predict(x_train)
y_pred_test = pipeline.predict(x_test)
prediction_end = time.perf_counter()


In [54]:
print("K-Nearest Neighbors Performance:")
print(f"Train Accuracy: {accuracy_score(y_train, y_pred_train):.4f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_test):.4f}")
print(f"F1 Score (Test): {f1_score(y_test, y_pred_test, average='weighted'):.4f}")
print(f"Training Time: {training_end - training_start:.4f}s")
print(f"Prediction Time: {prediction_end - prediction_start:.5f}s")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

K-Nearest Neighbors Performance:
Train Accuracy: 0.7881
Test Accuracy: 0.6050
F1 Score (Test): 0.6129
Training Time: 0.0197s
Prediction Time: 0.76511s
Confusion Matrix:
[[77 40 18]
 [46 85  0]
 [46  8 80]]


#### SVM

In [55]:
from sklearn.svm import LinearSVC
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', LinearSVC())
])

training_start = time.perf_counter()
pipeline.fit(x_train, y_train)
training_end = time.perf_counter()


prediction_start = time.perf_counter()
y_pred_train = pipeline.predict(x_train)
y_pred_test = pipeline.predict(x_test)
prediction_end = time.perf_counter()

In [56]:
print("LinearSVC Performance:")
print(f"Train Accuracy: {accuracy_score(y_train, y_pred_train):.4f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_test):.4f}")
print(f"F1 Score (Test): {f1_score(y_test, y_pred_test, average='weighted'):.4f}")
print(f"Training Time: {training_end - training_start:.4f}s")
print(f"Prediction Time: {prediction_end - prediction_start:.5f}s")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

LinearSVC Performance:
Train Accuracy: 0.9769
Test Accuracy: 0.9825
F1 Score (Test): 0.9824
Training Time: 0.3809s
Prediction Time: 0.01545s
Confusion Matrix:
[[128   1   6]
 [  0 131   0]
 [  0   0 134]]


## modeling with Hyperparameter Tuning

In [57]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from scipy.stats import randint

### KNN

In [58]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

param_grid = {
    'knn__n_neighbors': [3, 5,10,20,50,100],  
    'knn__weights': ['uniform'],  
    'knn__metric': ['euclidean']  
}
grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)


In [59]:
training_start = time.perf_counter()
grid_search.fit(x_train, y_train)
training_end = time.perf_counter()

best_model = grid_search.best_estimator_

prediction_start = time.perf_counter()
y_pred_train = best_model.predict(x_train)
y_pred_test = best_model.predict(x_test)
prediction_end = time.perf_counter()

In [60]:
accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred_test)

precision_train = precision_score(y_train, y_pred_train, average='weighted')
precision_test = precision_score(y_test, y_pred_test, average='weighted')

recall_train = recall_score(y_train, y_pred_train, average='weighted')
recall_test = recall_score(y_test, y_pred_test, average='weighted')

f1_train = f1_score(y_train, y_pred_train, average='weighted')
f1_test = f1_score(y_test, y_pred_test, average='weighted')

train_time = training_end - training_start
prediction_time = prediction_end - prediction_start

In [61]:
print("K-Nearest Neighbors Model Performance:")
print("\n Best Parameters:")
print(grid_search.best_params_)

print("\n Training Set:")
print(f"  - Accuracy: {accuracy_train:.4f}")
print(f"  - Precision: {precision_train:.4f}")
print(f"  - Recall: {recall_train:.4f}")
print(f"  - F1 Score: {f1_train:.4f}")
print(f"  - Training Time: {train_time:.4f} seconds")

print("\n Testing Set:")
print(f"  - Accuracy: {accuracy_test:.4f}")
print(f"  - Precision: {precision_test:.4f}")
print(f"  - Recall: {recall_test:.4f}")
print(f"  - F1 Score: {f1_test:.4f}")
print(f"  - Prediction Time: {prediction_time:.5f} seconds")

print("\n Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

K-Nearest Neighbors Model Performance:

 Best Parameters:
{'knn__metric': 'euclidean', 'knn__n_neighbors': 100, 'knn__weights': 'uniform'}

 Training Set:
  - Accuracy: 0.7825
  - Precision: 0.7796
  - Recall: 0.7825
  - F1 Score: 0.7779
  - Training Time: 13.5722 seconds

 Testing Set:
  - Accuracy: 0.7525
  - Precision: 0.7502
  - Recall: 0.7525
  - F1 Score: 0.7477
  - Prediction Time: 0.89514 seconds

 Confusion Matrix:
[[ 75  42  18]
 [ 18 113   0]
 [ 20   1 113]]


### Random forest 

In [62]:
pipeline = Pipeline([
    ('scaler', StandardScaler()), 
    ('rf', RandomForestClassifier(random_state=42))
])

param_grid = {
    'rf__n_estimators': [100],       
    'rf__max_depth': [None],        
    'rf__min_samples_split': [2],  
}
grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy')

In [63]:
training_start = time.perf_counter()
grid_search.fit(x_train, y_train)
training_end = time.perf_counter()

best_model = grid_search.best_estimator_

prediction_start = time.perf_counter()
y_pred_train = best_model.predict(x_train)
y_pred_test = best_model.predict(x_test)
prediction_end = time.perf_counter()

In [64]:
accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred_test)

precision_train = precision_score(y_train, y_pred_train, average='weighted')
precision_test = precision_score(y_test, y_pred_test, average='weighted')

recall_train = recall_score(y_train, y_pred_train, average='weighted')
recall_test = recall_score(y_test, y_pred_test, average='weighted')

f1_train = f1_score(y_train, y_pred_train, average='weighted')
f1_test = f1_score(y_test, y_pred_test, average='weighted')

train_time = training_end - training_start
prediction_time = prediction_end - prediction_start

In [65]:
print("Random Forest Model Performance:")
print("\n Best Parameters:")
print(grid_search.best_params_)

print("\n Training Set:")
print(f"  - Accuracy: {accuracy_train:.4f}")
print(f"  - Precision: {precision_train:.4f}")
print(f"  - Recall: {recall_train:.4f}")
print(f"  - F1 Score: {f1_train:.4f}")
print(f"  - Training Time: {train_time:.4f} seconds")

print("\n Testing Set:")
print(f"  - Accuracy: {accuracy_test:.4f}")
print(f"  - Precision: {precision_test:.4f}")
print(f"  - Recall: {recall_test:.4f}")
print(f"  - F1 Score: {f1_test:.4f}")
print(f"  - Prediction Time: {prediction_time:.5f} seconds")

print("\n Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

Random Forest Model Performance:

 Best Parameters:
{'rf__max_depth': None, 'rf__min_samples_split': 2, 'rf__n_estimators': 100}

 Training Set:
  - Accuracy: 1.0000
  - Precision: 1.0000
  - Recall: 1.0000
  - F1 Score: 1.0000
  - Training Time: 3.6014 seconds

 Testing Set:
  - Accuracy: 0.8775
  - Precision: 0.8804
  - Recall: 0.8775
  - F1 Score: 0.8783
  - Prediction Time: 0.14165 seconds

 Confusion Matrix:
[[115   9  11]
 [ 18 113   0]
 [ 11   0 123]]


## Results Interpretation

The evaluation of both regression and classification models highlights several important insights:

#### Linear Regression
The Linear Regression model demonstrated near-perfect performance:
- **Training R² = 1.0000** suggests a perfect fit, though this may indicate overfitting.
- However, the **Testing R² = 0.9943** confirms that the model generalizes very well, with minimal error (RMSE = 1.3535).
- This indicates that the linear relationship between the features and the mental health score is very well captured.

####  Classification Models
Among the classification models, the best performance was achieved by **LinearSVC** and **Logistic Regression**:
- **LinearSVC** had the highest F1 score (0.9748), showing robust and balanced classification capability.
- **Logistic Regression** followed closely (F1 = 0.9404), with high precision and recall, and a shorter training time.
- **Random Forest** achieved perfect training performance (F1 = 1.0000) but lower test performance (F1 = 0.8636), indicating some overfitting.
- **KNN** performed the worst (F1 = 0.7163) and had the longest prediction time, making it less suitable for this task.

####  Final Remarks
- Models like **LinearSVC** and **Logistic Regression** provide excellent performance and generalization without significant overfitting.
- The **Random Forest** model, while powerful, may require tuning or regularization to avoid overfitting.
- **KNN** is sensitive to the choice of `k` and does not scale well with larger datasets.

These findings should guide future model selection and optimization, depending on whether performance, interpretability, or computational efficiency is prioritized.
