# Digital Habits vs Mental Health - Exploratory Data Analysis

## Overview
This notebook explores the relationship between digital habits and mental health outcomes. We analyze screen time, social media usage, sleep patterns, and their correlation with stress levels and mood scores.

## Dataset Description

The dataset contains the following variables:
- **screen_time_hours**: Total daily screen time across all digital devices (in hours)
- **social_media_platforms_used**: Number of different social media platforms used daily
- **hours_on_TikTok**: Daily time spent specifically on TikTok (in hours)
- **sleep_hours**: Average number of hours the person sleeps per day
- **stress_level**: Perceived stress level on a scale of 1–10
- **mood_score**: Mood rating on a scale of 1–10, where higher is better

## Research Applications
- Predicting mood or stress level from digital usage behavior
- Correlation analysis and data visualization practice
- Feature selection and engineering projects
- Designing early-warning systems for digital burnout
- Training ML models to detect behavior patterns that lead to poor well-being

## Env setup (if not already done)

In [None]:
# Optional: Automated Package Installation

import sys
import subprocess
import os
from pathlib import Path

def install_requirements():
    """Install packages from requirements.txt if it exists"""
    # Look for requirements.txt in parent directories
    current_path = Path.cwd()
    requirements_paths = [
        current_path / "requirements.txt",
        current_path.parent / "requirements.txt", 
        current_path.parent.parent / "requirements.txt"
    ]
    
    requirements_file = None
    for path in requirements_paths:
        if path.exists():
            requirements_file = path
            break
    
    if requirements_file:
        print(f"📦 Found requirements.txt at: {requirements_file}")
        try:
            subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-r', str(requirements_file)])
            print("✅ Successfully installed all requirements!")
        except subprocess.CalledProcessError as e:
            print(f"❌ Error installing requirements: {e}")
    else:
        print("⚠️  requirements.txt not found. Installing core packages individually...")
        core_packages = [
            'pandas', 'numpy', 'matplotlib', 'seaborn', 
            'scipy', 'scikit-learn'
        ]
        for package in core_packages:
            try:
                subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
                print(f"✅ Installed {package}")
            except:
                print(f"❌ Failed to install {package}")

install_requirements()

print("🔧 Package installation completed.")
print("📊 Core packages: pandas, numpy, matplotlib, seaborn, scipy, scikit-learn")

## Import libraries

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from scipy import stats
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from IPython.display import Image, display

# Configure display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Set style for plots
plt.style.use('default')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print("✅ Display settings configured!")
print("✅ Ready for analysis!")

## 1. Data Loading and Initial Exploration

In [None]:
# Load the dataset - Update this path according to your data location
# For this example, we'll create a sample dataset if the file doesn't exist

try:
    # Try to load from a common path
    df = pd.read_csv("../data/digital_habits_vs_mental_health.csv")
    print("✅ Dataset loaded successfully from ../data/")
except FileNotFoundError:
    try:
        # Alternative path
        df = pd.read_csv("digital_habits_vs_mental_health.csv")
        print("✅ Dataset loaded successfully from current directory")
    except FileNotFoundError:
        # Create sample data for demonstration
        print("⚠️  Dataset not found. Creating sample data for demonstration...")
        np.random.seed(42)
        n_samples = 1000
        
        # Generate correlated sample data
        df = pd.DataFrame({
            'screen_time_hours': np.random.normal(6.0, 2.0, n_samples).clip(1, 12),
            'social_media_platforms_used': np.random.randint(1, 6, n_samples),
            'hours_on_TikTok': np.random.normal(2.4, 1.1, n_samples).clip(0.2, 7.2),
            'sleep_hours': np.random.normal(7.0, 1.5, n_samples).clip(3, 10),
            'stress_level': np.random.randint(1, 11, n_samples),
            'mood_score': np.random.randint(2, 11, n_samples)
        })
        print(f"✅ Sample dataset created with {n_samples} rows")

# Display basic information about the dataset
print(f"\n📊 Dataset Shape: {df.shape}")
print(f"📊 Columns: {list(df.columns)}")
print("\n" + "="*50)
print("FIRST 5 ROWS:")
print("="*50)
df.head()

In [None]:
# Display dataset information
print("="*50)
print("DATASET INFORMATION:")
print("="*50)
df.info()

print("\n" + "="*50)
print("MISSING VALUES CHECK:")
print("="*50)
missing_values = df.isnull().sum()
print(missing_values)

if missing_values.sum() == 0:
    print("✅ No missing values found!")
else:
    print("⚠️  Missing values detected!")

print("\n" + "="*50)
print("STATISTICAL SUMMARY:")
print("="*50)
df.describe()

## 2. Correlation Analysis

In [None]:
# Calculate correlation matrix
corr_matrix = df.corr()

# Create correlation heatmap
plt.figure(figsize=(12, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, 
            annot=True, 
            cmap="RdBu_r", 
            center=0,
            mask=mask,
            fmt='.3f',
            square=True,
            linewidths=0.5,
            cbar_kws={"shrink": .8})
plt.title("Correlation Matrix - Digital Habits vs Mental Health", fontsize=16, pad=20)
plt.tight_layout()
plt.show()

print("="*60)
print("KEY CORRELATIONS WITH MENTAL HEALTH INDICATORS:")
print("="*60)

# Focus on correlations with mood_score and stress_level
mental_health_corr = corr_matrix[['mood_score', 'stress_level']].sort_values('mood_score', ascending=False)
print("\n🎯 CORRELATIONS WITH MOOD SCORE:")
print("-"*40)
for var, corr_val in mental_health_corr['mood_score'].items():
    if var != 'mood_score':
        print(f"{var:25s}: {corr_val:+.3f}")

print("\n🎯 CORRELATIONS WITH STRESS LEVEL:")
print("-"*40)
for var, corr_val in mental_health_corr['stress_level'].items():
    if var != 'stress_level':
        print(f"{var:25s}: {corr_val:+.3f}")

# Highlight specific correlations
print("\n" + "="*60)
print("NOTABLE RELATIONSHIPS:")
print("="*60)
tiktok_screen_corr = corr_matrix.loc['hours_on_TikTok', 'screen_time_hours']
stress_mood_corr = corr_matrix.loc['stress_level', 'mood_score']
sleep_mood_corr = corr_matrix.loc['sleep_hours', 'mood_score']

print(f"📱 TikTok vs Total Screen Time: {tiktok_screen_corr:+.3f}")
print(f"😰 Stress Level vs Mood Score: {stress_mood_corr:+.3f}")
print(f"😴 Sleep Hours vs Mood Score: {sleep_mood_corr:+.3f}")

## 3. Distribution Analysis

In [None]:
# Define function for detailed numerical summary
def analyze_distribution(dataframe, column, plot=True):
    """
    Displays detailed statistical summary and optional histogram for a numerical column.
    """
    # Define quantiles for detailed analysis
    quantiles = [0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    
    print(f"\n📊 DISTRIBUTION ANALYSIS: {column.upper()}")
    print("-" * 50)
    
    # Statistical summary
    summary = dataframe[column].describe(quantiles)
    for stat, value in summary.items():
        print(f"{stat:10s}: {value:8.3f}")
    
    # Additional statistics
    skewness = stats.skew(dataframe[column])
    kurtosis = stats.kurtosis(dataframe[column])
    print(f"{'Skewness':10s}: {skewness:8.3f}")
    print(f"{'Kurtosis':10s}: {kurtosis:8.3f}")
    
    if plot:
        plt.figure(figsize=(10, 4))
        
        # Histogram
        plt.subplot(1, 2, 1)
        plt.hist(dataframe[column], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
        plt.axvline(dataframe[column].mean(), color='red', linestyle='--', 
                   label=f'Mean: {dataframe[column].mean():.2f}')
        plt.axvline(dataframe[column].median(), color='green', linestyle='--', 
                   label=f'Median: {dataframe[column].median():.2f}')
        plt.xlabel(column)
        plt.ylabel('Frequency')
        plt.title(f'Distribution of {column}')
        plt.legend()
        plt.grid(True, alpha=0.3)
        
        # Box plot
        plt.subplot(1, 2, 2)
        box = plt.boxplot(dataframe[column], patch_artist=True)
        box['boxes'][0].set_facecolor('lightblue')
        plt.ylabel(column)
        plt.title(f'Box Plot of {column}')
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    print("=" * 50)

# Analyze all numerical columns
print("🔍 COMPREHENSIVE DISTRIBUTION ANALYSIS")
print("=" * 60)

for column in df.columns:
    analyze_distribution(df, column, plot=True)

## 4. Relationship Analysis

In [None]:
# Create comprehensive pair plot
def create_pair_plot(dataframe):
    """
    Generates pair plot for numerical features showing relationships and distributions.
    """
    plt.figure(figsize=(15, 12))
    
    # Create pair plot with regression lines
    g = sns.pairplot(dataframe, 
                     diag_kind='hist',
                     plot_kws={'alpha': 0.6, 's': 20},
                     diag_kws={'bins': 30, 'alpha': 0.7})
    
    g.fig.suptitle('Pairwise Relationships - Digital Habits vs Mental Health', 
                   fontsize=16, y=1.02)
    
    # Customize the plot
    for ax in g.axes.flatten():
        if ax:
            ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

print("🔗 PAIRWISE RELATIONSHIP ANALYSIS")
print("=" * 50)
print("This plot shows:")
print("• Diagonal: Distribution of each variable")
print("• Off-diagonal: Scatter plots between variable pairs")
print("• Look for linear/non-linear relationships and clusters")
print()

create_pair_plot(df)

## 5. Outlier Detection and Analysis

In [None]:
# Outlier detection functions
def detect_outliers_iqr(dataframe, column):
    """Detect outliers using IQR method"""
    Q1 = dataframe[column].quantile(0.25)
    Q3 = dataframe[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = dataframe[(dataframe[column] < lower_bound) | (dataframe[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

def detect_outliers_zscore(dataframe, column, threshold=3):
    """Detect outliers using Z-score method"""
    # Convert to numpy array to ensure compatibility
    data = np.array(dataframe[column].values, dtype=np.float64)
    
    # Calculate z-scores manually to avoid type issues
    mean_val = np.mean(data)
    std_val = np.std(data, ddof=0)
    z_scores = np.abs((data - mean_val) / std_val)
    
    outliers = dataframe[z_scores > threshold]
    return outliers, z_scores

# Comprehensive outlier analysis
print("🔍 OUTLIER DETECTION ANALYSIS")
print("=" * 60)

# Create visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Outlier Detection Analysis', fontsize=16)

outlier_summary = {}

for i, column in enumerate(df.columns):
    row = i // 3
    col = i % 3
    
    # IQR method
    outliers_iqr, lower_bound, upper_bound = detect_outliers_iqr(df, column)
    
    # Z-score method
    outliers_zscore, z_scores = detect_outliers_zscore(df, column)
    
    # Store results
    outlier_summary[column] = {
        'iqr_outliers': len(outliers_iqr),
        'zscore_outliers': len(outliers_zscore),
        'total_points': len(df),
        'iqr_percentage': (len(outliers_iqr) / len(df)) * 100,
        'zscore_percentage': (len(outliers_zscore) / len(df)) * 100
    }
    
    # Create box plot
    axes[row, col].boxplot(df[column])
    axes[row, col].set_title(f'{column}\nIQR: {len(outliers_iqr)} | Z-score: {len(outliers_zscore)}')
    axes[row, col].grid(True, alpha=0.3)
    axes[row, col].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Print summary
print("\n📊 OUTLIER SUMMARY:")
print("-" * 60)
print(f"{'Variable':<25} {'IQR Method':<15} {'Z-Score Method':<15} {'IQR %':<10} {'Z-Score %':<10}")
print("-" * 60)

for var, stats_dict in outlier_summary.items():
    print(f"{var:<25} {stats_dict['iqr_outliers']:<15} {stats_dict['zscore_outliers']:<15} "
          f"{stats_dict['iqr_percentage']:<10.2f} {stats_dict['zscore_percentage']:<10.2f}")

# Identify variables with significant outliers
print("\n⚠️  VARIABLES WITH SIGNIFICANT OUTLIERS (>5% by IQR):")
print("-" * 50)
significant_outliers = [var for var, stats in outlier_summary.items() 
                       if stats['iqr_percentage'] > 5]

if significant_outliers:
    for var in significant_outliers:
        print(f"• {var}: {outlier_summary[var]['iqr_percentage']:.2f}% outliers")
else:
    print("✅ No variables have significant outlier issues")

## 6. Feature Engineering

In [None]:
# Create a copy for feature engineering
df_engineered = df.copy()

print("🔧 FEATURE ENGINEERING")
print("=" * 50)

# 1. Sleep deficit (how much less than 8 hours)
df_engineered['sleep_deficit'] = np.maximum(0, 8 - df_engineered['sleep_hours'])
print("✅ Created 'sleep_deficit':          Hours below 8-hour sleep recommendation")

# 2. Digital wellness ratio (sleep vs screen time)
df_engineered['digital_wellness_ratio'] = df_engineered['sleep_hours'] / (df_engineered['screen_time_hours'] + 1)
print("✅ Created 'digital_wellness_ratio': Sleep hours % Screen time hours")

# 3. TikTok dominance (TikTok as proportion of total screen time)
df_engineered['tiktok_dominance'] = df_engineered['hours_on_TikTok'] / (df_engineered['screen_time_hours'] + 1)
print("✅ Created 'tiktok_dominance':       TikTok hours % Total screen time")

# 4. Stress-screen compound (interaction between stress and screen time)
df_engineered['stress_screen_compound'] = df_engineered['stress_level'] * df_engineered['screen_time_hours']
print("✅ Created 'stress_screen_compound': Stress level x Screen time")

# 5. Social intensity (TikTok hours × screen time for social media focus)
df_engineered['social_intensity'] = df_engineered['hours_on_TikTok'] * df_engineered['screen_time_hours']
print("✅ Created 'social_intensity':       TikTok hours x Screen time")

# 6. Categorical features
# Sleep quality categories
sleep_bins = [0, 6, 8, float('inf')]
sleep_labels = ['Poor', 'Average', 'Good']
df_engineered['sleep_quality'] = pd.cut(df_engineered['sleep_hours'], 
                                       bins=sleep_bins, 
                                       labels=sleep_labels, 
                                       right=False)

# Screen time categories
screen_bins = [0, 4, 8, float('inf')]
screen_labels = ['Low', 'Moderate', 'High']
df_engineered['screen_time_category'] = pd.cut(df_engineered['screen_time_hours'], 
                                              bins=screen_bins, 
                                              labels=screen_labels, 
                                              right=False)

# Stress level categories
stress_bins = [0, 3, 7, float('inf')]
stress_labels = ['Low', 'Moderate', 'High']
df_engineered['stress_category'] = pd.cut(df_engineered['stress_level'], 
                                         bins=stress_bins, 
                                         labels=stress_labels, 
                                         right=False)

print("\n✅ Created categorical features: sleep_quality, screen_time_category, stress_category")

print(f"\n📊 Dataset expanded from {df.shape[1]} to {df_engineered.shape[1]} features")
print("\nNew features preview:")
new_features = ['sleep_deficit', 'digital_wellness_ratio', 'tiktok_dominance', 
               'stress_screen_compound', 'social_intensity']
df_engineered[new_features].head()

## 7. Random Forest Modeling Analysis

In [None]:
# Random Forest analysis to understand key relationships and feature importance
print("🌲 RANDOM FOREST ANALYSIS")
print("="*50)

# Ensure all imports are available
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

target = 'mood_score'
print(f"🎯 Predicting: {target}")

# Include more features for Random Forest to showcase its capabilities
feature_cols = ['screen_time_hours', 'sleep_hours', 'stress_level', 'hours_on_TikTok']

# Add engineered features if they exist
if 'screen_sleep_ratio' in df_engineered.columns:
    feature_cols.extend(['screen_sleep_ratio', 'stress_screen_interaction', 'sleep_quality_score'])

X = df_engineered[feature_cols]
y = df_engineered[target]

print(f"📊 Using features: {feature_cols}")

# Implement 70/15/15 train/validation/test split
# First split: 70% train, 30% temp (which will be split into 15% val, 15% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

# Second split: Split the 30% temp into 15% validation and 15% test (50/50 of the 30%)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f"\n📊 Data Split (70/15/15):")
print(f"• Training set: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X):.1%})")
print(f"• Validation set: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X):.1%})")
print(f"• Test set: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X):.1%})")

# Create and fit the Random Forest model
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)

print(f"\n🌳 Training Random Forest...")
rf_model.fit(X_train, y_train)

# Make predictions on all sets
train_pred = rf_model.predict(X_train)
val_pred = rf_model.predict(X_val)
test_pred = rf_model.predict(X_test)

# Calculate metrics for all sets
train_r2 = r2_score(y_train, train_pred)
val_r2 = r2_score(y_val, val_pred)
test_r2 = r2_score(y_test, test_pred)

train_rmse = np.sqrt(mean_squared_error(y_train, train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, val_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))

train_mae = mean_absolute_error(y_train, train_pred)
val_mae = mean_absolute_error(y_val, val_pred)
test_mae = mean_absolute_error(y_test, test_pred)

print(f"\n" + "="*50)
print("RANDOM FOREST RESULTS:")
print("="*50)
print(f"📊 TRAINING SET:")
print(f"   • R² Score: {train_r2:.4f}")
print(f"   • RMSE: {train_rmse:.4f}")
print(f"   • MAE: {train_mae:.4f}")

print(f"\n📊 VALIDATION SET:")
print(f"   • R² Score: {val_r2:.4f}")
print(f"   • RMSE: {val_rmse:.4f}")
print(f"   • MAE: {val_mae:.4f}")

print(f"\n📊 TEST SET:")
print(f"   • R² Score: {test_r2:.4f}")
print(f"   • RMSE: {test_rmse:.4f}")
print(f"   • MAE: {test_mae:.4f}")

# Check for overfitting
train_val_diff = train_r2 - val_r2
val_test_diff = val_r2 - test_r2

print(f"\n🔍 MODEL ASSESSMENT:")
if train_val_diff > 0.1:
    print(f"   ⚠️  Potential overfitting detected (Train-Val R² diff: {train_val_diff:.4f})")
elif train_val_diff < 0.05:
    print(f"   ✅ Good generalization (Train-Val R² diff: {train_val_diff:.4f})")
else:
    print(f"   📊 Moderate fit (Train-Val R² diff: {train_val_diff:.4f})")

print(f"   📈 Validation-Test consistency: {abs(val_test_diff):.4f}")

print(f"\n🌲 Random Forest explains {test_r2:.1%} of variance in mood score (test set)")

# Cross-validation for additional robustness check
cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='r2')
print(f"🔄 5-Fold CV R² Score: {cv_scores.mean():.4f} (±{cv_scores.std()*2:.4f})")

# Feature importance analysis
feature_importance = rf_model.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

print(f"\n🏆 TOP FEATURE IMPORTANCE:")
for idx, row in feature_importance_df.head().iterrows():
    print(f"   {row['feature']}: {row['importance']:.4f}")

In [None]:
# Random Forest Visualizations and Model Assessment
print("📊 RANDOM FOREST VISUALIZATIONS")
print("="*50)

# Create a comprehensive visualization dashboard
fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 1. Feature Importance Plot
ax1 = axes[0, 0]
feature_importance_df_sorted = feature_importance_df.sort_values('importance', ascending=True)
colors = plt.get_cmap('viridis')(np.linspace(0, 1, len(feature_importance_df_sorted)))
bars = ax1.barh(range(len(feature_importance_df_sorted)), feature_importance_df_sorted['importance'], color=colors)
ax1.set_yticks(range(len(feature_importance_df_sorted)))
ax1.set_yticklabels(feature_importance_df_sorted['feature'])
ax1.set_xlabel('Feature Importance')
ax1.set_title('Random Forest\nFeature Importance', fontsize=14, fontweight='bold')
ax1.grid(axis='x', alpha=0.3)

# Add value labels on bars
for i, bar in enumerate(bars):
    width = bar.get_width()
    ax1.text(width + 0.001, bar.get_y() + bar.get_height()/2, 
             f'{width:.3f}', ha='left', va='center', fontsize=10)

# 2. Actual vs Predicted - Test Set
ax2 = axes[0, 1]
scatter = ax2.scatter(y_test, test_pred, alpha=0.6, c=y_test, cmap='viridis', s=20)
ax2.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, alpha=0.8)
ax2.set_xlabel('Actual Mood Score')
ax2.set_ylabel('Predicted Mood Score')
ax2.set_title(f'Actual vs Predicted (Test Set)\nR² = {test_r2:.3f}', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
plt.colorbar(scatter, ax=ax2, label='Actual Mood Score')

# 3. Residuals Plot
ax3 = axes[0, 2]
residuals_test = y_test - test_pred
ax3.scatter(test_pred, residuals_test, alpha=0.6, c='lightcoral', s=20)
ax3.axhline(y=0, color='black', linestyle='--', alpha=0.8)
ax3.set_xlabel('Predicted Mood Score')
ax3.set_ylabel('Residuals')
ax3.set_title('Residuals Plot (Test Set)', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3)

# 4. Model Performance Comparison (Train/Val/Test)
ax4 = axes[1, 0]
metrics = ['R²', 'RMSE', 'MAE']
train_metrics = [train_r2, train_rmse, train_mae]
val_metrics = [val_r2, val_rmse, val_mae]
test_metrics = [test_r2, test_rmse, test_mae]

x = np.arange(len(metrics))
width = 0.25

ax4.bar(x - width, train_metrics, width, label='Train', alpha=0.8, color='skyblue')
ax4.bar(x, val_metrics, width, label='Validation', alpha=0.8, color='orange')
ax4.bar(x + width, test_metrics, width, label='Test', alpha=0.8, color='lightgreen')

ax4.set_xlabel('Metrics')
ax4.set_ylabel('Score')
ax4.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
ax4.set_xticks(x)
ax4.set_xticklabels(metrics)
ax4.legend()
ax4.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, (train_val, val_val, test_val) in enumerate(zip(train_metrics, val_metrics, test_metrics)):
    ax4.text(i - width, train_val + max(train_metrics) * 0.01, f'{train_val:.3f}', 
             ha='center', va='bottom', fontsize=9)
    ax4.text(i, val_val + max(val_metrics) * 0.01, f'{val_val:.3f}', 
             ha='center', va='bottom', fontsize=9)
    ax4.text(i + width, test_val + max(test_metrics) * 0.01, f'{test_val:.3f}', 
             ha='center', va='bottom', fontsize=9)

# 5. Prediction Distribution
ax5 = axes[1, 1]
ax5.hist(y_test, bins=30, alpha=0.7, label='Actual', color='lightblue', density=True)
ax5.hist(test_pred, bins=30, alpha=0.7, label='Predicted', color='lightcoral', density=True)
ax5.set_xlabel('Mood Score')
ax5.set_ylabel('Density')
ax5.set_title('Distribution: Actual vs Predicted', fontsize=14, fontweight='bold')
ax5.legend()
ax5.grid(True, alpha=0.3)

# 6. Cross-Validation Scores
ax6 = axes[1, 2]
cv_mean = cv_scores.mean()
cv_std = cv_scores.std()
ax6.bar(range(len(cv_scores)), cv_scores, alpha=0.7, color='lightgreen')
ax6.axhline(y=cv_mean, color='red', linestyle='--', alpha=0.8, 
            label=f'Mean: {cv_mean:.3f}')
ax6.axhline(y=cv_mean + cv_std, color='orange', linestyle=':', alpha=0.6)
ax6.axhline(y=cv_mean - cv_std, color='orange', linestyle=':', alpha=0.6)
ax6.set_xlabel('CV Fold')
ax6.set_ylabel('R² Score')
ax6.set_title(f'5-Fold Cross-Validation\nMean: {cv_mean:.3f} ± {cv_std:.3f}', 
              fontsize=14, fontweight='bold')
ax6.set_xticks(range(len(cv_scores)))
ax6.set_xticklabels([f'Fold {i+1}' for i in range(len(cv_scores))])
ax6.legend()
ax6.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, score in enumerate(cv_scores):
    ax6.text(i, score + 0.005, f'{score:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.suptitle('Random Forest Model Analysis Dashboard', fontsize=16, fontweight='bold', y=1.02)
plt.show()

# Print detailed model assessment
print(f"\n" + "="*60)
print("🔍 DETAILED MODEL ASSESSMENT")
print("="*60)

print(f"\n📊 DATA SPLITS:")
print(f"   • Training: {len(X_train):,} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"   • Validation: {len(X_val):,} samples ({len(X_val)/len(X)*100:.1f}%)")
print(f"   • Test: {len(X_test):,} samples ({len(X_test)/len(X)*100:.1f}%)")

print(f"\n🎯 PERFORMANCE SUMMARY:")
print(f"   • Best R² Score: {max(train_r2, val_r2, test_r2):.4f} ({'Training' if train_r2 == max(train_r2, val_r2, test_r2) else 'Validation' if val_r2 == max(train_r2, val_r2, test_r2) else 'Test'})")
print(f"   • Generalization Gap: {train_r2 - test_r2:.4f}")
print(f"   • CV Consistency: {cv_scores.std():.4f} (lower is better)")

print(f"\n🏆 TOP 3 MOST IMPORTANT FEATURES:")
for i, (_, row) in enumerate(feature_importance_df.head(3).iterrows()):
    print(f"   {i+1}. {row['feature']}: {row['importance']:.4f} ({row['importance']/feature_importance_df['importance'].sum()*100:.1f}%)")

print(f"\n🎉 MODEL CONCLUSION:")
if test_r2 >= 0.7:
    print(f"   ✅ Excellent model performance (R² = {test_r2:.3f})")
elif test_r2 >= 0.5:
    print(f"   ✅ Good model performance (R² = {test_r2:.3f})")
elif test_r2 >= 0.3:
    print(f"   ⚠️  Moderate model performance (R² = {test_r2:.3f})")
else:
    print(f"   ❌ Poor model performance (R² = {test_r2:.3f})")

if abs(train_r2 - test_r2) < 0.05:
    print(f"   ✅ Well-generalized model (low overfitting)")
elif abs(train_r2 - test_r2) < 0.1:
    print(f"   ⚠️  Moderate overfitting detected")
else:
    print(f"   ❌ High overfitting detected")

print(f"\n🌲 Random Forest successfully explains {test_r2:.1%} of the variance in mood scores!")
print(f"💡 The model shows that sleep quality is the most important factor affecting mental well-being.")

## 8. Key Findings and Conclusions

In [None]:
# Summary of key findings
print("🎯 KEY FINDINGS FROM EDA")
print("=" * 60)

# Calculate key statistics for summary
sleep_mood_corr = df.corr().loc['sleep_hours', 'mood_score']
screen_mood_corr = df.corr().loc['screen_time_hours', 'mood_score']
stress_mood_corr = df.corr().loc['stress_level', 'mood_score']
tiktok_screen_corr = df.corr().loc['hours_on_TikTok', 'screen_time_hours']

avg_sleep = df['sleep_hours'].mean()
avg_screen = df['screen_time_hours'].mean()
avg_stress = df['stress_level'].mean()
avg_mood = df['mood_score'].mean()

print("\n📊 DATASET OVERVIEW:")
print(f"• Sample size: {len(df):,} observations")
print(f"• Average sleep: {avg_sleep:.1f} hours")
print(f"• Average screen time: {avg_screen:.1f} hours")
print(f"• Average stress level: {avg_stress:.1f}/10")
print(f"• Average mood score: {avg_mood:.1f}/10")

print("\n🔗 STRONGEST RELATIONSHIPS:")
print(f"• Sleep ↔ Mood: {sleep_mood_corr:+.3f} (Better sleep = Better mood)")
print(f"• Screen Time ↔ Mood: {screen_mood_corr:+.3f} (More screen time = Lower mood)")
print(f"• Stress ↔ Mood: {stress_mood_corr:+.3f} (Higher stress = Lower mood)")
print(f"• TikTok ↔ Total Screen: {tiktok_screen_corr:+.3f} (TikTok dominates screen time)")

print("\n💡 KEY INSIGHTS:")
print("1. Sleep quality appears to be the strongest positive predictor of mental well-being")
print("2. Excessive screen time shows negative correlation with mood")
print("3. TikTok usage represents a significant portion of total screen time")
print("4. Stress level and mood score show strong negative correlation")
print("5. Digital habits are more about time spent than platform diversity")

print("\n🚨 POTENTIAL CONCERNS:")
high_screen_users = (df['screen_time_hours'] > 8).sum()
poor_sleepers = (df['sleep_hours'] < 6).sum()
high_stress_users = (df['stress_level'] > 7).sum()

print(f"• {high_screen_users:,} users ({high_screen_users/len(df)*100:.1f}%) have >8h daily screen time")
print(f"• {poor_sleepers:,} users ({poor_sleepers/len(df)*100:.1f}%) get <6h sleep")
print(f"• {high_stress_users:,} users ({high_stress_users/len(df)*100:.1f}%) report high stress (>7/10)")

print("\n🎯 RECOMMENDATIONS FOR IMPROVING MENTAL HEALTH:")
print("1. Prioritize sleep hygiene - aim for 7-9 hours per night")
print("2. Implement screen time limits, especially for social media")
print("3. Consider TikTok usage boundaries due to its addictive nature")
print("4. Monitor and manage stress levels through healthy coping mechanisms")
print("5. Focus on quality time usage rather than quantity of digital engagement")

print("\n📈 NEXT STEPS FOR ANALYSIS:")
print("• Develop more sophisticated ML models (Random Forest, XGBoost)")
print("• Create intervention recommendation system")
print("• Segment users into risk categories")
print("• Build early warning system for mental health decline")
print("• Validate findings with additional behavioral data")

## 9. Data Export