__OVERSTIMULATION BEHAVIOR ANALYSIS__

__Project Definition & Preliminary Analysis__

_Project Definition_

__Title__: Overstimulation Behavior and Lifestyle Analysis

__Objective__: Utilizing the "overstimulation_dataset.csv", this projects aims to analyze how various lifestyle factors influence overstimulation behaviors. The goal is to identify patterns and correlations that can informa strategies for managing overstimulation.

__Key Research Questions__:
1. Which lifestyle factors are most strongly associated with overstimulation behaviours?
2. Are there identifiable patterns or trends in overstimulation incidents across different demographics?
3. Is overstimulation a risk factor for depression? 
4. Can we develop a predictive model to anticipate overstimulation episodes based on lifestyle data?
5. Can recommendations be made for daily habits and lifestyle hygiene that can minimize the risk of overstimulation and improve an individual's mental health?
6. Is there a correlation between excessive stress and overstimulation and symptoms of depression?

__Dataset Overview__:

Source: Overstimulation Behavior and Lifestyle Dataset from Kaggle (https://www.kaggle.com/datasets/miadul/overstimulation-behavior-and-lifestyle-dataset/data)

File: "overstimulation_dataset.csv"

Features:
1. Demographics
- Age: Age of the individual (18-60)
2. Lifestyle & Daily Routine
- Sleep_Hours: Hours of sleep per day (3-10)
- Screen_Time: Screen time per day (1-12)
- Work_Hours: Hours worked per day (4-15)
- Exercise_Hours: Hours of physical activity per day(0-3)
- Caffeine_Intake: Number of cups of caffeinated drinks(0-5)
- Tech_Usage_Hours: Total hours spent using technology per day (1-10)
3. Environmental Exposure
- Noise_Exposure: Frequency of exposure to high noise (0-5)
- Social_Interaction: Number of daily social interactions (0-10)
4. Mental Health & Psychological Traits
- Stress_Level: Self-reported stress level (1=low stress, 10=high stress)
- Anxiety_Score: Anxiety score (1-10)
- Depression_Score: Depression score (1-10)
- Overthinking_Score: Tendency to overthink (1-10)
- Irritability_Score: Irritability (1-10)
- Sensory_Sensitivity: Sensitivity to sensory input(0 = low sensitivity, 4 = high sensitivity)
- Headache_Frequency: Headaches per week
5. Habits & Coping Mechanisms
- Multitasking_Habit: Whether the person tends to multitask (1 = Yes, 0 = No)
- Meditation_Habit: Whether the person practices meditation/mindfulness (1 = Yes, 0 = No)
- Sleep_Quality: Quality of sleep (1-4)
6. Target Variable
- Overstimulated: 1 = Yes, 0 = No

__Expected Outcomes__:

* Identification of key factors contributing to overstimulation
* Clustering of individuals based on overstimulation response patterns
* Preliminary predictive model to identify potential overstimulation

_Preliminary Analysis_

1. Data Exploration

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


In [None]:
data = pd.read_csv('overstimulation_dataset.csv')

In [None]:
data.head(5)

In [None]:
data.shape

In [None]:
data.isnull().sum()

In [None]:
data.info()

2. Descriptive Statistics & Visualization

In [None]:
data.describe()

In [None]:
data.columns

In [None]:
data['Overstimulated'].value_counts()

In [None]:
overstim_counts = data["Overstimulated"].value_counts()
labels = ['Not Overstimulated (0)', 'Overstimulated (1)']
plt.figure(figsize=(6, 6))
plt.pie(overstim_counts, labels=labels, autopct='%1.1f%%', colors=sns.color_palette("viridis"), startangle=90)
plt.title("Proportion of Overstimulation Among Users")
plt.axis('equal')
plt.show()

In [None]:
binary_cols = ['Meditation_Habit', 'Multitasking_Habit']
for col in binary_cols:
    plt.figure(figsize=(6,4))
    count = data[col].value_counts()
    plt.pie(count, labels=count.index, autopct='%1.1f%%', colors=sns.color_palette('viridis', len(count)).as_hex())
    plt.title(f'{col} distribution')
    plt.show()

In [None]:
categorical_cols = ['Sensory_Sensitivity', 'Sleep_Quality', 'Noise_Exposure', 'Headache_Frequency']
for col in categorical_cols:
    plt.figure(figsize=(6,4))
    sns.histplot(x=col, data=data, palette='viridis')
    plt.title(f'{col} distribution')
    plt.tight_layout()
    plt.show()

In [None]:
#sprawdzenie wartości unikalnych w zmiennych kategorycznych
categorical_cols = ['Meditation_Habit', 'Multitasking_Habit', 'Sensory_Sensitivity', 'Sleep_Quality', 'Noise_Exposure', 'Headache_Frequency']
for col in categorical_cols:
    print(f"{col}: {data[col].unique()}")

In [None]:
# upewnienie się czy wszystkie wskazane kolumny kategoryczne są typu int
data['Multitasking_Habit'] = data['Multitasking_Habit'].astype(int)     # binarne
data['Meditation_Habit'] = data['Meditation_Habit'].astype(int)         # binarne
data['Sensory_Sensitivity'] = data['Sensory_Sensitivity'].astype(int)
data['Sleep_Quality'] = data['Sleep_Quality'].astype(int)
data['Noise_Exposure'] = data['Noise_Exposure'].astype(int)
data['Headache_Frequency'] = data['Headache_Frequency'].astype(int)

In [None]:
data.hist(figsize=(10,10), color='skyblue', bins=30, xlabelsize=8, ylabelsize=8, edgecolor='black')
plt.suptitle("Histogram of numerical columns in dataset")
plt.show()

In [None]:
# Sleep hours vs Age
plt.figure(figsize=(10,6))
sns.lineplot(x='Age', y='Sleep_Hours', data=data, marker='o', color='blue')
plt.title('Sleep hours vs Age')
plt.show()

In [None]:
# Sleep hours vs Stress level
plt.figure(figsize=(10,6))
sns.lineplot(x='Stress_Level', y='Sleep_Hours', data=data, marker='o', color='blue')
plt.title('Sleep hours vs Stress level')
plt.show()

In [None]:
# Screen time vs Stress level
plt.figure(figsize=(10,6))
sns.lineplot(x='Stress_Level', y='Screen_Time', data=data, marker='o', color='red')
plt.title('Screen time vs Stress level')
plt.show()

In [None]:
# Screen time vs overstimulation
plt.figure(figsize=(10,6))
sns.scatterplot(x='Screen_Time', y='Stress_Level', data=data, marker='o', hue='Overstimulated')
plt.title('Screen time vs overstimulation')
plt.show()

In [None]:
sns.pairplot(data[['Age', 'Sleep_Hours', 'Screen_Time', 'Stress_Level', 'Overstimulated']], hue='Overstimulated', palette='husl')
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(data.corr(), annot=True, cmap='viridis', fmt='.2f', linewidths=0.5)
plt.title('Correlation heatmap')
plt.show()

In [None]:
# One-Hot Encoding dla kolumn niebinarnych
data = pd.get_dummies(data, columns=['Sensory_Sensitivity', 'Sleep_Quality', 
                                 'Noise_Exposure', 'Headache_Frequency'], drop_first=True)

In [None]:
# wyszukanie wszystkich kolumn typu bool
bool_cols = data.select_dtypes(include='bool').columns

# zamiana na int
data[bool_cols] = data[bool_cols].astype(int)
data.head()

In [None]:
data.describe()

In [None]:
#18-25 young adults
# 26-35 early adults
# 36-45 mid adults
# 46-60 older adults

In [None]:
# One-Hot Encoding dla Age
labels = ['Young_Adults', 'Early_Adults', 'Mid_Adults', 'Older_Adults']
bins=[18, 25, 35, 45, 60]

data['Age'] = pd.cut(data['Age'], bins=bins, labels=labels)
data = pd.get_dummies(data, columns=['Age'], drop_first=True).head()
print(data.head())

In [None]:
bool_cols = data.select_dtypes(include='bool').columns

# zamiana na int
data[bool_cols] = data[bool_cols].astype(int)
data.head()

3. Initial Insights & Hypotheses

* Do overstimulation levels correlate with specific environmental triggers (eg. high noise levels)?
* Can we detect clusters of individuals based on overstimulation score?

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
X = data.drop(columns=['Overstimulated'])  # Features
y = data['Overstimulated']  # Target variable