# Hypothesis and Statistical Testing
---

Import libraries:

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import pingouin as pg

---

## Import cleaned data parquet file

Import the cleaned data that was the output of the 1st notebook

In [2]:
df = pd.read_parquet("../data/mental_health_social_media_dataset_cleaned.parquet")

---
## Hypothesises to test

- Sleep vs Age Group
  - H₀: Mean sleep hours are equal across age groups
  - H₁: At least one group differs
- Stress vs Platform
  - H₀: Mean stress is equal across platforms
  - H₁: At least one platform differs
- Platform vs Mental State
  - H₀: No association between platform and mental state
  - H₁: They are associated
- Sleep vs Mental State
  - H₀: Sleep is equal across mental states
  - H₁: At least one mental_state differs
- Screen Time vs Stress
  - H₀: No linear correlation
  - H₁: Linear correlation exists
- Negative Interaction Ratio vs Stress
  - H₀: No linear correlation
  - H₁: Linear correlation exists

---
## Tests to use

To explore the hypotheses, I used a combination of ANOVA, chi-square, and correlation tests.

**ANOVA**
ANOVA is useful when comparing the average value of a numerical variable across several groups, eg. whether stress or sleep differs by platform or mental state. Even though some variables weren’t perfectly normal, the large sample size makes ANOVA robust enough for this analysis.

**Chi-square**
For relationships between two categorical variables, like platform and mental state, the chi-square test is more appropriate because it checks whether the distribution of categories differs across groups.

**Correlation tests**
For examining how two numerical variables move together, such as screen time and stress, or negative interactions and stress, I used Pearson and Spearman correlation. Pearson captures linear relationships, while Spearman checks monotonic trends, making them a good pair when the data may not follow perfect linear patterns.

These tests cover all the key variable types in the dataset and provide a clear, appropriate way to evaluate the hypotheses.

---
## Sleep vs Age Group

Hypotheses:

- H₀: Mean sleep hours are equal across age groups
- H₁: At least one group differs

Code:

In [3]:
# Select the relevant columns
data = df[["sleep_hours", "age_group",]]

# Perform one-way ANOVA
anova = pg.anova(data=data, dv="sleep_hours", between="age_group", detailed=True)

# Display the ANOVA results
anova

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,age_group,1215.360066,5,243.072013,5898.930092,0.0,0.855199
1,Within,205.783356,4994,0.041206,,,


Results:

A one-way ANOVA revealed extremely large differences in sleep variable across age groups.

p < 0.001 says that I can confidently say age groups have very different sleep means.

The effect size was exceptionally high, indicating that age group explained approximately 85.5% of the total variance.

Such a strong effect is far greater than would be expected in real-world behavioural data and likely reflects the synthetic structure of the dataset.

We reject H₀: Mean sleep hours are equal across age groups and accept H₁: At least one group differs.

---
## Stress vs Platform

Hypothesis:
- H₀: Mean stress is equal across platforms
- H₁: At least one platform differs

Code:

In [4]:
# Select the relevant columns
data = df[["stress_level", "platform"]]

# Perform one-way ANOVA
anova = pg.anova(data=data, dv="stress_level", between="platform", detailed=True)

# Display the ANOVA results
anova

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,platform,1076.422753,6,179.403792,196.194538,3.4678859999999997e-225,0.190784
1,Within,4565.688447,4993,0.914418,,,


Results:

A one-way ANOVA showed a statistically significant effect of platform on the stress level.

p < 0.001 says that I can confidently say that stress level differs across platforms.

The effect size was large, indicating that platform accounts for about 19% of the total variance.

This represents a substantial difference between platforms, consistent with observations in the EDA.

We can reject H₀: Mean stress is equal across platforms and accept H₁: At least one platform differs.

---
## Platform vs Mental State

Hypotheses:
- H₀: No association between platform and mental state
- H₁: They are associated

Code:

In [5]:
# select the relevant columns
ct = pd.crosstab(df["platform"], df["mental_state"])

# Chi-square test
chi2, p, dof, expected = stats.chi2_contingency(ct)
print("Chi-square:", chi2)
print("p-value:", p)
print("dof:", dof)

Chi-square: 466.0026390307361
p-value: 3.765329048527234e-92
dof: 12


Results:

There is overwhelming statistical evidence that platform and mental_state are not independent.

p < 0.001 which indicates that mental_state distribution differs substantially across platforms.

We can reject H₀: No association between platform and mental state and accept H₁: They are associated.

---
## Sleep vs Mental State

Hypotheses:
- H₀: Sleep is equal across mental states
- H₁: At least one mental_state differs

Code:

In [6]:
# Select the relevant columns
data = df[["sleep_hours", "mental_state"]]

# Perform one-way ANOVA
anova = pg.anova(data=data, dv="sleep_hours", between="mental_state", detailed=True)

# Display the ANOVA results
anova

Unnamed: 0,Source,SS,DF,MS,F,p-unc,np2
0,mental_state,353.511891,2,176.755946,827.298029,4.416012e-311,0.248752
1,Within,1067.631531,4997,0.213654,,,


Results:

A one-way ANOVA showed a significant effect of mental state on sleep duration.

p < 0.001 which says there is essentially zero probability that all groups have equal sleep means.

The effect size was large indicating that mental-state category explains roughly 25% of the total variance in sleep hours.

This suggests substantial differences in sleep patterns across mental-state groups.

We can reject H₀: Sleep is equal across mental states and accept H₁: At least one mental_state differs.

---
## Screen Time vs Stress

Hypotheses:
- H₀: No linear correlation
- H₁: Linear correlation exists

Code:

In [7]:
# Select the relevant columns
data = df[["daily_screen_time_min", "stress_level"]]

# Pearson
r, p = stats.pearsonr(data["daily_screen_time_min"], data["stress_level"])
print("Pearson r:", r, "p-value:", p)

# Spearman (non-parametric)
rho, p_s = stats.spearmanr(data["daily_screen_time_min"], data["stress_level"])
print("Spearman rho:", rho, "p-value:", p_s)

Pearson r: 0.835955242510056 p-value: 0.0
Spearman rho: 0.8122534740068954 p-value: 0.0


Results:

Pearson correlation analysis showed a strong positive association between daily screen time and stress level.

As daily screen time increases, stress level increases in a strongly linear way.

Spearman’s rank correlation produced a similarly high value.

Individuals with higher screen time almost always have higher stress, even if the relationship were not perfectly linear.

These results suggest that higher screen time is strongly associated with higher stress levels.

The magnitude of the correlation is exceptionally large for behavioural data and likely reflects the structured relationships inherent in the synthetic dataset rather than real-world psychological effects.

We can reject H₀: No linear correlation and accept H₁: Linear correlation exists.

---
## Add Negative Interaction Ratio vs Stress section

Hypothesis:
- H₀: No linear correlation
- H₁: Linear correlation exists

Code:

In [8]:
# Select the relevant columns
data = df[["interaction_negative_ratio", "stress_level"]].dropna()

# Pearson
r, p = stats.pearsonr(data["interaction_negative_ratio"], data["stress_level"])
print("Pearson r:", r, "p-value:", p)

# Spearman (non-parametric)
rho, p_s = stats.spearmanr(data["interaction_negative_ratio"], data["stress_level"])
print("Spearman rho:", rho, "p-value:", p_s)

Pearson r: 0.4434964276173199 p-value: 5.123507147633844e-240
Spearman rho: 0.29359723808182703 p-value: 5.645097179149306e-100


Results:

Pearson correlation analysis revealed a moderate positive relationship between negative interaction ratio and stress level.

Spearman’s rank correlation also indicated a significant but weaker monotonic association.

These results suggest that individuals who experience a higher proportion of negative social interactions tend to report higher stress levels.

The effect size is moderate and consistent with realistic behavioural patterns, although the strength of the relationship is influenced by the structured nature of the synthetic dataset.

We can reject H₀: No linear correlation and accept H₁: Linear correlation exists.

---
## Conclusions

Overall, the statistical tests backed up what I saw in the visual EDA. 

Sleep patterns differed a lot across both age groups and mental-state categories, and platform choice was also linked to clear differences in stress, anxiety, and mood. 

There was a strong association between platform and mental_state, showing that some platforms tend to have more stressed or at-risk users than others. 

I also found that higher screen time was strongly related to higher stress, and a higher ratio of negative interactions was moderately linked to stress as well. 

These patterns are very consistent, though stronger than what you'd expect in real data, which reflects the synthetic nature of the dataset.

Hypothesis results:

- Sleep vs Age Group
  - ~~H₀: Mean sleep hours are equal across age groups~~
  - H₁: At least one group differs
- Stress vs Platform
  - ~~H₀: Mean stress is equal across platforms~~
  - H₁: At least one platform differs
- Platform vs Mental State
  - ~~H₀: No association between platform and mental state~~
  - H₁: They are associated
- Sleep vs Mental State
  - ~~H₀: Sleep is equal across mental states~~
  - H₁: At least one mental_state differs
- Screen Time vs Stress
  - ~~H₀: No linear correlation~~
  - H₁: Linear correlation exists
- Negative Interaction Ratio vs Stress
  - ~~H₀: No linear correlation~~
  - H₁: Linear correlation exists