In [11]:
%%capture
# Export this Notebook to PDF
!jupyter nbconvert --to pdf "Hypothesis Testing.ipynb" \
    --TagRemovePreprocessor.enabled=True  \
    --TagRemovePreprocessor.remove_cell_tags remove_cell \
    --TagRemovePreprocessor.remove_all_outputs_tags remove_output \
    --TagRemovePreprocessor.remove_input_tags remove_input;

In [1]:
# Make Jupyter reload library before every execution

%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np

df= pd.read_csv('data/all.csv', parse_dates=True, )

# Hypothesis Testing

The observed correlations from the data analysis and visualizations suggest several hypotheses that could be explored through further analysis:

1. **Impact of Sleep Disturbances on Quality:** Given the strong negative correlation between sleep disturbances and quality, we can hypothesize that increased sleep disturbances are likely to negatively impact the quality of sleep.

1. **Age in Relation to Sleep Duration:** The negative correlation between age and calculated night sleep duration leads to the hypothesis that sleep duration may decrease with age.

1. **Relationship Between Sleep Onset Time and Quality:** The moderate negative correlation observed between sleep onset time and quality suggests that a longer time to fall asleep might be associated with poorer sleep quality.

1. **Influence of Exercise on Sleep Duration and Quality:** The slight positive correlation between exercise days per week and sleep duration hints at a potential hypothesis that increased physical activity could contribute to longer and possibly better quality sleep.

1. **Nap Duration's Effect on Nighttime Sleep Duration and Quality:** Although the correlation is weak, we could investigate whether the duration of naps has any effect on the duration of nighttime sleep.

## Hypothesis 1 - Increased sleep disturbances negatively impact the quality of sleep.

**Null Hypothesis ($H_0$)**: The level of Sleep Disturbances has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The level of Sleep Disturbances has a negative impact on Sleep Quality.

### Spearman Correlations Test

In [3]:
sleep_disturbances_mapping = {
    "Never": 0,
    "Rarely": 1,
    "Sometimes": 2,
    "Frequently": 3,
    "Often": 4,
}

df["Sleep Disturbances Ordinal"] = df["Sleep Disturbances"].map(
    sleep_disturbances_mapping
)


We will a 1-tail negative *Spearman correlation test* by setting `alternative='less'` to measure the correlation

In [4]:
import scipy.stats as stats

correlation, p_value = stats.spearmanr(df['Sleep Quality'].to_numpy(), df['Sleep Disturbances Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation:.3f}')
print(f'P-value: {p_value}')

Correlation: -0.453
P-value: 4.198014093688787e-07


With the given results of a Spearman correlation coefficient ($\rho$) of approximately -0.453 and a p-value of approximately 4.2e-07, we can draw the following conclusions about the relationship between sleep disturbances and sleep quality:

- **Strength and Direction of Correlation:** The Spearman correlation coefficient of -0.453 indicates a moderate negative correlation between sleep disturbances and sleep quality. This means that as sleep disturbances increase (become more frequent), sleep quality tends to decrease (gets worse).

- **Statistical Significance:** The p-value is a measure of the probability that the observed correlation occurred by chance if there were no actual relationship in the population. A p-value of 4.2e-07 is extremely small, far below the common alpha level of 0.05 used to determine statistical significance. This means that the negative correlation observed is highly unlikely to be due to random variation in the sample; it's statistically significant.

**Conclusion:**
Based on the Spearman correlation test, we can confidently reject the null hypothesis that there is no correlation between sleep disturbances and sleep quality. The data supports the alternative hypothesis that sleep disturbances do affect sleep quality, with more disturbances associated with worse sleep quality. This result aligns with what might be expected intuitively: that individuals who experience more disturbances during sleep tend to report lower overall sleep quality.

### Chi-squared Test

First, construct contigency table between `Sleep Disturbances` and `Sleep Quality`

In [5]:
contingency_table = pd.crosstab(df['Sleep Disturbances'], df['Sleep Quality'])
contingency_table

Sleep Quality,2,3,4,5
Sleep Disturbances,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Frequently,5,2,0,0
Never,0,5,6,4
Often,2,1,0,0
Rarely,2,17,24,5
Sometimes,5,16,14,0


After that, the Chi-squared test could be conducted with `scipy` library

In [6]:
from scipy.stats import chi2_contingency

# Perform the Chi-squared test
chi2_stat, p_val, dof, ex = chi2_contingency(contingency_table)

# Print the results
print(f"Chi2 Stat: {chi2_stat}")
print(f"P Value: {p_val}")

Chi2 Stat: 46.03087922530431
P Value: 6.853850711377629e-06


The results from the Chi-squared test on your data can be interpreted as follows


**Chi-Squared Statistic (Chi2 Stat):**
The value is approximately 46.03. This statistic measures the difference between the observed and expected frequencies in the contingency table. A higher value typically indicates a stronger departure from the null hypothesis.

**P-Value:**
The p-value is about 6.85e-06, which is significantly less than the common alpha level of 0.05. This low p-value indicates that the observed association between sleep disturbances and sleep quality is highly unlikely to have occurred by random chance.


**Conclusion:**
Given the very low p-value and the high Chi-squared statistic, you can reject the null hypothesis of independence. This means there is a statistically significant association between sleep disturbances and sleep quality in your dataset. In other words, the level of sleep disturbances appears to be related to the reported sleep quality of the individuals in your study.

## Hypothesis 3 - The longer it takes to fall as sleep, the worse Sleep Quality becomes

### Spearman Correlation Test

**Null Hypothesis ($H_0$)**: The increase of Sleep onset time has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The increase of Sleep onset time leads to the decline on Sleep Quality.


In [7]:
onset_mapping = {
    "<15 Minutes": 7.5,
    "30-60 Minutes": 45,
    "15-30 Minutes": 20,
    ">60 Minutes": 60,
}

df["Sleep Onset Time Ordinal"] = df["Sleep Onset Time"].map(onset_mapping)


In [8]:
import scipy.stats as stats

# Perform Spearman correlation test
correlation, p_value = stats.spearmanr(df['Sleep Quality'].to_numpy(), df['Sleep Onset Time Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation}')
print(f'P-value: {p_value}')

Correlation: -0.37764117513828727
P-value: 2.7990152273389178e-05


-  **Correlation Coefficient:** The negative correlation coefficient indicates an inverse relationship between sleep onset time and sleep quality. This suggests that longer times to fall asleep (indicating difficulty initiating sleep) are associated with lower sleep quality ratings.

-  **Statistical Significance:** The p-value measures the probability that the observed correlation is due to random chance. A p-value of 2.80e-05 is very small and well below the conventional alpha level of 0.05, which is commonly used to assess statistical significance. This indicates that the observed correlation is highly unlikely to have occurred by chance.

**Conclusion:**

Based on the Spearman correlation test, we can reject the null hypothesis ($H_0$) that sleep onset time has no impact on sleep quality. Instead, we accept the alternative hypothesis ($H_1$) that there is a statistically significant negative relationship between sleep onset time and sleep quality. In practical terms, this result suggests that interventions aimed at reducing sleep onset time might be beneficial for improving overall sleep quality.

### Chi-squared Test

First, construct contigency table between `Sleep Onset Time` and `Sleep Quality`

In [9]:
contingency_table = pd.crosstab(df['Sleep Onset Time'], df['Sleep Quality'])
contingency_table

Sleep Quality,2,3,4,5
Sleep Onset Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
15-30 Minutes,5,22,24,4
30-60 Minutes,5,12,4,0
<15 Minutes,2,5,14,5
>60 Minutes,2,2,2,0


After that, the Chi-squared test could be conducted with `scipy` library

In [10]:
from scipy.stats import chi2_contingency

# Perform the Chi-squared test
chi2_stat, p_val, dof, ex = chi2_contingency(contingency_table)

# Print the results
print(f"Chi2 Stat: {chi2_stat}")
print(f"P Value: {p_val}")

Chi2 Stat: 19.297119225405574
P Value: 0.022781787964876524


The results from the Chi-squared test related to Hypothesis 3, which concerns the relationship between sleep onset time and sleep quality, can be interpreted as follows:


**Chi-Squared Statistic (Chi2 Stat):**

The value is approximately 19.30. This statistic measures how much the observed frequencies in the data deviate from the frequencies that would be expected if there were no association between sleep onset time and sleep quality. A higher value typically indicates a stronger departure from the null hypothesis.

**P-Value:**

The p-value is approximately 0.0228, which is lower the common alpha level of 0.05. This indicates that there is a statistically significant association between sleep onset time and sleep quality, although the strength of this association is not as strong as it might be for a much lower p-value.


**Conclusion:**

Given the p-value of 0.0228, which is less than the alpha level of 0.05, you can reject the null hypothesis. This suggests that there is a statistically significant association between sleep onset time and sleep quality in your dataset. The relationship, as indicated by the Chi-squared statistic, is present but not extremely strong.

You should examine the observed frequencies against the expected frequencies to gain insights into the nature of the association. For instance, you might find that certain sleep onset times are more frequently associated with particular levels of sleep quality, indicating patterns that could be of interest for further study or interventions.