In [13]:
%%capture
# Export this Notebook to PDF
!jupyter nbconvert --to pdf "Hypothesis Testing.ipynb" \
    --TagRemovePreprocessor.enabled=True  \
    --TagRemovePreprocessor.remove_cell_tags remove_cell \
    --TagRemovePreprocessor.remove_all_outputs_tags remove_output \
    --TagRemovePreprocessor.remove_input_tags remove_input;

In [1]:
# Make Jupyter reload library before every execution

%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np

df= pd.read_csv('data/all.csv', parse_dates=True, )

# Hypothesis Testing

The observed correlations from the data analysis and visualizations suggest several hypotheses that could be explored through further analysis:

1. **Impact of Sleep Disturbances on Quality:** Given the strong negative correlation between sleep disturbances and quality, we can hypothesize that increased sleep disturbances are likely to negatively impact the quality of sleep.

1. **Age in Relation to Sleep Duration:** The negative correlation between age and calculated night sleep duration leads to the hypothesis that sleep duration may decrease with age.

1. **Relationship Between Sleep Onset Time and Quality:** The moderate negative correlation observed between sleep onset time and quality suggests that a longer time to fall asleep might be associated with poorer sleep quality.

1. **Influence of Exercise on Sleep Duration and Quality:** The slight positive correlation between exercise days per week and sleep duration hints at a potential hypothesis that increased physical activity could contribute to longer and possibly better quality sleep.

1. **Nap Duration's Effect on Nighttime Sleep Duration:** Although the correlation is weak, we could investigate whether the duration of naps has any effect on the duration of nighttime sleep.

For each hypothesis, we aim to perform at least two types of hypothesis testing methods provided by the Scipy libray.

## Hypothesis 1 - Increased sleep disturbances negatively impact the quality of sleep.

**Null Hypothesis ($H_0$)**: The level of Sleep Disturbances has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The level of Sleep Disturbances has a negative impact on Sleep Quality.

### Spearman Correlations Test

In [3]:
sleep_disturbances_mapping = {
    "Never": 0,
    "Rarely": 1,
    "Sometimes": 2,
    "Frequently": 3,
    "Often": 4,
}

df["Sleep Disturbances Ordinal"] = df["Sleep Disturbances"].map(
    sleep_disturbances_mapping
)


We will a 1-tail negative *Spearman correlation test* by setting `alternative='less'` to measure the correlation

In [4]:
import scipy.stats as stats

correlation, p_value = stats.spearmanr(df['Sleep Quality'].to_numpy(), df['Sleep Disturbances Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation:.3f}')
print(f'P-value: {p_value}')

Correlation: -0.453
P-value: 4.198014093688786e-07


With the given results of a Spearman correlation coefficient ($\rho$) of approximately -0.453 and a p-value of approximately 4.2e-07, we can draw the following conclusions about the relationship between sleep disturbances and sleep quality:

- **Strength and Direction of Correlation:** The Spearman correlation coefficient of -0.453 indicates a moderate negative correlation between sleep disturbances and sleep quality. This means that as sleep disturbances increase (become more frequent), sleep quality tends to decrease (gets worse).

- **Statistical Significance:** The p-value is a measure of the probability that the observed correlation occurred by chance if there were no actual relationship in the population. A p-value of 4.2e-07 is extremely small, far below the common alpha level of 0.05 used to determine statistical significance. This means that the negative correlation observed is highly unlikely to be due to random variation in the sample; it's statistically significant.

**Conclusion:**
Based on the Spearman correlation test, we can confidently reject the null hypothesis that there is no correlation between sleep disturbances and sleep quality. The data supports the alternative hypothesis that sleep disturbances do affect sleep quality, with more disturbances associated with worse sleep quality. This result aligns with what might be expected intuitively: that individuals who experience more disturbances during sleep tend to report lower overall sleep quality.

### Chi-squared Test

First, construct contigency table between `Sleep Disturbances` and `Sleep Quality`

In [5]:
contingency_table = pd.crosstab(df['Sleep Disturbances'], df['Sleep Quality'])
contingency_table

Sleep Quality,2,3,4,5
Sleep Disturbances,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Frequently,5,2,0,0
Never,0,5,6,4
Often,2,1,0,0
Rarely,2,17,24,5
Sometimes,5,16,14,0


After that, the Chi-squared test could be conducted with `scipy` library

In [6]:
from scipy.stats import chi2_contingency

# Perform the Chi-squared test
chi2_stat, p_val, dof, ex = chi2_contingency(contingency_table)

# Print the results
print(f"Chi2 Stat: {chi2_stat}")
print(f"P Value: {p_val}")
print(f"Degrees of Freedom: {dof}")

Chi2 Stat: 46.03087922530431
P Value: 6.853850711377629e-06
Degrees of Freedom: 12


The results from the Chi-squared test on our data can be interpreted as follows


**Chi-Squared Statistic (Chi2 Stat):**
The value is approximately 46.03, which is notably higher than the critical value of 21.026 for a chi-square distribution with 12 degrees of freedom at significant level of 0.05. As a result, this considerable difference suggests strong evidence to reject the Null hypothesis.

**P-Value:**
The p-value is about 6.85e-06, which is significantly less than the common alpha level of 0.05. This low p-value indicates that the observed association between sleep disturbances and sleep quality is highly unlikely to have occurred by random chance.


**Conclusion:**
Given the very low p-value and the high Chi-squared statistic, we can reject the null hypothesis of independence. This means there is a statistically significant association between sleep disturbances and sleep quality in our dataset. In other words, the level of sleep disturbances appears to be related to the reported sleep quality of the individuals in our study.

## Hypothesis 2 - Older people have shorter Night sleep

**Null Hypothesis ($H_0$)**: Age group has no impact on Sleep Duration.

**Alternative Hypothesis ($H_1$)**: Age group has negative correlation with Sleep Duration

### Pearson Correlations Test

Since the distribution of Sleep Quality is fairly normal, we'll use the Pearson test.

In [7]:
age_mapping = {
    "25-34": 30,
    "16-24": 20,
    "35-44": 40,
    "45-54": 50,
    "55+": 60,
}

df["Age Group Ordinal"] = df["Age Group"].map(age_mapping)

We will a 1-tail negative *Pearson correlation test* by setting `alternative='less'` to measure the correlation

In [8]:
import scipy.stats as stats

da = df.dropna(subset=['Age Group Ordinal','Calculated Night Sleep Duration'], axis='index')

correlation, p_value = stats.pearsonr(da['Calculated Night Sleep Duration'].to_numpy(), da['Age Group Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation:.3f}')
print(f'P-value: {p_value}')

Correlation: -0.232
P-value: 0.008805297803731816


- **Correlation Coefficient:** The value is approximately -0.232, indicating a weak negative correlation between age group and sleep duration. This means that as age increases, there is a slight tendency for sleep duration to decrease.

- **P-value:** The p-value is approximately 0.0088, which is less than the conventional significance level of 0.05. The one-tailed test suggests that we can reject the null hypothesis in favor of the alternative hypothesis, which is that age group has a negative correlation with sleep duration.


**Conclusion:**
The test provides evidence to support the claim that older age groups tend to have shorter sleep duration. The correlation is not strong but is statistically significant, suggesting a slight trend where sleep duration decreases with increasing age. This result is consistent with the alternative hypothesis and indicates that age may be a factor influencing sleep duration. However, the correlation is weak, so while there may be a trend, it is not a definitive predictor of sleep duration. Other factors not considered in this test may also play a role in determining sleep duration across age groups.

### Kendall's Tau test

In [9]:
import scipy.stats as stats

correlation, p_value = stats.kendalltau(da['Calculated Night Sleep Duration'].to_numpy(), da['Age Group Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation:.3f}')
print(f'P-value: {p_value}')

Correlation: -0.204
P-value: 0.006194262319330456


- **Correlation Coefficient:**
The value of Kendall’s Tau is -0.204, which indicates a weak negative correlation between age group and sleep duration. This means that as the ordinal age group increases (which likely corresponds to increasing actual age), there is a slight tendency for sleep duration to decrease.
- **Statistical Significance:**
The p-value is approximately 0.00619, which is less than the conventional alpha level of 0.05 used for statistical significance. This indicates that the observed correlation is statistically significant and is not likely to have occurred by chance.


**Conclusion:**
Baed on the Kendall’s Tau test, you can reject the null hypothesis that there is no association between age group and sleep duration. The data suggests that there is a statistically significant, albeit weak, negative relationship between the two. As people fall into higher ordinal age groups, they may experience a slight decrease in sleep duration. However, given the weak strength of the correlation, age alone is not a strong predictor of sleep duration, and other factors may also play a significant role.

## Hypothesis 3 - The longer it takes to fall as sleep, the worse Sleep Quality becomes

### Spearman Correlation Test

**Null Hypothesis ($H_0$)**: The increase of Sleep onset time has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The increase of Sleep onset time leads to the decline on Sleep Quality.


In [10]:
onset_mapping = {
    "<15 Minutes": 7.5,
    "30-60 Minutes": 45,
    "15-30 Minutes": 20,
    ">60 Minutes": 60,
}

df["Sleep Onset Time Ordinal"] = df["Sleep Onset Time"].map(onset_mapping)


In [11]:
import scipy.stats as stats

# Perform Spearman correlation test
correlation, p_value = stats.spearmanr(df['Sleep Quality'].to_numpy(), df['Sleep Onset Time Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation}')
print(f'P-value: {p_value}')

Correlation: -0.37764117513828727
P-value: 2.7990152273389178e-05


-  **Correlation Coefficient:** The negative correlation coefficient indicates an inverse relationship between sleep onset time and sleep quality. This suggests that longer times to fall asleep (indicating difficulty initiating sleep) are associated with lower sleep quality ratings.

-  **Statistical Significance:** The p-value measures the probability that the observed correlation is due to random chance. A p-value of 2.80e-05 is very small and well below the conventional alpha level of 0.05, which is commonly used to assess statistical significance. This indicates that the observed correlation is highly unlikely to have occurred by chance.

**Conclusion:**

Based on the Spearman correlation test, we can reject the null hypothesis ($H_0$) that sleep onset time has no impact on sleep quality. Instead, we accept the alternative hypothesis ($H_1$) that there is a statistically significant negative relationship between sleep onset time and sleep quality. In practical terms, this result suggests that interventions aimed at reducing sleep onset time might be beneficial for improving overall sleep quality.

### Chi-squared Test

First, construct contigency table between `Sleep Onset Time` and `Sleep Quality`

In [12]:
contingency_table = pd.crosstab(df['Sleep Onset Time'], df['Sleep Quality'])
contingency_table

Sleep Quality,2,3,4,5
Sleep Onset Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
15-30 Minutes,5,22,24,4
30-60 Minutes,5,12,4,0
<15 Minutes,2,5,14,5
>60 Minutes,2,2,2,0


After that, the Chi-squared test could be conducted with `scipy` library

In [13]:
from scipy.stats import chi2_contingency

# Perform the Chi-squared test
chi2_stat, p_val, dof, ex = chi2_contingency(contingency_table)

# Print the results
print(f"Chi2 Stat: {chi2_stat}")
print(f"Degrees of Freedom: {dof}")
print(f"P Value: {p_val}")

Chi2 Stat: 19.297119225405574
Degrees of Freedom: 9
P Value: 0.022781787964876524


The results from the Chi-squared test related to Hypothesis 3, which concerns the relationship between sleep onset time and sleep quality, can be interpreted as follows:


1. **Chi-Squared Statistic (Chi2 Stat):**
Since the calculated Chi-square statistic is greater than the critical value, we can conclude that there is a statistically significant association between "Sleep Onset Time" and "Sleep Quality" at the 0.05 significance level. This means that the two variables are not independent, and their relationship is not likely to be due to random chance.

1. **P-Value:**
The p-value is approximately 0.0228, which is lower the common alpha level of 0.05. This indicates that there is a statistically significant association between sleep onset time and sleep quality, although the strength of this association is not as strong as it might be for a much lower p-value.


**Conclusion:**
Given the p-value of 0.0228, which is less than the alpha level of 0.05, we can reject the null hypothesis. This suggests that there is a statistically significant association between sleep onset time and sleep quality in our dataset. The relationship, as indicated by the Chi-squared statistic, is present but not extremely strong.

## Hypothesis 4a - Weekly exercise frequency improves sleep quality

### Spearman Correlation Test

**Null Hypothesis ($H_0$)**: The increase of Weekly exercise frequency has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The increase of Weekly exercise frequency improves on Sleep Quality.


In [14]:
exercise_mapping = {"0 Days": 0, "1-2 Days": 1, "3-4 Days": 2, "5+ Days": 3}
df["Exercise Days/Week Ordinal"] = df["Exercise Days/Week"].map(exercise_mapping)

Since we are testing the positive correlation, `alternative='greater'` is used

In [15]:
import scipy.stats as stats

# Perform Spearman correlation test
correlation, p_value = stats.spearmanr(df['Exercise Days/Week Ordinal'], df['Sleep Quality'], alternative='greater')

print(f'Correlation: {correlation}')
print(f'P-value: {p_value}')

Correlation: -0.0577165101075317
P-value: 0.7235175030385403



1. **Correlation Coefficient:**
The negative correlation coefficient of -0.0577 is very close to zero, indicating a very weak inverse relationship between exercise frequency and sleep quality. This suggests that as exercise frequency increases, there is a very slight tendency for sleep quality to decrease, but the effect is so small that it might not be meaningful in practical terms.

1. **Statistical Significance:**
The p-value of 0.724 is much higher than the common alpha level of 0.05. A high p-value indicates that the correlation observed in the sample is very likely to have occurred by random chance and is not statistically significant.

**Conclusion:**
Based on this analysis, there is no evidence to support a significant relationship between exercise frequency and sleep quality in the dataset. The correlation is very weak and not statistically significant. This means that, within the data collected, changes in exercise frequency do not appear to have a notable impact on sleep quality. These findings suggest that other factors not captured in this analysis might play a more substantial role in determining sleep quality.

### Chi-squared Test

First, construct contigency table between `Exercise Days/Week` and `Sleep Quality`

In [16]:
contingency_table = pd.crosstab(df['Exercise Days/Week'], df['Sleep Quality'])
contingency_table

Sleep Quality,2,3,4,5
Exercise Days/Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0 Days,4,6,8,5
1-2 Days,4,20,18,1
3-4 Days,5,7,13,2
5+ Days,1,8,5,1


After that, the Chi-squared test could be conducted with `scipy` library

In [17]:
from scipy.stats import chi2_contingency

# Perform the Chi-squared test
chi2_stat, p_val, dof, ex = chi2_contingency(contingency_table)

# Print the results
print(f"Chi2 Stat: {chi2_stat}")
print(f"P Value: {p_val}")
print(f"Degrees of Freedom: {dof}")

Chi2 Stat: 13.219870509579255
P Value: 0.1529071255382214
Degrees of Freedom: 9


1. **Chi-Squared Statistic (Chi2 Stat):**
With a critical value of approximately 16.92 for a chi-square distribution with 9 degrees of freedom at a significance level ($\alpha$) of 0.05, it's important to note that the calculated chi-square statistic (13.2199) falls below this critical value. Consequently, this indicates insufficient evidence to reject the null hypothesis.

1. **P-Value:**
Given a p-value of 0.1529, which exceeds the common significance level of 0.05 (or 5%), we are unable to reject the null hypothesis ($H_0$). In essence, this implies that there isn't enough compelling evidence to support a significant association or difference between the variables under examination (Weekly Exercise Frequency and Sleep Quality). In simpler terms, the observed data does not significantly differ from what would be expected assuming no association or difference.

**Conclusion:**
In summary, the results of the chi-squared test indicate that we lack the requisite evidence to reject the null hypothesis. Therefore, at a 5% significance level, it is not possible to conclude that Weekly Exercise Frequency has a significant impact on Sleep Quality.

## Hypothesis 4b - Weekly exercise frequency improves Sleep Duration

### Spearman Correlation Test

**Null Hypothesis ($H_0$)**: The increase of Weekly exercise frequency has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The increase of Weekly exercise frequency improves on Sleep Quality.


Since we are testing the positive correlation, `alternative='greater'` is used

In [18]:
import scipy.stats as stats

# Perform Spearman correlation test
da = df.dropna(subset=['Exercise Days/Week Ordinal','Calculated Night Sleep Duration'], axis='index')
correlation, p_value = stats.spearmanr(da['Exercise Days/Week Ordinal'], da['Calculated Night Sleep Duration'], alternative='greater')

print(f'Correlation: {correlation}')
print(f'P-value: {p_value}')

Correlation: 0.18054562562558002
P-value: 0.03265916110394638


1. **Correlation Coefficient:**
The correlation coefficient, denoted as r, is approximately 0.1805. This value represents a positive, albeit relatively weak, linear relationship between "Weekly Exercise Frequency" and "Night Sleep Duration." In practical terms, it implies that as weekly exercise frequency increases, there is a tendency for night sleep duration to increase slightly, but the strength of this relationship is not very robust.

1. **Statistical Significance:**
The p-value associated with the correlation coefficient is approximately 0.0327. This p-value is less than the conventional significance level of 0.05. Consequently, it indicates that the observed correlation between weekly exercise frequency and night sleep duration is statistically significant. This suggests that the observed relationship is unlikely to be a random occurrence.

**Conclusion:**
In summary, the data reveals a statistically significant, yet relatively weak, positive linear relationship between weekly exercise frequency and night sleep duration. This means that as weekly exercise frequency increases, there is evidence to suggest a slight increase in night sleep duration, but it's important to note that other factors may also play a role in determining sleep duration.

### ANOVA Test

In [19]:
import scipy.stats as stats

# Create a DataFrame
da = df.dropna(subset=['Exercise Days/Week','Calculated Night Sleep Duration'], axis='index')

# Perform one-way ANOVA
groups = da["Exercise Days/Week"].unique()
anova_results = []

for group in groups:
    group_data = da[da["Exercise Days/Week"] == group]["Calculated Night Sleep Duration"]
    anova_results.append(group_data)


# Perform the ANOVA
f_statistic, p_value = stats.f_oneway(*anova_results)
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

F-statistic: 2.9423229179890527
P-value: 0.03667055489715219



1. **F-statistic:**
The F-statistic of approximately 2.9423 indicates that there exists some variation between the group means, specifically in the context of sleep duration across exercise frequency groups.

1. **P-value:**
With a p-value of approximately 0.0367, which falls below the commonly selected significance level of 0.05, we find evidence to conclude that the one-way ANOVA is statistically significant.

**Conclusion:**
We can infer that there are statistically significant differences in sleep duration among the exercise frequency groups.
In practical terms, this suggests that exercise frequency does indeed have a statistically significant impact on sleep duration. However, it's worth noting that the ANOVA itself doesn't elucidate the magnitude or direction of these differences. Further post-hoc tests or in-depth analysis may be necessary to explore the specific nature and extent of these variations.

## Hypothesis 5 - Increased Nap Time decreases Night Sleep duration

### Spearman Correlation Test

**Null Hypothesis ($H_0$)**: The increase of Nap time has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The increase of Nap time shortens on Sleep Quality.


In [20]:
nap_duration_mapping = {
    "No Nap": 0,
    "<30 Minutes": 15,
    "60-90 Minutes": 75,
    "30-60 Minutes": 45,
    ">90 Minutes": 90,
}

df["Nap Duration Ordinal"] = df["Nap Duration"].map(nap_duration_mapping)

In [21]:
import scipy.stats as stats

# Perform Spearman correlation test
da = df.dropna(subset=['Nap Duration Ordinal','Calculated Night Sleep Duration'], axis='index')

correlation, p_value = stats.spearmanr(da['Nap Duration Ordinal'], da['Calculated Night Sleep Duration'], alternative='less')

print(f'Correlation: {correlation}')
print(f'P-value: {p_value}')

Correlation: -0.12075730424094128
P-value: 0.10989663093511584


* **Correlation Coefficient:**
The correlation coefficient is approximately -0.121, indicating a very weak negative relationship between nap duration and sleep quality. This suggests that as nap duration increases, there is a slight tendency for sleep quality to decrease, but the relationship is not strong.
* **P-value:**
The p-value is approximately 0.110, which is above the conventional threshold of 0.05 for statistical significance. This means that the results do not provide enough evidence to reject the null hypothesis at the 5% significance level.

**Conclusion:**
Based on the results of the Spearman correlation test, there is not enough evidence to conclude that an increase in nap duration has a significant impact on sleep quality. While there is a weak negative correlation, it is not statistically significant, and therefore, the null hypothesis cannot be rejected. It is important to note that this does not necessarily mean there is no relationship at all, but rather that the test did not detect a strong enough relationship in the sample data to assert that one exists in the population.

### Kendall's Tau

In [22]:
import scipy.stats as stats

correlation, p_value = stats.kendalltau(
    da["Nap Duration Ordinal"].to_numpy().astype(float),
    da["Calculated Night Sleep Duration"].to_numpy(),
    alternative="less",
)

print(f"Correlation: {correlation:.3f}")
print(f"P-value: {p_value}")

Correlation: -0.096
P-value: 0.11454464315718871


* **Correlation Coefficient:** The value is -0.096, which indicates a very weak negative correlation between the two variables. This suggests that as nap duration increases, there is a slight tendency for night sleep duration to decrease, but this tendency is not strong.

* **P-value:** The p-value is approximately 0.115, which is above the conventional alpha level of 0.05 used for determining statistical significance. This means that the negative correlation observed is not statistically significant.

**Conclusion:**
In conclusion, the analysis does not support the alternative hypothesis that increased nap duration has a significant impact on reducing night sleep duration. However, given the weak correlation, it is possible that a larger sample size or further studies could reveal more about the nature of this relationship.

# Conclusion

At a significance level of 0.05, our analysis reveals the following noteworthy associations between various factors and sleep quality:

* **Sleep Disturbance:** We find that an increase in sleep disturbances is linked to a notable deterioration in sleep quality. This suggests that individuals experiencing more sleep disruptions tend to report lower sleep quality, indicating the detrimental impact of such disturbances on overall restorative sleep.
* **Age Group and Sleep Duration:** Our statistical analysis delineates a weak yet statistically significant negative correlation between age group and sleep duration. This suggests a tendency for sleep duration to decrease modestly as age increases. Although the correlation is not robust, the trend indicates that older individuals may experience a slight reduction in the amount of sleep they obtain, warranting attention to sleep health in advancing age groups.

* **Sleep Onset Time:** Our analysis indicates that the speed at which an individual falls asleep plays a significant role in determining sleep quality. Specifically, a quicker onset of sleep is associated with a substantial improvement in sleep quality. This implies that individuals who are able to fall asleep more rapidly tend to enjoy a better overall sleep experience in terms of quality.

* **Exercise Weekly Frequency:** Interestingly, our findings do not provide sufficient statistical evidence to support the notion that increasing exercise frequency significantly enhances sleep quality. However, it is worth noting that a correlation exists between higher exercise frequency and extended sleep duration. This suggests that individuals who engage in more frequent exercise routines tend to sleep for longer durations, although the 
improvement in sleep quality itself remains inconclusive.

* **Nap Duration:** Our examination does not reveal a clear-cut connection between the length of naps and the overall duration of nighttime sleep. The data does not present a compelling case for the hypothesis that longer naps correlate with either an increase or decrease in the total amount of sleep obtained at night. As such, it remains uncertain whether napping habits have a significant impact on the length of nocturnal sleep, warranting further investigation.



