# Team# 02 Project_ File# 04_ Big Data 
# 02423029_ Suleman Khan
# 02323027_ Muhammad Farooq
# 02421070_ Adnan Mumtaz 

In [3]:
import pandas as pd
import numpy as np

df= pd.read_csv('all_1.csv', parse_dates=True, )

# Hypothesis Testing

The observed correlations from the data analysis and visualizations suggest several hypotheses that could be explored through further analysis:

1. **Impact of Sleep Disturbances on Quality:** Given the strong negative correlation between sleep disturbances and quality, we can hypothesize that increased sleep disturbances are likely to negatively impact the quality of sleep.

1. **Age in Relation to Sleep Duration:** The negative correlation between age and calculated night sleep duration leads to the hypothesis that sleep duration may decrease with age.

1. **Relationship Between Sleep Onset Time and Quality:** The moderate negative correlation observed between sleep onset time and quality suggests that a longer time to fall asleep might be associated with poorer sleep quality.

1. **Influence of Exercise on Sleep Duration and Quality:** The slight positive correlation between exercise days per week and sleep duration hints at a potential hypothesis that increased physical activity could contribute to longer and possibly better quality sleep.

1. **Nap Duration's Effect on Nighttime Sleep Duration:** Although the correlation is weak, we could investigate whether the duration of naps has any effect on the duration of nighttime sleep.

For each hypothesis, we aim to perform at least two types of hypothesis testing methods provided by the Scipy libray.

## Hypothesis 1 - Increased sleep disturbances negatively impact the quality of sleep.

**Null Hypothesis ($H_0$)**: The level of Sleep Disturbances has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The level of Sleep Disturbances has a negative impact on Sleep Quality.

### Spearman Correlations Test

In [4]:
sleep_disturbances_mapping = {
    "Never": 0,
    "Rarely": 1,
    "Sometimes": 2,
    "Frequently": 3,
    "Often": 4,
}

df["Sleep Disturbances Ordinal"] = df["Sleep Disturbances"].map(
    sleep_disturbances_mapping
)


We will a 1-tail negative *Spearman correlation test* by setting `alternative='less'` to measure the correlation

In [5]:
import scipy.stats as stats

correlation, p_value = stats.spearmanr(df['Sleep Quality'].to_numpy(), df['Sleep Disturbances Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation:.3f}')
print(f'P-value: {p_value}')

Correlation: 0.177
P-value: 0.9999516219309827


**With the given results of a Spearman correlation coefficient ($\rho$) of approximately 0.177 and a p-value of approximately 0.9999516219309827, we can draw the following conclusions about the relationship between sleep disturbances and sleep quality:**

- **Strength and Direction of Correlation:** The Spearman correlation coefficient of 0.177 indicates a weak positive correlation between sleep disturbances and sleep quality. This means that as sleep disturbances increase (become more frequent), there is a slight tendency for sleep quality to increase, which is counterintuitive and suggests that the relationship may not be meaningful.

- **Statistical Significance:** The p-value is a measure of the probability that the observed correlation occurred by chance if there were no actual relationship in the population. A p-value of 0.9999516219309827 is extremely high, far above the common alpha level of 0.05 used to determine statistical significance. This means that the observed correlation is likely due to random variation in the sample and is not statistically significant.

**Conclusion:**
Based on the Spearman correlation test, we cannot reject the null hypothesis that there is no correlation between sleep disturbances and sleep quality. The data does not support the alternative hypothesis that sleep disturbances affect sleep quality. This result suggests that, within this dataset, there is no meaningful relationship between the frequency of sleep disturbances and the overall quality of sleep.

### Chi-squared Test

First, construct contigency table between `Sleep Disturbances` and `Sleep Quality`

In [6]:
contingency_table = pd.crosstab(df['Sleep Disturbances'], df['Sleep Quality'])
contingency_table

Sleep Quality,3.0,4.0,5.0,6.0,7.0,8.0,9.0
Sleep Disturbances,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Frequently,2,2,0,25,26,13,11
Never,5,6,6,20,10,20,12
Often,1,0,1,19,11,24,18
Rarely,17,27,7,19,19,22,14
Sometimes,16,14,2,22,25,30,16


After that, the Chi-squared test could be conducted with `scipy` library

In [7]:
from scipy.stats import chi2_contingency

# Perform the Chi-squared test
chi2_stat, p_val, dof, ex = chi2_contingency(contingency_table)

# Print the results
print(f"Chi2 Stat: {chi2_stat}")
print(f"P Value: {p_val}")
print(f"Degrees of Freedom: {dof}")

Chi2 Stat: 86.14103292472487
P Value: 6.21340151287131e-09
Degrees of Freedom: 24


**The results from the Chi-squared test on our data can be interpreted as follows:**

**Chi-Squared Statistic (Chi2 Stat):**
The value is approximately 86.14, which is notably higher than the critical value for a chi-square distribution with 24 degrees of freedom at a significance level of 0.05. This considerable difference suggests strong evidence to reject the null hypothesis.

**P-Value:**
The p-value is about 6.21e-09, which is significantly less than the common alpha level of 0.05. This extremely low p-value indicates that the observed association is highly unlikely to have occurred by random chance.

**Degrees of Freedom:**
The degrees of freedom for this test is 24.

**Conclusion:**
Given the very low p-value and the high Chi-squared statistic, we can reject the null hypothesis of independence. This means there is a statistically significant association between the variables in our dataset. In other words, the variables being tested appear to be related to each other in our study.


## Hypothesis 2 - Older people have shorter Night sleep

**Null Hypothesis ($H_0$)**: Age group has no impact on Sleep Duration.

**Alternative Hypothesis ($H_1$)**: Age group has negative correlation with Sleep Duration

### Pearson Correlations Test

Since the distribution of Sleep Quality is fairly normal, we'll use the Pearson test.

In [8]:
age_mapping = {
    "25-34": 30,
    "16-24": 20,
    "35-44": 40,
    "45-54": 50,
    "55+": 60,
}

df["Age Group Ordinal"] = df["Age Group"].map(age_mapping)

We will a 1-tail negative *Pearson correlation test* by setting `alternative='less'` to measure the correlation

In [9]:
import scipy.stats as stats

da = df.dropna(subset=['Age Group Ordinal','Calculated Night Sleep Duration'], axis='index')

correlation, p_value = stats.pearsonr(da['Calculated Night Sleep Duration'].to_numpy(), da['Age Group Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation:.3f}')
print(f'P-value: {p_value}')

Correlation: -0.044
P-value: 0.17660073379867178


**The results from the Pearson correlation test on our data can be interpreted as follows:**

- **Correlation Coefficient:** The value is approximately -0.044, indicating a very weak negative correlation between age group and calculated night sleep duration. This means that as age increases, there is a slight tendency for sleep duration to decrease, but the relationship is very weak.

- **P-value:** The p-value is approximately 0.1766, which is greater than the conventional significance level of 0.05. This suggests that we cannot reject the null hypothesis, and there is not enough evidence to support the claim that age group has a significant negative correlation with sleep duration.

**Conclusion:**
The test does not provide sufficient evidence to support the claim that older age groups tend to have shorter sleep duration. The correlation is very weak and not statistically significant, indicating that age may not be a significant factor influencing sleep duration in this dataset. Other factors not considered in this test may play a more substantial role in determining sleep duration across age groups.

### Kendall's Tau test

In [10]:
import scipy.stats as stats

correlation, p_value = stats.kendalltau(da['Calculated Night Sleep Duration'].to_numpy(), da['Age Group Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation:.3f}')
print(f'P-value: {p_value}')

Correlation: -0.032
P-value: 0.20047771303612327


**The results from the Kendall's Tau correlation test on our data can be interpreted as follows:**

- **Correlation Coefficient:**
  The value of Kendall’s Tau is -0.032, which indicates a very weak negative correlation between age group and calculated night sleep duration. This means that as the ordinal age group increases (which likely corresponds to increasing actual age), there is a slight tendency for sleep duration to decrease, but the relationship is very weak.

- **Statistical Significance:**
  The p-value is approximately 0.2005, which is greater than the conventional alpha level of 0.05 used for statistical significance. This indicates that the observed correlation is not statistically significant and is likely to have occurred by chance.

**Conclusion:**
Based on the Kendall’s Tau test, we cannot reject the null hypothesis that there is no association between age group and sleep duration. The data suggests that there is no statistically significant relationship between the two. As people fall into higher ordinal age groups, there is no meaningful decrease in sleep duration. Given the very weak strength of the correlation and the lack of statistical significance, age alone is not a predictor of sleep duration, and other factors may also play a significant role.

## Hypothesis 3 - The longer it takes to fall as sleep, the worse Sleep Quality becomes

### Spearman Correlation Test

**Null Hypothesis ($H_0$)**: The increase of Sleep onset time has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The increase of Sleep onset time leads to the decline on Sleep Quality.


In [11]:
onset_mapping = {
    "<15 Minutes": 7.5,
    "30-60 Minutes": 45,
    "15-30 Minutes": 20,
    ">60 Minutes": 60,
}

df["Sleep Onset Time Ordinal"] = df["Sleep Onset Time"].map(onset_mapping)


In [12]:
import scipy.stats as stats

# Perform Spearman correlation test
correlation, p_value = stats.spearmanr(df['Sleep Quality'].to_numpy(), df['Sleep Onset Time Ordinal'].to_numpy().astype(float), alternative='less')

print(f'Correlation: {correlation}')
print(f'P-value: {p_value}')

Correlation: 0.034082749220749824
P-value: 0.7723304853756148


**The results from the Spearman correlation test on our data can be interpreted as follows:**

- **Correlation Coefficient:** The value of the Spearman correlation coefficient is approximately 0.034, indicating a very weak positive correlation between sleep onset time and sleep quality. This suggests that longer times to fall asleep are slightly associated with higher sleep quality ratings, but the relationship is very weak and counterintuitive.

- **Statistical Significance:** The p-value is approximately 0.7723, which is much greater than the conventional alpha level of 0.05 used for statistical significance. This indicates that the observed correlation is likely due to random chance and is not statistically significant.

**Conclusion:**

Based on the Spearman correlation test, we cannot reject the null hypothesis ($H_0$) that sleep onset time has no impact on sleep quality. The data does not support the alternative hypothesis ($H_1$) that there is a statistically significant relationship between sleep onset time and sleep quality. In practical terms, this result suggests that sleep onset time may not be a significant factor influencing overall sleep quality in this dataset.

### Chi-squared Test

First, construct contigency table between `Sleep Onset Time` and `Sleep Quality`

In [13]:
contingency_table = pd.crosstab(df['Sleep Onset Time'], df['Sleep Quality'])
contingency_table

Sleep Quality,3.0,4.0,5.0,6.0,7.0,8.0,9.0
Sleep Onset Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
15-30 Minutes,22,27,8,46,50,46,37
30-60 Minutes,12,4,1,18,24,29,15
<15 Minutes,5,16,6,33,10,27,18
>60 Minutes,2,2,1,8,7,7,1


After that, the Chi-squared test could be conducted with `scipy` library

In [14]:
from scipy.stats import chi2_contingency

# Perform the Chi-squared test
chi2_stat, p_val, dof, ex = chi2_contingency(contingency_table)

# Print the results
print(f"Chi2 Stat: {chi2_stat}")
print(f"Degrees of Freedom: {dof}")
print(f"P Value: {p_val}")

Chi2 Stat: 31.234526501533338
Degrees of Freedom: 18
P Value: 0.027039609724060935


**The results from the Chi-squared test related to Hypothesis 3, which concerns the relationship between sleep onset time and sleep quality, can be interpreted as follows:**

1. **Chi-Squared Statistic (Chi2 Stat):**
   The calculated Chi-square statistic is approximately 31.23. Since this value is greater than the critical value for a chi-square distribution with 18 degrees of freedom at the 0.05 significance level, we can conclude that there is a statistically significant association between "Sleep Onset Time" and "Sleep Quality". This means that the two variables are not independent, and their relationship is not likely to be due to random chance.

2. **P-Value:**
   The p-value is approximately 0.0270, which is lower than the common alpha level of 0.05. This indicates that there is a statistically significant association between sleep onset time and sleep quality, although the strength of this association is not as strong as it might be for a much lower p-value.

**Conclusion:**
Given the p-value of 0.0270, which is less than the alpha level of 0.05, we can reject the null hypothesis. This suggests that there is a statistically significant association between sleep onset time and sleep quality in our dataset. The relationship, as indicated by the Chi-squared statistic, is present but not extremely strong.

## Hypothesis 4a - Weekly exercise frequency improves sleep quality

### Spearman Correlation Test

**Null Hypothesis ($H_0$)**: The increase of Weekly exercise frequency has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The increase of Weekly exercise frequency improves on Sleep Quality.


In [15]:
exercise_mapping = {"0 Days": 0, "1-2 Days": 1, "3-4 Days": 2, "5+ Days": 3}
df["Exercise Days/Week Ordinal"] = df["Exercise Days/Week"].map(exercise_mapping)

Since we are testing the positive correlation, `alternative='greater'` is used

In [16]:
import scipy.stats as stats

# Perform Spearman correlation test
correlation, p_value = stats.spearmanr(df['Exercise Days/Week Ordinal'], df['Sleep Quality'], alternative='greater')

print(f'Correlation: {correlation}')
print(f'P-value: {p_value}')

Correlation: 0.04418610205195544
P-value: 0.1665127520643026


1. **Correlation Coefficient:**
   The correlation coefficient of approximately 0.044 indicates a very weak positive relationship between exercise frequency and sleep quality. This suggests that as exercise frequency increases, there is a very slight tendency for sleep quality to increase, but the effect is so small that it might not be meaningful in practical terms.

2. **Statistical Significance:**
   The p-value of approximately 0.1665 is much higher than the common alpha level of 0.05. A high p-value indicates that the correlation observed in the sample is very likely to have occurred by random chance and is not statistically significant.

**Conclusion:**
Based on this analysis, there is no evidence to support a significant relationship between exercise frequency and sleep quality in the dataset. The correlation is very weak and not statistically significant. This means that, within the data collected, changes in exercise frequency do not appear to have a notable impact on sleep quality. These findings suggest that other factors not captured in this analysis might play a more substantial role in determining sleep quality.

### Chi-squared Test

First, construct contigency table between `Exercise Days/Week` and `Sleep Quality`

In [17]:
contingency_table = pd.crosstab(df['Exercise Days/Week'], df['Sleep Quality'])
contingency_table

Sleep Quality,3.0,4.0,5.0,6.0,7.0,8.0,9.0
Exercise Days/Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0 Days,6,11,5,31,6,0,32
1-2 Days,20,20,8,42,45,71,0
3-4 Days,7,13,2,32,39,38,39
5+ Days,8,5,1,0,1,0,0


After that, the Chi-squared test could be conducted with `scipy` library

In [18]:
from scipy.stats import chi2_contingency

# Perform the Chi-squared test
chi2_stat, p_val, dof, ex = chi2_contingency(contingency_table)

# Print the results
print(f"Chi2 Stat: {chi2_stat}")
print(f"P Value: {p_val}")
print(f"Degrees of Freedom: {dof}")

Chi2 Stat: 177.97745172391032
P Value: 2.411246515176397e-28
Degrees of Freedom: 18


1. **Chi-Squared Statistic (Chi2 Stat):**
   The calculated Chi-square statistic is approximately 177.98, which is notably higher than the critical value for a chi-square distribution with 18 degrees of freedom at a significance level ($\alpha$) of 0.05. This considerable difference suggests strong evidence to reject the null hypothesis.

2. **P-Value:**
   The p-value is approximately 2.41e-28, which is significantly lower than the common alpha level of 0.05. This extremely low p-value indicates that the observed association between exercise frequency and sleep quality is highly unlikely to have occurred by random chance.

**Conclusion:**
Given the very low p-value and the high Chi-squared statistic, we can reject the null hypothesis. This suggests that there is a statistically significant association between exercise frequency and sleep quality in our dataset. The relationship, as indicated by the Chi-squared statistic, is strong and suggests that exercise frequency does have a significant impact on sleep quality.

## Hypothesis 4b - Weekly exercise frequency improves Sleep Duration

### Spearman Correlation Test

**Null Hypothesis ($H_0$)**: The increase of Weekly exercise frequency has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The increase of Weekly exercise frequency improves on Sleep Quality.


Since we are testing the positive correlation, `alternative='greater'` is used

In [19]:
import scipy.stats as stats

# Perform Spearman correlation test
da = df.dropna(subset=['Exercise Days/Week Ordinal','Calculated Night Sleep Duration'], axis='index')
correlation, p_value = stats.spearmanr(da['Exercise Days/Week Ordinal'], da['Calculated Night Sleep Duration'], alternative='greater')

print(f'Correlation: {correlation}')
print(f'P-value: {p_value}')

Correlation: -0.0007873104918378794
P-value: 0.5068774759866075


1. **Correlation Coefficient:**
   The correlation coefficient, denoted as \( r \), is approximately 0.0102. This value represents a very weak positive relationship between "Weekly Exercise Frequency" and "Calculated Night Sleep Duration." In practical terms, it implies that as weekly exercise frequency increases, there is a very slight tendency for night sleep duration to increase, but the strength of this relationship is negligible.

2. **Statistical Significance:**
   The p-value associated with the correlation coefficient is approximately 0.4117. This p-value is much higher than the conventional significance level of 0.05. Consequently, it indicates that the observed correlation between weekly exercise frequency and night sleep duration is not statistically significant. This suggests that the observed relationship is likely to be a random occurrence.

**Conclusion:**
In summary, the data does not reveal a statistically significant relationship between weekly exercise frequency and night sleep duration. The correlation is very weak and not statistically significant, indicating that changes in weekly exercise frequency do not appear to have a notable impact on night sleep duration. Other factors not captured in this analysis may play a more substantial role in determining sleep duration.

### ANOVA Test

In [20]:
import scipy.stats as stats

# Create a DataFrame
da = df.dropna(subset=['Exercise Days/Week','Calculated Night Sleep Duration'], axis='index')

# Perform one-way ANOVA
groups = da["Exercise Days/Week"].unique()
anova_results = []

for group in groups:
    group_data = da[da["Exercise Days/Week"] == group]["Calculated Night Sleep Duration"]
    anova_results.append(group_data)


# Perform the ANOVA
f_statistic, p_value = stats.f_oneway(*anova_results)
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

F-statistic: 0.06330108942658033
P-value: 0.9791771457421672


1. **F-statistic:**
   The F-statistic of approximately 0.0633 indicates that there is very little variation between the group means in the context of sleep duration across exercise frequency groups.

2. **P-value:**
   With a p-value of approximately 0.9792, which is much higher than the commonly selected significance level of 0.05, we do not find evidence to conclude that the one-way ANOVA is statistically significant.

**Conclusion:**
We cannot infer that there are statistically significant differences in sleep duration among the exercise frequency groups. In practical terms, this suggests that exercise frequency does not have a statistically significant impact on sleep duration in this dataset. The high p-value indicates that any observed differences in sleep duration across exercise frequency groups are likely due to random chance rather than a true effect. Further analysis may be necessary to explore other factors that could influence sleep duration.

## Hypothesis 5 - Increased Nap Time decreases Night Sleep duration

### Spearman Correlation Test

**Null Hypothesis ($H_0$)**: The increase of Nap time has no impact on Sleep Quality.

**Alternative Hypothesis ($H_1$)**: The increase of Nap time shortens on Sleep Quality.


In [21]:
nap_duration_mapping = {
    "No Nap": 0,
    "<30 Minutes": 15,
    "60-90 Minutes": 75,
    "30-60 Minutes": 45,
    ">90 Minutes": 90,
}

df["Nap Duration Ordinal"] = df["Nap Duration"].map(nap_duration_mapping)

In [22]:
import scipy.stats as stats

# Perform Spearman correlation test
da = df.dropna(subset=['Nap Duration Ordinal','Calculated Night Sleep Duration'], axis='index')

correlation, p_value = stats.spearmanr(da['Nap Duration Ordinal'], da['Calculated Night Sleep Duration'], alternative='less')

print(f'Correlation: {correlation}')
print(f'P-value: {p_value}')

Correlation: 0.04228201956735934
P-value: 0.8228509437934408


* **Correlation Coefficient:**
  The correlation coefficient is approximately 0.042, indicating a very weak positive relationship between nap duration and calculated night sleep duration. This suggests that as nap duration increases, there is a slight tendency for night sleep duration to increase, but the relationship is not strong.

* **P-value:**
  The p-value is approximately 0.823, which is well above the conventional threshold of 0.05 for statistical significance. This means that the results do not provide enough evidence to reject the null hypothesis at the 5% significance level.

**Conclusion:**
Based on the results of the Spearman correlation test, there is not enough evidence to conclude that an increase in nap duration has a significant impact on calculated night sleep duration. While there is a very weak positive correlation, it is not statistically significant, and therefore, the null hypothesis cannot be rejected. It is important to note that this does not necessarily mean there is no relationship at all, but rather that the test did not detect a strong enough relationship in the 

### Kendall's Tau

In [23]:
import scipy.stats as stats

correlation, p_value = stats.kendalltau(
    da["Nap Duration Ordinal"].to_numpy().astype(float),
    da["Calculated Night Sleep Duration"].to_numpy(),
    alternative="less",
)

print(f"Correlation: {correlation:.3f}")
print(f"P-value: {p_value}")

Correlation: 0.033
P-value: 0.8287279351719994


* **Correlation Coefficient:** The value is 0.033, which indicates a very weak positive correlation between nap duration and calculated night sleep duration. This suggests that as nap duration increases, there is a slight tendency for night sleep duration to increase, but this tendency is not strong.

* **P-value:** The p-value is approximately 0.829, which is well above the conventional alpha level of 0.05 used for determining statistical significance. This means that the positive correlation observed is not statistically significant.

**Conclusion:**
In conclusion, the analysis does not support the alternative hypothesis that increased nap duration has a significant impact on increasing night sleep duration. The weak correlation and high p-value suggest that any observed relationship is likely due to random chance. However, given the weak correlation, it is possible that a larger sample size or further studies could reveal more about the nature of this relationship.

# Conclusion

At a significance level of 0.05, our analysis reveals the following noteworthy associations between various factors and sleep quality:

* **Sleep Disturbance:** We find that an increase in sleep disturbances is linked to a notable deterioration in sleep quality. This suggests that individuals experiencing more sleep disruptions tend to report lower sleep quality, indicating the detrimental impact of such disturbances on overall restorative sleep.

* **Age Group and Sleep Duration:** Our statistical analysis does not reveal a significant correlation between age group and sleep duration. The weak correlation observed suggests that age may not be a strong predictor of sleep duration, and other factors may play a more substantial role.

* **Sleep Onset Time:** Our analysis indicates that there is no significant relationship between sleep onset time and sleep quality. The weak correlation and high p-value suggest that the speed at which an individual falls asleep does not have a substantial impact on sleep quality in this dataset.

* **Exercise Weekly Frequency:** Our findings do not provide sufficient statistical evidence to support the notion that increasing exercise frequency significantly enhances sleep quality. Additionally, there is no significant correlation between exercise frequency and night sleep duration, indicating that exercise frequency may not have a notable impact on sleep patterns.

* **Nap Duration:** Our examination does not reveal a significant relationship between the length of naps and the overall duration of nighttime sleep. The weak correlation and high p-value suggest that napping habits do not have a substantial impact on the length of nocturnal sleep.