## Assignment week 05: Sleeping habits

Welcome to **week five** of this course programming 1. You will learn about analysing data with pandas and numpy and you will learn to visualize with bokeh. Concretely, you will preprocess the Sleep Study data in an appropiate format in order to conduct statistical and visual analysis. Learning outcomes:


## About the data

The data is collected from a survey-based study of the sleeping habits of individuals within the US. 

Below is a description of each of the variables contained within the dataset.

- Enough = Do you think that you get enough sleep?
- Hours = On average, how many hours of sleep do you get on a weeknight?
- PhoneReach = Do you sleep with your phone within arms reach?
- PhoneTime = Do you use your phone within 30 minutes of falling asleep?
- Tired = On a scale from 1 to 5, how tired are you throughout the day? (1 being not tired, 5 being very tired)
- Breakfast = Do you typically eat breakfast?

The two research questions you should answer in this assignment are:
1. Is there a differences in Hours sleep caused by having breakfast (yes, no)?
2. Is there a differences in Hours sleep caused by having breakfast and the tireness (score)


The assignment consists of 6 parts:

- [part 1: load the data](#0)
- [part 2: data inspection](#1)
- [part 3: check assumptions](#2)
   - [check normality 3.1](#ex-31)
   - [check equal variance 3.2](#ex-32)
- [part 4: prepare the data](#3)
- [part 5: answer the research question](#4)
- [part 6: enhanced plotting](#5)

Part 1 till 5 are mandatory, part 6 is optional (bonus)
To pass the assingnment you need to a score of 60%. 


**NOTE If your project data is suitable you can use that data instead of the given data**

## ANOVA

Analysis of variance (ANOVA) compares the variances between groups versus within groups. It basically determines whether the differences between groups is larger than the differences within a group (the noise). 
A graph picturing this is as follow: https://link.springer.com/article/10.1007/s00424-019-02300-4/figures/2


In ANOVA, the dependent variable must be a continuous (interval or ratio) level of measurement. For instance Glucose level. The independent variables in ANOVA must be categorical (nominal or ordinal) variables. For instance trial category, time of day (AM versus PM) or time of trial (different categories). Like the t-test, ANOVA is also a parametric test and has some assumptions. ANOVA assumes that the data is normally distributed.  The ANOVA also assumes homogeneity of variance, which means that the variance among the groups should be approximately equal. ANOVA also assumes that the observations are independent of each other. 

A one-way ANOVA has just one independent variable. A two-way ANOVA (are also called factorial ANOVA) refers to an ANOVA using two independent variables. For research question 1 we can use the one-way ANOVA, for research question two we can use two-way ANOVA. But first we need to check the assumptions. 


---

<a name='0'></a>
## Part 1: Load the data (10 pt)

load the `sleep.csv` data. Get yourself familiar with the data. Answer the following questions.

1. What is the percentage missing data?
2. Considering the research question, what is the dependent variable and what are the indepent variables? Are they of the correct datatype? 

In [249]:
import pandas as pd
df = pd.read_csv("sleep.csv")
df.head()

Unnamed: 0,Enough,Hours,PhoneReach,PhoneTime,Tired,Breakfast
0,Yes,8.0,Yes,Yes,3,Yes
1,No,6.0,Yes,Yes,3,No
2,Yes,6.0,Yes,Yes,2,Yes
3,No,7.0,Yes,Yes,4,No
4,No,7.0,Yes,Yes,2,Yes


In [250]:
#code printing percentage missing data
missing = df.isnull().sum() / len(df) *100
print(missing)

Enough        0.000000
Hours         1.923077
PhoneReach    0.000000
PhoneTime     0.000000
Tired         0.000000
Breakfast     0.000000
dtype: float64


In [251]:
#code printing answer dependent and independent variables
print("The dependent variable is Hours, independent variables are Breakfast and Tired")

The dependent variable is Hours, independent variables are Breakfast and Tired


In [252]:
#code printing answer about datatypes
df.info()
print("\n Breakfast and Tired should be categorical variables")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Enough      104 non-null    object 
 1   Hours       102 non-null    float64
 2   PhoneReach  104 non-null    object 
 3   PhoneTime   104 non-null    object 
 4   Tired       104 non-null    int64  
 5   Breakfast   104 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 5.0+ KB

 Breakfast and Tired should be categorical variables


---

<a name='1'></a>
## Part 2: Inspect the data (30 pt)

Inspect the data practically. Get an idea about how well the variable categories are ballanced. Are the values of a variable equally divided? What is the mean value of the dependent variable? Are there correlations amongs the variables?


<ul>
<li>Create some meaninful overviews such as variable value counts</li>
<li>Create a scatter plot ploting the relation between being tired and hours of sleep with different colors for Breakfast</li>
    <li>Print some basic statistics about the target (mean, standard deviation)</li>
    <li>Create a heatmap to check for correlations among variables. </li>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
    <ul><li>the gitbook has a bokeh heatmap example</li></ul>
</details>
</ul>

In [253]:
#code your answer to the value counts and distribution plots here
print("Value count of the Tired variable")
print(df["Tired"].value_counts())
print("\n")

print("Value count of the Hours variable")
print(df["Hours"].value_counts())
print("\n")

print("Value count of the Breakfast variable")
print(df["Breakfast"].value_counts())
print("\n")

print("Value count of the Breakfast&Tired variable")
print(df[["Breakfast","Tired"]].value_counts())

Value count of the Tired variable
3    40
2    27
4    23
5    10
1     4
Name: Tired, dtype: int64


Value count of the Hours variable
7.0     35
6.0     24
8.0     16
5.0     12
9.0      8
4.0      4
2.0      2
10.0     1
Name: Hours, dtype: int64


Value count of the Breakfast variable
Yes    63
No     41
Name: Breakfast, dtype: int64


Value count of the Breakfast&Tired variable
Breakfast  Tired
Yes        3        25
           2        20
No         3        15
Yes        4        12
No         4        11
           2         7
           5         7
Yes        1         3
           5         3
No         1         1
dtype: int64


In [254]:
#code for the scatter plot here
#Create a scatter plot plotting the relation between being tired and hours of sleep with different colors for Breakfast
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot
output_notebook()

def make_plot(sleep_df):
    bf_no = df[df['Breakfast'] == "No"]
    bf_yes = df[df['Breakfast'] == "Yes"]

    p = figure(title="Relationship between tiredness and hours of sleep", background_fill_color="#fafafa")
    p.circle(y=bf_no["Hours"], x=bf_no["Tired"], legend_label="No breakfast", color="red", alpha=0.5)
    p.circle(y=bf_yes["Hours"], x=bf_yes["Tired"], legend_label="Yes breakfast", color="blue", alpha=0.5)
    p.xaxis.axis_label = 'Tiredness'
    p.yaxis.axis_label = "Sleep in hours"
    p.grid.grid_line_color="white"
    return p

plot = make_plot(df)
show(plot)

In [255]:
#code your answer to the target statistics here
print("Target statistics, only including Hours and Breakfast")
print(df.groupby("Breakfast").describe())
print("\n")

print("Target statistics, including Hours, Tired, and Breakfast")
print(df.groupby(["Breakfast","Tired"]).describe())

Target statistics, only including Hours and Breakfast
          Hours                                               Tired            \
          count      mean       std  min  25%  50%  75%   max count      mean   
Breakfast                                                                       
No         41.0  6.268293  1.549587  2.0  6.0  6.0  7.0  10.0  41.0  3.390244   
Yes        61.0  6.918033  1.268793  4.0  6.0  7.0  8.0   9.0  63.0  2.873016   

                                              
                std  min  25%  50%  75%  max  
Breakfast                                     
No         1.045898  1.0  3.0  3.0  4.0  5.0  
Yes        0.941722  1.0  2.0  3.0  3.0  5.0  


Target statistics, including Hours, Tired, and Breakfast
                Hours                                              
                count      mean       std  min  25%  50%  75%   max
Breakfast Tired                                                    
No        1       1.0  7.000000       NaN 

In [256]:
#code your answer for the heatmap here and briefly state your finding
import numpy as np
# Revert Yes/No answers to 1/0, so correlation is possible
dict = {"Yes":1, "No":0}
df_heatmap = df.copy()
df_heatmap['Breakfast'] = df_heatmap['Breakfast'].astype(str).map(dict)
df_heatmap['Enough'] = df_heatmap['Enough'].astype(str).map(dict)
df_heatmap['PhoneReach'] = df_heatmap['PhoneReach'].astype(str).map(dict)
df_heatmap['PhoneTime'] = df_heatmap['PhoneTime'].astype(str).map(dict)

# Create correlation matrix
c = df_heatmap.corr().abs()
y_range = (list(reversed(c.columns)))
x_range = (list(c.index))
c

Unnamed: 0,Enough,Hours,PhoneReach,PhoneTime,Tired,Breakfast
Enough,1.0,0.38074,0.084214,0.003945,0.417006,0.132029
Hours,0.38074,1.0,0.054957,0.151378,0.191913,0.225818
PhoneReach,0.084214,0.054957,1.0,0.150451,0.073232,0.239392
PhoneTime,0.003945,0.151378,0.150451,1.0,0.035423,0.005761
Tired,0.417006,0.191913,0.073232,0.035423,1.0,0.251096
Breakfast,0.132029,0.225818,0.239392,0.005761,0.251096,1.0


In [257]:
#reshape
dfc = pd.DataFrame(c.stack(), columns=['r']).reset_index()
dfc.head()
#transfer to ColumnDataSource object
from bokeh.models import ColumnDataSource
source = ColumnDataSource(dfc)

In [258]:
#plot a heatmap
from bokeh.models import (BasicTicker, ColorBar, ColumnDataSource,
                          LinearColorMapper, PrintfTickFormatter,)
from bokeh.transform import transform
from bokeh.palettes import Plasma256

#create colormapper 
mapper = LinearColorMapper(palette=Plasma256, low=dfc.r.min(), high=dfc.r.max())

#create plot
p = figure(title="correlation heatmap", plot_width=500, plot_height=450,
           x_range=x_range, y_range=y_range, x_axis_location="above", toolbar_location=None)

#use mapper to fill the rectangles in the plot
p.rect(x="level_0", y="level_1", width=1, height=1, source=source,
       line_color=None, fill_color=transform('r', mapper))

#create and add colorbar to the right
color_bar = ColorBar(color_mapper=mapper, location=(0, 0),
                     ticker=BasicTicker(desired_num_ticks=len(x_range)), 
                     formatter=PrintfTickFormatter(format="%.1f"))
p.add_layout(color_bar, 'right')

#draw axis
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "10px"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = 1.0

#show
show(p)

**My interpretation of the heatmap:** Only Tired&Hours, and Tired&Enough seem to be somewhat correlated. Other variables have seem to have weak or no correlation

---

<a name='2'></a>
## Part 3: Check Assumptions

Before we answer the research question with ANOVA we need to check the following assumptions:

1. ANOVA assumes that the dependent variable is normaly distributed
2. ANOVA also assumes homogeneity of variance
3. ANOVA also assumes that the observations are independent of each other. Most of the time we need domain knowledge and experiment setup descriptions to estimate this assumption

We are going to do this graphically and statistically. 

<a name='ex-31'></a>
### Check normality (10 pt)

<ul><li>
Plot the distribution of the dependent variable. Add a vertical line at the position of the average. Add a vertical line for the robuust estimation. Add the normal distribution line to the plot. Comment on the normallity of the data. Do you want the full points? Plot with bokeh!</li>

<li>Use a Shapiro-Wilk Test or an Anderson-Darling test to check statistically</li></ul>


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
    <ul><li>check the code of lesson 1 DS1 bayesian statistics</li>
        <li>heart_failure case of gitbook uses bokeh histograms</li>
</ul>
</details>

In [259]:
### Frequency plot ###
from bokeh.models import Span
from bokeh.models import Range1d
# your code to plot here
median = df["Hours"].median()
mean = df["Hours"].mean()
# Create array with all hour values
hours = df["Hours"].sort_values().dropna().unique()
hours_count = df["Hours"].value_counts().sort_index() #Frequencies per hours

# Create a plot with frequency distribution and mean/median
p = figure(title="Distribution of hours slept", background_fill_color="#fafafa")
p.vbar(x=hours, top=hours_count, width=0.9)
p.line(x=mean, y=[0,40], line_width=2, color="red", legend_label="mean")
p.line(x=median, y=[0,40], line_width=2, color="green", legend_label="median")
p.xaxis.axis_label = 'Sleep in hours'
p.yaxis.axis_label = "Frequency"
p.grid.grid_line_color="white"
p.xgrid.grid_line_color = None
p.y_range=Range1d(0,38)
show(p)




In [260]:
### Density plot ###
import hvplot.pandas
df["Hours"].hvplot.kde()

In [261]:
# Shapiro-Wilk test for normality
from scipy import stats

shapiro_test = stats.shapiro(df["Hours"].dropna())
shapiro_test

ShapiroResult(statistic=0.93398118019104, pvalue=7.15833084541373e-05)

In [262]:
# briefly summarize your findings
print("Data does not seem to be normally distributed according to the Shapiro-Wilks plot. Visually, however, it seems normally distributed")

Data does not seem to be normally distributed according to the Shapiro-Wilks plot. Visually, however, it seems normally distributed


<a name='ex-32'></a>
### Check homogeneity of variance (20 pt)

<ul><li>
Use boxplots for the check of homoegeneity of variance. Do you want the full points? Plot with bokeh!</li>

<li>Use a Levene’s & Bartlett’s Test of Equality (Homogeneity) of Variance to test equal variance statistically</li><ul>

In [263]:
# your code to plot here
boxplot_tired = df.hvplot.box(y="Hours", by="Tired")
boxplot_tired

In [264]:
boxplot_tired_bf = df.hvplot.box(y="Hours", by=["Tired","Breakfast"])
boxplot_tired_bf

In [272]:
# your code for the statistical test here

# Levene's test
from scipy.stats import levene
levene(df['Hours'][df['Tired'] == '1'].dropna(),
               df['Hours'][df['Tired'] == '2'].dropna(),
               df['Hours'][df['Tired'] == '3'].dropna(),
               df['Hours'][df['Tired'] == '4'].dropna(),
               df['Hours'][df['Tired'] == '5'].dropna(), center="median")

LeveneResult(statistic=nan, pvalue=nan)

In [266]:
# briefly summarize your findings
print("The boxplots and Levene's test suggest there is no equal variance between conditions")

The boxplots and Levene's test suggest there is no equal variance between conditions


---

<a name='3'></a>
## Part 4: Prepare your data (10 pt)

Create a dataframe with equal samplesize. Make three categories for tireness 1-2 = no, 3 = maybe, 4-5 = yes

In [267]:
# Create categories for Tiredness
dict_tired = {1:"no", 2:"no", 3:"maybe", 4:"yes", 5:"yes"}
df.Tired = df.Tired.dropna().map(dict_tired).astype("category")
df.Tired


0      maybe
1      maybe
2         no
3        yes
4         no
       ...  
99        no
100    maybe
101    maybe
102       no
103    maybe
Name: Tired, Length: 104, dtype: category
Categories (3, object): ['maybe', 'no', 'yes']

In [268]:
# Check for equal sample size
print(df["Tired"].value_counts())

maybe    40
yes      33
no       31
Name: Tired, dtype: int64


---

<a name='4'></a>
## Part 5: Answer the research questions (20 pt)

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
    <ul><li>use one-way ANOVA for research question 1</li>
    <li>Use two-way ANOVA for research question 2</li>
    <li>https://reneshbedre.github.io/blog/anova.html</li>
</ul>
</details>

In [269]:
#Your solution here
import scipy.stats as stats
stats.f_oneway(df['Hours'][df['Tired'] == 'no'].dropna(),
               df['Hours'][df['Tired'] == 'yes'].dropna(),
               df['Hours'][df['Tired'] == 'maybe'].dropna())

F_onewayResult(statistic=0.7087661985071012, pvalue=0.4947316010199153)

**Answer**
Research Question1: There significant effect of Tiredness on Hours slept

---

<a name='5'></a>
## Part 6: Enhanced plotting (20 pt)

Create a panel with 1) your dataframe with equal samplesize 2) a picture of a sleeping beauty, 3) the scatter plot of tired / hours of sleep with different colors for Breakfast from part 2 4) the boxplots given the p-value for the anova outcome in the title

In [270]:
# Calculate group sizes from total sample size, and square root(N) to get standard deviation from standard error
n_group_ca = int(n_canada / 5)
n_group_nl = int(n_holland / 5)
sqrt_n_ca = int(np.sqrt(n_group_ca))
sqrt_n_nl = int(np.sqrt(n_group_nl))

# Create all groups, combining the Netherlands and Canada
# np.random was used because of the large sample size, the distribution will be roughly normal
age12_17_pre = np.append(np.random.normal(41.7, (1.89)*sqrt_n_nl, n_group_nl), 
                         np.random.normal(55.5, (1.28)*sqrt_n_ca, n_group_ca))
age18_34_pre = np.append(np.random.normal(59.9, (1.52)*sqrt_n_nl, n_group_nl), 
                         np.random.normal(64.3, (0.82)*sqrt_n_ca, n_group_ca))
age35_49_pre = np.append(np.random.normal(55.6, (1.41)*sqrt_n_nl, n_group_nl), 
                         np.random.normal(58.4, (0.77)*sqrt_n_ca, n_group_ca))      
age50_64_pre = np.append(np.random.normal(55.8, (1.55)*sqrt_n_nl, n_group_nl), 
                         np.random.normal(54.0, (0.76)*sqrt_n_ca, n_group_ca))
age65_over_pre = np.append(np.random.normal(55.1, (1.48)*sqrt_n_nl, n_group_nl), 
                           np.random.normal(37.3, (0.64)*sqrt_n_ca, n_group_ca))
age12_17_lockdown = np.append(np.random.normal(42.4, (2.04)*sqrt_n_nl, n_group_nl), 
                              np.random.normal(42.9, (1.40)*sqrt_n_ca, n_group_ca)) 
age18_34_lockdown = np.append(np.random.normal(62.2, (1.59)*sqrt_n_nl, n_group_nl), 
                              np.random.normal(59.9, (1.08)*sqrt_n_ca, n_group_ca))
age35_49_lockdown = np.append(np.random.normal(60.5, (1.51)*sqrt_n_nl, n_group_nl), 
                              np.random.normal(57.2, (1.02)*sqrt_n_ca, n_group_ca))
age50_64_lockdown = np.append(np.random.normal(59.9, (1.61)*sqrt_n_nl, n_group_nl), 
                              np.random.normal(55.1, (0.92)*sqrt_n_ca, n_group_ca))
age65_over_lockdown = np.append(np.random.normal(56.0, (1.63)*sqrt_n_nl, n_group_nl),
                                np.random.normal(40.3, (0.56)*sqrt_n_ca, n_group_ca))

pre_total = [age12_17_pre, age18_34_pre, age50_64_pre, age65_over_pre]
lockdown_total = [age12_17_lockdown, age18_34_lockdown, age50_64_lockdown, age65_over_lockdown]
total  = np.append(pre_total, lockdown_total)#your solution here