# Overview: How does sleep correlate with assessment scores when exercise is used as a covariate?
Notebook by Lindsey M. Williams (lmw192@mit.edu)

### [Part 1:](#part_1) Overall Correlations
1) Create a new dataframe (df_ancovaData_pt1) containing all columns of interest from various dataframes.

2) Export dataframe as csv (ancovaData_pt1.csv).

3) csv files can then be opened in RStudio to run various ANCOVA tests to assess correlations between various measures of sleep and **OVERALL** assessemnt scores (adding up scores of all assessments) with exercise as a covariate.

### [Part 2:](#part_2) "Relevant Sleep" Correlations
As opposed to part 1, part 2 of this analysis will assess how ** sleep during the week leading up to an exam** (the "relevant sleep") correlates with the corresponding **WEEKLY** assessment scores.

1) Create a new dataframe (df_ancovaData_pt2) containing all columns of interest from various dataframes where all averages for sleep and exersice measures only contain data collected in the week leading up to the assessment.

2) Export dataframe as csv (ancovaData_pt2.csv)

3) csv files can then be opened in RStudio to run varous  ANCOVA tests to assess correlations between various measures of sleep and **WEEKLY** assessemnt scores (adding up scores of all assessments) with exercise as a covariate.

### [Part 3:](#part_3) Group (1 & 2) Differences (T-Tests)

### [Part 4:](#part_4) Gender Differences (T-Tests)

## Setup

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import usefulMindBodyAnalysisTools as myTools # Custom functions from old FitBit Analsis (by Lindsey)

import utils

sns.set_context('notebook', font_scale=1.5)

df_roster_original = utils.load_roster()
df_sleep = utils.load_sleep_data()
df_mult = utils.load_mult_measures_data()
df_hr = utils.load_heart_rate_data()

# Make copies of dateframes to clean-up for easier use
df_roster_firstClean = df_roster_original.copy()

## Data Cleaning

### Cleaning df_roster

In [2]:
# Clean up column header = 'How_many_days_do_you_exercise_1'
df_roster_firstClean.How_many_days_do_you_exercise_1.replace(to_replace=['4-Feb', '7-May', '10-Jul'],
                                                  value=['2-4', '5-7', '7-10'],
                                                  inplace=True)

# Clean up column header = 'How_many_hrs_do__you_exercise_per_day_1'
df_roster_firstClean.How_many_hrs_do__you_exercise_per_day_1.replace(to_replace=['2-Jan'],
                                                  value=['1-2'],
                                                  inplace=True)

# Clean up column header = 'sleep_1' (How many hours do you sleep per night?)
df_roster_firstClean.sleep_1.replace(to_replace=['7-Jun', '6-May', '6-8 hours','8-Jun', '8-Jul', "usually 5-6 but 10 this week since I've been sick",'About 6', '9-Aug', '9-Jun','6-7 but usually closer to 6- trying to fix that though', '6-7 hours', '9-Jul'],
                                                  value=['6-7', '5-6', '6-8','6-8', '7-8', '5-6','6', '8-9', '6-9','6-7', '6-7', '7-9'],
                                                  inplace=True)

# Clean up column header = 'How_many_days_do_you_exercise_per_week_'
df_roster_firstClean.How_many_days_do_you_exercise_per_week_.replace(to_replace=['Twice: the PE classes'],
                                                  value=['2'],
                                                  inplace=True)

# Clean up column header = 'How_many_hours_do_you_exercise_on_a_typical_day_',
df_roster_firstClean.How_many_hours_do_you_exercise_on_a_typical_day_.replace(to_replace=['Usually 30 minutes, but I recently injured my ankle', '2-Jan', '2.5 hour frisbee practice twice a week. other days not so much'],
                                                  value=['<1', '1-2', '>2'],
                                                  inplace=True)

# Clean up column header = 'What_kind_of_exercise_1',
df_roster_firstClean.What_kind_of_exercise_1.replace(to_replace=['Cardiovascular (brisk walking, biking, running, swimming, etc..), Strength (weight lifting, resistance training, etc..)', 
                                                            'Cardiovascular (brisk walking, biking, running, swimming, etc..), Flexibility/Balance (pilates, tai chi, stabiity ball etc..)', 
                                                            'Cardiovascular (brisk walking, biking, running, swimming, etc..), Strength (weight lifting, resistance training, etc..), Flexibility/Balance (pilates, tai chi, stabiity ball etc..)', 
                                                            'Cardiovascular (brisk walking, biking, running, swimming, etc..)',
                                                            'none',
                                                            'None',
                                                            'Strength (weight lifting, resistance training, etc..)',
                                                            'Not really an exercise person',
                                                            'Flexibility/Balance (pilates, tai chi, stabiity ball etc..)',
                                                            'I do not exercise regularly',
                                                            'Taekwondo',
                                                            'Strength (weight lifting, resistance training, etc..), Flexibility/Balance (pilates, tai chi, stabiity ball etc..)',
                                                            'Walking to and from class typically'],
                                                  value=[['cardio strength'],
                                                        ['cardio flexibility'],
                                                        ['cardio strength flexibility'],
                                                        ['cardio'],
                                                        ['none'],
                                                        ['none'],
                                                        ['strength'],
                                                        ['none'],
                                                        ['flexibility'],
                                                        ['none'],
                                                        ['flexibility'],
                                                        ['strength flexibility'],
                                                        ['walking']],
                                                  inplace=True)
df_roster_firstClean.head()

Unnamed: 0_level_0,year,subjectID,group,Completed_PE_,Age,Gender,How_many_days_do_you_exercise_1,How_many_hrs_do__you_exercise_per_day_1,What_kind_of_exercise_1,stress_last_week_1,...,Quiz_3,Midterm_1,Quiz_4,Quiz_5,Quiz_6,Midterm_2,Quiz_7,Quiz_8,Midterm_3,overall_score
subjectID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MBL001,1,MBL001,1,-,18.0,Female,2-4,1-2,cardio strength,low,...,5.5,57.5,8.0,10.0,9.0,85.5,9.5,8.0,79.0,281.5
MBL002,1,MBL002,1,-,18.0,Female,2-4,<1,cardio flexibility,moderate,...,6.5,61.0,7.0,8.5,6.5,91.0,10.0,8.5,68.0,273.0
MBL003,1,MBL003,1,-,18.0,Female,5-7,>2,cardio strength flexibility,moderate,...,8.5,55.5,6.0,8.5,6.5,80.0,5.0,5.0,48.5,229.5
MBL004,1,MBL004,2,yes,18.0,Female,2-4,1-2,cardio,low,...,7.5,89.0,6.5,9.0,9.5,70.0,8.0,10.0,71.5,285.5
MBL005,1,MBL005,2,yes,19.0,Female,2-4,1-2,cardio strength,moderate,...,7.5,60.5,9.0,8.5,10.0,86.0,5.5,0.0,85.0,281.5


In [3]:
df_roster = df_roster_firstClean.copy()
# Clean up column header = 'How_many_days_do_you_exercise_1'
df_roster.How_many_days_do_you_exercise_1.replace(to_replace=['2-4', '5-7', "I don't regularly exercise", '1', '7-10', 'I am injured so currently 0, but typically 5-6'],
                                                  value=[3, 6, 0, 1, 8, 5],
                                                  inplace=True)

# Clean up column header = 'How_many_hrs_do__you_exercise_per_day_1'
df_roster.How_many_hrs_do__you_exercise_per_day_1.replace(to_replace=['1-2', '<1', '>2'],
                                                  value=[1.5, 1, 2],
                                                  inplace=True)

# Clean up column header = 'sleep_1' (How many hours do you sleep per night?)
df_roster.sleep_1.replace(to_replace=['6-7', '5-6', '6-8', '6', '6.5', '7', '8', '7-8', '8-9', '6:47', '8.5', '7.5', '6-9', '5', '5.5', '9', '7-9'],
                                                  value=[6.5, 5.5, 7, 6, 6.5, 7, 8, 7.5, 8.5, 'nan', 8.5, 7.5, 7.5, 5, 5.5, 9, 8],
                                                  inplace=True)

# Clean up column header = 'How_many_days_do_you_exercise_per_week_'
df_roster.How_many_days_do_you_exercise_per_week_.replace(to_replace=['2-4', '5-7', "I don't regularly exercise", '2', '1'],
                                                  value=[3, 6, 0, 2, 1],
                                                  inplace=True)

# Clean up column header = 'How_many_hours_do_you_exercise_on_a_typical_day_',
df_roster.How_many_hours_do_you_exercise_on_a_typical_day_.replace(to_replace=['<1', '1-2', "I don't regularly exercise", '>2'],
                                                  value=[.5, 1.5, 0, 2],
                                                  inplace=True)


# Clean up column header = 'caffeine_',
df_roster.caffeine_.replace(to_replace=['<1', '1 a day', '2 or more a day', '0', '1 every other day ', 'never'],
                                                  value=[.5, 1, 2, 0, .5, 0],
                                                  inplace=True)

# Changes all the data under the assessements (quizzes and midterms) into floats, instead of objects
assessments = [c for c, _ in utils.assessment_dates]
df_roster.loc[:, assessments] = df_roster.loc[:, assessments].astype(float)

df_roster.head()

Unnamed: 0_level_0,year,subjectID,group,Completed_PE_,Age,Gender,How_many_days_do_you_exercise_1,How_many_hrs_do__you_exercise_per_day_1,What_kind_of_exercise_1,stress_last_week_1,...,Quiz_3,Midterm_1,Quiz_4,Quiz_5,Quiz_6,Midterm_2,Quiz_7,Quiz_8,Midterm_3,overall_score
subjectID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MBL001,1,MBL001,1,-,18.0,Female,3,1.5,cardio strength,low,...,5.5,57.5,8.0,10.0,9.0,85.5,9.5,8.0,79.0,281.5
MBL002,1,MBL002,1,-,18.0,Female,3,1.0,cardio flexibility,moderate,...,6.5,61.0,7.0,8.5,6.5,91.0,10.0,8.5,68.0,273.0
MBL003,1,MBL003,1,-,18.0,Female,6,2.0,cardio strength flexibility,moderate,...,8.5,55.5,6.0,8.5,6.5,80.0,5.0,5.0,48.5,229.5
MBL004,1,MBL004,2,yes,18.0,Female,3,1.5,cardio,low,...,7.5,89.0,6.5,9.0,9.5,70.0,8.0,10.0,71.5,285.5
MBL005,1,MBL005,2,yes,19.0,Female,3,1.5,cardio strength,moderate,...,7.5,60.5,9.0,8.5,10.0,86.0,5.5,0.0,85.0,281.5


##### Display the unique values in the column 'What_kind_of_exercise_1' and how many times each occurs

In [5]:
myTools.disp_uniqueData_inCol(['What_kind_of_exercise_1'], df_roster, these_unique_on=False, times_appear_on=False, uniqueVal_freq_chart=True)

There are  9 unique values


                 Unique Values  Frequency
0              cardio strength         38
1           cardio flexibility         19
2  cardio strength flexibility          8
3                       cardio          6
4                         none          6
5                     strength          5
6                  flexibility          2
7         strength flexibility          2
8                      walking          1


### Cleaning df_hr
(Removes dates and times in which fitbit was not being worn)

In [None]:
'''Create a new df called 'df_hr' (which will be the "cleaned up" version of the dataframe).
Only include values for which the confidence rating (of the recorded heart rate) is greater
than 0 and where 'bpm' is more than 30.

In Summary: This will remove values for which subjects
were not wearing the fitbit
'''
df_hr = df_hr[(df_hr['confidence'] > 0) & (df_hr['bpm'] > 30)]
df_hr.head()

<a id='part_1'></a>
# Part 1

## Creating the needed columns for the new dataframe

### Assessing Avg. Number of VERY_ACTIVE and MODERATELY_ACTIVE periods/day
(Will be used to create pd.series that will be columns in df_ancovaData)

In [None]:
activityLevelCounts = df_mult.groupby(['subjectID']).activityLevel.value_counts()

# How many days of usage were there per subject according to hr data
num_days_of_data_by_hrData = df_hr.groupby('subjectID').date.nunique(dropna=True)
# num_days_of_data_by_hrData.sort_values(ascending=True).head()

# For each subject, how many times during data collection period were classified as "VERY_ACTIVE"?
veryActiveCounts = activityLevelCounts.loc[:, 'VERY_ACTIVE']
# For each subject, how many times during data collection period were classified as "MODERATELY_ACTIVE"?
modActiveCounts = activityLevelCounts.loc[:, 'MODERATELY_ACTIVE']

# Create pd.series of average number of VERY_ACTIVE and MODERATELY_ACTIVE counts
# Find the average number of 'veryActiveCounts' per day (divided by days of available data)
veryActiveCounts_divByHr = veryActiveCounts.divide(num_days_of_data_by_hrData)
# Find the average number of 'modActiveCounts' per day (divided by days of available data)
modActiveCounts_divByHr = modActiveCounts.divide(num_days_of_data_by_hrData)

### Creating columns: Mean Sleep, Sleep Quality, Steps/Day, and Active Minutes/Day

In [None]:
# Returns a pd.series of each subjectID and their mean hours of sleep per night
series_meanSleep = df_sleep.groupby('subjectID')['sleepDuration'].mean() / 60

# Returns a pd.series of each subjectID and their average sleep quality score
sleep_Quality = df_sleep.groupby('subjectID').sleepQualityScoreA.mean()

# Returns a pd.series of each subjectID and the average number of steps they took per day
avgStepsPerDay = df_mult.groupby(['subjectID', 'date']).steps.sum().groupby('subjectID').mean()

# Displays average active minutes per day per subject
avgActiveMinsPerDay = df_mult.groupby(['subjectID', 'date']).activeMinutes.sum().groupby('subjectID').mean()

## Constructing & Exporting the NEW Dataframe (df_ancovaData)

In [None]:
# Create new dataframe to use for ANCOVA
df_ancovaData_pt1 = pd.DataFrame()

In [None]:
# Add columns to new dateframe
df_ancovaData_pt1['group'] = df_roster['group']
df_ancovaData_pt1['gender'] = df_roster['Gender']
df_ancovaData_pt1['type_of_exercise'] = df_roster['What_kind_of_exercise_1']
df_ancovaData_pt1['overall_score'] = df_roster['overall_score']
df_ancovaData_pt1.loc[:,'meanSleep_fb'] = series_meanSleep # fb = FitBit-recorded data
df_ancovaData_pt1.loc[:,'sleepQuality_fb'] = sleep_Quality # fb = FitBit-recorded data
df_ancovaData_pt1['exercise_hrsDay_sr'] = df_roster['How_many_hrs_do__you_exercise_per_day_1'] # sr = self-reported data
df_ancovaData_pt1['exercise_daysWeek_sr'] = df_roster['How_many_days_do_you_exercise_1'] # sr = self-reported data
df_ancovaData_pt1.loc[:,'veryActive_fb'] = veryActiveCounts_divByHr # fb = FitBit-recorded data
df_ancovaData_pt1.loc[:,'modActive_fb'] = modActiveCounts_divByHr # fb = FitBit-recorded data
df_ancovaData_pt1.loc[:,'stepsPerDay_fb'] = avgStepsPerDay # fb = FitBit-recorded data
df_ancovaData_pt1.loc[:,'activeMinsPerDay_fb'] = avgActiveMinsPerDay # fb = FitBit-recorded data
df_ancovaData_pt1.head()

In [6]:
df_roster.head()

Unnamed: 0_level_0,year,subjectID,group,Completed_PE_,Age,Gender,How_many_days_do_you_exercise_1,How_many_hrs_do__you_exercise_per_day_1,What_kind_of_exercise_1,stress_last_week_1,...,Quiz_3,Midterm_1,Quiz_4,Quiz_5,Quiz_6,Midterm_2,Quiz_7,Quiz_8,Midterm_3,overall_score
subjectID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
MBL001,1,MBL001,1,-,18.0,Female,3,1.5,cardio strength,low,...,5.5,57.5,8.0,10.0,9.0,85.5,9.5,8.0,79.0,281.5
MBL002,1,MBL002,1,-,18.0,Female,3,1.0,cardio flexibility,moderate,...,6.5,61.0,7.0,8.5,6.5,91.0,10.0,8.5,68.0,273.0
MBL003,1,MBL003,1,-,18.0,Female,6,2.0,cardio strength flexibility,moderate,...,8.5,55.5,6.0,8.5,6.5,80.0,5.0,5.0,48.5,229.5
MBL004,1,MBL004,2,yes,18.0,Female,3,1.5,cardio,low,...,7.5,89.0,6.5,9.0,9.5,70.0,8.0,10.0,71.5,285.5
MBL005,1,MBL005,2,yes,19.0,Female,3,1.5,cardio strength,moderate,...,7.5,60.5,9.0,8.5,10.0,86.0,5.5,0.0,85.0,281.5


In [None]:
df_ancovaData_pt1.loc[:,'modPlusVeryActive_fb'] = veryActiveCounts_divByHr + modActiveCounts_divByHr # fb = FitBit-recorded data

In [None]:
# Export dataframe to csv file
df_ancovaData_pt1.to_csv(path_or_buf = 'ancovaData_pt1.csv')

<a id='part_2'></a>
# Part 2

### Define relative dates (the week before/leading upto) each assessment

In [None]:
relevant_dates = [('Quiz_1', '2016-09-07', '2016-09-15'),
                  ('Quiz_2', '2016-09-16', '2016-09-22'),
                  ('Quiz_3', '2016-09-23', '2016-09-29'),
                  ('Midterm_1', '2016-09-07', '2016-10-05'),
                  ('Quiz_4', '2016-10-06', '2016-10-13'),
                  ('Quiz_5', '2016-10-14', '2016-10-21'),
                  ('Quiz_6', '2016-10-22', '2016-10-28'),
                  ('Midterm_2', '2016-10-06', '2016-11-02'),
                  ('Quiz_7', '2016-11-03', '2016-11-10'),
                  ('Quiz_8', '2016-11-16', '2016-11-24'),
                  ('Midterm_3', '2016-11-03', '2016-11-30')]

In [None]:
# Do not need anymore
# Returns a pd.series of each subjectID and their mean hours of "relevant sleep"/night during exam period of interest

# df_sleep_Quiz_1 = df_sleep.loc['2016-09-07': '2016-09-15', :]
# series_meanSleep_pt2_Quiz_1 = df_sleep_Quiz_1.groupby('subjectID')['sleepDuration'].mean() / 60

In [None]:
# NEED TO REMOVE
# def relevant_meanSleep():
#     for assessment, start, end in assessment_content:
#         new_df_name = df_sleep_+ str(assessment) # Assign new df name
#         new_df_name = df_sleep.loc[start : end, :]
        
#         new_series = series_meanSleep_pt2_ + str(assessment) # Assign new series name
#         new_series = new_df_name.groupby('subjectID')['sleepDuration'].mean() / 60
        
#         # Add to new df
        
# relevant_meanSleep()

### Create new DataFrame & add new columns

In [None]:
# Create new dataframe to use for ANCOVA pt2
df_ancovaData_pt2 = pd.DataFrame()

In [None]:
# Create new columns in df_ancovaData_pt2 for each exam score
df_ancovaData_pt2['overall_score'] = df_roster['overall_score'] # Add overall score

for assessment, start, end in relevant_dates: # Add scores for all other exams
    df_ancovaData_pt2[str(assessment)] = df_roster[assessment]

In [None]:
# Create new columns in df_ancovaData_pt2 for meanSleep within relevant dates for each exam
def _relevant_meanSleep(df, start, end):
    subset = df.loc[start : end, :] # Only include the relevant dates in the DataFrame
    return subset.groupby('subjectID')['sleepDuration'].mean() / 60

for assessment, start, end in relevant_dates:
    output = _relevant_meanSleep(df_sleep, start, end)
    colname = "meanSleep_{}".format(assessment)
    df_ancovaData_pt2.loc[:, colname] = output
    
df_ancovaData_pt2.loc[:, 'meanSleep_Quiz_1' :].head(20)

# Another potential method
# pd.concat([_relevant_meanSleep(df_sleep, start, end) for _, start, end in relevant_dates], axis=1)

In [None]:
# #TEMP
df_roster.loc[df_roster['subjectID'] == 'MBL055', 'Quiz_1' :]

In [None]:
#TEMP
df_hr_temp = df_hr.set_index('subjectID')

df_hr_temp.loc['MBL008', :]

In [None]:
# Create new columns in df_ancovaData_pt2 for sleepQuality within relevant dates for each exam
def _relevant_sleepQuality(df, start, end):
    subset = df.loc[start : end, :] # Only include the relevant dates in the DataFrame
    return subset.groupby('subjectID').sleepQualityScoreA.mean()

for assessment, start, end in relevant_dates:
    output = _relevant_sleepQuality(df_sleep, start, end)
    colname = "sleepQuality_{}".format(assessment)
    df_ancovaData_pt2.loc[:, colname] = output
    
df_ancovaData_pt2.head()

In [None]:
# Reindex df_mult so that date is used as the index
df_mult_reindex = df_mult.set_index('dateTime')
df_mult_reindex.head()

# Question: Why does setting 'date' instead of 'dateTime' as the index cause problems later?

In [None]:
# Create new columns in df_ancovaData_pt2 for avgStepsPerDay within relevant dates for each exam
def _relevant_avgStepsPerDay(df, start, end):
    subset = df.loc[start : end, :] # Only include the relevant dates in the DataFrame
    return subset.groupby(['subjectID', 'date']).steps.sum().groupby('subjectID').mean()

for assessment, start, end in relevant_dates:
    output = _relevant_avgStepsPerDay(df_mult_reindex, start, end)
    colname = "avgStepsPerDay_{}".format(assessment)
    df_ancovaData_pt2.loc[:, colname] = output
    
df_ancovaData_pt2.head()

In [None]:
# Create new columns in df_ancovaData_pt2 for avgActiveMinsPerDay within relevant dates for each exam
def _relevant_avgActiveMinsPerDay(df, start, end):
    subset = df.loc[start : end, :] # Only include the relevant dates in the DataFrame
    return subset.groupby(['subjectID', 'date']).activeMinutes.sum().groupby('subjectID').mean()

for assessment, start, end in relevant_dates:
    output = _relevant_avgActiveMinsPerDay(df_mult_reindex, start, end)
    colname = "avgActiveMinsPerDay_{}".format(assessment)
    df_ancovaData_pt2.loc[:, colname] = output
    
df_ancovaData_pt2.head()

In [None]:
# Reindex df_hr using 'dateTime'
df_hr_reindex = df_hr.set_index('dateTime')

#### Create new columns for VERY_ACTIVE counts

In [None]:
# Create new columns in df_ancovaData_pt2 for veryActiveCounts within relevant dates for each exam
def _relevant_veryActiveCounts(df1, df2, start, end):
    
    subset1 = df1.loc[start : end, :] # Only include the relevant dates in the DataFrame
    activityLevelCounts_subset = subset1.groupby(['subjectID']).activityLevel.value_counts()
    # print(activityLevelCounts_subset)
    
    subset2 = df2.loc[start : end, :] # Only include the relevant dates in the DataFrame
    num_days_of_data_by_hrData_subset = subset2.groupby('subjectID').date.nunique(dropna=True) # Days of FitBit Usage
    # print (num_days_of_data_by_hrData_subset)
    
    # For each subject, how many times during data collection period were classified as "VERY_ACTIVE"?
    veryActiveCounts_subset = activityLevelCounts_subset.loc[:, 'VERY_ACTIVE']
    # print (veryActiveCounts_subset)
    
    # Find the average number of 'veryActiveCounts' per day (divided by days of available data)
    veryActiveCounts_divByHr_subset = veryActiveCounts_subset.divide(num_days_of_data_by_hrData_subset)
    #If VERY_ACTIVE counts is 0, this line makes it say 0, instead of NaN
    veryActiveCounts_divByHr_subset.loc[veryActiveCounts_divByHr_subset.isnull()] = 0
    return veryActiveCounts_divByHr_subset
    
for assessment, start, end in relevant_dates:
    output = _relevant_veryActiveCounts(df_mult_reindex, df_hr_reindex, start, end)
    colname = "veryActiveCounts_{}".format(assessment)
    df_ancovaData_pt2.loc[:, colname] = output
    
df_ancovaData_pt2.head()

#### Create new columns for MODERATELY_ACTIVE counts

In [None]:
# Create new columns in df_ancovaData_pt2 for modActiveCounts within relevant dates for each exam
def _relevant_modActiveCounts(df1, df2, start, end):
    
    subset1 = df1.loc[start : end, :] # Only include the relevant dates in the DataFrame
    activityLevelCounts_subset = subset1.groupby(['subjectID']).activityLevel.value_counts()
    # print(activityLevelCounts_subset)
    
    subset2 = df2.loc[start : end, :] # Only include the relevant dates in the DataFrame
    num_days_of_data_by_hrData_subset = subset2.groupby('subjectID').date.nunique(dropna=True) # Days of FitBit Usage
    # print (num_days_of_data_by_hrData_subset)
    
    # For each subject, how many times during data collection period were classified as "MODERATELY_ACTIVE"?
    modActiveCounts_subset = activityLevelCounts_subset.loc[:, 'MODERATELY_ACTIVE']
    # print (modActiveCounts_subset)
    
    # Find the average number of 'modActiveCounts' per day (divided by days of available data)
    modActiveCounts_divByHr_subset = modActiveCounts_subset.divide(num_days_of_data_by_hrData_subset)
    #If MODERATELY_ACTIVE counts is 0, this line makes it say 0, instead of NaN
    modActiveCounts_divByHr_subset.loc[modActiveCounts_divByHr_subset.isnull()] = 0
    return modActiveCounts_divByHr_subset
    
for assessment, start, end in relevant_dates:
    output = _relevant_modActiveCounts(df_mult_reindex, df_hr_reindex, start, end)
    colname = "modActiveCounts_{}".format(assessment)
    df_ancovaData_pt2.loc[:, colname] = output
    
# df_ancovaData_pt2.loc[:, 'modActiveCounts_Quiz_1' :].head(10)

In [None]:
df_ancovaData_pt2.head()

In [None]:
# Export dataframe to csv file
df_ancovaData_pt2.to_csv(path_or_buf = 'ancovaData_pt2.csv')

<a id='part_3'></a>
# Part 3: Group 1 & 2 Differences (T-Tests)

### Define new function to perform T-Test

In [None]:
def print_ind_ttest(df, column, group_a, group_b, y_col):
    """Print results of two-sample ind. t-test between two groups on y_col."""
    from scipy.stats import ttest_ind

    series_a = df.loc[df[column]==group_a, y_col]
    series_b = df.loc[df[column]==group_b, y_col]
    # y_col makes sure the rows that are returned only contain the data from the column you want (not the whole row)
    t_test = ttest_ind(series_a, series_b, nan_policy='omit')

    asterisk = " *" if t_test.pvalue < 0.05 else "" # an astrisk will indicate significant difference
    out = "{}{}\n".format(y_col.upper(), asterisk)
    out += ("p = {:.3f}, T = {:.2f}".format(t_test.pvalue, t_test.statistic))
    out += ("\n\t{}\t{}\nN\t{}\t{}\nMean\t{:.2f}\t{:.2f}"
            "".format(group_a, group_b, len(series_a), len(series_b),
                      series_a.mean(), series_b.mean()))
for assessment, start, end in assessment_content:
    print(start, end)    print(out)

### Class Grades (T-Test Between Groups 1 and 2)

In [None]:
print("T-tests comparing assessment scores by group\n")
for assessment, _, _ in relevant_dates: # The underscore is used as a throwaway variable (which was 'dates' here)
    print_ind_ttest(df_roster, "group", 1, 2, assessment)
    print()

### Various Measures of Exersice and Sleep (T-Test Between Groups 1 and 2)

In [None]:
VariablesForTTest = ('meanSleep_fb',
                     'sleepQuality_fb',
                     'exercise_hrsDay_sr',
                     'exercise_daysWeek_sr',
                     'veryActive_fb',
                     'modActive_fb',
                     'stepsPerDay_fb',
                     'activeMinsPerDay_fb',
                     'modPlusVeryActive_fb'
                    )

print("T-tests comparing various variables scores by group\n")
for variableOfInterest in VariablesForTTest: # The underscore is used as a throwaway variable (which was 'dates' here)
    print_ind_ttest(df_ancovaData_pt1, "group", 1, 2, variableOfInterest)
    print()

<a id='part_4'></a>
# Part 4: Gender Differences Differences (T-Tests)

### Class Grades (T-Test Between Genders)

In [None]:
print("T-tests comparing assessment scores by gender\n")
for assessment, _, _ in relevant_dates: # The underscore is used as a throwaway variable (which was 'dates' here)
    print_ind_ttest(df_roster, "Gender", "Female", "Male", assessment)
    print()

### Various Measures of Exersice and Sleep (T-Test Between Genders)

In [None]:
VariablesForTTest = ('meanSleep_fb',
                     'sleepQuality_fb',
                     'exercise_hrsDay_sr',
                     'exercise_daysWeek_sr',
                     'veryActive_fb',
                     'modActive_fb',
                     'stepsPerDay_fb',
                     'activeMinsPerDay_fb',
                     'modPlusVeryActive_fb'
                    )

print("T-tests comparing various variables scores by gender\n")
for variableOfInterest in VariablesForTTest: # The underscore is used as a throwaway variable (which was 'dates' here)
    print_ind_ttest(df_ancovaData_pt1, "gender", "Female", "Male", variableOfInterest)
    print()

# Extra Things: Still in Progress