# The Setting -- The Return to Hogwarts

Being a wizard can be dangerous. Being a wizard in training can be even more dangerous. The Hogwarts school nurse is a very busy person and records their activity in a log. You are asked to review a sample of the logs from the nurse's office of Hogwarts and to **clean** these logs for analysis later on.  Please perform these parts **in order** as earlier steps may impact later steps.

# Part 1 -- Create New Test Cases

(10 points) Review the original data and the data quality tasks below.  Design 10 test records (rows of patient visits) to insert into the original data.  You can manually develop these by modifying the original file or use Python in some capacity to generate them. If you modify the original file, please submit your modified file with your submission.

In plain english, explain how your test records will help test the data quality tasks. What makes them good test cases?

***The test records below are designed to verify the data quality tasks below, and contain the same variables as the nurse log data.***

* ***The first test simply uses NaN (Numpy) for the time_spent variable, 
which should be caught in the first of the data quality tasks (part 3) where all missing values are removed from the time_spent column.***

* ***Tests 2 and 3 deal with the threshold cases for the height(cm) variable (part 4), which removes values greater than 250. Test 2 is 250, which is the maximum allowable numeric value; test 3 is 251, which is the smallest numeric value that should be caught in the task of part 4.***

* ***Tests 4, 5, 6, and 7 are designed to test the supplies_used data quality check, where missing values, negative values, and values greater than 100 are replaced with the mean of the supplies_used column - test 4 uses Nan and should be replaced with the mean; test 5 uses -1 and should be replaced with the mean; test 6 uses 100, which is the maximum allowable numeric value and should NOT be replaced with the mean; and test 7 uses 101, which is the smallest positive value that should be replaced with the mean.***

* ***The time_spent data quality task is checked by filling all test case columns using hours instead of the minutes unit, which will need proper conversion of units and conversion of numeric time value to minutes. Test 8 changes the hours to minute, which is the proper unit by has a typo (minute instead of minutes).***

* ***All test cases are filled with the same valid date to test the bad dates quality check. Test 9 uses the same date in a written format (instead of numerical format) and test 10 uses the same date with a typo.***

In [31]:
import pandas as pd
import numpy as np

# nurse log file from google drive
url = r'https://raw.githubusercontent.com/masseygeo/LearningDataScience/main/datasets/Pandas_02.csv'

# read nurse log csv file into pandas dataframe
df_original = pd.read_csv(url)

# display first five rows of nurse log dataframe
df_original.head()

Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,height(cm),weight(kg),charge,supplies_used
0,15685.0,Harry,Potter,8219,06-05-1994,20 minutes,174.0,57,25.72,5.0
1,7619.0,Ron,Weasley,7512,01-15-1994,10 minutes,180.0,60,7.16,2.0
2,14593.0,Hermione,Granger,5896,01-25-1994,20 minutes,164.0,53,8.85,1.0
3,15685.0,Harry,Potter,1552,1994-02-15,5 minutes,174.0,58,25.72,3.0
4,15685.0,Harry,Potter,1202,05-19-1994,20 minutes,174.0,55,25.72,3.0


In [32]:
# create dataframe for test data; all values equal to 1; use columns from original nuse log df; 
# index named test1 to test 10
df_test = pd.DataFrame(1, columns = df_original.columns, index=['test'+str(x+1) for x in range(10)])

# fill name columns
df_test['first_name'] = 'Rubeus'
df_test['last_name'] = 'Hagrid'


# create new test cases for data cleaning

# test 1 - missing data in time_spent
df_test.loc['test1', 'time_spent'] = np.nan

# tests 2-3 - endmember range values in height(cm)
df_test.loc['test2', 'height(cm)'] = 250
df_test.loc['test3', 'height(cm)'] = 251

# tests 4-7 - supplies_used quality check
df_test.loc['test4', 'supplies_used'] = np.nan
df_test.loc['test5', 'supplies_used'] = -1
df_test.loc['test6', 'supplies_used'] = 100
df_test.loc['test7', 'supplies_used'] = 101

# test 8 - time_spent column separation and units conversion
df_test.loc['test2':, 'time_spent'] = '1 hours'
df_test.loc['test8', 'time_spent'] = '1 minute'

# tests 9-10 - poorly formatted date
df_test['date'] = '03-30-1950'
df_test.loc['test9','date'] = 'March 30, 1950'
df_test.loc['test10', 'date'] = '03-30-195'


# display dataframe of test cases
df_test.head(10)

Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,height(cm),weight(kg),charge,supplies_used
test1,1,Rubeus,Hagrid,1,03-30-1950,,1,1,1,1.0
test2,1,Rubeus,Hagrid,1,03-30-1950,1 hours,250,1,1,1.0
test3,1,Rubeus,Hagrid,1,03-30-1950,1 hours,251,1,1,1.0
test4,1,Rubeus,Hagrid,1,03-30-1950,1 hours,1,1,1,
test5,1,Rubeus,Hagrid,1,03-30-1950,1 hours,1,1,1,-1.0
test6,1,Rubeus,Hagrid,1,03-30-1950,1 hours,1,1,1,100.0
test7,1,Rubeus,Hagrid,1,03-30-1950,1 hours,1,1,1,101.0
test8,1,Rubeus,Hagrid,1,03-30-1950,1 minute,1,1,1,1.0
test9,1,Rubeus,Hagrid,1,"March 30, 1950",1 hours,1,1,1,1.0
test10,1,Rubeus,Hagrid,1,03-30-195,1 hours,1,1,1,1.0


# Part 2 -- Create

(5 points) Load the original data and your new test cases into the same Python structure (of your choice).



In [33]:
# concatenate original nurse log dataframe with dataframe of test cases, both have same columns; 
# make copy of data instead of view
df = pd.concat([df_original, df_test], copy=True)

# display concatenated dataframe of nurse log and test cases
df.head(len(df))

Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,height(cm),weight(kg),charge,supplies_used
0,15685.0,Harry,Potter,8219,06-05-1994,20 minutes,174.0,57,25.72,5.0
1,7619.0,Ron,Weasley,7512,01-15-1994,10 minutes,180.0,60,7.16,2.0
2,14593.0,Hermione,Granger,5896,01-25-1994,20 minutes,164.0,53,8.85,1.0
3,15685.0,Harry,Potter,1552,1994-02-15,5 minutes,174.0,58,25.72,3.0
4,15685.0,Harry,Potter,1202,05-19-1994,20 minutes,174.0,55,25.72,3.0
5,8954.0,Dobby,,1205,03-12-1994,10 minutes,106.0,25,17.42,0.0
6,7619.0,Ron,Weasley,6895,04-05-1994,.5 hours,,60,22.07,5.0
7,15689.0,Harry,Potter,6854,10-11-1994,20 minute,57.0,174,18.09,5.0
8,7619.0,Ron,Weasley,1265,07-08-1994,15 minutes,180.0,61,15.24,1.0
9,,Serverus,Snape,5454,09-12-1994,10 minutes,185.0,82,0.0,0.0


# Part 3 -- Missing Data for Time Spent

(5 points) Drop all rows with missing values for the time_spent column.  Print your data to verify your changes.


In [34]:
# summarize data
# number of null values in time_spent (sum true bools); should be 2 records - row 18 and test1
print('Missing data in column *time_spent*: ', df['time_spent'].isna().sum())

# boolean mask using values that are NOT null with notna method
df = df[df['time_spent'].notna()]

# summarize modified data
# number of null values in time_spent of modified df (sum of true bools); should be 0 records
print('Missing data in *time_spent* of modified df: ', df['time_spent'].isna().sum())

# dispaly modified nurse log dataframe
df.head(len(df))

Missing data in column *time_spent*:  2
Missing data in *time_spent* of modified df:  0


Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,height(cm),weight(kg),charge,supplies_used
0,15685.0,Harry,Potter,8219,06-05-1994,20 minutes,174.0,57,25.72,5.0
1,7619.0,Ron,Weasley,7512,01-15-1994,10 minutes,180.0,60,7.16,2.0
2,14593.0,Hermione,Granger,5896,01-25-1994,20 minutes,164.0,53,8.85,1.0
3,15685.0,Harry,Potter,1552,1994-02-15,5 minutes,174.0,58,25.72,3.0
4,15685.0,Harry,Potter,1202,05-19-1994,20 minutes,174.0,55,25.72,3.0
5,8954.0,Dobby,,1205,03-12-1994,10 minutes,106.0,25,17.42,0.0
6,7619.0,Ron,Weasley,6895,04-05-1994,.5 hours,,60,22.07,5.0
7,15689.0,Harry,Potter,6854,10-11-1994,20 minute,57.0,174,18.09,5.0
8,7619.0,Ron,Weasley,1265,07-08-1994,15 minutes,180.0,61,15.24,1.0
9,,Serverus,Snape,5454,09-12-1994,10 minutes,185.0,82,0.0,0.0


# Part 4 -- Range checking for Height

(10 points) Drop all rows with values larger than 250 for height (slightly larger than the world record for tallest person). Print your data to verify your changes.


In [35]:
# summarize df
# mask df using values greater than 250 (exclusive) in height(cm) then get length; 
# should be 2 records - row 20 and test 3
print('Values greater than 250 in column *height(cm)*: ', len(df[df['height(cm)'] > 250]))


# modify df by dropping records less than or equal to 250; filter df with values greater than 250; 
#get index to use in drop method
df = df.drop(df[df['height(cm)'] > 250].index)


# summarize modified df
# filter df using values greater than 250 (exclusive) in height(cm) then get length; should be 0 records
print('Values greater than 250 in column *height(cm)* in modified df: ', len(df[df['height(cm)'] > 250]))


# dispaly modified nurse log dataframe
df.head(len(df))

Values greater than 250 in column *height(cm)*:  2
Values greater than 250 in column *height(cm)* in modified df:  0


Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,height(cm),weight(kg),charge,supplies_used
0,15685.0,Harry,Potter,8219,06-05-1994,20 minutes,174.0,57,25.72,5.0
1,7619.0,Ron,Weasley,7512,01-15-1994,10 minutes,180.0,60,7.16,2.0
2,14593.0,Hermione,Granger,5896,01-25-1994,20 minutes,164.0,53,8.85,1.0
3,15685.0,Harry,Potter,1552,1994-02-15,5 minutes,174.0,58,25.72,3.0
4,15685.0,Harry,Potter,1202,05-19-1994,20 minutes,174.0,55,25.72,3.0
5,8954.0,Dobby,,1205,03-12-1994,10 minutes,106.0,25,17.42,0.0
6,7619.0,Ron,Weasley,6895,04-05-1994,.5 hours,,60,22.07,5.0
7,15689.0,Harry,Potter,6854,10-11-1994,20 minute,57.0,174,18.09,5.0
8,7619.0,Ron,Weasley,1265,07-08-1994,15 minutes,180.0,61,15.24,1.0
9,,Serverus,Snape,5454,09-12-1994,10 minutes,185.0,82,0.0,0.0


# Part 5 -- Missing Data for Supplies Used

(10 points)  For the supplies used column, replace (1) all missing values, (2) all negative values, and (3) all values over 100 (exclusive). Replace with the **mean** value for supplies used; restrict your mean calculation using non-missing, non-negative values from the *supplies_used* column that are less than or equal to 100. Print your data to verify your changes.


In [36]:
# summarize df
# sum number of missing values in supplies_used; should be 2 missing values - row 13 and test4
print('Missing values in *supplies used*: ', df['supplies_used'].isna().sum())

# query df for all values less than 0 (negative) then find length of df; count of negative records 
# should be 1 - test5
print('Negative values in *supplies used*: ', len(df.query('supplies_used < 0')))

# query df for values over 100 (exclusive) in supplies_used then find length of df; count of records 
#should be 1 - test7
print('Values greater than 100 (exclusive) in *supplies used*: ', 
      len(df.query('supplies_used > 100')))


# modify df
# new df without missing, negatives, or over 100 in supplies_used; then find mean of new df in 
# all numeric columns
df_mean = df.query('0 <= supplies_used <= 100').mean(numeric_only=True)

# replace missing values in supplies_used column with mean
df.loc[df['supplies_used'].isna().index, 'supplies_used'] = df_mean['supplies_used']

# replace negative values in supplies_used column with mean
df.loc[df.query('supplies_used < 0').index, 'supplies_used'] = df_mean['supplies_used']

# replace values over 100 in supplies_used column with mean
df.loc[df.query('supplies_used > 100').index, 'supplies_used'] = df_mean['supplies_used']


# summarize modified df
# sum number of missing values in supplies_used; should be 0
print('\nMissing values in *supplies used* in modified df: ', df['supplies_used'].isna().sum())

# query df for all values less than 0 (negative) then find length of df; count of negative records should be 0
print('Negative values in *supplies used* in modified df: ', 
      len(df.query('supplies_used < 0').index))

# query df for values over 100 (exclusive) in supplies_used then find length of df; count of records should be 0
print('Values greater than 100 in *supplies used* in modified df: ', 
      len(df.query('supplies_used > 100').index))

# print note about replaced values
print('  ***Missing, negative, or values greater than 100 replaced by mean of *supplies_used* = ', 
      df_mean['supplies_used'])


# dispaly modified nurse log dataframe
df.head(len(df))

Missing values in *supplies used*:  2
Negative values in *supplies used*:  1
Values greater than 100 (exclusive) in *supplies used*:  1

Missing values in *supplies used* in modified df:  0
Negative values in *supplies used* in modified df:  0
Values greater than 100 in *supplies used* in modified df:  0
  ***Missing, negative, or values greater than 100 replaced by mean of *supplies_used* =  7.875


Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,height(cm),weight(kg),charge,supplies_used
0,15685.0,Harry,Potter,8219,06-05-1994,20 minutes,174.0,57,25.72,7.875
1,7619.0,Ron,Weasley,7512,01-15-1994,10 minutes,180.0,60,7.16,7.875
2,14593.0,Hermione,Granger,5896,01-25-1994,20 minutes,164.0,53,8.85,7.875
3,15685.0,Harry,Potter,1552,1994-02-15,5 minutes,174.0,58,25.72,7.875
4,15685.0,Harry,Potter,1202,05-19-1994,20 minutes,174.0,55,25.72,7.875
5,8954.0,Dobby,,1205,03-12-1994,10 minutes,106.0,25,17.42,7.875
6,7619.0,Ron,Weasley,6895,04-05-1994,.5 hours,,60,22.07,7.875
7,15689.0,Harry,Potter,6854,10-11-1994,20 minute,57.0,174,18.09,7.875
8,7619.0,Ron,Weasley,1265,07-08-1994,15 minutes,180.0,61,15.24,7.875
9,,Serverus,Snape,5454,09-12-1994,10 minutes,185.0,82,0.0,7.875


# Part 6 -- Normalize Time Spent

(15 points) Add two columns to your data structure by splitting the *time_spent* column. Split this into two columns named *time_spent* and *time_spent_unit*, where the number value is stored into *time_spent* and the description of the unit (minutes, hours, etc) is stored in *time_spent_unit*.  

Furthermore, convert any values with hours as units into minutes as units and correct any typos of minute to minutes. Print your data to verify your changes.



In [37]:
# modify df
# insert new column called time_spent_unit right of existing time_spent column with values of nan
df.insert(6, 'time_spent_unit', value=np.nan)

# split time_spent column in two columns of new dataframe; split on space in string
split = df['time_spent'].str.split(expand=True)

# place numeric time measurements into time_spent column of df, casting data type to float
df['time_spent'] = split[0].astype('float64')

# place units into time_spent_unit column of df
df['time_spent_unit'] = split[1]


# summarize df
# summarize total number of fixes to do - should be 10 at row 6, 7, test2, test4-10
print('Total number of typos or wrong units in column *time_spent_unit*: ', 
      len(df.query('time_spent_unit != "minutes"')))

# summarize number of units to convert - should be 8 at row 6, test2, test4-7, test9-10
print('Number of rows in "hours" units in column *time_spent_unit*: ', 
      len(df.query('time_spent_unit == "hours"')))


# modify df fixing typos and converting units
# correct minute typos replacing minute with minutes; modify in place
df['time_spent_unit'].replace('minute', 'minutes', inplace=True)

# find rows with hours units and time_spent columun only; multiply by 60 to convert hours to minutes
df.loc[df.query('time_spent_unit == "hours"').index, 'time_spent'] *= 60

# change hours to minutes in time_spent_unit column; modify in place
df['time_spent_unit'].replace('hours', 'minutes', inplace=True)


# summarize modified df
# summarize total number of typos or conversions to do; should be 0
print('\nTotal number of typos or wrong units in column *time_spent_unit* in modified df: ', 
      len(df.query('time_spent_unit != "minutes"')))

# summarize number of units to convert; should be 0
print('Number of hours units in column *time_spent_unit*in modified df: ', 
      len(df.query('time_spent_unit == "hours"')))


# display modified df
df.head(len(df))

Total number of typos or wrong units in column *time_spent_unit*:  10
Number of rows in "hours" units in column *time_spent_unit*:  8

Total number of typos or wrong units in column *time_spent_unit* in modified df:  0
Number of hours units in column *time_spent_unit*in modified df:  0


Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,time_spent_unit,height(cm),weight(kg),charge,supplies_used
0,15685.0,Harry,Potter,8219,06-05-1994,20.0,minutes,174.0,57,25.72,7.875
1,7619.0,Ron,Weasley,7512,01-15-1994,10.0,minutes,180.0,60,7.16,7.875
2,14593.0,Hermione,Granger,5896,01-25-1994,20.0,minutes,164.0,53,8.85,7.875
3,15685.0,Harry,Potter,1552,1994-02-15,5.0,minutes,174.0,58,25.72,7.875
4,15685.0,Harry,Potter,1202,05-19-1994,20.0,minutes,174.0,55,25.72,7.875
5,8954.0,Dobby,,1205,03-12-1994,10.0,minutes,106.0,25,17.42,7.875
6,7619.0,Ron,Weasley,6895,04-05-1994,30.0,minutes,,60,22.07,7.875
7,15689.0,Harry,Potter,6854,10-11-1994,20.0,minutes,57.0,174,18.09,7.875
8,7619.0,Ron,Weasley,1265,07-08-1994,15.0,minutes,180.0,61,15.24,7.875
9,,Serverus,Snape,5454,09-12-1994,10.0,minutes,185.0,82,0.0,7.875


# Part 7 -- Replace Bad Dates

(5 points) Replace any bad dates (missing, impossible dates, poorly formated, etc) with a date representing January 1st, 1994. Print your data to verify your changes.


In [38]:
# convert dates to datetime dtype using month-day-year format; errors coerced to NaT values
df['date'] = pd.to_datetime(df['date'], format='%m-%d-%Y', errors='coerce')


# summarize number of bad dates - should be 6 at rows 3,10,13,19, test9
print('Total number of bad dates (incorrectly formatted and/or impossible values) in *date*: ', 
      df['date'].isna().sum())


# replace all bad dates with January 1, 1994
df.loc[df['date'].isna(),'date'] = '1994-01-01'


# summarize number of bad dates in modified df - should be 0
print('\nTotal number of bad dates (incorrectly formatted and/or impossible values) in *date* in modified df: ', 
      df['date'].isna().sum())


# display modified df
df.head(len(df))

Total number of bad dates (incorrectly formatted and/or impossible values) in *date*:  6

Total number of bad dates (incorrectly formatted and/or impossible values) in *date* in modified df:  0


Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,time_spent_unit,height(cm),weight(kg),charge,supplies_used
0,15685.0,Harry,Potter,8219,1994-06-05,20.0,minutes,174.0,57,25.72,7.875
1,7619.0,Ron,Weasley,7512,1994-01-15,10.0,minutes,180.0,60,7.16,7.875
2,14593.0,Hermione,Granger,5896,1994-01-25,20.0,minutes,164.0,53,8.85,7.875
3,15685.0,Harry,Potter,1552,1994-01-01,5.0,minutes,174.0,58,25.72,7.875
4,15685.0,Harry,Potter,1202,1994-05-19,20.0,minutes,174.0,55,25.72,7.875
5,8954.0,Dobby,,1205,1994-03-12,10.0,minutes,106.0,25,17.42,7.875
6,7619.0,Ron,Weasley,6895,1994-04-05,30.0,minutes,,60,22.07,7.875
7,15689.0,Harry,Potter,6854,1994-10-11,20.0,minutes,57.0,174,18.09,7.875
8,7619.0,Ron,Weasley,1265,1994-07-08,15.0,minutes,180.0,61,15.24,7.875
9,,Serverus,Snape,5454,1994-09-12,10.0,minutes,185.0,82,0.0,7.875


# Part 8 -- Consistency of IDs

(10 points) Replace any inconsistent medical record numbers with the most commonly occurring medical record number for each first/last name combination (ignoring case). For example, 015689 was inconsistent with Potter's other IDs and would become 015685.

In [39]:
# group by first_name and last_name; ignoring case by changing both variables to lowercase; including NA values
group_mrn = df.groupby([df['first_name'].str.lower(), df['last_name'].str.lower()], dropna=False)


# print unique values of group object to see unique values & first/last name combinations
# harry potter and ron weasley both have multiple values (NaN,potter is likely harry potter, but specifications require each unique combination)
print('First_name-last_name combinations (case insensitive) with multiple unique values in *medical_record_number*: \n')
unique_mrn = group_mrn['medical_record_number'].apply(lambda x: x.unique())
print(unique_mrn[unique_mrn.map(len)>1])


# get most common values for name combinations with more than one unique value - harry potter and ron weasley
# harry potter
hp_mode = df.loc[group_mrn.groups['harry','potter'],'medical_record_number'].mode().values[0]
print('\nMost commonly occurring medical_record_number for harry potter: ', hp_mode)
# ron weasley
rw_mode = df.loc[group_mrn.groups['ron','weasley'],'medical_record_number'].mode().values[0]
print('\nMost commonly occurring medical_record_number for ron weasley: ', rw_mode)


# replace all medical_record_number values with most common value for harry potter and ron weasley
df.loc[group_mrn.groups['harry','potter'], 'medical_record_number'] = hp_mode
df.loc[group_mrn.groups['ron','weasley'], 'medical_record_number'] = rw_mode


# check unique values again to verify
print('\nFirst_name-last_name combinations (case insensitive) with multiple unique values in *medical_record_number* in modified df: \n')
unique_mrn_mod = group_mrn['medical_record_number'].apply(lambda x: x.unique())
print(unique_mrn_mod[unique_mrn_mod.map(len)>1])


# display modified df
df.head(len(df))

First_name-last_name combinations (case insensitive) with multiple unique values in *medical_record_number*: 

first_name  last_name
harry       potter       [15685.0, 15689.0]
ron         weasley       [7619.0, 15685.0]
Name: medical_record_number, dtype: object

Most commonly occurring medical_record_number for harry potter:  15685.0

Most commonly occurring medical_record_number for ron weasley:  7619.0

First_name-last_name combinations (case insensitive) with multiple unique values in *medical_record_number* in modified df: 

Series([], Name: medical_record_number, dtype: object)


Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,time_spent_unit,height(cm),weight(kg),charge,supplies_used
0,15685.0,Harry,Potter,8219,1994-06-05,20.0,minutes,174.0,57,25.72,7.875
1,7619.0,Ron,Weasley,7512,1994-01-15,10.0,minutes,180.0,60,7.16,7.875
2,14593.0,Hermione,Granger,5896,1994-01-25,20.0,minutes,164.0,53,8.85,7.875
3,15685.0,Harry,Potter,1552,1994-01-01,5.0,minutes,174.0,58,25.72,7.875
4,15685.0,Harry,Potter,1202,1994-05-19,20.0,minutes,174.0,55,25.72,7.875
5,8954.0,Dobby,,1205,1994-03-12,10.0,minutes,106.0,25,17.42,7.875
6,7619.0,Ron,Weasley,6895,1994-04-05,30.0,minutes,,60,22.07,7.875
7,15685.0,Harry,Potter,6854,1994-10-11,20.0,minutes,57.0,174,18.09,7.875
8,7619.0,Ron,Weasley,1265,1994-07-08,15.0,minutes,180.0,61,15.24,7.875
9,,Serverus,Snape,5454,1994-09-12,10.0,minutes,185.0,82,0.0,7.875


# Part 9 -- Calculate Aggregates, Part 1

(5 points) Use your cleaned data to calculate the mean time spent (in minutes) for all records. Print this value.



In [40]:
# mean time spent for all records
mean_time_spent = df['time_spent'].mean()
print('Average time spent for all records (in minutes): ', mean_time_spent)

Average time spent for all records (in minutes):  25.642857142857142


# Part 10 -- Calculate Aggregates, Part 2

(5 points) Use your cleaned data to find the month in 1994 with the largest amount of time spent logged by the nurse. Print this value.  

Leave as a comment any lingering data quality concerns you might have in reporting aggregate monthly values back to Hogwarts administration.

In [41]:
# filter df for only records in 1994; group filtered data by month
year_month_group = df[df['date'].dt.year==1994].groupby(df['date'].dt.month)


# find sum of time spent for each month; sort from largest to smallest; reset index starting with 0
a = year_month_group['time_spent'].apply(lambda x: x.sum()).sort_values(ascending=False).reset_index()


# print month number with the most time spent in 1994
print('The month number in 1994 with the most time_spent was: ', a.loc[0,'date'], '\n')


# report lingering data quality concerns...
print('Data quality issues remain in the nurse log dataset sample, including...')
print('  * Missing values still exist in medical_record_number, first_name, last_name, and height(cm).')
print('  * There are still reconciliations to be made in first_name, last_name, and medical_record_numbers.')
print('  * There may be issues with unreasonably large values in weight(kg).')


# display final df
df.head(len(df))

The month number in 1994 with the most time_spent was:  1 

Data quality issues remain in the nurse log dataset sample, including...
  * Missing values still exist in medical_record_number, first_name, last_name, and height(cm).
  * There are still reconciliations to be made in first_name, last_name, and medical_record_numbers.
  * There may be issues with unreasonably large values in weight(kg).


Unnamed: 0,medical_record_number,first_name,last_name,visit_id,date,time_spent,time_spent_unit,height(cm),weight(kg),charge,supplies_used
0,15685.0,Harry,Potter,8219,1994-06-05,20.0,minutes,174.0,57,25.72,7.875
1,7619.0,Ron,Weasley,7512,1994-01-15,10.0,minutes,180.0,60,7.16,7.875
2,14593.0,Hermione,Granger,5896,1994-01-25,20.0,minutes,164.0,53,8.85,7.875
3,15685.0,Harry,Potter,1552,1994-01-01,5.0,minutes,174.0,58,25.72,7.875
4,15685.0,Harry,Potter,1202,1994-05-19,20.0,minutes,174.0,55,25.72,7.875
5,8954.0,Dobby,,1205,1994-03-12,10.0,minutes,106.0,25,17.42,7.875
6,7619.0,Ron,Weasley,6895,1994-04-05,30.0,minutes,,60,22.07,7.875
7,15685.0,Harry,Potter,6854,1994-10-11,20.0,minutes,57.0,174,18.09,7.875
8,7619.0,Ron,Weasley,1265,1994-07-08,15.0,minutes,180.0,61,15.24,7.875
9,,Serverus,Snape,5454,1994-09-12,10.0,minutes,185.0,82,0.0,7.875
