# Exercises - Tidy Data

<h2 id="exercises">Exercises</h2>
<p>Create a directory named <code>advanced_topics</code> and within that directory, do your work for this exercise in a jupyter notebook or python script named <code>tidy_data</code>.</p>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
file_loc = './untidy_data/'

<h3>1</h3>
<h3>Attendance Data</h3>

<h4>1.1</h4>
<p>Load the <code>attendance.csv</code> file and calculate an attendance percentage for each student.</p> 

In [3]:
df_abs = pd.read_csv(file_loc + 'attendance.csv').rename(columns={'Unnamed: 0': 'student'})
df_abs

Unnamed: 0,student,2018-01-01,2018-01-02,2018-01-03,2018-01-04,2018-01-05,2018-01-06,2018-01-07,2018-01-08
0,Sally,P,T,T,H,P,A,T,T
1,Jane,A,P,T,T,T,T,A,T
2,Billy,A,T,A,A,H,T,P,T
3,John,P,T,H,P,P,T,P,P


<h4>1.2</h4><p>One half day is worth 50% of a full day, and 10 tardies is equal to one absence.</p>

In [4]:
abs_score = {
    'A': 0,
    'T': .9,
    'H': .5,
    'P': 1
}

In [5]:
dfm_abs = df_abs.melt(id_vars='student', var_name='date', value_name='abs_code')
dfm_abs.head(8)

Unnamed: 0,student,date,abs_code
0,Sally,2018-01-01,P
1,Jane,2018-01-01,A
2,Billy,2018-01-01,A
3,John,2018-01-01,P
4,Sally,2018-01-02,T
5,Jane,2018-01-02,P
6,Billy,2018-01-02,T
7,John,2018-01-02,T


In [6]:
# dfm_abs['abs_pct'] = dfm_abs.abs_code.apply(lambda x: abs_score[x])
dfm_abs['abs_pct'] = dfm_abs.abs_code.map(abs_score)
dfm_abs.head(8)

Unnamed: 0,student,date,abs_code,abs_pct
0,Sally,2018-01-01,P,1.0
1,Jane,2018-01-01,A,0.0
2,Billy,2018-01-01,A,0.0
3,John,2018-01-01,P,1.0
4,Sally,2018-01-02,T,0.9
5,Jane,2018-01-02,P,1.0
6,Billy,2018-01-02,T,0.9
7,John,2018-01-02,T,0.9


<h4>1.3</h4>
<p>You should end up with this:</p>
<pre><code>name
Billy    0.5250
Jane     0.6875
John     0.9125
Sally    0.7625
Name: grade, dtype: float64</code></pre>

In [7]:
dfs_abs = dfm_abs.groupby(by='student').abs_pct.agg(['sum','mean'])
dfs_abs

Unnamed: 0_level_0,sum,mean
student,Unnamed: 1_level_1,Unnamed: 2_level_1
Billy,4.2,0.525
Jane,5.5,0.6875
John,7.3,0.9125
Sally,6.1,0.7625


In [8]:
dfgc_abs = dfm_abs.rename(columns={'date': 'code'}).groupby(by=['student', 'abs_code']).code.count()
dfgs_abs = dfm_abs.groupby(by=['student', 'abs_code']).abs_pct.sum()
dfg_abs = pd.DataFrame([dfgc_abs, dfgs_abs]).T
dfg_abs

Unnamed: 0_level_0,Unnamed: 1_level_0,code,abs_pct
student,abs_code,Unnamed: 2_level_1,Unnamed: 3_level_1
Billy,A,3.0,0.0
Billy,H,1.0,0.5
Billy,P,1.0,1.0
Billy,T,3.0,2.7
Jane,A,2.0,0.0
Jane,P,1.0,1.0
Jane,T,5.0,4.5
John,H,1.0,0.5
John,P,5.0,5.0
John,T,2.0,1.8


<h3>2</h3><h3>Coffee Levels</h3>

<h4>2.1</h4>
<p>Read the <code>coffee_levels.csv</code> file.</p>

In [9]:
df_cof = pd.read_csv(file_loc + 'coffee_levels.csv')
df_cof.sample(5)

Unnamed: 0,hour,coffee_carafe,coffee_amount
16,14,y,0.058361
12,10,y,0.023163
8,16,x,0.183891
26,14,z,0.864464
27,15,z,0.436364


<h4>2.2</h4>
<p>Transform the data so that each carafe is in it's own column.</p>

In [10]:
dfp_cof = df_cof.pivot_table(values='coffee_amount', index='hour', columns='coffee_carafe')
dfp_cof

coffee_carafe,x,y,z
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8,0.816164,0.189297,0.999264
9,0.451018,0.521502,0.91599
10,0.843279,0.023163,0.144928
11,0.335533,0.235529,0.311495
12,0.898291,0.017009,0.771947
13,0.310711,0.997464,0.39852
14,0.507288,0.058361,0.864464
15,0.215043,0.144644,0.436364
16,0.183891,0.544676,0.280621
17,0.39156,0.594126,0.436677


<h4>2.3</h4>
<p>Is this the best shape for the data?</p>

This is not the best shape for the data. The units of measure among the carafes is the same, so it is counter-productive to pivot by that variable. The data was in tidy format when first encountered.

<h3>3</h3><h3>Cake Recipes</h3>

<h4>3.1</h4>
<p>Read the <code>cake_recipes.csv</code> data. This data set contains cake tastiness scores for combinations of different recipes, oven rack positions, and oven temperatures.</p>

In [11]:
df_cake = pd.read_csv(file_loc + 'cake_recipes.csv')
df_cake#.sample(5)

Unnamed: 0,recipe:position,225,250,275,300
0,a:bottom,61.738655,53.912627,74.41473,98.786784
1,a:top,51.709751,52.009735,68.576858,50.22847
2,b:bottom,57.09532,61.904369,61.19698,99.248541
3,b:top,82.455004,95.224151,98.594881,58.169349
4,c:bottom,96.470207,52.001358,92.893227,65.473084
5,c:top,71.306308,82.795477,92.098049,53.960273
6,d:bottom,52.799753,58.670419,51.747686,56.18311
7,d:top,96.873178,76.101363,59.57162,50.971626


<h4>3.2</h4>
<p>Tidy the data as necessary.</p>

In [12]:
dft_cake = df_cake.melt(id_vars='recipe:position', var_name='temp', value_name='rating')
dft_cake['recipe'] = dft_cake['recipe:position'].str.extract(r'^(.*):')
dft_cake['position'] = dft_cake['recipe:position'].str.extract(r':(.*)$')
dft_cake.drop(columns='recipe:position', inplace=True)
dft_cake = dft_cake[['recipe', 'temp', 'position', 'rating']].sort_values(by=['recipe', 'temp', 'position'])

# df3m['Year'] = df3m.variable.str.extract(r'^(\d+)')
# df3m['Var'] = df3m.variable.str.extract(r'^\d+\s(.*)$')

dft_cake

Unnamed: 0,recipe,temp,position,rating
0,a,225,bottom,61.738655
1,a,225,top,51.709751
8,a,250,bottom,53.912627
9,a,250,top,52.009735
16,a,275,bottom,74.41473
17,a,275,top,68.576858
24,a,300,bottom,98.786784
25,a,300,top,50.22847
2,b,225,bottom,57.09532
3,b,225,top,82.455004


<h4>3.3</h4>
<p>Which recipe, on average, is the best?</p>

*recipe b*

In [13]:
dfs_recipe = pd.DataFrame(dft_cake.groupby(by='recipe').rating.mean()).reset_index()
dfs_recipe['is_best'] = dfs_recipe.rating == max(dfs_recipe.rating)

display(dfs_recipe[dfs_recipe['is_best']])
display(dfs_recipe[dfs_recipe['is_best']==False])

Unnamed: 0,recipe,rating,is_best
1,b,76.736074,True


Unnamed: 0,recipe,rating,is_best
0,a,63.922201,False
2,c,75.874748,False
3,d,62.864844,False


<h4>3.4</h4>
<p>Which oven temperature, on average, produces the best results? 275</p>

In [14]:
dfs_temp = pd.DataFrame(dft_cake.groupby(by='temp').rating.mean()).reset_index()
dfs_temp['is_best'] = dfs_temp.rating == max(dfs_temp.rating)

display(dfs_temp[dfs_temp['is_best']])
display(dfs_temp[dfs_temp['is_best']==False])

Unnamed: 0,temp,rating,is_best
2,275,74.886754,True


Unnamed: 0,temp,rating,is_best
0,225,71.306022,False
1,250,66.577437,False
3,300,66.627655,False


<h4>3.5</h4>
<p>Which combination of recipe, rack position, and temperature gives the best result? recipe b, bottom rack, 300 degrees</p>

In [15]:
dfs_all = dft_cake.copy()
dfs_all['is_best'] = dfs_all.rating == max(dfs_all.rating)


display(dfs_all[dfs_all['is_best']])
display(dfs_all[dfs_all['is_best']==False])

Unnamed: 0,recipe,temp,position,rating,is_best
26,b,300,bottom,99.248541,True


Unnamed: 0,recipe,temp,position,rating,is_best
0,a,225,bottom,61.738655,False
1,a,225,top,51.709751,False
8,a,250,bottom,53.912627,False
9,a,250,top,52.009735,False
16,a,275,bottom,74.41473,False
17,a,275,top,68.576858,False
24,a,300,bottom,98.786784,False
25,a,300,top,50.22847,False
2,b,225,bottom,57.09532,False
3,b,225,top,82.455004,False
