In [36]:
import numpy as np
import pandas as pd

<h2>Tidy data</h2>

Tidy data frames follow the rules:
<ol>
    <li> Each variable is a column.</li>
    <li> Each observation is a row.</li>
    <li> Each type of observation has its own separate data frame.</li>
</ol>

<h2>The data set</h2>

Let us load our original data set

In [37]:
df = pd.read_csv('../data/gfmt_sleep.csv', na_values='*')
df['insomnia'] = df['sci'] <= 16

df.head()

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess,insomnia
0,8,f,39,65,80,72.5,91.0,90.0,93.0,83.5,93.0,90.0,9,13,2,True
1,16,m,42,90,90,90.0,75.5,55.5,70.5,50.0,75.0,50.0,4,11,7,True
2,18,f,31,90,95,92.5,89.5,90.0,86.0,81.0,89.0,88.0,10,9,3,True
3,22,f,35,100,75,87.5,89.5,,71.0,80.0,88.0,80.0,13,8,20,True
4,27,f,74,60,65,62.5,68.5,49.0,61.0,49.0,65.0,49.0,13,9,12,True


<h2>Split-apply-combine</h2>
Let us say that we need to compute the median <b>percent correct</b> for subjects <b>with insomnia</b> and the median <b>percent correct</b> for subjects <b>without insomnia</b>. What do we need to do for it?
<ol>
    <li> <b>Split</b> the data set according to the <code>'insomnia'</code> field. This means that we need to split it up so that we have to datasets: one for those with insomnia and one for those without.</li>
    <li> <b>Apply</b> a median function to the activity in these split data sets.</li>
    <li> <b>Combine</b> the results of these medians on the split data set into a new, summary data set that contains the two classes (insomniac and not insomniac) and medians for each.
</ol>
This is called the <b>split-apply-combine</b> strategy. As a general technique, it was put forward by Hadley Wickham in a paper called <a href="https://www.jstatsoft.org/article/view/v040i01">The Split-Apply-Combine Strategy for Data Analysis</a>.

Pandas's split-apply-combine operations are achieved by using the <code>groupby()</code> method. It is like the splitting part. Then we can easily apply functions we need to the resulting <code>DataFrameGroupBy</code> object.

In [38]:
grouped = df.groupby('insomnia')

grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x13448d0c0>

As we can see, there is not a nice demonstration here. This is mostly because this is an intermidate state and only for <b>applying</b> a function to it.

In [39]:
# This is what we wanted to do after all.
df_median = grouped.median()

# Take a peak!
df_median

Unnamed: 0_level_0,participant number,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
insomnia,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
False,54.0,36.0,90.0,80.0,85.0,74.5,55.5,71.5,59.0,75.0,59.25,26.0,4.0,6.0
True,46.0,39.0,90.0,75.0,75.0,76.5,72.0,71.0,68.5,77.0,65.0,14.0,9.0,7.0


Note that the output gives us the median for each column value split into two groups by their <code>insomnia</code> value. However, <code>insomnia</code> is now the name of the row index. To do otherwise, we use <code>reset_index()</code> method.

In [40]:
df_median.reset_index()

Unnamed: 0,insomnia,participant number,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
0,False,54.0,36.0,90.0,80.0,85.0,74.5,55.5,71.5,59.0,75.0,59.25,26.0,4.0,6.0
1,True,46.0,39.0,90.0,75.0,75.0,76.5,72.0,71.0,68.5,77.0,65.0,14.0,9.0,7.0


Oftentimes we would like to look at several groups of people through a combination of the characteristics. For example, if we want to look at male insomniacs, female insomniacs, and male non-insomniacs and female non-insomniacs, we could do this by passing <code>gender</code> along with <code>insomnia</code> into <code>df.groupby()</code>:

In [41]:
grouped_male_female = df.groupby(['gender', 'insomnia'])

df_median_male_female = grouped_male_female.median()

df_median_male_female.reset_index()

Unnamed: 0,gender,insomnia,participant number,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
0,f,False,58.0,36.0,85.0,80.0,85.0,74.0,55.0,70.5,60.0,74.0,58.75,26.0,4.0,7.0
1,f,True,46.0,39.0,80.0,75.0,72.5,76.5,73.75,71.0,68.5,77.0,70.5,14.0,9.0,7.0
2,m,False,41.0,38.5,90.0,80.0,82.5,76.0,57.75,74.25,54.75,76.25,59.25,29.0,3.0,6.0
3,m,True,55.5,37.0,95.0,82.5,83.75,83.75,55.5,75.75,73.25,81.25,62.5,14.0,9.0,8.0


This process is called <b>aggregation</b> - the process of splitting up the data set into groups, and then computing a summary statistic for each group.

In [42]:
# Create a column in the dataframe called 'rank grouped by insomnia' values of which
# are the ranking of the 'percent correct' value for each of the insomnia groups
df['rank grouped by insomnia'] = grouped['percent correct'].rank(method='first')
df.head()

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess,insomnia,rank grouped by insomnia
0,8,f,39,65,80,72.5,91.0,90.0,93.0,83.5,93.0,90.0,9,13,2,True,11.0
1,16,m,42,90,90,90.0,75.5,55.5,70.5,50.0,75.0,50.0,4,11,7,True,21.0
2,18,f,31,90,95,92.5,89.5,90.0,86.0,81.0,89.0,88.0,10,9,3,True,23.0
3,22,f,35,100,75,87.5,89.5,,71.0,80.0,88.0,80.0,13,8,20,True,19.0
4,27,f,74,60,65,62.5,68.5,49.0,61.0,49.0,65.0,49.0,13,9,12,True,3.0


To see what exactly <code>.rank()</code> does and to demonstrate nice sorting properties of <code>DataFrames</code>, let us sort our data frame by <code>insomnia</code> and then by <code>percent correct</code>:

In [48]:
df_sorted = df.sort_values(by=['insomnia', 'percent correct'])

pd.set_option('display.max_rows', 102)
df_sorted[['insomnia', 'percent correct', 'rank grouped by insomnia']]

Unnamed: 0,insomnia,percent correct,rank grouped by insomnia
81,False,40.0,1.0
94,False,55.0,2.0
39,False,57.5,3.0
76,False,60.0,4.0
96,False,60.0,5.0
86,False,62.5,6.0
101,False,62.5,7.0
41,False,65.0,8.0
28,False,67.5,9.0
50,False,67.5,10.0


<h2>Aggregating and transforming with custom functions</h2>

Let's say we want to compute some non built-in function on our data. For example, <b>coeffecient of variation</b> (CoV). We have to write our own function for CoV

In [49]:
def coeff_of_var(data):
    """Compute coeffecient of variation from an array of data."""
    return np.std(data) / np.mean(data)

Now we can apply it as an aggregating function on our grouped data.

In [53]:
grouped.agg(coeff_of_var)

  grouped.agg(coeff_of_var)


Unnamed: 0_level_0,participant number,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
insomnia,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
False,0.586191,0.384262,0.166784,0.184061,0.138785,0.195978,0.350286,0.204312,0.298216,0.187304,0.262509,0.175245,0.577869,0.571566
True,0.536117,0.313853,0.218834,0.32576,0.171856,0.156219,0.22544,0.222827,0.211512,0.160061,0.197484,0.381907,0.299741,0.681514


<h2>Looping over a GroupBy object</h2>

While the <code>GroupBy</code> methods we have learned so far (like <code>transform()</code> and <code>agg()</code>) are useful and lead to concise code, sometimes we want to loop over the groups of the said <code>GroupBy</code> object. To discover more about this, I will be going on the standart for me process of exploration of new python objects:

In [57]:
for name, group in grouped_male_female:
    print(name, type(group), type(grouped_male_female))

('f', False) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
('f', True) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
('m', False) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
('m', True) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.groupby.generic.DataFrameGroupBy'>


we see that these objects are just <code>DataFrames</code>, and so we can apply the full force of our <code>DataFrame</code> apparatus on it:

In [59]:
for name, group in grouped_male_female:
    print(f'{name} : {group["percent correct"].median()}')

('f', False) : 85.0
('f', True) : 72.5
('m', False) : 82.5
('m', True) : 83.75
