In [1]:
import numpy as np
import pandas as pd

Data comes in all sizes and shapes and formats, and we will spend lots of our time <b>wrangling</b> the data into the format we could use. Wrangling is a topic for another day; for now all data sets will be tidy right away.

<h2>The data set</h2>

The data file is a <b>CSV file</b>(comma-separated value). It is a text file that is easily read into data structures in many programming languages. When storing data, we should try to always store it in such a format which is open, has a well-defined specification, and is readable in many contexts.
<ul>
    <li><b>Good formats: </b> JSON, CSV</li>
    <li><b>Bad formats: </b> Excel, <code>.mat</code></li>
</ul>

In [4]:
!head ../data/gfmt_sleep.csv #taking a peek!

﻿participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
8,f,39,65,80,72.5,91,90,93,83.5,93,90,9,13,2
16,m,42,90,90,90,75.5,55.5,70.5,50,75,50,4,11,7
18,f,31,90,95,92.5,89.5,90,86,81,89,88,10,9,3
22,f,35,100,75,87.5,89.5,*,71,80,88,80,13,8,20
27,f,74,60,65,62.5,68.5,49,61,49,65,49,13,9,12
28,f,61,80,20,50,71,63,31,72.5,64.5,70.5,15,14,2
30,m,32,90,75,82.5,67,56.5,66,65,66,64,16,9,3
33,m,62,45,90,67.5,54,37,65,81.5,62,61,14,9,9
34,f,33,80,100,90,70.5,76.5,64.5,*,68,76.5,14,12,10


While we could write a parser for this file, since it is in a CSV format we could use pre-built tools. For this we may use <b>Pandas</b>, a <b>powerful</b> tool for handling data.
<h2>Pandas</h2>
The primary object of Pandas is the <code>DataFrame</code>. We will use Pandas to read data and store it into <code>DataFrame</code> instance.

In [5]:
df = pd.read_csv('../data/gfmt_sleep.csv', na_values='*')

In [9]:
# Look at the contents of first 5 rows:
df.head()

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
0,8,f,39,65,80,72.5,91.0,90.0,93.0,83.5,93.0,90.0,9,13,2
1,16,m,42,90,90,90.0,75.5,55.5,70.5,50.0,75.0,50.0,4,11,7
2,18,f,31,90,95,92.5,89.5,90.0,86.0,81.0,89.0,88.0,10,9,3
3,22,f,35,100,75,87.5,89.5,,71.0,80.0,88.0,80.0,13,8,20
4,27,f,74,60,65,62.5,68.5,49.0,61.0,49.0,65.0,49.0,13,9,12


<code>DataFrame</code> is always indexed by columns, with column names(<code>string</code> values) being the keys.

In [13]:
df['percent correct'].head()

0    72.5
1    90.0
2    92.5
3    87.5
4    62.5
Name: percent correct, dtype: float64

To refer to a particular element of the column we can use integer indexes.

In [14]:
df['percent correct'][4]

62.5

However, this is <b>not</b> the preferred way to do this. It is better to use <code>loc</code> method:

In [15]:
df.loc[4, 'percent correct']

62.5

<b>Row indices need not be integers</b>, and we should not count on that. In practice, almost always we will use <b>boolean indexing</b>.

In [16]:
# Get the value(s) in the column 'percent correct' with the value in 'participant number' 42
df.loc[df['participant number'] == 42, 'percent correct']

54    85.0
Name: percent correct, dtype: float64

In [19]:
# Get the entire row about participant with number 42
df.loc[df['participant number'] == 42, :]

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
54,42,m,29,100,70,85.0,75.0,,64.5,43.0,74.0,43.0,32,1,6


We can use boolean bitwise operators in the boolean indexing. <code>&</code> for bitwise AND, <code>|</code> for bitwise OR, <code>~</code> for bitwise NOT.

In [21]:
# Get entries for women under 21
df.loc[(df['age'] < 21) & (df['gender'] == 'f'), :]

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
27,3,f,16,70,80,75.0,70.0,57.0,54.0,53.0,57.0,54.5,23,1,3
29,5,f,18,90,100,95.0,76.5,83.0,80.0,,80.0,83.0,21,7,5
66,58,f,16,85,85,85.0,55.0,30.0,50.0,40.0,52.5,35.0,29,2,11
79,72,f,18,80,75,77.5,67.5,51.5,66.0,57.0,67.0,53.0,29,4,6
88,85,f,18,85,85,85.0,93.0,92.0,91.0,89.0,91.5,91.0,25,4,21


In [24]:
# We create three boolean arrays, and for each the entries are True if and only if the
# Boolean statement evaluated at the identical cell at the expression on the right is True
# and False otherwise

under30 = df['age'] < 30
women = df['gender'] == 'f'
good_performers = df['percent correct'] > 85

In [25]:
# Get the entries of Women under 30 who performed good on the face matching test
df.loc[under30 & women & good_performers, :]

Unnamed: 0,participant number,gender,age,correct hit percentage,correct reject percentage,percent correct,confidence when correct hit,confidence incorrect hit,confidence correct reject,confidence incorrect reject,confidence when correct,confidence when incorrect,sci,psqi,ess
22,93,f,28,100,75,87.5,89.5,,67.0,60.0,80.0,60.0,16,7,4
29,5,f,18,90,100,95.0,76.5,83.0,80.0,,80.0,83.0,21,7,5
30,6,f,28,95,80,87.5,100.0,85.0,94.0,61.0,99.0,65.0,19,7,12
33,10,f,25,100,100,100.0,90.0,,85.0,,90.0,,17,10,11
56,44,f,21,85,90,87.5,66.0,29.0,70.0,29.0,67.0,29.0,26,7,18
58,48,f,23,90,85,87.5,67.0,47.0,69.0,40.0,67.0,40.0,18,6,8
60,51,f,24,85,95,90.0,97.0,41.0,74.0,73.0,83.0,55.5,29,1,7
75,67,f,25,100,100,100.0,61.5,,58.5,,60.5,,28,8,9


This would normally require us to use loops, but the numpy and pandas libraries allow us to treat all this as simple non-loop operations.

In [27]:
# initialize the list of Boolean indices
indices = [False] * len(df)

# Iterate over the rows of the DataFrame to check if the row should be included
for i, r in df.iterrows():
    if r['age'] < 30 and r['gender'] == 'f' and r['percent correct'] > 85:
        indices[i] = True

#Make our selection with Boolean indexing
df.loc[indices, :]