# Data exploration

We use the dataset found at https://github.com/mkcor/data-wrangling/blob/master/data/tidy_who.csv
(see the notebook at the root of that repo for the generation of this dataset).

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('../datasets/tidy_who.csv')

## Loading data

... just works. `read_csv()` comes with many convenient arguments, such as `skiprows`, `nrows`, `na_values`, etc. Note that, alternatively, we could have run `df = pd.read_csv('https://raw.githubusercontent.com/mkcor/data-wrangling/master/data/tidy_who.csv')`.

In [4]:
df.head()

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
0,Afghanistan,EMR,1980,,sp,m,14
1,Afghanistan,EMR,1981,,sp,m,14
2,Afghanistan,EMR,1982,,sp,m,14
3,Afghanistan,EMR,1983,,sp,m,14
4,Afghanistan,EMR,1984,,sp,m,14


In [5]:
df.shape

(429744, 7)

In [6]:
df.sample(10)

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
425982,Libya,EMR,1992,,rel,f,65
45457,Uganda,AFR,2005,725.0,sp,m,5564
276072,Viet Nam,WPR,2004,,ep,f,14
140011,Curaçao,AMR,2011,0.0,sn,m,4554
198311,Suriname,AMR,2005,,sn,f,4554
302310,Guyana,AMR,2004,,ep,f,4554
429727,Zimbabwe,AFR,1999,,rel,f,65
347736,Ethiopia,AFR,1998,,rel,m,3544
176585,Algeria,AFR,1991,,sn,f,2534
293802,Ecuador,AMR,1998,,ep,f,3544


In [7]:
df.describe()

Unnamed: 0,year,cases,age_range
count,429744.0,81381.0,429744.0
mean,1997.57154,667.482496,2542.714286
std,10.407887,4490.566875,1990.957917
min,1980.0,0.0,14.0
25%,1989.0,3.0,65.0
50%,1998.0,28.0,2534.0
75%,2007.0,200.0,4554.0
max,2015.0,250051.0,5564.0


In [8]:
df['g_whoregion'].unique()

array(['EMR', 'EUR', 'AFR', 'WPR', 'AMR', 'SEA'], dtype=object)

In [9]:
df['country'].nunique()

219

## Selecting data

We are already familiar with column selection. The `[ ]` syntax is the most basic way of indexing.

In [10]:
df['country'].head(3)

0    Afghanistan
1    Afghanistan
2    Afghanistan
Name: country, dtype: object

Columns can also be accessed as attributes (as long as they have a valid Python name).

In [11]:
df.country[1000:1003]

1000    Brazil
1001    Brazil
1002    Brazil
Name: country, dtype: object

We can select elements of a DataFrame either by label (with the `.loc` attribute) or by position (with the `.iloc` attribute). Row and column indices take the usual order (first and second place, respectively).

In [12]:
df.loc[0, 'country']

'Afghanistan'

In [13]:
df.loc[df.shape[0] - 1, 'country']

'Zimbabwe'

In [14]:
df.iloc[0, 0]

'Afghanistan'

In [15]:
df.iloc[df.shape[0] - 1, 0]

'Zimbabwe'

Slicing works too.

In [16]:
df.loc[:5, 'country']

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
5    Afghanistan
Name: country, dtype: object

Label indexing is more natural than positional indexing (think of a function call, where keyword arguments are easier to work with than positional arguments).

In [17]:
df.loc[:5, 'country':'type']

Unnamed: 0,country,g_whoregion,year,cases,type
0,Afghanistan,EMR,1980,,sp
1,Afghanistan,EMR,1981,,sp
2,Afghanistan,EMR,1982,,sp
3,Afghanistan,EMR,1983,,sp
4,Afghanistan,EMR,1984,,sp
5,Afghanistan,EMR,1985,,sp


In [18]:
df.iloc[:5, :5]

Unnamed: 0,country,g_whoregion,year,cases,type
0,Afghanistan,EMR,1980,,sp
1,Afghanistan,EMR,1981,,sp
2,Afghanistan,EMR,1982,,sp
3,Afghanistan,EMR,1983,,sp
4,Afghanistan,EMR,1984,,sp


Often we want to select data based on certain conditions.

In [19]:
cond = df.year < 1981

In [20]:
df[cond].shape

(11872, 7)

In [21]:
df[cond & (df.country == 'Argentina') & (df.type == 'rel') & (df.sex == 'm')]

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
322596,Argentina,AMR,1980,,rel,m,14
330270,Argentina,AMR,1980,,rel,m,1524
337944,Argentina,AMR,1980,,rel,m,2534
345618,Argentina,AMR,1980,,rel,m,3544
353292,Argentina,AMR,1980,,rel,m,4554
360966,Argentina,AMR,1980,,rel,m,5564
368640,Argentina,AMR,1980,,rel,m,65


In [22]:
gr_and_it = df.country.isin(['Greece', 'Italy'])

In [23]:
df[gr_and_it].tail()

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
425497,Italy,EUR,2011,,rel,f,65
425498,Italy,EUR,2012,,rel,f,65
425499,Italy,EUR,2013,249.0,rel,f,65
425500,Italy,EUR,2014,,rel,f,65
425501,Italy,EUR,2015,280.0,rel,f,65


Subsets can be selected by callable functions (returning valid indexers).
The following function performs a selection by label (along `country` and `g_whoregion`).

In [22]:
lambda x: ['country', 'g_whoregion']

<function __main__.<lambda>>

So it can serve as a column indexer.

In [23]:
df.loc[:3, lambda x: ['country', 'g_whoregion']]

Unnamed: 0,country,g_whoregion
0,Afghanistan,EMR
1,Afghanistan,EMR
2,Afghanistan,EMR
3,Afghanistan,EMR


The following function filters for data where the number of cases is greater than 100,000.

In [24]:
lambda x: x.cases > 100000

<function __main__.<lambda>>

So it can serve as a row indexer.

In [25]:
great = df.loc[lambda x: x.cases > 100000, :]
great

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
133665,India,SEA,2007,250051.0,sn,m,3544
187383,India,SEA,2007,148811.0,sn,f,3544
241101,India,SEA,2007,105825.0,ep,m,3544
294819,India,SEA,2007,101015.0,ep,f,3544
333196,India,SEA,2014,180319.0,rel,m,1524
333197,India,SEA,2015,186771.0,rel,m,1524
340870,India,SEA,2014,190483.0,rel,m,2534
340871,India,SEA,2015,197298.0,rel,m,2534
348544,India,SEA,2014,199850.0,rel,m,3544
348545,India,SEA,2015,207000.0,rel,m,3544


In [26]:
df.cases.loc[lambda x: x > 100000]

133665    250051.0
187383    148811.0
241101    105825.0
294819    101015.0
333196    180319.0
333197    186771.0
340870    190483.0
340871    197298.0
348544    199850.0
348545    207000.0
354519    100297.0
354520    102352.0
354521    103685.0
356218    188106.0
356219    194837.0
362193    112558.0
362194    108902.0
362195    105403.0
363892    148933.0
363893    154262.0
369867    124476.0
369868    123436.0
369869    125699.0
371566    102929.0
371567    106612.0
386914    149671.0
386915    155026.0
394588    127605.0
394589    132171.0
Name: cases, dtype: float64

We may want to select or mask data while preserving the original shape.

In [27]:
great.where(great.country == 'India')

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
133665,India,SEA,2007.0,250051.0,sn,m,3544.0
187383,India,SEA,2007.0,148811.0,sn,f,3544.0
241101,India,SEA,2007.0,105825.0,ep,m,3544.0
294819,India,SEA,2007.0,101015.0,ep,f,3544.0
333196,India,SEA,2014.0,180319.0,rel,m,1524.0
333197,India,SEA,2015.0,186771.0,rel,m,1524.0
340870,India,SEA,2014.0,190483.0,rel,m,2534.0
340871,India,SEA,2015.0,197298.0,rel,m,2534.0
348544,India,SEA,2014.0,199850.0,rel,m,3544.0
348545,India,SEA,2015.0,207000.0,rel,m,3544.0


In [28]:
great.mask(great.country == 'India')

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
133665,,,,,,,
187383,,,,,,,
241101,,,,,,,
294819,,,,,,,
333196,,,,,,,
333197,,,,,,,
340870,,,,,,,
340871,,,,,,,
348544,,,,,,,
348545,,,,,,,


### Hands-on exercises

1. Select the rows of `df` where country is Greece, age is at most 24, and year is 2000. Name it `df1`.
2. Write the `df1` DataFrame to a CSV file located in the `data/` subdirectory. (Hint: The method is named `to_csv`.)
3. Read this CSV file into a DataFrame named `df2`. What do you notice about the index? (Feel free to fire up a Terminal and look at the CSV file.) 

In [29]:
df1 = df[(df.country == 'Greece') & (df.year == 2000) & (df.age_range.isin([14, 1524]))]

In [30]:
df1.to_csv('../data/df1.csv')

In [31]:
df2 = pd.read_csv('../data/df1.csv')

## Indexing

In [32]:
df1.index

Int64Index([  2768,  10442,  56486,  64160, 110204, 117878, 163922, 171596,
            217640, 225314, 271358, 279032, 325076, 332750, 378794, 386468],
           dtype='int64')

In [33]:
df2.index

RangeIndex(start=0, stop=16, step=1)

We could specify that the first (unnamed) column should be used as the index (row labels).

In [34]:
pd.read_csv('../data/df1.csv', index_col=0)

Unnamed: 0,country,g_whoregion,year,cases,type,sex,age_range
2768,Greece,EUR,2000,1.0,sp,m,14
10442,Greece,EUR,2000,10.0,sp,m,1524
56486,Greece,EUR,2000,0.0,sp,f,14
64160,Greece,EUR,2000,2.0,sp,f,1524
110204,Greece,EUR,2000,,sn,m,14
117878,Greece,EUR,2000,,sn,m,1524
163922,Greece,EUR,2000,,sn,f,14
171596,Greece,EUR,2000,,sn,f,1524
217640,Greece,EUR,2000,,ep,m,14
225314,Greece,EUR,2000,,ep,m,1524


Remember we learnt `set_index()` in the previous section? We also have `reset_index()` at our disposal.

In [35]:
df1.reset_index()

Unnamed: 0,index,country,g_whoregion,year,cases,type,sex,age_range
0,2768,Greece,EUR,2000,1.0,sp,m,14
1,10442,Greece,EUR,2000,10.0,sp,m,1524
2,56486,Greece,EUR,2000,0.0,sp,f,14
3,64160,Greece,EUR,2000,2.0,sp,f,1524
4,110204,Greece,EUR,2000,,sn,m,14
5,117878,Greece,EUR,2000,,sn,m,1524
6,163922,Greece,EUR,2000,,sn,f,14
7,171596,Greece,EUR,2000,,sn,f,1524
8,217640,Greece,EUR,2000,,ep,m,14
9,225314,Greece,EUR,2000,,ep,m,1524


And we are back to a default index for this DataFrame. The original index is stored in its own column.

In [36]:
df1.reset_index().index

RangeIndex(start=0, stop=16, step=1)

### Hands-on exercises

4. Write the df1 DataFrame with a default index to another CSV file.
5. Read this other CSV file into a DataFrame, setting its index to be the original index.

## Reference

* Current limitations with the feather format: https://github.com/wesm/feather/tree/master/python#limitations