# Pandas DataFrames

## Questions
How can I do statistical analysis of tabular data?

## Objectives
Select individual values from a Pandas dataframe.

Select entire rows or entire columns from a dataframe.

Select a subset of both rows and columns from a dataframe in a single operation.

Select a subset of a dataframe by a single Boolean criterion.

### Note about Pandas Dataframes / Series
A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

## Selecting values
To access a value at the position `[i,j]` of a DataFrame, we have two options, depending on what is the meaning of i in use. 

### Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

### Use `DataFrame.loc[..., ...]` to select values by their (entry) label

### Use `:` on it's own to mean all columns or all rows

In [2]:
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

In [3]:
data.head()

Unnamed: 0_level_0,gdpPercap_1952,gdpPercap_1957,gdpPercap_1962,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982,gdpPercap_1987,gdpPercap_1992,gdpPercap_1997,gdpPercap_2002,gdpPercap_2007
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Albania,1601.056136,1942.284244,2312.888958,2760.196931,3313.422188,3533.00391,3630.880722,3738.932735,2497.437901,3193.054604,4604.211737,5937.029526
Austria,6137.076492,8842.59803,10750.72111,12834.6024,16661.6256,19749.4223,21597.08362,23687.82607,27042.01868,29095.92066,32417.60769,36126.4927
Belgium,8343.105127,9714.960623,10991.20676,13149.04119,16672.14356,19117.97448,20979.84589,22525.56308,25575.57069,27561.19663,30485.88375,33692.60508
Bosnia and Herzegovina,973.533195,1353.989176,1709.683679,2172.352423,2860.16975,3528.481305,4126.613157,4314.114757,2546.781445,4766.355904,6018.975239,7446.298803
Bulgaria,2444.286648,3008.670727,4254.337839,5577.0028,6597.494398,7612.240438,8224.191647,8239.854824,6302.623438,5970.38876,7696.777725,10680.79282


In [4]:
data.iloc[0,0]

1601.056136

In [6]:
data.loc['Albania', 'gdpPercap_1967']

2760.196931

In [8]:
data.loc['Albania',:]

gdpPercap_1952    1601.056136
gdpPercap_1957    1942.284244
gdpPercap_1962    2312.888958
gdpPercap_1967    2760.196931
gdpPercap_1972    3313.422188
gdpPercap_1977    3533.003910
gdpPercap_1982    3630.880722
gdpPercap_1987    3738.932735
gdpPercap_1992    2497.437901
gdpPercap_1997    3193.054604
gdpPercap_2002    4604.211737
gdpPercap_2007    5937.029526
Name: Albania, dtype: float64

In [9]:
data.loc[:,'gdpPercap_1967']

country
Albania                    2760.196931
Austria                   12834.602400
Belgium                   13149.041190
Bosnia and Herzegovina     2172.352423
Bulgaria                   5577.002800
Croatia                    6960.297861
Czech Republic            11399.444890
Denmark                   15937.211230
Finland                   10921.636260
France                    12999.917660
Germany                   14745.625610
Greece                     8513.097016
Hungary                    9326.644670
Iceland                   13319.895680
Ireland                    7655.568963
Italy                     10022.401310
Montenegro                 5907.850937
Netherlands               15363.251360
Norway                    16361.876470
Poland                     6557.152776
Portugal                   6361.517993
Romania                    6470.866545
Serbia                     7991.707066
Slovak Republic            8412.902397
Slovenia                   9405.489397
Spain            

### Select multiple columns or rows using `DataFrame.loc` and a named slice.

In [11]:
data.loc['Italy':'Romania', "gdpPercap_1967":'gdpPercap_1982']

Unnamed: 0_level_0,gdpPercap_1967,gdpPercap_1972,gdpPercap_1977,gdpPercap_1982
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Italy,10022.40131,12269.27378,14255.98475,16537.4835
Montenegro,5907.850937,7778.414017,9595.929905,11222.58762
Netherlands,15363.25136,18794.74567,21209.0592,21399.46046
Norway,16361.87647,18965.05551,23311.34939,26298.63531
Poland,6557.152776,8006.506993,9508.141454,8451.531004
Portugal,6361.517993,9022.247417,10172.48572,11753.84291
Romania,6470.866545,8011.414402,9356.39724,9605.314053


### Result of slicing can be used in further operations.

In [13]:
subset = data.loc['Italy':'Romania', "gdpPercap_1967":'gdpPercap_1982'].min()

### Use comparisons to select data based on value.

In [15]:
print('subset of data', subset > 8000)

subset of data gdpPercap_1967    False
gdpPercap_1972    False
gdpPercap_1977     True
gdpPercap_1982     True
dtype: bool


### Select values or NaN using a Boolean mask

In [16]:
subset[subset > 8000]

gdpPercap_1977    9356.397240
gdpPercap_1982    8451.531004
dtype: float64

### Group By: split-apply-combine

# Activities