# Pandas Tutorial - Part 2

Welcome to the second part of you pandas tutorial. 

In this tutorial you get to know:

- Dataframes and series
- Basic pandas functions for dataframes and series
- How to store datatypes (int, float, string, etc) in pandas

- Using the loc fundtion to display, filter, edit segments of a dataframe
- How to deal with nan-values

- Group by - functions
- Plotting in pandas

- Some numpy basics

Since this is a jupyter notebook, you can edit and execute code directly in the notebook. If you need an intro to jupyter notebooks, have a look at this <a href="https://www.youtube.com/watch?v=HW29067qVWk">video</a>.


In [1]:
# Libraries we need
import pandas as pd
import numpy as np

The `pandas` and `numpy` libraries are core tools for all kinds of data handling and analysis in Python. `pandas` allows easy and quick handling of data in so-called DataFrames ([pandas tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html)). 

## Series - Basic functions

A **series** is a one-dimansional array that can hold different types of data such as integers, float, strings, or objects. Every entry in the series as an index (label).

The following is a simple example of a series. 

In [2]:
S = pd.Series([6, 8, 42, 26000])
print(S)

0        6
1        8
2       42
3    26000
dtype: int64


*Tasks*
1. Enter a float and a string value in the series and see what happens to the dtype property.
2. Try accessing the index using `S.index` (S being the name of the series in this case) as a command. With `S.values`, you can see the entries in the series in list format.

But the index does not have to be numerical. You can also use strings as indices, as shown in the example below.

In [3]:
mode = ['bicycle', 'tram', 'train', 'car']
quantities = [320, 13, 59, 176]
S = pd.Series(quantities, index=mode)
print(S)

bicycle    320
tram        13
train       59
car        176
dtype: int64


*Tasks*
1. Try accessing a specific index using `S['tram']` for example.
2. Create another series with the same mores in series S but different quantities. Then observe what happens when you add these two series simply using `S + S2` for exmaple.  
3. To the new series S2, add new modes such as 'boat' and then add `S + S2` and observe what happens.

You do not have to use lists to create a series. You can also pass a dictionary as input as shown below.

In [4]:
transport = {'bicycle' : 320,
             'tram' : 13,
             'train': 59,
             'car': 176}
transport_series = pd.Series(transport)
print(transport_series)

bicycle    320
tram        13
train       59
car        176
dtype: int64


### Apply function
You can use apply to perform basic math operations on your series. To find out more about the parameters which you can pass to the apply function check out the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html">pandas documentation page</a>. This is a very useful page to use for pandas as it also shows you all the other fucntions available for series inclusing all parameters which you pass to the fucntion.

In [5]:
S.apply(np.log)

bicycle    5.768321
tram       2.564949
train      4.077537
car        5.170484
dtype: float64

Here you can observe a key difference between stating `S.apply(np.log)` and `S = S.apply(np.log)`. In the first case you see the output of the fucntion but it is not stored. The second supresses the output but stores it in the variable. You then need to state `print(S)` or simply `S` to see the output. *Try this in the cell below.*

## Dataframes - Basic functions

In contrast to series, dataframes have **two dimensions** and therefore look like tables as you would use them in Excel. This makes pandas a very powerful python library as you can perform complex operations on large amounts of data fairly quick. Each column can be interpreted as a series and you can perform similar functions to a column as to series. Each column can have it's own data type (integer, float, string, object).

### Creating a dataframe / Loading data

You have several options to create dataframes. These include:

- From several lists
- From series
- From dictionaries
- From a csv file

### Dataframe from lists
In the example below, a dataframe is created using lists. The lists contain Eurostat data with the rail kilometers in Dutch provinces per year.

In [32]:
Regions = ['Utrecht','Noord_Holland','Zuid_Holland','Zeeland']
Years = [2016, 2017, 2018, 2019]

Rail_2016 = [194, 365, 456, 97]
Rail_2017 = [194, 365, 454, 97]
Rail_2018 = [196, 366, 486, 127]
Rail_2019 = [200, 344, 446, 103]

Rail_NL = pd.DataFrame(list(zip(Rail_2016,Rail_2017,Rail_2018,Rail_2019)),columns = Years, index = Regions)
Rail_NL#[2016]

Unnamed: 0,2016,2017,2018,2019
Utrecht,194,194,196,200
Noord_Holland,365,365,366,344
Zuid_Holland,456,454,486,446
Zeeland,97,97,127,103


However, this dataframe is not very informative as it does not have column headers or indices. 
*Task:* Using the two lists below, assign **years as column headers and regions as indices**. You can do this while creating the dataframe by adding  `columns = ...` or by setting the columns afterwards using `Rail_NL.columns = ...`

##### Selecting data using loc 

You can select data by simply typing `Rail_NL[2016]`. However, this is a rather limited method to select data. The most versatile method to select data is the `loc` as exemplified below.

In [38]:
Rail_NL.loc['Utrecht']

2016    194
2017    194
2018    196
2019    200
Name: Utrecht, dtype: int64

In [35]:
Rail_NL.loc[Rail_NL[2016]<200]

Unnamed: 0,2016,2017,2018,2019
Utrecht,194,194,196,200
Zeeland,97,97,127,103


In [36]:
Rail_NL.loc[(Rail_NL[2016]<200)&(Rail_NL[2019]>=150)]

Unnamed: 0,2016,2017,2018,2019
Utrecht,194,194,196,200


Since selecting data is a core element of using pandas, there are many more ways to select and filter data for eample using iloc, Index.Slice, filtering and many more. For some inspiration have a look at <a href="https://www.kdnuggets.com/2019/06/select-rows-columns-pandas.html">this tutorial</a>.

### Dataframe from series

Now we want to create a dataframe from series, where each row represents a region.

In [30]:
Utrecht = pd.Series([194, 194 , 196, 200],index=Years)
Noord_Holland = pd.Series([365, 365, 366, 344],index=Years)
Zuid_Holland = pd.Series([456, 454, 486, 446],index=Years)
Zeeland = pd.Series([97, 97, 127, 103],index=Years)

Rail_NL_series = pd.concat([Utrecht,Noord_Holland,Zuid_Holland,Zeeland],axis=1)
Rail_NL_series

Unnamed: 0,0,1,2,3
2016,194,365,456,97
2017,194,365,454,97
2018,196,366,486,127
2019,200,344,446,103


This is not what we aimed for. When concantenating data, the default is that all data is attached below using `axis = 0` which points downwards. 

*Task:* Since we want to concantenate rows, set `axis = 1` in the example above. This will add each series as one column. 

Instead of renaming the columns as mentioned above, you can also give the series a name before concantenating by adding the property `name='Utrecht'`. This is more safe as you are sure the correct header is assigned to each series.

##### Transposing dataframe

Be aware though! Contrary to the above, the data now display the region in the columns and the year in the row. You can sinply change the orientation ofn a dataframe using the command `Rail_NL.T` or by adding `.T` to the concantenation command. 

### Dataframe from dictionaries

An option that is very similar to creating a dataframe from series is to create a dataframe from a dictionary. Have a look at what is similar and what is different in the example below.

In [42]:
rail_km = {'Utrecht': [194, 194 , 196, 200],
         'Noord_Holland': [365, 365, 366, 344],
         'Zuid_Holland': [456, 454, 486, 446],
         'Zeeland': [97, 97, 127, 103]
        }

Rail_NL_dict = pd.DataFrame(rail_km, index= Years)
Rail_NL_dict

Unnamed: 0,Utrecht,Noord_Holland,Zuid_Holland,Zeeland
2016,194,365,456,97
2017,194,365,454,97
2018,196,366,486,127
2019,200,344,446,103


### Dataframe from CSVs
Alternatively, you can also load data in bulk using a CSV file. In the example below, we load all data on rail kilometers for the Netherlands from 2008 to 2019.

In [46]:
rail_NL_csv_original = pd.read_csv('estat_NL_rail_km.csv') 
rail_NL_csv_original

Unnamed: 0,DATAFLOW,freq,tra_infr,unit,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
0,ESTAT:TGS00113(1.0),A,RL,KM,NL11,2008,163,
1,ESTAT:TGS00113(1.0),A,RL,KM,NL11,2009,163,
2,ESTAT:TGS00113(1.0),A,RL,KM,NL11,2010,163,
3,ESTAT:TGS00113(1.0),A,RL,KM,NL11,2011,164,
4,ESTAT:TGS00113(1.0),A,RL,KM,NL11,2012,164,
...,...,...,...,...,...,...,...,...
139,ESTAT:TGS00113(1.0),A,RL,KM,NL42,2015,251,
140,ESTAT:TGS00113(1.0),A,RL,KM,NL42,2016,253,
141,ESTAT:TGS00113(1.0),A,RL,KM,NL42,2017,253,
142,ESTAT:TGS00113(1.0),A,RL,KM,NL42,2018,265,


In the dataframe we created, regions are not entered with their name but with their <a href="https://de.wikipedia.org/wiki/NUTS:NL">NUTS code</a>. To improve readability, we will first use the dictionary below to rename the cells in `geo` column above.

*Task:* Try to rename the NUTS codes in the above dataframe with the following command.

`rail_NL_csv_original = rail_NL_csv_original.replace({"geo": NUTS})`

In [47]:
NUTS = {'NL11': 'Groningen',
        'NL12': 'Friesland',
        'NL13': 'Drenthe',
        'NL21': 'Overijssel',
        'NL22': 'Gelderland',
        'NL23': 'Flevoland',
        'NL31': 'Utrecht',
        'NL32': 'Noord-Holland',
        'NL33': 'Zuid-Holland',
        'NL34': 'Zeeland',
        'NL41': 'Noord-Brabant',
        'NL42': 'Limburg'}
rail_NL_csv_original = rail_NL_csv_original.replace({"geo": NUTS})
rail_NL_csv_original

Unnamed: 0,DATAFLOW,freq,tra_infr,unit,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG
0,ESTAT:TGS00113(1.0),A,RL,KM,Groningen,2008,163,
1,ESTAT:TGS00113(1.0),A,RL,KM,Groningen,2009,163,
2,ESTAT:TGS00113(1.0),A,RL,KM,Groningen,2010,163,
3,ESTAT:TGS00113(1.0),A,RL,KM,Groningen,2011,164,
4,ESTAT:TGS00113(1.0),A,RL,KM,Groningen,2012,164,
...,...,...,...,...,...,...,...,...
139,ESTAT:TGS00113(1.0),A,RL,KM,Limburg,2015,251,
140,ESTAT:TGS00113(1.0),A,RL,KM,Limburg,2016,253,
141,ESTAT:TGS00113(1.0),A,RL,KM,Limburg,2017,253,
142,ESTAT:TGS00113(1.0),A,RL,KM,Limburg,2018,265,


##### Pivot Function
But we still have data that we do not need and we would like to see the years as the column names. We could now use the `.drop` function to drop the columns which we do not need. But this would not solve the problem with our column names. For this we need to **pivot** our data.

*Task:* In the code below enter the correct column heards of `rail_NL_csv_orignal` to pivot the data.

In [49]:
rail_NL_csv = pd.pivot_table(rail_NL_csv_original, values = '...' , index='...', columns = '...')
rail_NL_csv

TIME_PERIOD,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
geo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Drenthe,105,105,105,105,105,105,105,105,105,105,109,101
Flevoland,40,40,40,40,40,40,67,67,67,67,67,67
Friesland,151,145,151,151,151,151,150,150,149,149,149,150
Gelderland,542,542,542,542,542,542,546,546,544,544,556,542
Groningen,163,163,163,164,164,164,164,164,163,164,200,171
Limburg,245,251,251,251,251,251,251,251,253,253,265,241
Noord-Brabant,325,325,357,356,356,356,352,352,352,352,383,361
Noord-Holland,357,357,371,371,371,371,371,371,365,365,366,344
Overijssel,302,302,302,302,302,302,314,314,312,312,318,315
Utrecht,195,195,195,196,196,196,195,195,194,194,196,200


source data https://ec.europa.eu/eurostat/databrowser/view/tgs00113/default/table?lang=en

NUTS code https://de.wikipedia.org/wiki/NUTS:NL

# Still to add

Multiindex

Group by

Plotting

In [9]:
%matplotlib inline