# Basics of pandas

In this Jupyter notebook we cover:
- How to read in different data types
- What are DataFrames and how to interact with them
- How to query these DataFrames, and retrieve only a subset of it
- How to apply filters on the values

### First start with reading in a data set
Here we will use the student_debt data set, it can be found in the teamlinq lesson, as well as in the data dashboard.

## <span class="section">1.</span> Data frames


The most important data type of the Pandas library is **`pd.DataFrame`**.
It is a _composite_ data type, whose values are called **data frames**.

A data frame is a two-dimensional arrangement of data values. It can be helpful to think of it as a table with rows and columns. 
The data itself is typically of a primitive data type (`int`, `float`, `bool`, `str`,
but also some variants of these as offered by the NumPy library;
details are not relevant at this point).

Here is an example in which we create a data frame from scratch; we name it `df`.
It has

* four rows indexed from 0 through 3, and
* three columns labeled `'A'`, `'B'`, and `'C'`.

In [57]:
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': ['one', 'two', 'three', 'four'], 'C': [False, True, False, True]})
df

Unnamed: 0,A,B,C
0,1,one,False
1,2,two,True
2,3,three,False
3,4,four,True


##  <span class="section">1.1.</span> Getting data from a `DataFrame`

You can **get a column** from a data frame by indexing, just like a sequence.
The result is a `Series` object.
This is useful to extract a particular feature from the data set.

In [58]:
df['B']

0      one
1      two
2    three
3     four
Name: B, dtype: object

To **get a row** from a data frame,
you use the `DataFrame` attribute `loc`.
It is called an _indexer attribute_,
because it supports indexing
with _square brackets_, just like indexing of lists.
The result is also a `Series` object:

In [59]:
df.loc[2]

A        3
B    three
C    False
Name: 2, dtype: object

You can **get a value** at a given location in a data frame
in several ways:

* **`df[column_label][row_index]`** : first get the column,
    then get the value from the resulting `Series` object
* **`df.loc[row_index, column_label]`** : get the value directly,
    using `loc` with the row index and column label

In [60]:
df['B'][2]

'three'

In [61]:
df.loc[2, 'B']

'three'

_Slicing_ can be used to **get larger parts from a data frame**,
with a syntax similar to the slicing of lists.

> NOTE: When slicing `DataFrame` and `Series` objects
> with the syntax `.loc[start:stop]`,
> the **`stop` value is included**.  
> This contrasts to _list_ slicing, where `stop` is _not_ included.

To select a slice of rows,
use `.loc[start:stop]`.
The result is a (new!) data frame:

In [62]:
df.loc[1:2]

Unnamed: 0,A,B,C
1,2,two,True
2,3,three,False


When `start` or `stop` is omitted in the slice `start:stop`,
the first or last item is implied. Therefore, `.loc[:]` extracts all rows:

In [63]:
df.loc[:]

Unnamed: 0,A,B,C
0,1,one,False
1,2,two,True
2,3,three,False
3,4,four,True


The `DataFrame` function call **`df.head(n)`** returns the first `n` rows of the data frame `df`,
regardless of how the rows are indexed.
The default value for `n` is 5 (i.e. omitting the value in `df.head()` is equivalent to `df.head(5)`).

In [64]:
df.head(2)

Unnamed: 0,A,B,C
0,1,one,False
1,2,two,True


You can also apply slicing to the columns:

In [65]:
df.loc[:, 'A':'B']

Unnamed: 0,A,B
0,1,one
1,2,two
2,3,three
3,4,four


Note that here the first `:` argument tells to select all the rows.

Using `.loc` we can also extract a particular slice of rows and columns in one go:

In [66]:
df.loc[1:2, 'A':'B']

Unnamed: 0,A,B
1,2,two
2,3,three


To **get _non-adjacent_ rows**,
use `.loc` with a _list of row indices_
(note the **double square brackets**).
The result is a (new!) data frame:

In [67]:
df.loc[[1, 3]]

Unnamed: 0,A,B,C
1,2,two,True
3,4,four,True


You can also **get _non-adjacent_ columns**.
One way is by providing a _list of column labels_ directly to the data frame
(note the **double square brackets**).
The result is a (new!) data frame:

In [68]:
df[['A', 'C']]

Unnamed: 0,A,C
0,1,False
1,2,True
2,3,False
3,4,True


An alternative way of getting non-adjacent columns,
is to use `.loc` with a _full slice_ for the rows using just '`:`':

In [69]:
df.loc[:, ['A', 'C']]

Unnamed: 0,A,C
0,1,False
1,2,True
2,3,False
3,4,True


Note that `df['B']` returns column '`B`' as a `Series` object.
Sometimes it is more convenient to get it as a `DataFrame` with one column.
This is achieved by using a list with only one element `'B'` as column label
(note the **double square brackets**):

In [70]:
df[['B']]

Unnamed: 0,B
0,one
1,two
2,three
3,four


## <span class="section">1.1</span> Sorting a `DataFrame`

To sort a data frame `df` by the values in column `'c'` use **`df.sort_values(by='c')`**.
Note that this does not modify the data frame,
but rather returns a new data frame.  
A series is sorted by the same function,
but it does not need the `by` argument.

Here is the country data sorted by _area_:

In [104]:
file_countries = './data/country.csv'
country_data = pd.read_csv(file_countries)
country_data.head()

Unnamed: 0,name,alpha_3,tld,continent,capital,area,population
0,Andorra,AND,.ad,EU,Andorra la Vella,468.0,84000
1,United Arab Emirates,ARE,.ae,AS,Abu Dhabi,82880.0,4975593
2,Afghanistan,AFG,.af,AS,Kabul,647500.0,29121286
3,Antigua and Barbuda,ATG,.ag,NM,St. John's,443.0,86754
4,Anguilla,AIA,.ai,NM,The Valley,102.0,13254


In [96]:
country_data.sort_values(by='area')

Unnamed: 0,name,alpha_3,tld,continent,capital,area,population
232,United States Minor Outlying Islands,UMI,.um,OC,,0.00,0
236,Vatican,VAT,.va,EU,Vatican City,0.44,921
138,Monaco,MCO,.mc,EU,Monaco,1.95,32965
82,Gibraltar,GIB,.gi,EU,Gibraltar,6.50,27884
220,Tokelau,TKL,.tk,OC,,10.00,1466
...,...,...,...,...,...,...,...
47,China,CHN,.cn,AS,Beijing,9596960.00,1330044000
233,United States,USA,.us,NM,Washington,9629091.00,310232863
37,Canada,CAN,.ca,NM,Ottawa,9984670.00,33679000
8,Antarctica,ATA,.aq,AN,,14000000.00,0


To sort in **descending order**, supply the argument **`ascending=False`**:

In [97]:
country_data.sort_values(by='area', ascending=False)

Unnamed: 0,name,alpha_3,tld,continent,capital,area,population
191,Russia,RUS,.ru,EU,Moscow,17100000.00,140702000
8,Antarctica,ATA,.aq,AN,,14000000.00,0
37,Canada,CAN,.ca,NM,Ottawa,9984670.00,33679000
233,United States,USA,.us,NM,Washington,9629091.00,310232863
47,China,CHN,.cn,AS,Beijing,9596960.00,1330044000
...,...,...,...,...,...,...,...
220,Tokelau,TKL,.tk,OC,,10.00,1466
82,Gibraltar,GIB,.gi,EU,Gibraltar,6.50,27884
138,Monaco,MCO,.mc,EU,Monaco,1.95,32965
236,Vatican,VAT,.va,EU,Vatican City,0.44,921


> A data frame can be **sorted in place**,
> that is, the data frame is modified,
> rather than returning a new data frame.
> This is achieved by supplying the argument **`inplace=True`**,
> but is not recommended (rather, assign the result to a new variable).

To **sort on multiple columns**, provide a _list of column labels_:

In [98]:
country_data.sort_values(by=['continent', 'name'])

Unnamed: 0,name,alpha_3,tld,continent,capital,area,population
61,Algeria,DZA,.dz,AF,Algiers,2381740.0,34586184
7,Angola,AGO,.ao,AF,Luanda,1246700.0,13068161
24,Benin,BEN,.bj,AF,Porto-Novo,112620.0,9056010
34,Botswana,BWA,.bw,AF,Gaborone,600370.0,2029307
20,Burkina Faso,BFA,.bf,AF,Ouagadougou,274200.0,16241811
...,...,...,...,...,...,...,...
186,Paraguay,PRY,.py,SM,Asuncion,406750.0,6375830
174,Peru,PER,.pe,SM,Lima,1285220.0,29907003
208,Suriname,SUR,.sr,SM,Paramaribo,163270.0,492829
234,Uruguay,URY,.uy,SM,Montevideo,176220.0,3477000


As you can see, the index (left-most column) labels are sorted along as well.
If you want to have a new index for the result, in sorted order,
then you can use the function **`reset_index()`**.
The old index is preserved as well in the column labeled `'index'`.

Let's sort the country data by _population_,
reset the index, and assign the result to variable `country_data_sorted_by_pop`:

In [99]:
country_data_sorted_by_pop = country_data.sort_values(by='population', ascending=False).reset_index()
country_data_sorted_by_pop

Unnamed: 0,index,name,alpha_3,tld,continent,capital,area,population
0,47,China,CHN,.cn,AS,Beijing,9596960.0,1330044000
1,104,India,IND,.in,AS,New Delhi,3287590.0,1173108018
2,233,United States,USA,.us,NM,Washington,9629091.0,310232863
3,100,Indonesia,IDN,.id,AS,Jakarta,1919440.0,242968342
4,30,Brazil,BRA,.br,SM,Brasilia,8511965.0,201103330
...,...,...,...,...,...,...,...,...
246,89,South Georgia and the South Sandwich Islands,SGS,.gs,AN,Grytviken,3903.0,30
247,232,United States Minor Outlying Islands,UMI,.um,OC,,0.0,0
248,33,Bouvet Island,BVT,.bv,AN,,49.0,0
249,8,Antarctica,ATA,.aq,AN,,14000000.0,0


You can also calculate the sum over column values by adding ".sum()" behind the selected dataframe. 

In [100]:
country_data[['area', 'population']].sum()

area          1.499102e+08
population    6.862922e+09
dtype: float64

### Exercise <span class="exercise">7.a</span>

Sort the data on _area_ in descending order,
and show the _name_ and _area_ for the first 10 rows.

##  <span class="section">2.</span> `Index` and `Series` objects

To understand Pandas data frames better,
you should know about two more data types:

* **`pd.Index`**:
    this is a special type of object that holds the index;
    by default, it is just a range of integers starting from 0.
* **`pd.Series`**:
    this type represents a single named sequence of data values of the same type,
    with an index.  
    It is used for a single column in a data frame,
    but also for a row extracted from a data frame,
    where the column labels then serve as the `Series` index,
    and the row index serves as the `Series` name.
    
Thus, a data frame is a sequence of `Series` objects,
with a shared `Index` object.
The names of the `Series` objects are the column labels.

In [71]:
names = ['Peter', 'Anna', 'Tom', 'John', 'Simone']
years = [ 1998, 2002, 1946, 1973, 1962 ]

This time, instead of creating a data frame, we create a `Series` object out of the information as follows. The `Series` object itself will contain the years of birth, it will be indexed by the names, and the object will be called `'Years of birth'`. We assign the object to variable `se_years`:

In [72]:
se_years = pd.Series( years, index=names, name='Years of birth' )
se_years

Peter     1998
Anna      2002
Tom       1946
John      1973
Simone    1962
Name: Years of birth, dtype: int64

##  <span class="section">3.</span> `Functions`


Functions are usefull when you want to do a certain operation multiple times or on multiple data. A function normally has the shape:
```
def function_name(*input_parameters*):
    perform actions

    return *output_values*
```

To calculate the distance between points you can for example use the function:

In [73]:
import math
def distance(x1 : int, y1 : int, x2 : int, y2 : int) -> float:
    """calculates the distance between 2 2-dimensional point
    """
    
    dx = x2 - x1
    dy = y2 - y1
    dsquared = dx**2 + dy**2
    result = math.sqrt(dsquared)
    
    return result

In [74]:
distance(1, 3, 4, 6)

4.242640687119285

##  <span class="section">4.</span> `Error Handling`

In case of errors, it is often useful to inspect the variable types. We can inspect the type of `DataFrame` and `Series` variables using the **`type()`** function:

In [75]:
type(df) # type of the df variable

pandas.core.frame.DataFrame

In [76]:
type(df['C']) # type of the 'C' column in the df DataFrame

pandas.core.series.Series

In [77]:
type(se_years) # type of the se_years variable

pandas.core.series.Series

In [78]:
df.dtypes['C']

dtype('bool')

Note that a column has the `Series` type, which is not the same as the type of the values inside of a column. You can get the type of the values inside of a column using **`df.dtypes`**:

Finding the type of certain objects can be very usefull, since they can give information about why certain errors show up when running code.


In [79]:
x = str(4)
y = 5

x-y

TypeError: unsupported operand type(s) for -: 'str' and 'int'

##  <span class="section">5.</span> `Hackathon`


Now that we have a basic understanding of `DataFrames` and `Series` we can continue to the processing of the data relevant to the hackathon


In [80]:
import pandas as pd 

path = './data/student_debt.csv' #this is the file path to my csv

df = pd.read_csv(path) #read the csv into a pandas DataFrame
df #leave an object at the end of a code cell and it will get printed during execution

Unnamed: 0.1,Unnamed: 0,Period,Characteristic,No.people,Sum,Average,Median
0,2,2011,Total,754.2,9.5,12.6,7.4
1,3,2011,Man,391.0,5.1,13.1,7.6
2,4,2011,Woman,363.1,4.4,12.0,7.2
3,5,2011,up to 20 years old,41.1,0.1,2.5,1.4
4,6,2011,between 20 and 25 years old,284.8,2.3,8.1,4.8
...,...,...,...,...,...,...,...
67,69,2019*,up to 20 years old,104.5,0.4,4.1,2.5
68,70,2019*,between 20 and 25 years old,479.0,5.2,10.9,7.2
69,71,2019*,between 25 and 45 years old,822.2,13.6,16.5,10.6
70,72,2019*,between 45 and 65 years old,7.7,0.1,12.4,6.2


In [81]:
type(df["Period"][0])

str

In [82]:
#To print out only the first few rows of a dataframe use .head()
df.head(5) #print out first 5 rows of our dataframe
#you can achieve similar behaviour with .tail() which returns the last few elements in our dataframe

Unnamed: 0.1,Unnamed: 0,Period,Characteristic,No.people,Sum,Average,Median
0,2,2011,Total,754.2,9.5,12.6,7.4
1,3,2011,Man,391.0,5.1,13.1,7.6
2,4,2011,Woman,363.1,4.4,12.0,7.2
3,5,2011,up to 20 years old,41.1,0.1,2.5,1.4
4,6,2011,between 20 and 25 years old,284.8,2.3,8.1,4.8


You can also easily check the dimensionality (number of rows and columns) of the data set by using the `.shape` attribute.

In [83]:
df.shape # first number is always number of rows, second number is number of columns

(72, 7)

We can also select just a few columns from our dataframe and form a new dataframe. 
To do so, specify a list with the column names you want to include, and use the following syntax: `df[list_of_cols]`

In [84]:
columns_to_include = ['Period', 'Characteristic', 'No.people'] #always make sure column names are the same

df_nr_people = df[columns_to_include]
df_nr_people.head(5) #only the three selected columns are there in our new dataframe

Unnamed: 0,Period,Characteristic,No.people
0,2011,Total,754.2
1,2011,Man,391.0
2,2011,Woman,363.1
3,2011,up to 20 years old,41.1
4,2011,between 20 and 25 years old,284.8


In [85]:
#You can also retrieve all column names in a list-like format by using
df.columns #all column names of our df dataframe

Index(['Unnamed: 0', 'Period', 'Characteristic', 'No.people', 'Sum', 'Average',
       'Median'],
      dtype='object')

You can also retrieve just one column of a dataframe by passing the column name in square brackets after the dataframe. This returns a Pandas Series, which can basically be used as a list of values.

In [86]:
df['No.people'].head(5) #list of only one column. NOTE: .head() and .tail() still works on Series

0    754.2
1    391.0
2    363.1
3     41.1
4    284.8
Name: No.people, dtype: float64

### Filtering values 

In Pandas we can also filter our data by selecting rows that fulfill a certain condition.

Let's say I want to only look at the rows that contain data about students up to 20 years old. To retrieve this data I can use the `Characteristic` column in the dataframe.

In [87]:
## to filter the data we need to specify a condition first
upto_20_yo = df['Characteristic'] == 'up to 20 years old' 
# here we specify that we are interested in rows where the 'Characteristic' column equals 'up to 20 years old' 

# we can then use this condition to filter our data by simply passing it in square brackets
df_upto_20 = df[upto_20_yo]

# we can also select certain columns as previously from this new dataframe
# Let's say we are only interested in the Year, the No.people and the Average student debt. 
# We can create a small data frame only containing this information
df_upto_20[['Period', 'No.people', 'Average']]

Unnamed: 0,Period,No.people,Average
3,2011,41.1,2.5
11,2012,40.6,2.2
19,2013,40.5,2.2
27,2014,43.8,2.3
35,2015,48.6,2.4
43,2016,77.4,2.4
51,2017,100.3,3.4
59,2018*,111.3,3.9
67,2019*,104.5,4.1


### Multiple conditions

We sometimes need multiple conditions to perform an analysis. 
In Pandas, we can pass multiple filters into our dataframe.

Let's look at another example:
We want to see how many times we have an average student debt higher than 12 (here 12 means 10.000EUR), and we are only interested in the Men and Women categories.

We can create multiple filters to achieve this.

In [88]:
high_average = df['Average'] > 12.0 

characteristics_of_interest = ['Man', 'Woman']
#To make a filter based on whether a value is contained in a list, use the .isin() function of Pandas
men_women = df['Characteristic'].isin(characteristics_of_interest) 

#Link these two conditionals together using & symbol
df[high_average & men_women]

Unnamed: 0.1,Unnamed: 0,Period,Characteristic,No.people,Sum,Average,Median
1,3,2011,Man,391.0,5.1,13.1,7.6
9,11,2012,Man,449.5,5.5,12.2,6.8
17,19,2013,Man,474.1,5.8,12.3,6.8
25,27,2014,Man,501.3,6.3,12.5,7.0
33,35,2015,Man,529.7,6.7,12.7,7.1
34,36,2015,Woman,495.7,6.0,12.1,7.0
41,43,2016,Man,583.7,7.4,12.6,6.8
49,51,2017,Man,632.1,8.1,12.8,7.3
50,52,2017,Woman,605.5,7.3,12.1,7.1
57,59,2018*,Man,687.1,9.1,13.2,7.8


Using filters like this we can already answer a few questions from the data. Above, we can see that there are more entries for `Man` than for `Woman`. Also, if you look at the statistics (`Sum, Average, Median`), can you spot any pattern?

 What do you think these suggest?

# Practice exercises

Below are some practice exercises you should be able to do. 
We will not specify exact steps, but rather questions that you can answer with the data, as we are curious whether you can find the answer.
For the exercises use the `renewable_electricitry.csv` file available to download in teamlinq or in the data dashboard.



In [105]:
df = pd.read_csv('./data/renewable_electricity.csv')
df


Unnamed: 0.1,Unnamed: 0,ID,Source,Periods,Gross production normalized,Gross production,Net production,Installations in year,Capacity (megawatt)
0,0,0,Total,1990,809,807,725,0,0
1,1,1,Total,1991,930,935,851,0,0
2,2,2,Total,1992,982,994,898,0,0
3,3,3,Total,1993,1128,1107,957,0,0
4,4,4,Total,1994,1252,1257,1093,0,0
...,...,...,...,...,...,...,...,...,...
460,460,460,biogas-other,2016,326,227,219,0,43
461,461,461,biogas-other,2017,312,189,183,0,43
462,462,462,biogas-other,2018,279,175,170,0,41
463,463,463,biogas-other,2019,316,201,195,0,42


### Question 1:
How many different `Source` values are there in the dataset?


In [90]:
### insert your code for Question 1 below:


### Question 2:
How many rows do we have in the data with information about reneable electricity in 1990 or 1991?

In [91]:
### insert your code for Question 2 below:



### Question 3:
How many times (i.e. in how many years) did the Dutch government **not** install a single offshore wind plant (hint: the source is called `wind-offshore`)?

In [92]:
### insert your code for Question 3 below:


### Question 4:
Create a function that returns the mean value for `Gross Production normalized` over all the possible time periods given a certain value for Source.

In [93]:
def calc_mean(df, source):
    """
    write the code below that would calculate the mean of the Gross Production normalized values for a given "Source" 
    """

    return mean

In [94]:
calc_mean(df, source = "biogas-other")

NameError: name 'mean' is not defined

### Question 5:

For question 5 we will combine 2 datasets from the Hackathon. In the water track there are 2 datasets that contain data for each manucipality. One of the datasets contain all the names of the manucipalities in Noord-Brabant, while the second dataset contains data from all the manucipalities in the netherlands.

The Task for this question is to select only manucipalities that are located in Noord-Brabant from the `municipality_data` dataframe and assign it to a new dataframe object.

In [106]:
municipality_data = pd.read_csv("./data/municipality_data.csv")
municipality_NB = pd.read_csv("./data/brabant_cities.csv")