# Basics of pandas

In this Jupyter notebook we cover:
- How to read in different data types
- What are DataFrames and how to interact with them
- How to query these DataFrames, and retrieve only a subset of it
- How to apply filters on the values

### First start with reading in a data set
Here we will use the student_debt data set, it can be found in the teamlinq lesson, as well as in the data dashboard.

In [12]:
import pandas as pd 

path = './student_debt.csv' #this is the file path to my csv

df = pd.read_csv(path) #read the csv into a pandas DataFrame
df #leave an object at the end of a code cell and it will get printed during execution

Unnamed: 0.1,Unnamed: 0,Period,Characteristic,No.people,Sum,Average,Median
0,2,2011,Total,754.2,9.5,12.6,7.4
1,3,2011,Man,391.0,5.1,13.1,7.6
2,4,2011,Woman,363.1,4.4,12.0,7.2
3,5,2011,up to 20 years old,41.1,0.1,2.5,1.4
4,6,2011,between 20 and 25 years old,284.8,2.3,8.1,4.8
...,...,...,...,...,...,...,...
67,69,2019*,up to 20 years old,104.5,0.4,4.1,2.5
68,70,2019*,between 20 and 25 years old,479.0,5.2,10.9,7.2
69,71,2019*,between 25 and 45 years old,822.2,13.6,16.5,10.6
70,72,2019*,between 45 and 65 years old,7.7,0.1,12.4,6.2


In [11]:
#To print out only the first few rows of a dataframe use .head()
df.head(5) #print out first 5 rows of our dataframe
#you can achieve similar behaviour with .tail() which returns the last few elements in our dataframe

Unnamed: 0.1,Unnamed: 0,Period,Characteristic,No.people,Sum,Average,Median
0,2,2011,Total,754.2,9.5,12.6,7.4
1,3,2011,Man,391.0,5.1,13.1,7.6
2,4,2011,Woman,363.1,4.4,12.0,7.2
3,5,2011,up to 20 years old,41.1,0.1,2.5,1.4
4,6,2011,between 20 and 25 years old,284.8,2.3,8.1,4.8


You can also easily check the dimensionality (number of rows and columns) of the data set by using the `.shape` attribute.

In [14]:
df.shape # first number is always number of rows, second number is number of columns

(72, 7)

We can also select just a few columns from our dataframe and form a new dataframe. 
To do so, specify a list with the column names you want to include, and use the following syntax: `df[list_of_cols]`

In [12]:
columns_to_include = ['Period', 'Characteristic', 'No.people'] #always make sure column names are the same

df_nr_people = df[columns_to_include]
df_nr_people.head(5) #only the three selected columns are there in our new dataframe

Unnamed: 0,Period,Characteristic,No.people
0,2011,Total,754.2
1,2011,Man,391.0
2,2011,Woman,363.1
3,2011,up to 20 years old,41.1
4,2011,between 20 and 25 years old,284.8


In [13]:
#You can also retrieve all column names in a list-like format by using
df.columns #all column names of our df dataframe

Index(['Unnamed: 0', 'Period', 'Characteristic', 'No.people', 'Sum', 'Average',
       'Median'],
      dtype='object')

You can also retrieve just one column of a dataframe by passing the column name in square brackets after the dataframe. This returns a Pandas Series, which can basically be used as a list of values.

In [16]:
df['No.people'].head(5) #list of only one column. NOTE: .head() and .tail() still works on Series

0    754.2
1    391.0
2    363.1
3     41.1
4    284.8
Name: No.people, dtype: float64

### Filtering values 

In Pandas we can also filter our data by selecting rows that fulfill a certain condition.

Let's say I want to only look at the rows that contain data about students up to 20 years old. To retrieve this data I can use the `Characteristic` column in the dataframe.

In [19]:
## to filter the data we need to specify a condition first
upto_20_yo = df['Characteristic'] == 'up to 20 years old' 
# here we specify that we are interested in rows where the 'Characteristic' column equals 'up to 20 years old' 

# we can then use this condition to filter our data by simply passing it in square brackets
df_upto_20 = df[upto_20_yo]

# we can also select certain columns as previously from this new dataframe
# Let's say we are only interested in the Year, the No.people and the Average student debt. 
# We can create a small data frame only containing this information
df_upto_20[['Period', 'No.people', 'Average']]

Unnamed: 0,Period,No.people,Average
3,2011,41.1,2.5
11,2012,40.6,2.2
19,2013,40.5,2.2
27,2014,43.8,2.3
35,2015,48.6,2.4
43,2016,77.4,2.4
51,2017,100.3,3.4
59,2018*,111.3,3.9
67,2019*,104.5,4.1


### Multiple conditions

We sometimes need multiple conditions to perform an analysis. 
In Pandas, we can pass multiple filters into our dataframe.

Let's look at another example:
We want to see how many times we have an average student debt higher than 12 (here 12 means 10.000EUR), and we are only interested in the Men and Women categories.

We can create multiple filters to achieve this.

In [7]:
high_average = df['Average'] > 12.0 

characteristics_of_interest = ['Man', 'Woman']
#To make a filter based on whether a value is contained in a list, use the .isin() function of Pandas
men_women = df['Characteristic'].isin(characteristics_of_interest) 

#Link these two conditionals together using & symbol
df[high_average & men_women]

Unnamed: 0.1,Unnamed: 0,Period,Characteristic,No.people,Sum,Average,Median
1,3,2011,Man,391.0,5.1,13.1,7.6
9,11,2012,Man,449.5,5.5,12.2,6.8
17,19,2013,Man,474.1,5.8,12.3,6.8
25,27,2014,Man,501.3,6.3,12.5,7.0
33,35,2015,Man,529.7,6.7,12.7,7.1
34,36,2015,Woman,495.7,6.0,12.1,7.0
41,43,2016,Man,583.7,7.4,12.6,6.8
49,51,2017,Man,632.1,8.1,12.8,7.3
50,52,2017,Woman,605.5,7.3,12.1,7.1
57,59,2018*,Man,687.1,9.1,13.2,7.8


Using filters like this we can already answer a few questions from the data. Above, we can see that there are more entries for `Man` than for `Woman`. What do you think this suggests?

# Practice exercises

Below are some practice exercises you should be able to do. 
We will not specify exact steps, but rather questions that you can answer with the data, as we are curious whether you can find the answer.
For the exercises use the `renewable_electricitry.csv` file available to download in teamlinq or in the data dashboard.


### Question 1
How many rows do we have in the data with information about reneable electricity in 1990 or 1991?

### Question 2
How many times (i.e. in how many years) did the Dutch government **not** install a single offshore wind plant (hint: the source is called `wind-offshore`)?

In [8]:
df = pd.read_csv('renewable_electricity.csv')
df

Unnamed: 0.1,Unnamed: 0,ID,Source,Periods,Gross production normalized,Gross production,Net production,Installations in year,Capacity (megawatt)
0,0,0,Total,1990,809,807,725,0,0
1,1,1,Total,1991,930,935,851,0,0
2,2,2,Total,1992,982,994,898,0,0
3,3,3,Total,1993,1128,1107,957,0,0
4,4,4,Total,1994,1252,1257,1093,0,0
...,...,...,...,...,...,...,...,...,...
460,460,460,biogas-other,2016,326,227,219,0,43
461,461,461,biogas-other,2017,312,189,183,0,43
462,462,462,biogas-other,2018,279,175,170,0,41
463,463,463,biogas-other,2019,316,201,195,0,42
