# Advanced Data Analysis - week 1, lecture 2, examples

In the advanced data analysis course, we assume basic knowledge of Python, as could be acquired by attending the *Introduction to Programming* bridging course.

This notebook includes the examples and exercises presented in **Week 1** lecture no. 2. There is an additional notebook with the examples and exercises suggested for autonomous study during the week.

In **week 1**, we will focus mainly on introducing the dataflow/functional programming paradigm often used for data analysis. 


## Data model

We consider data that is organized according to a tabular model - we call it a DataFrame.

A DataFrame consists of a 2-dimension table, with labeled columns, i.e., each column has a label. The values in each row can have different types - e.g. Strings, integers, floats, etc. 

The following example shows the table shown in lecture 1, with 4 labeled columns, *Name*, *Age*, *Educational level*, *Company*.

![table](images/lec1-fulltable.png)

## Data processing / analysis

A DataFrame can be transformed into another DataFrame by applying a function.

Data processing consists of applying a sequence of operators/functions to DataFrame. Each operation transforms one DataFrame into another Dataframe.

Check the following example. What computation is being performed?

![transform](images/lec1-transform.png)

Interestingly, we can classify the transformations that are being performed in a single Dataframe in two main types:
* **One-to-one** transformations. These transformations apply a function to each row independently, generating a new row. More generally, the function can produce as result zero, one or many rows. A **filter** is a special *ont-to-one* transformation that either returns the row itself or no row.
* **Many-to-one** transformations. These transformations apply a function to multiple rows and transform them into a single one.


### Map - one-to-one (one-to-*) transformations

The first type of transformation, a map transformation, applies a function to each row individually, generating a new row, no row, or multiple rows.

Is this performed in the previous example?

Yes. The first transformation checks each row, and if the value of *Company* is *Good*, it generates the exactly same row. Otherwise, it generates no row. 

**NOTE:** the transformation in this specific example is often called a **filter**, but most map functions are not filters.

**NOTE:** An interesting aspect, that is explored by frameworks that support parallel execution (e.g. using GPUs or distributed processing), is that map functions can be applied in parallel (at the same time) to different data items.

### Reduce - many-to-one transformations

The second type of transformation, a reduce transformation, applies a function to multiple rows and generates a new row. These function are often called aggregations also.

Is this performed in the previous example?


Yes. The second transformation gets all rows and applies the a function that returns the row with the lowest age.

Often, frameworks iterate over the rows to be reduced and apply the function to two elements at each time. In the previous example, the reduce would proceed as follows: 
1. ```youngest((Andrew,55,1,Good),(Bernhard,43,2,Good)) -> (Bernhard,43,2,Good)``` - the first step applies the function to the first two elements;
2. ```youngest((Bernhard,43,2,Good),(Dennis,82,3,Good)) -> (Bernhard,43,2,Good)``` - the nth step applies the function to the result of the previous step and the (n+1)th element.
3. ...

**NOTE:** When there is support for parallel execution, this can be done more efficiently by comparing at the same time multiple pairs - if enough parallelism exist, this lead to executing the function is log(n) steps.

### ReduceByKey / GroupBy - multiple many-to-one transformations at once

Often, we do not want to apply the reduce transformation to all elements of a table. Instead, we want to apply the function to different groups of rows at the same time. 

In the previous example, we might want to know the youngest person for Good and for Bad company sub-groups. 

This can be seen as the general **Reduce** case, where the rows are grouped and the reduce transformation is applied to each sub-group at the same time.


### Exercises

What transformations would you use to:

* Compute the average age of good and bad companies ? 
--> A: one-to-one (group by good/bad), many-to-one (average of each group)

* Know which group has lower average Education level: good or bad companies ?
--> A: one-to-one (group by good/bad), many-to-one (average of each group), many-to-one (min)

## Programming (with Pandas)

We now show how to program in Python adopting the given programming paradigm. 

We will use the popular [**Pandas**](https://pandas.pydata.org/) library for our examples, althouth the underlying paradigm can be found in different libraries and frameworks.

For using Pandas, we start by importing *pandas*.

In [2]:
# imports pandas
import pandas as pd


### Data model : DataFrame

In *Pandas*, a table is represented as a [**DataFrame**](https://pandas.pydata.org/docs/reference/frame.html). (follow the link for DataFrame documentation)

There are multiple ways to create an initial DataFrame. For example, you can create date from a Python dictionary, as follows:

In [3]:
population = pd.DataFrame( { "country": ["PT", "ES", "DE"] , \
                            "population": [10276617, 46937060, 83019213]})

print( population)


  country  population
0      PT    10276617
1      ES    46937060
2      DE    83019213


Pandas will maintain an additional column, the index, with a increasing integer. This column - the first column when printing the dataframe - is created automatically.

#### Loading DataFrame from CSV files

More often, will want to load the data from files. To create a DataFrame from a CSV file, you can use the ```load_csv``` function.

Note: If the following code fails, the most likely reason is that you do not have the *data* directory with the data files.

In [4]:
import os

# Let's create a PATH in a OS independent way
# File lec1-example.csv is in directory data
fileName = os.path.join( "data", "lec1-example.csv")

# Read a CSV file into a DataFrame
df = pd.read_csv(fileName)

print( df)


        Name  Age  Educational level Company
0     Andrew   55                1.0    Good
1   Bernhard   43                2.0    Good
2   Carolina   37                5.0     Bad
3     Dennis   82                3.0    Good
4        Eve   23                3.2     Bad
5       Fred   46                5.0    Good
6    Gwyneth   38                4.2     Bad
7     Hayden   50                4.0     Bad
8      Irene   29                4.5     Bad
9      James   42                4.1    Good
10     Kevin   35                4.5     Bad
11       Lea   38                2.5    Good
12    Marcus   31                4.8     Bad
13     Nigel   71                2.3    Good


#### Saving DataFrame into CSV files

You can save a DataFrame into a CSV file using ```to_csv``` function.

In [5]:
import os

# Let's create a PATH in a OS independent way
# File lec1-saved.csv will be in directory data
fileName = os.path.join( "data", "lec1-saved.csv")

# Save DataFrameRead a CSV file into a DataFrame
df.to_csv( fileName)


Please check the file created. Is it the same as the original lec1-saved.csv?

No, it has an additional column with the row number. You can also save the DataFrame without this number by using the ```index=False``` option.

In [6]:
import os

# Let's create a PATH in a OS independent way
# File lec1-saved-noindex.csv will be in directory data
fileName = os.path.join( "data", "lec1-saved-noindex.csv")

# Save DataFrameRead a CSV file into a DataFrame
df.to_csv( fileName, index=False)


### Data processing with Pandas

We now show the transformations necessary to perform the exercises proposed above.


#### Selecting rows based on conditions

It is possible to select the rows for which a column has a given value as follows:

In [7]:
# Select the persons that are good company.
good = df[df["Company"]=="Good"]

print(good)

        Name  Age  Educational level Company
0     Andrew   55                1.0    Good
1   Bernhard   43                2.0    Good
3     Dennis   82                3.0    Good
5       Fred   46                5.0    Good
9      James   42                4.1    Good
11       Lea   38                2.5    Good
13     Nigel   71                2.3    Good


In [8]:
# Select the persons that are good company and have educational level larger than 3.
goodEd3plus = df[(df["Company"]=="Good") & (df["Educational level"]>=3.0)]

print(goodEd3plus)

     Name  Age  Educational level Company
3  Dennis   82                3.0    Good
5    Fred   46                5.0    Good
9   James   42                4.1    Good


#### Selecting a subset of the columns

Often, we do not need all data that is in a table. We can get rid of the data we do not need by selecting the columns we want using the following syntax ```dataframe[[col1,col2,...]]```. 

In the following example we create a new DataFrame containing only the Name and Age columns.

In [9]:
# Select the persons that are good company.
person_age = df[["Name","Age"]]

print(person_age)

        Name  Age
0     Andrew   55
1   Bernhard   43
2   Carolina   37
3     Dennis   82
4        Eve   23
5       Fred   46
6    Gwyneth   38
7     Hayden   50
8      Irene   29
9      James   42
10     Kevin   35
11       Lea   38
12    Marcus   31
13     Nigel   71


#### Applying reduce/aggregation functions

Pandas allow to compute the reduction/aggregation for the values of one or multiple columns.

You must select the columns for which you want to perform the computation, and then call the reduce/aggregation function.

The following example computes first, the minimum age (```min```function), and then the minimum of both *Age* and *Educational level* at the same time. Pandas has multiple useful aggregation functions, including, maximum (```max```), minimum (```min```), mean (```mean```), median (```median```), standard deviation (```std```), etc. - check the [**DataFrame** documentation](https://pandas.pydata.org/docs/reference/frame.html) for the list of available functions.

In [10]:
minAge = good["Age"].min()
print( "Minimum age is ")
print( minAge)

mins = good[["Name", "Age","Educational level"]].min()
print( "Minimum information for several columns now")
print( mins)


Minimum age is 
38
Minimum information for several columns now
Name                 Andrew
Age                      38
Educational level       1.0
dtype: object


It is also possible to apply different aggregation function to different columns using the [```dataframe.agg()``` function.](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)


In [11]:
stats = good.agg({'Age' : 'min', 'Educational level' : 'max'})
print( "Minimum and max in the same operation")
print( stats)

Minimum and max in the same operation
Age                  38.0
Educational level     5.0
dtype: float64


Wait, this was not what we wanted in the first place - we want the information about the youngest person that is a good company.

Function ```nsmallest(num elems, columns)``` allow to compute that.

In [12]:
youngestGood = good.nsmallest(3,["Age"])
#print( good.nsmallest(1,["Age"]))
print(youngestGood)


        Name  Age  Educational level Company
11       Lea   38                2.5    Good
9      James   42                4.1    Good
1   Bernhard   43                2.0    Good


#### Series

In Pandas, a Series is a sequence of values of any type, with an associated index - each value has an associated index. 

A Series can be created from a Dataframe by using the syntax ```dataframe[col]```.


In [13]:
#Creates a series with the values of column Name
good["Name"]

0       Andrew
1     Bernhard
3       Dennis
5         Fred
9        James
11         Lea
13       Nigel
Name: Name, dtype: object

Some Dataframe functions also generate series as a result. For example, the result of an aggregation over multiple columns in a Series.


In [14]:
type(good[["Age","Educational level"]].min())

pandas.core.series.Series

A Series can be converted into a Dataframe using the ```to_frame``` function

In [15]:
good["Name"].to_frame()

Unnamed: 0,Name
0,Andrew
1,Bernhard
3,Dennis
5,Fred
9,James
11,Lea
13,Nigel


#### Applying reduce/aggregation functions per group

```groupby([cols])``` allows to group elements of a DataFrame before applying an aggregation function to each of the groups. 

The following example computes the lowest age for each value of Company.


In [29]:
youngest = df[["Age","Company"]].groupby(["Company"]).min()
print( youngest)

youngestAny = youngest.idxmin()
print( "The youngest person is " + youngestAny["Age"])

         Age
Company     
Bad       23
Good      38
The youngest person is Bad


### Code for exercises above

In [17]:
# Compute the average age of good and bad companies ?

# TODO: complete the code

print(df[["Company", "Age"]].groupby("Company").mean())
#Alternative: 
# df[["Company", "Age"]].\
    #groupby("Company").\
    #mean()


               Age
Company           
Bad      34.714286
Good     53.857143


In [33]:
# Know which group has lower average Education level: good or bad companies ?

# TODO: complete the code
print(df[["Educational level", "Company"]].groupby("Company").mean())
print(df[["Educational level", "Company"]].groupby("Company").mean().min())

         Educational level
Company                   
Bad               4.314286
Good              2.842857
Educational level    2.842857
dtype: float64


## Functions over multiple Dataframe

Often, data will be in multiple tables/Dataframes. To process data, it is necessary to execute operations over these tables. 

We now introduce some of the operations available on Pandas for combining multiple tables.


### Appending tables

Sometimes, we have data over which we want to perform a computation that is in two different Dataframes - e.g. because we read data from different files.

Consider we have the following tables:

| country | population |
|---------|------------|
| PT | 10276617 |
| ES | 46937060 |
| DE | 83019213 |

and 

| capital | population | country |
|---------|------------|---------|
| Brasilia | 211049519 | BR |
| Mexico City | 127575529 | MX |
| Montevideu | 3461731 | UY |

```pd.concat([dataframe,dataframe2])``` creates a new table that combines the values in the first dataframe followed by the values in dataframe2, using the columns name. If some column does not exist in one table, the value of the rows will be **NaN**.

| country | population | capital |
|---------|------------|---------|
| PT | 10276617 | NaN |
| ES | 46937060 | NaN |
| DE | 83019213 | NaN |
| BR | 211049519 | Brasilia |
| MX | 127575529 | Mexico City |
| UY | 3461731 | Montevideu |

| country | population | language |
|---------|------------|---------|
| PT | 10276617 | Portuguese |
| ES | 46937060 | Spanish |
| ES | 46937060 | Catalan |
| DE | 83019213 | German |
| BR | 211049519 | Portuguese |
| MX | 127575529 | Spanish |
| UY | 3461731 | NaN |
| AR | NaN | Spanish |
| IT | NaN | Italian |

The following code show the example running.


In [19]:
population1 = pd.DataFrame( { "country": ["PT", "ES", "DE"] , \
                            "population": [10276617, 46937060, 83019213]})

print( population1)



  country  population
0      PT    10276617
1      ES    46937060
2      DE    83019213


In [20]:
population2 = pd.DataFrame( {"capital" : ["Brasilia", "Mexico City", "Montevideu"],\
                            "population": [211049519, 127575529, 3461731], \
                            "country": ["BR", "MX", "UY"]})

print( population2)



       capital  population country
0     Brasilia   211049519      BR
1  Mexico City   127575529      MX
2   Montevideu     3461731      UY


In [21]:
population = pd.concat([population1,population2]) 
print( population)

  country  population      capital
0      PT    10276617          NaN
1      ES    46937060          NaN
2      DE    83019213          NaN
0      BR   211049519     Brasilia
1      MX   127575529  Mexico City
2      UY     3461731   Montevideu


### Joining tables

More interestingly, we might want to combine the columns from one or more tables into a new table.

Consider thehat we have the following two tables. The first table has a list of countries and their population.

| country | population |
|---------|------------|
| PT | 10276617 |
| ES | 46937060 |
| DE | 83019213 |

The second table has the language spoken in each country.

| country | language |
|---------|----------|
| PT | Portuguese |
| ES | Spanish |
| MX | Spanish |
| AR | Spanish |
| DE | German |
| IT | Italian |
| BR | Portuguese |


| country | language |
|---------|----------|
| DE | German |
| PT | Portuguese |
| ES | Spanish |



If we want to compute the number of persons that speak each language, it would be interesting to have a single table with the country, population and language columns. To this end, we need to combine both of the previous tables (this can also be seen as extending the first table with the values of the second table).

What we want to achieve is the following table, with columns country, population and language: 

| country | population | language |
|---------|------------|----------|
| PT | 10276617 | Portuguese |
| ES | 46937060 | Spanish |
| DE | 83019213 | German |


The ```dataframe.join(dataframe2,on=column,how="left"|"right"|"inner")``` function allows to combine two tables. By default, the two table are combined using the index, i.e., a row with index **i** in daataframe is combined with the row with index **i** in dataframe2. 

The ```dataframe.merge(dataframe2,left_on=column,right_on=column,how="left"|"right"|"inner")``` function does the same as join, but allows to specify the columns to be used to combine in both dataframes. 

In our example, we want to combine the row with country value **N** of the first table with the row with country value **N** in the second table. To this end, we can use the ```on="country"```to specify that we want to use the value of the column *country* in the first table. For dataframe2, the index will be used -- this require us to chaneg the index of dataframe2, using the ```dataframe2.set_index( col)``` function.

This example is coded in the following cells.

### Joining tables

More interestingly, we might want to combine the columns from one or more tables into a new table.

Consider thehat we have the following two tables. The first table has a list of countries and their population.

| country | population |
|---------|------------|
| PT | 10276617 |
| ES | 46937060 |
| DE | 83019213 |

The second table has the language spoken in each country.

| country | language |
|---------|----------|
| PT | Portuguese |
| ES | Spanish |
| MX | Spanish |
| AR | Spanish |
| DE | German |
| IT | Italian |
| BR | Portuguese |

If we want to compute the number of persons that speak each language, it would be interesting to have a single table with the country, population and language columns. To this end, we need to combine both of the previous tables (this can also be seen as extending the first table with the values of the second table).

What we want to achieve is the following table, with columns country, population and language: 

| country | population | language |
|---------|------------|----------|
| PT | 10276617 | Portuguese |
| ES | 46937060 | Spanish |
| DE | 83019213 | German |


The ```dataframe.join(dataframe2,on=column,how="left"|"right"|"inner")``` function allows to combine two tables. By default, the two table are combined using the index, i.e., a row with index **i** in daataframe is combined with the row with index **i** in dataframe2. 

In our example, we want to combine the row with country value **N** of the first table with the row with country value **N** in the second table. To this end, we can use the ```on="country"```to specify that we want to use the value of the column *country* in the first table. For dataframe2, the index will be used -- this require us to chaneg the index of dataframe2, using the ```dataframe2.set_index( col)``` function.

This example is coded in the following cells.

In [22]:
language = pd.DataFrame( { "country0": ["PT", "ES", "MX", "AR", "DE", "BR"] , \
                            "language": ["Portuguese", "Spanish", "Spanish", "Spanish", "German", "Portuguese"]})

print( language)



  country0    language
0       PT  Portuguese
1       ES     Spanish
2       MX     Spanish
3       AR     Spanish
4       DE      German
5       BR  Portuguese


In [23]:
countries1 = population1.join(language.set_index("country0"),on="country")

print(countries1)

  country  population    language
0      PT    10276617  Portuguese
1      ES    46937060     Spanish
2      DE    83019213      German


In [24]:
countries1merge = population1.merge(language,left_on="country",right_on="country0")

print(countries1merge)

  country  population country0    language
0      PT    10276617       PT  Portuguese
1      ES    46937060       ES     Spanish
2      DE    83019213       DE      German


Based on the population Dataframe computed before, compute the number of persons that speak each language.

In [36]:
## TODO Complete
langStats = countries1[["language", "population"]]

print(langStats)

     language  population
0  Portuguese    10276617
1     Spanish    46937060
2      German    83019213


### Join type : left (default)

The way join works varies depending on the 

In a left join, each row of *dataframe* is combined with all possible values of *dataframe2*. If no row in the second dataframe matches the joining column of the first, then the value for the columns will be **NaN**.

For better exemplifying, we start by extending our language table to include one other language for Spain : Catalan.

| country | language |
|---------|----------|
| PT | Portuguese |
| ES | Spanish |
| ES | Catalan |
| MX | Spanish |
| AR | Spanish |
| DE | German |
| IT | Italian |
| BR | Portuguese |


In [37]:
languageExt = pd.DataFrame( { "country": ["PT", "ES", "ES", "MX", "AR", "DE", "IT", "BR"] , \
                            "language": ["Portuguese", "Spanish", "Catalan", "Spanish", "Spanish", "German", "Italian", "Portuguese"]})

print( languageExt)


  country    language
0      PT  Portuguese
1      ES     Spanish
2      ES     Catalan
3      MX     Spanish
4      AR     Spanish
5      DE      German
6      IT     Italian
7      BR  Portuguese


In [38]:
countries = population[["country","population"]].join(languageExt.set_index("country"),on="country")

print(countries)

  country  population    language
0      PT    10276617  Portuguese
1      ES    46937060     Spanish
1      ES    46937060     Catalan
2      DE    83019213      German
0      BR   211049519  Portuguese
1      MX   127575529     Spanish
2      UY     3461731         NaN


In [39]:
# merge with just "on" will use the column with the specified name in both Dataframes 
countries = population[["country","population"]].merge(languageExt,on="country")

print(countries)

  country  population    language
0      PT    10276617  Portuguese
1      ES    46937060     Spanish
2      ES    46937060     Catalan
3      DE    83019213      German
4      BR   211049519  Portuguese
5      MX   127575529     Spanish


The line for Uruguay (UY) has **NaN** in the language column.


### Join type : right

In a right join, each row of *dataframe2* is combined with all possible values in the first *dataframe*. If no row in the first dataframe matches the joining column of the second, then the value for the columns will be **NaN**.



In [40]:
countries = population.join(languageExt.set_index("country"),on="country",how="right")

print(countries)


    country   population      capital    language
0.0      PT   10276617.0          NaN  Portuguese
1.0      ES   46937060.0          NaN     Spanish
1.0      ES   46937060.0          NaN     Catalan
1.0      MX  127575529.0  Mexico City     Spanish
NaN      AR          NaN          NaN     Spanish
2.0      DE   83019213.0          NaN      German
NaN      IT          NaN          NaN     Italian
0.0      BR  211049519.0     Brasilia  Portuguese


In [41]:
countriesMerge = population.merge(languageExt,on="country")

print(countriesMerge)


  country  population      capital    language
0      PT    10276617          NaN  Portuguese
1      ES    46937060          NaN     Spanish
2      ES    46937060          NaN     Catalan
3      DE    83019213          NaN      German
4      BR   211049519     Brasilia  Portuguese
5      MX   127575529  Mexico City     Spanish


### Join type : inner

In an inner join, each row of *dataframe* is combined with all possible values in the second *dataframe*. If no row in the second dataframe matches the joining column of the first, then the row will not be part of the result.



In [42]:
countries = population.join(languageExt.set_index("country"),on="country",how="inner")

print(countries)


  country  population      capital    language
0      PT    10276617          NaN  Portuguese
1      ES    46937060          NaN     Spanish
1      ES    46937060          NaN     Catalan
2      DE    83019213          NaN      German
0      BR   211049519     Brasilia  Portuguese
1      MX   127575529  Mexico City     Spanish


In [43]:
countries = population.merge(languageExt,on="country",how="inner")

print(countries)


  country  population      capital    language
0      PT    10276617          NaN  Portuguese
1      ES    46937060          NaN     Spanish
2      ES    46937060          NaN     Catalan
3      DE    83019213          NaN      German
4      BR   211049519     Brasilia  Portuguese
5      MX   127575529  Mexico City     Spanish


### Join type : outer

In an outnner join, each row of *dataframe* is combined with all possible values in the second *dataframe*. If no row exists in any of the dataframes, both row will appear in the final result.



In [44]:
countries = population.join(languageExt.set_index("country"),on="country",how="outer")

print(countries)

    country   population      capital    language
0.0      PT   10276617.0          NaN  Portuguese
1.0      ES   46937060.0          NaN     Spanish
1.0      ES   46937060.0          NaN     Catalan
2.0      DE   83019213.0          NaN      German
0.0      BR  211049519.0     Brasilia  Portuguese
1.0      MX  127575529.0  Mexico City     Spanish
2.0      UY    3461731.0   Montevideu         NaN
NaN      AR          NaN          NaN     Spanish
NaN      IT          NaN          NaN     Italian


In [45]:
countries = population.merge(languageExt,on="country",how="outer")

print(countries)

  country   population      capital    language
0      PT   10276617.0          NaN  Portuguese
1      ES   46937060.0          NaN     Spanish
2      ES   46937060.0          NaN     Catalan
3      DE   83019213.0          NaN      German
4      BR  211049519.0     Brasilia  Portuguese
5      MX  127575529.0  Mexico City     Spanish
6      UY    3461731.0   Montevideu         NaN
7      AR          NaN          NaN     Spanish
8      IT          NaN          NaN     Italian


## Exercises


### Exercise 1

Based on the given data, compute the population that speaks Spanish.

In [51]:
countries = population[["country","population"]].join(languageExt.set_index("country"),on="country")
spanish_speaking = countries[countries["language"] == "Spanish"]
print(spanish_speaking["population"].sum())

174512589


### Exercise 2

For each language for which there is some known population, compute the population that speaks such language.

In [65]:
countries = population[["country","population"]].join(languageExt.set_index("country"),on="country")
reduced = countries[["language", "population"]].dropna().reset_index(drop=True)
print(reduced)

     language  population
0  Portuguese    10276617
1     Spanish    46937060
2     Catalan    46937060
3      German    83019213
4  Portuguese   211049519
5     Spanish   127575529
