# Advanced Data Analysis - week 1, lecture 2, examples

In the advanced data analysis course, we assume basic knowledge of Python, as could be acquired by attending the *Introduction to Programming* bridging course.

This notebook includes the examples and exercises on combining Dataframes presented in **Week 1** lecture no. 2. 


In [3]:
import pandas as pd

## Functions over multiple Dataframe

Often, data will be in multiple tables/Dataframes. To process data, it is necessary to execute operations over these tables. 

We now introduce some of the operations available on Pandas for combining multiple tables.


### Appending tables

Sometimes, we have data over which we want to perform a computation that is in two different Dataframes - e.g. because we read data from different files.

Consider we have the following tables:

| country | population |
|---------|------------|
| PT | 10276617 |
| ES | 46937060 |
| DE | 83019213 |

and 

| capital | population | country |
|---------|------------|---------|
| Brasilia | 211049519 | BR |
| Mexico City | 127575529 | MX |
| Montevideu | 3461731 | UY |

```pd.concat([dataframe,dataframe2])``` creates a new table that combines the values in the first dataframe followed by the values in dataframe2, using the columns name. If some column does not exist in one table, the value of the rows will be **NaN**.

| country | population | capital |
|---------|------------|---------|
| PT | 10276617 | NaN |
| ES | 46937060 | NaN |
| DE | 83019213 | NaN |
| BR | 211049519 | Brasilia |
| MX | 127575529 | Mexico City |
| UY | 3461731 | Montevideu |

| country | population | language |
|---------|------------|---------|
| PT | 10276617 | Portuguese |
| ES | 46937060 | Spanish |
| ES | 46937060 | Catalan |
| DE | 83019213 | German |
| BR | 211049519 | Portuguese |
| MX | 127575529 | Spanish |
| UY | 3461731 | NaN |
| AR | NaN | Spanish |
| IT | NaN | Italian |

The following code show the example running.


In [4]:
population1 = pd.DataFrame( { "country": ["PT", "ES", "DE"] , \
                            "population": [10276617, 46937060, 83019213]})

print( population1)



  country  population
0      PT    10276617
1      ES    46937060
2      DE    83019213


In [5]:
population2 = pd.DataFrame( {"capital" : ["Brasilia", "Mexico City", "Montevideu"],\
                            "population": [211049519, 127575529, 3461731], \
                            "country": ["BR", "MX", "UY"]})

print( population2)



       capital  population country
0     Brasilia   211049519      BR
1  Mexico City   127575529      MX
2   Montevideu     3461731      UY


In [6]:
population = pd.concat([population1,population2]) 
print( population)

  country  population      capital
0      PT    10276617          NaN
1      ES    46937060          NaN
2      DE    83019213          NaN
0      BR   211049519     Brasilia
1      MX   127575529  Mexico City
2      UY     3461731   Montevideu


### Joining tables

More interestingly, we might want to combine the columns from one or more tables into a new table.

Consider thehat we have the following two tables. The first table has a list of countries and their population.

| country | population |
|---------|------------|
| PT | 10276617 |
| ES | 46937060 |
| DE | 83019213 |

The second table has the language spoken in each country.

| country | language |
|---------|----------|
| PT | Portuguese |
| ES | Spanish |
| MX | Spanish |
| AR | Spanish |
| DE | German |
| IT | Italian |
| BR | Portuguese |


| country | language |
|---------|----------|
| DE | German |
| PT | Portuguese |
| ES | Spanish |



If we want to compute the number of persons that speak each language, it would be interesting to have a single table with the country, population and language columns. To this end, we need to combine both of the previous tables (this can also be seen as extending the first table with the values of the second table).

What we want to achieve is the following table, with columns country, population and language: 

| country | population | language |
|---------|------------|----------|
| PT | 10276617 | Portuguese |
| ES | 46937060 | Spanish |
| DE | 83019213 | German |


The ```dataframe.join(dataframe2,on=column,how="left"|"right"|"inner")``` function allows to combine two tables. By default, the two table are combined using the index, i.e., a row with index **i** in daataframe is combined with the row with index **i** in dataframe2. 

The ```dataframe.merge(dataframe2,left_on=column,right_on=column,how="left"|"right"|"inner")``` function does the same as join, but allows to specify the columns to be used to combine in both dataframes. 

In our example, we want to combine the row with country value **N** of the first table with the row with country value **N** in the second table. To this end, we can use the ```on="country"```to specify that we want to use the value of the column *country* in the first table. For dataframe2, the index will be used -- this require us to chaneg the index of dataframe2, using the ```dataframe2.set_index( col)``` function.

This example is coded in the following cells.

### Joining tables

More interestingly, we might want to combine the columns from one or more tables into a new table.

Consider thehat we have the following two tables. The first table has a list of countries and their population.

| country | population |
|---------|------------|
| PT | 10276617 |
| ES | 46937060 |
| DE | 83019213 |

The second table has the language spoken in each country.

| country | language |
|---------|----------|
| PT | Portuguese |
| ES | Spanish |
| MX | Spanish |
| AR | Spanish |
| DE | German |
| IT | Italian |
| BR | Portuguese |

If we want to compute the number of persons that speak each language, it would be interesting to have a single table with the country, population and language columns. To this end, we need to combine both of the previous tables (this can also be seen as extending the first table with the values of the second table).

What we want to achieve is the following table, with columns country, population and language: 

| country | population | language |
|---------|------------|----------|
| PT | 10276617 | Portuguese |
| ES | 46937060 | Spanish |
| DE | 83019213 | German |


The ```dataframe.join(dataframe2,on=column,how="left"|"right"|"inner")``` function allows to combine two tables. By default, the two table are combined using the index, i.e., a row with index **i** in daataframe is combined with the row with index **i** in dataframe2. 

In our example, we want to combine the row with country value **N** of the first table with the row with country value **N** in the second table. To this end, we can use the ```on="country"```to specify that we want to use the value of the column *country* in the first table. For dataframe2, the index will be used -- this require us to chaneg the index of dataframe2, using the ```dataframe2.set_index( col)``` function.

This example is coded in the following cells.

In [7]:
language = pd.DataFrame( { "country0": ["PT", "ES", "MX", "AR", "DE", "BR"] , \
                            "language": ["Portuguese", "Spanish", "Spanish", "Spanish", "German", "Portuguese"]})

print( language)



  country0    language
0       PT  Portuguese
1       ES     Spanish
2       MX     Spanish
3       AR     Spanish
4       DE      German
5       BR  Portuguese


In [8]:
countries1 = population1.join(language.set_index("country0"),on="country")

print(countries1)

  country  population    language
0      PT    10276617  Portuguese
1      ES    46937060     Spanish
2      DE    83019213      German


In [9]:
countries1merge = population1.merge(language,left_on="country",right_on="country0")

print(countries1merge)

  country  population country0    language
0      PT    10276617       PT  Portuguese
1      ES    46937060       ES     Spanish
2      DE    83019213       DE      German


Based on the population Dataframe computed before, compute the number of persons that speak each language.

In [10]:
## TODO Complete
langStats = countries1.groupby("language")["population"].sum()

print(langStats)

language
German        83019213
Portuguese    10276617
Spanish       46937060
Name: population, dtype: int64


### Join type : left (default)

The way join works varies depending on the 

In a left join, each row of *dataframe* is combined with all possible values of *dataframe2*. If no row in the second dataframe matches the joining column of the first, then the value for the columns will be **NaN**.

For better exemplifying, we start by extending our language table to include one other language for Spain : Catalan.

| country | language |
|---------|----------|
| PT | Portuguese |
| ES | Spanish |
| ES | Catalan |
| MX | Spanish |
| AR | Spanish |
| DE | German |
| IT | Italian |
| BR | Portuguese |


In [11]:
languageExt = pd.DataFrame( { "country": ["PT", "ES", "ES", "MX", "AR", "DE", "IT", "BR"] , \
                            "language": ["Portuguese", "Spanish", "Catalan", "Spanish", "Spanish", "German", "Italian", "Portuguese"]})

print( languageExt)


  country    language
0      PT  Portuguese
1      ES     Spanish
2      ES     Catalan
3      MX     Spanish
4      AR     Spanish
5      DE      German
6      IT     Italian
7      BR  Portuguese


In [12]:
countries = population[["country","population"]].join(languageExt.set_index("country"),on="country")

print(countries)

  country  population    language
0      PT    10276617  Portuguese
1      ES    46937060     Spanish
1      ES    46937060     Catalan
2      DE    83019213      German
0      BR   211049519  Portuguese
1      MX   127575529     Spanish
2      UY     3461731         NaN


In [13]:
# merge with just "on" will use the column with the specified name in both Dataframes 
countries = population[["country","population"]].merge(languageExt,on="country")

print(countries)

  country  population    language
0      PT    10276617  Portuguese
1      ES    46937060     Spanish
2      ES    46937060     Catalan
3      DE    83019213      German
4      BR   211049519  Portuguese
5      MX   127575529     Spanish


The line for Uruguay (UY) has **NaN** in the language column.


### Join type : right

In a right join, each row of *dataframe2* is combined with all possible values in the first *dataframe*. If no row in the first dataframe matches the joining column of the second, then the value for the columns will be **NaN**.



In [14]:
countries = population.join(languageExt.set_index("country"),on="country",how="right")

print(countries)


    country   population      capital    language
0.0      PT   10276617.0          NaN  Portuguese
1.0      ES   46937060.0          NaN     Spanish
1.0      ES   46937060.0          NaN     Catalan
1.0      MX  127575529.0  Mexico City     Spanish
NaN      AR          NaN          NaN     Spanish
2.0      DE   83019213.0          NaN      German
NaN      IT          NaN          NaN     Italian
0.0      BR  211049519.0     Brasilia  Portuguese


In [15]:
countriesMerge = population.merge(languageExt,on="country")

print(countriesMerge)


  country  population      capital    language
0      PT    10276617          NaN  Portuguese
1      ES    46937060          NaN     Spanish
2      ES    46937060          NaN     Catalan
3      DE    83019213          NaN      German
4      BR   211049519     Brasilia  Portuguese
5      MX   127575529  Mexico City     Spanish


### Join type : inner

In an inner join, each row of *dataframe* is combined with all possible values in the second *dataframe*. If no row in the second dataframe matches the joining column of the first, then the row will not be part of the result.



In [16]:
countries = population.join(languageExt.set_index("country"),on="country",how="inner")

print(countries)


  country  population      capital    language
0      PT    10276617          NaN  Portuguese
1      ES    46937060          NaN     Spanish
1      ES    46937060          NaN     Catalan
2      DE    83019213          NaN      German
0      BR   211049519     Brasilia  Portuguese
1      MX   127575529  Mexico City     Spanish


In [17]:
countries = population.merge(languageExt,on="country",how="inner")

print(countries)


  country  population      capital    language
0      PT    10276617          NaN  Portuguese
1      ES    46937060          NaN     Spanish
2      ES    46937060          NaN     Catalan
3      DE    83019213          NaN      German
4      BR   211049519     Brasilia  Portuguese
5      MX   127575529  Mexico City     Spanish


### Join type : outer

In an outnner join, each row of *dataframe* is combined with all possible values in the second *dataframe*. If no row exists in any of the dataframes, both row will appear in the final result.



In [18]:
countries = population.join(languageExt.set_index("country"),on="country",how="outer")

print(countries)

    country   population      capital    language
0.0      PT   10276617.0          NaN  Portuguese
1.0      ES   46937060.0          NaN     Spanish
1.0      ES   46937060.0          NaN     Catalan
2.0      DE   83019213.0          NaN      German
0.0      BR  211049519.0     Brasilia  Portuguese
1.0      MX  127575529.0  Mexico City     Spanish
2.0      UY    3461731.0   Montevideu         NaN
NaN      AR          NaN          NaN     Spanish
NaN      IT          NaN          NaN     Italian


In [19]:
countries = population.merge(languageExt,on="country",how="outer")

print(countries)

  country   population      capital    language
0      PT   10276617.0          NaN  Portuguese
1      ES   46937060.0          NaN     Spanish
2      ES   46937060.0          NaN     Catalan
3      DE   83019213.0          NaN      German
4      BR  211049519.0     Brasilia  Portuguese
5      MX  127575529.0  Mexico City     Spanish
6      UY    3461731.0   Montevideu         NaN
7      AR          NaN          NaN     Spanish
8      IT          NaN          NaN     Italian


## Exercises


### Exercise 1

Based on the given data, compute the population that speaks Spanish.

In [20]:
print(countries.loc[countries["language"] == "Spanish"]["population"].sum())

174512589.0


### Exercise 2

For each language for which there is some known population, compute the population that speaks such language.

In [23]:
print(countries.loc[countries["population"] > 0].groupby("language")["population"].sum())

language
Catalan        46937060.0
German         83019213.0
Portuguese    221326136.0
Spanish       174512589.0
Name: population, dtype: float64
