[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mosleh-exeter/BEM1025/blob/main/Lecture/04-Lecture04-Data-Assembly.ipynb)

# Session 4 Data Assembly

Content:
- Data aggregation (grouping)
- Concatenation
- Merging (joining) dataframes

In [82]:
# we import the library pandas and give it the "pd" nickname
import pandas as pd

In [83]:
# we use pandas.read_csv() function to access the file "gapminder.tsv" stored in a remote location 

# the remote location is: https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/

# with the argument sep='\t' we indicate that the columns are separated by tabs rather than commas.

df = pd.read_csv('https://raw.githubusercontent.com/mosleh-exeter/BEM1025/main/data/gapminder.tsv', sep='\t')

In [84]:
# we show the first 5 rows
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


# Grouped and aggregated calculations

There are several initial questions that we can ask ourselves:
1. For each year in our data, what is the average life expectancy? What is the average life expectancy, population, and GDP?
2. What if we stratify the data by continent and perform the same calculations?
3. How many countries are listed in each continent?

These questions could be answered using "groupby" operation in pandas. 

**"A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups."**

Read more about groupby operation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

## Group by a single column
### What is the average life expectancy for each year?

In [56]:
# the following command groups data by the columm "year" then extracts the column lifeExp and computes the mean
df.groupby('year')['lifeExp'].mean().reset_index()

Unnamed: 0,year,lifeExp
0,1952,49.05762
1,1957,51.507401
2,1962,53.609249
3,1967,55.67829
4,1972,57.647386
5,1977,59.570157
6,1982,61.533197
7,1987,63.212613
8,1992,64.160338
9,1997,65.014676


**Note:** We often do reset_index after groupby. "reset_index" Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.
Groupby operation can create new index for the outcome dataframe based on the group columns, we might want to have index starting from zero for our downstream analysis

Read more about reset_index here :https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html

### The following figure provides a visual representation of the operation we have just performed
Here, the groupby operation  slices the dataset for each year, it then finds the average value for lifeExp for each slice, and finally aggreates the average values.

<img src="https://www.dropbox.com/s/w5zq0kfm9rkx6q0/Generic-Groupby-mean-1.png?dl=1">

## Group by multiple columns
### What is the average life expectancy and income for each combination of year AND continent ?

In [85]:
df.groupby(['year','continent'])[['lifeExp','gdpPercap']].mean().reset_index().head()

Unnamed: 0,year,continent,lifeExp,gdpPercap
0,1952,Africa,39.1355,1252.572466
1,1952,Americas,53.27984,4079.062552
2,1952,Asia,46.314394,5195.484004
3,1952,Europe,64.4085,5661.057435
4,1952,Oceania,69.255,10298.08565


## Finding unique observations in each group
### In the given dataframe, how many unique countries have observations there in each continent ?

In [86]:
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


We use "nunique" to count number of distinct elements in specified axis.

Return Series with number of distinct elements. Can ignore NaN values.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html

In [87]:
#  we group by continent then extract the country column and count unique occurrences
df.groupby('continent')['country'].nunique().reset_index()

Unnamed: 0,continent,country
0,Africa,52
1,Americas,25
2,Asia,33
3,Europe,30
4,Oceania,2


# Concatenation

### Concaticnation is often used to stack two or more dataframes with similar structure/columns
###Â Check the documentation for the concat command

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

### The following figure provides a visual representation of the operation we want to perform

<img src="https://github.com/mosleh-exeter/BEM1025/raw/main/images/08_concat_row1.svg">



## Let's create seperate dataframes for three countries and then conctinate them

In [88]:
df_Afghanistan=df[df['country']=="Afghanistan"]
df_Afghanistan.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [89]:
df_Albania=df[df['country']=="Albania"]
df_Albania.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
12,Albania,Europe,1952,55.23,1282697,1601.056136
13,Albania,Europe,1957,59.28,1476505,1942.284244
14,Albania,Europe,1962,64.82,1728137,2312.888958
15,Albania,Europe,1967,66.22,1984060,2760.196931
16,Albania,Europe,1972,67.69,2263554,3313.422188


In [90]:
df_Turkey=df[df['country']=="Turkey"]
df_Turkey.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1572,Turkey,Europe,1952,43.585,22235677,1969.10098
1573,Turkey,Europe,1957,48.079,25670939,2218.754257
1574,Turkey,Europe,1962,52.098,29788695,2322.869908
1575,Turkey,Europe,1967,54.336,33411317,2826.356387
1576,Turkey,Europe,1972,57.005,37492953,3450.69638


## We want to combine the data of all the countries into a single dataframe

"concat" concatenates pandas objects along a particular axis.

We often use axis=0 to concat along rows (axis=1 concat dataframes along columns and create a new dataframe by puting  the given dataframes side by side)

Read more: https://pandas.pydata.org/docs/reference/api/pandas.concat.html


In [91]:
# here axis=0 specify that we want to concatinate dataframes on rows, i.e., adding rows from all dataframe on the top of each other
# alternatively, we could add more columns from the dataframes
All_three_cities = pd.concat([df_Afghanistan,df_Albania,df_Turkey], axis=0)

**Note:** Here we used groupby operation together with head to show the first 3 observations per each country

In [92]:
All_three_cities.groupby('country').head(3)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
12,Albania,Europe,1952,55.23,1282697,1601.056136
13,Albania,Europe,1957,59.28,1476505,1942.284244
14,Albania,Europe,1962,64.82,1728137,2312.888958
1572,Turkey,Europe,1952,43.585,22235677,1969.10098
1573,Turkey,Europe,1957,48.079,25670939,2218.754257
1574,Turkey,Europe,1962,52.098,29788695,2322.869908


## Merge
### The merge operation creates a new dataframe by matching values of columns from two given dataframes


    pd.merge(
    left=...,
    right=..,
    on=...,
    how=..,
    )
    
**left:** First dataframe to merge with.

**right:** Second dataframe to merge with.

**on:** Column name or a list of column names to join on. These must be found in both DataFrames. 

**how:** Type of merge to be performed.

"right": use all values from right dataframe and only values from left for which there is a match

"left": use all values from left dataframe and only values from right for which there is a match

"outer": use union of values from both dataframes

"inner": use intersection of valueskeys from both dataframes i.e. only values that are matched from dataframes.
    
Read more: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

### The following figure provides a visual representation of join operation
    
<img src="https://pandas.pydata.org/pandas-docs/stable/_images/08_merge_left.svg">



    
    

### Find population density for each continent in year? (join on a single column)

To answer this question and have all required data in one dataframe:

First, we need data for each continent land size (given below). 

Next, we need to find population of each continent by grouping operation and sum. 

Finally, we need to merge the population dataframe and the land size dtaframe and find the population density by dividing population over land size for each continent

In [93]:
df_continent_land_size=pd.DataFrame({
                        'continent':['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'],
                        'land_size':[31033131,22134900,29648481,38791112,8486460]})


In [94]:
df_continent_pop=df.groupby(['continent','year'])[['pop']].sum().reset_index()
df_continent_pop.head()

Unnamed: 0,continent,year,pop
0,Africa,1952,237640501
1,Africa,1957,264837738
2,Africa,1962,296516865
3,Africa,1967,335289489
4,Africa,1972,379879541


In [95]:
df_continent_landsize=pd.merge(df_continent_pop,df_continent_land_size,on='continent')
df_continent_landsize['pop_density']=df_continent_landsize['pop']/df_continent_landsize['land_size']
df_continent_landsize.head()

Unnamed: 0,continent,year,pop,land_size,pop_density
0,Africa,1952,237640501,29648481,8.015267
1,Africa,1957,264837738,29648481,8.93259
2,Africa,1962,296516865,29648481,10.001081
3,Africa,1967,335289489,29648481,11.308825
4,Africa,1972,379879541,29648481,12.812783


### Find total food supply per day for each country in year.

To address this question:
    
First, we need data for food supply (given below)

Next, we need to combine food supply dataframe and population dataframe

Finally, we need to merge both dataframe on year and country.

In [98]:
df_food_supply=pd.read_csv('https://raw.githubusercontent.com/mosleh-exeter/BEM1025/main/data/global_food.csv')
df_food_supply.head()

Unnamed: 0,country,year,food_supply_per_capita
0,Afghanistan,1961,3054.9053
1,Afghanistan,1962,2973.2468
2,Afghanistan,1963,2751.7795
3,Afghanistan,1964,3013.4424
4,Afghanistan,1965,3017.76


In [99]:
df_country_food=pd.merge(df,df_food_supply,on=['country','year'])
df_country_food['total_food_perday']=df_country_food['pop']*df_country_food['food_supply_per_capita']
df_country_food.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,food_supply_per_capita,total_food_perday
0,Afghanistan,Asia,1962,31.997,10267083,853.10071,2973.2468,30526570000.0
1,Afghanistan,Asia,1967,34.02,11537966,836.197138,3033.9111,35005160000.0
2,Afghanistan,Asia,1972,36.088,13079460,739.981106,2742.1306,35865590000.0
3,Afghanistan,Asia,1977,38.438,14880372,786.11336,2557.0142,38049320000.0
4,Afghanistan,Asia,1982,39.854,12881816,978.011439,3105.8108,40008480000.0
