[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mosleh-exeter/BEM1025/blob/main/Tutorial/04-Tutorial04-Data-Assembly-practice-solution.ipynb)

# Tutorial 04. Data Assembly in Pandas

Content:

1. Concatenating data
2. Advance merging of data sets
3. Grouping

In [20]:
import pandas as pd

We will work with a airquality dataset for this excercise

Reading No2 measurements data for weather stations

In [21]:
## The air_quality_no2_long.csv data set provides 𝑁𝑂2 values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.
air_quality_no2 = pd.read_csv('https://www.dropbox.com/s/70230oct6p0ovnv/air_quality_no2_long.csv?dl=1',parse_dates=True)

In [22]:
air_quality_no2.head()

Unnamed: 0,city,country,date.utc,location,parameter,value,unit
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³


Reading PM25 measurements data for weather stations

In [23]:
## The air_quality_pm25_long.csv data set provides 𝑃𝑀25 values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.

air_quality_pm25 = pd.read_csv('https://www.dropbox.com/s/d0ef5l5rm95fkdx/air_quality_pm25_long.csv?dl=1',parse_dates=True)

In [24]:
air_quality_pm25.head()

Unnamed: 0,city,country,date.utc,location,parameter,value,unit
0,Antwerpen,BE,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,µg/m³
1,Antwerpen,BE,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,µg/m³
2,Antwerpen,BE,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,µg/m³
3,Antwerpen,BE,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0,µg/m³
4,Antwerpen,BE,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5,µg/m³


Reading stations coordination data

In [34]:
## The air quality measurement station coordinates are stored in a data file air_quality_stations.csv

stations_coord = pd.read_csv("https://www.dropbox.com/s/1wd3n5m1chg1b1k/air_quality_stations.csv?dl=1")
stations_coord=stations_coord.drop_duplicates(subset='location')

In [35]:
stations_coord.head()

Unnamed: 0,location,coordinates.latitude,coordinates.longitude
0,BELAL01,51.23619,4.38522
1,BELHB23,51.1703,4.341
2,BELLD01,51.10998,5.00486
3,BELLD02,51.12038,5.02155
4,BELR833,51.32766,4.36226


Reading description of airquality measurements

In [36]:
## The air quality parameters metadata are stored in a data file air_quality_parameters.csv
air_quality_parameters = pd.read_csv("https://www.dropbox.com/s/qnp2myzjbukpbgj/air_quality_parameters.csv?dl=1")
air_quality_parameters=air_quality_parameters.rename(columns={'id':'parameter'})

In [37]:
air_quality_parameters

Unnamed: 0,parameter,description,name
0,bc,Black Carbon,BC
1,co,Carbon Monoxide,CO
2,no2,Nitrogen Dioxide,NO2
3,o3,Ozone,O3
4,pm10,Particulate matter less than 10 micrometers in...,PM10
5,pm25,Particulate matter less than 2.5 micrometers i...,PM2.5
6,so2,Sulfur Dioxide,SO2


----

**Q1.** Combine the measurements for 𝑁𝑂2 "air_quality_no2" and 𝑃𝑀25 "air_quality_pm25", such that the outcome is a dataframe with similar structure to the input dataframes and contains all rows from both dataframes. Call the outcome dataframe "air_quality"

In [38]:
air_quality = pd.concat([air_quality_pm25, air_quality_no2],ignore_index=True)
# If you want the concatenation to ignore existing indices, you can set the argument ignore_index=True . 
#Then, the resulting DataFrame index will be labeled with 0 , …, n-1 .

# The number of rows of the outcome is equal to the sum of the number of rows of input
print(air_quality.shape,air_quality_pm25.shape,air_quality_no2.shape)

(3178, 7) (1110, 7) (2068, 7)


In [39]:
air_quality

Unnamed: 0,city,country,date.utc,location,parameter,value,unit
0,Antwerpen,BE,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,µg/m³
1,Antwerpen,BE,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,µg/m³
2,Antwerpen,BE,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,µg/m³
3,Antwerpen,BE,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0,µg/m³
4,Antwerpen,BE,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5,µg/m³
...,...,...,...,...,...,...,...
3173,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³
3174,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³
3175,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³
3176,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³


**Q2.** Add the station coordinates "stations_coord" to the corresponding rows in the measurements dataframe "air_quality". The outcome should be a dataframe that contains measurements data and station coordinate data. Compare the outcome when parameter how is 'left' and 'right' in the merge operation.


In [40]:
air_quality_with_station_cooord=pd.merge(air_quality,stations_coord,on='location')
print(air_quality_with_station_cooord.shape)
air_quality_with_station_cooord

(3178, 9)


Unnamed: 0,city,country,date.utc,location,parameter,value,unit,coordinates.latitude,coordinates.longitude
0,Antwerpen,BE,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,µg/m³,51.20966,4.43182
1,Antwerpen,BE,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,µg/m³,51.20966,4.43182
2,Antwerpen,BE,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,µg/m³,51.20966,4.43182
3,Antwerpen,BE,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0,µg/m³,51.20966,4.43182
4,Antwerpen,BE,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5,µg/m³,51.20966,4.43182
...,...,...,...,...,...,...,...,...,...
3173,Paris,FR,2019-05-07 05:00:00+00:00,FR04014,no2,72.4,µg/m³,48.83724,2.39390
3174,Paris,FR,2019-05-07 04:00:00+00:00,FR04014,no2,61.9,µg/m³,48.83724,2.39390
3175,Paris,FR,2019-05-07 03:00:00+00:00,FR04014,no2,50.4,µg/m³,48.83724,2.39390
3176,Paris,FR,2019-05-07 02:00:00+00:00,FR04014,no2,27.7,µg/m³,48.83724,2.39390


In [16]:
air_quality_with_station_cooord=pd.merge(air_quality,stations_coord,on='location',how='left')
print(air_quality_with_station_cooord.shape)
air_quality_with_station_cooord

(4182, 9)


Unnamed: 0,city,country,date.utc,location,parameter,value,unit,coordinates.latitude,coordinates.longitude
0,Antwerpen,BE,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,µg/m³,51.20966,4.43182
1,Antwerpen,BE,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,µg/m³,51.20966,4.43182
2,Antwerpen,BE,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,µg/m³,51.20966,4.43182
3,Antwerpen,BE,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0,µg/m³,51.20966,4.43182
4,Antwerpen,BE,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5,µg/m³,51.20966,4.43182
...,...,...,...,...,...,...,...,...,...
4177,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,51.49467,-0.13193
4178,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,51.49467,-0.13193
4179,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,51.49467,-0.13193
4180,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,51.49467,-0.13193


In [17]:
air_quality_with_station_cooord=pd.merge(air_quality,stations_coord,on='location',how='right')
print(air_quality_with_station_cooord.shape)
air_quality_with_station_cooord

(4244, 9)


Unnamed: 0,city,country,date.utc,location,parameter,value,unit,coordinates.latitude,coordinates.longitude
0,,,,BELAL01,,,,51.23619,4.38522
1,,,,BELHB23,,,,51.17030,4.34100
2,,,,BELLD01,,,,51.10998,5.00486
3,,,,BELLD02,,,,51.12038,5.02155
4,,,,BELR833,,,,51.32766,4.36226
...,...,...,...,...,...,...,...,...,...
4239,,,,Southend-on-Sea,,,,51.54420,0.67841
4240,,,,Southwark A2 Old Kent Road,,,,51.48050,-0.05955
4241,,,,Thurrock,,,,51.47707,0.31797
4242,,,,Tower Hamlets Roadside,,,,51.52253,-0.04216


**Q3.** Add the parameters full description and name, provided by the parameters metadata dataframe "air_quality_parameters", to the measurements dataframe from the previous task. The outcome should be a dataframe containing measurement data, station locations, and parameters description


In [18]:
air_quality_with_station_cooord_params=pd.merge(air_quality,air_quality_parameters,on='parameter')
air_quality_with_station_cooord_params

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,description,name
0,Antwerpen,BE,2019-06-18 06:00:00+00:00,BETR801,pm25,18.0,µg/m³,Particulate matter less than 2.5 micrometers i...,PM2.5
1,Antwerpen,BE,2019-06-17 08:00:00+00:00,BETR801,pm25,6.5,µg/m³,Particulate matter less than 2.5 micrometers i...,PM2.5
2,Antwerpen,BE,2019-06-17 07:00:00+00:00,BETR801,pm25,18.5,µg/m³,Particulate matter less than 2.5 micrometers i...,PM2.5
3,Antwerpen,BE,2019-06-17 06:00:00+00:00,BETR801,pm25,16.0,µg/m³,Particulate matter less than 2.5 micrometers i...,PM2.5
4,Antwerpen,BE,2019-06-17 05:00:00+00:00,BETR801,pm25,7.5,µg/m³,Particulate matter less than 2.5 micrometers i...,PM2.5
...,...,...,...,...,...,...,...,...,...
3173,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,Nitrogen Dioxide,NO2
3174,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,Nitrogen Dioxide,NO2
3175,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,Nitrogen Dioxide,NO2
3176,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,Nitrogen Dioxide,NO2


**Q4.** Find the average value of all measurements for pm25 and no2 in each city 

In [19]:
air_quality_with_station_cooord_params.groupby(['city','parameter']).mean().reset_index()

Unnamed: 0,city,parameter,value
0,Antwerpen,no2,25.778947
1,Antwerpen,pm25,21.50495
2,London,no2,24.77709
3,London,pm25,8.993062
4,Paris,no2,27.740538
