# Data-Pipelines 📊

## Context and objectives

### The aim of this project is to build a data pipeline to enrich an existing public dataset. For this goal, multiple public datasets will be combined together with a database scrapped from a website.


#### As we all know, there is a growing tendency to purchase electric cars. Not only they are becoming popular but also governments are fostering it by providing aids for people to purchase electric cars.

#### This project goal is to democratize data about electric cars evolution and its enviromental and health impac, if any.


#### The chosen scrapped dataset was "World-most-polluted-countries" from iqair.com. The scrapped table provides historical data about the most polluted country and region ranking based on annual average PM2.5 concentration (μg/m³).

####  In order to enrich this dataset, 3 other indicators will be considered and added: 1) charging points around the world, 2) sales cars history, 3) EV stock share and 4) number of deaths by risk factor

The following hypotheses were formulated to guide the analysis:

- The more charging stations, the higher the sales.
- The higher stock share, the higher the sales.
- The higher the sales, the less deaths.
- The more sales, the less the pollution.





## Libraries

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import re
import numpy as np
import requests
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')
from IPython.display import Image

In [4]:
url = "https://www.iqair.com/world-most-polluted-countries"

In [3]:
html= requests.get(url)
html.content

b'<!DOCTYPE html><html lang="en"><head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=5.0">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <title>World\'s Most Polluted Countries in 2022 - PM2.5 Ranking | IQAir</title>\n    \n    <base href="/">\n\n    <!-- Delete for CNEN Start - See remove-analysis-for-chinese.js - CHANGE THIS CAREFULLY -->\n    <!-- Google Tag Manager - Recommended to be placed as high as possible in head - INT-6419 -->\n    <!-- This placement makes it the first tag that loads things in the head, should suffice -->\n    <script>\n      (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\n      new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\n      j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n      \'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n      })(window,document,\'scri

In [5]:
soup= BeautifulSoup(html.content, "html.parser")

In [7]:
results2=soup.find_all("span", attrs = {"class":"text-normal"})

In [38]:
    url = "https://www.iqair.com/world-most-polluted-countries"
    html = requests.get(url)
    soup = BeautifulSoup(html.content, "html.parser")
    results=soup.find_all("div", attrs = {"class": "inner-table"})
    pd.set_option('display.max_rows', None)
    df=pd.read_html(results[0].prettify())[0]
    df

Unnamed: 0,Rank,Country/Region,2022,2021,2020,2019,2018,Population
0,1,Chad,89.7,75.9,-,-,-,17179740
1,2,Iraq,80.1,49.7,-,39.6,-,43533592
2,3,Pakistan,70.9,66.8,59,65.8,74.3,231402117
3,4,Bahrain,66.6,49.8,39.7,46.8,59.8,1463265
4,5,Bangladesh,65.8,76.9,77.1,83.3,97.1,169356251
5,6,Burkina Faso,63.0,-,-,-,-,22100683
6,7,Kuwait,55.8,29.7,34,38.3,56,4250114
7,8,India,53.3,58.1,51.9,58.1,72.5,1407563842
8,9,Egypt,46.5,29.1,-,18,-,109262178
9,10,Tajikistan,46.0,59.4,30.9,-,-,9750064


In [100]:
pollution_2

Unnamed: 0,region,population,year,pollution
0,Chad,17179740,2018,-
1,Iraq,43533592,2018,-
2,Pakistan,231402117,2018,74.3
3,Bahrain,1463265,2018,59.8
4,Bangladesh,169356251,2018,97.1
5,Burkina Faso,22100683,2018,-
6,Kuwait,4250114,2018,56
7,India,1407563842,2018,72.5
8,Egypt,109262178,2018,-
9,Tajikistan,9750064,2018,-


In [112]:
pollution_2 = pd.melt(df, id_vars=['Country/Region', 'Population'], value_vars=['2018', '2019', '2020', '2021', '2022'], var_name='Year', value_name='Pollution')

pd.set_option('display.max_rows', None)
pollution_2.rename(columns = {"Country/Region": "region", "Population":"population", "Pollution":"pollution", "Year":"year"}, inplace=True)
pollution_2.drop(pollution_2[pollution_2['pollution'] == '-'].index, inplace = True)
pollution_2.dropna()

pollution_2['year'] = pollution_2['year'].astype(int)
pollution_2['pollution'] = pollution_2['pollution'].astype(float)






Unnamed: 0,region,population,year,pollution
2,Pakistan,231402117,2018,74.3
3,Bahrain,1463265,2018,59.8
4,Bangladesh,169356251,2018,97.1
6,Kuwait,4250114,2018,56.0
7,India,1407563842,2018,72.5
10,United Arab Emirates,9365145,2018,49.9
15,Nepal,30034989,2018,54.1
16,Uganda,45853778,2018,40.8
17,Nigeria,213401323,2018,44.8
18,Bosnia Herzegovina,3270943,2018,40.0


In [91]:
df_unpivot.dtypes

Country/Region    object
Population         int64
Year              object
Pollution         object
dtype: object

In [64]:
chargers=pd.read_csv("data/IEA-EV-dataEV charging pointsEVHistorical.csv")
sales=pd.read_csv("data/IEA-EV-dataEV salesCarsHistorical.csv") 
stock=pd.read_csv("data/IEA-EV-dataEV stock shareCarsHistorical.csv") 
deaths=pd.read_csv("data/number-of-deaths-by-risk-factor.csv") 





In [85]:
chargers_2= chargers.iloc[:, [0,5, 7]]
chargers_2

Unnamed: 0,region,year,value
0,Australia,2017,40.0
1,Australia,2017,440.0
2,Australia,2018,61.0
3,Australia,2018,670.0
4,Australia,2019,250.0
5,Australia,2019,1700.0
6,Australia,2020,350.0
7,Australia,2020,2000.0
8,Australia,2021,320.0
9,Australia,2021,2000.0


In [83]:

sales_2= sales.iloc[:, [0,5, 7]]
sales_2


Unnamed: 0,region,year,value
0,Australia,2011,49
1,Australia,2012,170
2,Australia,2012,80
3,Australia,2013,100
4,Australia,2013,190
5,Australia,2014,370
6,Australia,2014,950
7,Australia,2015,1000
8,Australia,2015,760
9,Australia,2016,670


In [81]:
stock_2= stock.iloc[:, [0,5, 7]]
stock_2

Unnamed: 0,region,year,value
0,Australia,2011,0.00046
1,Australia,2012,0.0028
2,Australia,2013,0.0054
3,Australia,2014,0.017
4,Australia,2015,0.032
5,Australia,2016,0.043
6,Australia,2017,0.061
7,Australia,2018,0.089
8,Australia,2019,0.16
9,Australia,2020,0.22


In [86]:
deaths_2= deaths.iloc[:, [0,2, 3]]


In [87]:
deaths_2.rename(columns = {"Entity": "region", "Deaths - Cause: All causes - Risk: Outdoor air pollution - OWID - Sex: Both - Age: All Ages (Number)":"number of deaths by air pollution", "Year":"year"}, inplace=True)
deaths_2

Unnamed: 0,region,year,number of deaths by air pollution
0,Afghanistan,1990,3169
1,Afghanistan,1991,3222
2,Afghanistan,1992,3395
3,Afghanistan,1993,3623
4,Afghanistan,1994,3788
5,Afghanistan,1995,3869
6,Afghanistan,1996,3943
7,Afghanistan,1997,4024
8,Afghanistan,1998,4040
9,Afghanistan,1999,4042
