# Ironhack - Data Analytics Bootcamp

## Project 6 - Final Project

Free choice.

### Main Objective

The objective of this project is the application of one or more methodologies and tools learned during the course.

### Deliverables

- Presentation of your project - 7 minutes;
- Link to github and/or tableau project;
- Summary of your main insights.

## The Project:

###  ROAD TRAFFIC DEATHS IN THE WORLD IN 2016

In this project I chose to analyze some serious problem present in most countries in the world. Therefore, I took a look at the [Global Health Observatory](https://apps.who.int/gho/data/node.main) data repository, available on the World Health Organization website, and chose to analyze the road traffic deaths in the world in 2016, the most recent date available. For that, I downloaded the following dataframes:

[Road traffic deaths (Data by country)](https://apps.who.int/gho/data/node.main.A997?lang=en);

[Reported distribution of road traffic deaths by type of road user (Data by country)](https://apps.who.int/gho/data/node.main.A998?lang=en);

[Registered vehicles (Data by country)](https://apps.who.int/gho/data/node.main.A995?lang=en);

I also found it necessary to obtain some information on the socio-economic level of each country analyzed. For this, I chose the [Gross Domestic Product per Capita - GDP (nominal)](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita), which is a good measure to analize a country's wealth and social development.

I chose to get the United Nations rank, through web scraping from the [Wikipedia](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita) website, among the three dataframes available there, because its data is from 2017, the closest date to 2016, that is the date of the other dataframes used in this work.

Then, after cleaning and merging the four dataframes, the analysis was made through Tableau Software.


#### Methodology:

1. Downloading the three dataframes from the World Health Organization website;
2. Importing Pandas libray and Regex;
3. Importing the first dataframe containing the "road traffic deaths by country";
4. Cleaning it, dropping useless data, reorganizing and renamimng some columns;
5. Importing the second dataframe containing the "registered vehicles by country";
6. Eliminating the spaces in the column "Number of registered vehicles" and change its type to float;
7. Importing the third dataframe containing the "reported distribution of road traffic deaths by type of road user by country";
8. Eliminating the non-numeric characters from the items in the column ' Drivers/passengers of 4-wheeled vehicles':
9. Dropping the duplicate row for the country 'Eswatini'
10. Web scraping the United Nations table with the countries of the world sorted by their gross domestic product per capita from Wikipedia;
11. Dropping the column 'Rank', renamed the remaining columns and the countries whose names were different than in the previous dataframes;
12. Merging the three WHO dataframes into a new one;
13. Renaming the countries whose names were different in a way that it would be recognizable by Tableau maps.
14. Merging the GDP dataframe with the three WHO dataframes into a new one;
15. Dropping null data and the columns 'Data Source','Year_x', and 'Year_y';
16. Exporting the final dataframe as a ".csv" file.
17. Importing the file into Tableau;
18. Creating 8 dashboards and a story to analize the dataframe created.

#### Problems faced:

- The dataframes had data obtained in different years, from 2012 to 2017, therefore, there were some minor inconsistencies;
- Country names had to be standardized before merging the dataframes;
- Tableau is not as intuitive as it looks;

#### Technologies used:

- Python;
- Pandas;
- Tableau.

#### Tableau views:

a) The name of the project: "ROAD TRAFFIC DEATHS IN THE WORLD IN 2016 - Exploratory Data Analysis":<br/>
<img src='story1.jpg' width='800px' />
<br/>
b) Top 10 countries (total number of deaths):<br/>
<img src='story2.jpg' width='800px' />
<br/>
c) Top 10 countries (death rate per 100,000 population):<br/>
<img src='story3.jpg' width='800px' />
<br/>
d) Top 10 countries (lowest total number of deaths):<br/>
<img src='story4.jpg' width='800px' />
<br/>
e) Top 10 countries (lowest death rate per 100,000 population):<br/>
<img src='story5.jpg' width='800px' />
<br/>
f) Top 10 countries (distribution of road traffic deaths by type of road user):<br/>
<img src='story6.jpg' width='800px' />
<br/>
g) Top 20 in number of registered vehicles / Total road traffic deaths:<br/>
<img src='story7.jpg' width='800px' />
<br/>
h) Conclusion:<br/>
<img src='story8.jpg' width='800px' />    
___________________________________________________________________________________________________________________________

## Tableau link for the project:

https://public.tableau.com/profile/marcus.brand.o#!/vizhome/Roadtrafficdeathsbycountry-2016/Dash-F?publish=yes
___________________________________________________________________________________________________________________________

## Colaborator:

Marcus Brandão
___________________________________________________________________________________________________________________________

## The code:

In [None]:
# Importing modules:

import pandas as pd
import re

#### 1 - First dataframe: Road traffic deaths by country (World Health Organization):

In [None]:
# Importing the dataframe with the road traffic deaths by country:

df1 = pd.read_csv("./RS_196,RS_198.csv", sep=",")

In [None]:
# Getting the information about the dataframe 'df1':

df1.info()

In [None]:
# Visualizing the first 5 rows of the dataframe 'df1':

df1.head(5)

In [None]:
# Dropping the first row, because it is a subheading:

df1 = df1.drop(0)
df1

In [None]:
# Renaming the first column to "country":

df1 = df1.rename(columns = {'Unnamed: 0': 'Country'})
df1

In [None]:
# Eliminating the spaces in the column "Estimated number of road traffic deaths":

df1["Estimated number of road traffic deaths"] = df1["Estimated number of road traffic deaths"].str.replace(" ", "")
df1

In [None]:
'''
Eliminating the square brackets, keeping only the mean values of each row of the column
"Estimated number of road traffic deaths" and changing its type to integers:
'''

df1["Estimated number of road traffic deaths"] = df1["Estimated number of road traffic deaths"].apply(lambda x: int(re.findall('\d*', x)[0]))
df1.info()

In [None]:
# Visualizing the final result of the "df1" dataframe:

df1.head(5)

#### 2 - Second dataframe: Registered vehicles by country (World Health Organization):

In [None]:
# Importing the dataframe with the registered vehicles by country:

df2 = pd.read_csv("./RS_194.csv", sep=",")

In [None]:
# Getting the information about the dataframe 'df2':

df2.info()

In [None]:
# Visualizing the first 5 rows of the dataframe 'df2':

df2.head(5)

In [None]:
# Eliminating the spaces in the column "Number of registered vehicles" and changing its type to float:

df2["Number of registered vehicles"] = df2["Number of registered vehicles"].str.replace(" ", "").astype(float)
df2

#### 3 - Third dataframe: Reported distribution of road traffic deaths by type of road user by country (World Health Organization):

In [None]:
# Importing the dataframe with the reported distribution of road traffic deaths by type of road user by country:

df3 = pd.read_csv("./RS_246.csv", sep=",", skiprows=1)

In [None]:
# Getting the information about the dataframe 'df3':

df3.info()

In [None]:
# Visualizing the first 5 rows of the dataframe 'df3':

df3.head(5)

In [None]:
# Eliminating the non-numeric characters from the items in the column ' Drivers/passengers of 4-wheeled vehicles':

df3[" Drivers/passengers of 4-wheeled vehicles"] = df3[" Drivers/passengers of 4-wheeled vehicles"].apply(lambda x : x if x != x else re.sub('[a-zA-Z]', '', x))

In [None]:
# Checking the column 'Country' values:

df3['Country'].values

In [None]:
# Dropping the duplicate row for the country 'Eswatini'

eswatini_to_drop = df3.loc[(df3['Country'] == 'Eswatini') & (df3['Year'] == 2013)].index
df3.drop(eswatini_to_drop, inplace=True)

In [None]:
# Checking the column 'Country' again:

df3['Country'].values

In [None]:
df3.head(5)

#### 4 - Fourth dataframe: United Nations table with the countries of the world sorted by their gross domestic product per capita from Wikipedia:

In [None]:
# Scraping a table with the countries of the world sorted by their gross domestic product per capita from Wikipedia:

link = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'

df4 = pd.read_html(link,header=0)[4]
df4

In [None]:
# Dropping the column 'Rank':

df4 = df4.drop('Rank', axis=1)

In [None]:
# Renaming the remaining columns:

df4.columns = ['Country', 'GDP(US$)']

In [None]:
# Renaming the countries whose names are different than in the previous dataframes:

def df4_clean_country(arg):
    arg = re.sub('\s\(.*', '', arg)
    if arg == 'Bahamas, The':
        arg = 'Bahamas'
    elif arg == 'Cape Verde':
        arg = 'Cabo Verde'        
    elif arg == 'Congo, Republic of the':
        arg = 'Congo'
    elif arg == 'Congo, Democratic Republic of the':
        arg = 'Democratic Republic of the Congo'
    elif arg == 'Czech Republic':
        arg = 'Czechia'  
    elif arg == 'East Timor':
        arg = 'Timor-Leste'               
    elif arg == 'Gambia, The':
        arg = 'Gambia'         
    elif arg == 'Korea, South':
        arg = 'South Korea'
    elif arg == 'Micronesia, Federated States of':
        arg = 'Micronesia'
    elif arg == "São Tomé and Príncipe":
        arg = 'Sao Tome and Principe'
    elif arg == 'United States':
        arg = 'United States of America'
    elif arg == 'Vietnam':
        arg = 'Viet Nam'
    return arg

df4["Country"] = df4["Country"].apply(df4_clean_country)

In [None]:
# Checking the column 'Country' again:

df4["Country"].values.tolist()

#### List with all the analyzed countries:

In [None]:
# Turning the coluns 'Country' of each dataframe into lists:

df1['Country'].values.tolist()
df2['Country'].values.tolist()
df3['Country'].values.tolist()
df4["Country"].values.tolist()

countries = sorted(list(set(df1['Country'].values.tolist() + df2['Country'].values.tolist() + df3['Country'].values.tolist() + df4["Country"].values.tolist())))

In [None]:
len(countries)

#### Merging the three dataframes into a new one:

In [None]:
# Creating a final datafram 'df5' as a result of the merging of the 'df1 and 'df2' ones:

df5 = df1.merge(df2, how='outer', on='Country')
df5.head(5)

In [None]:
# Merging the final dataframe with dataframe 'df3':

df5 = df5.merge(df3, how='outer', on='Country')
df5.head(5)

In [None]:
# Cleaning the column 'Country' before merging with the dataframe 'df4' to avoid duplicates:

#df5["Country"] = df5["Country"].apply(lambda x : x if x != x else re.sub('\s\(.*', '', x))

In [None]:
# Cleaning and renaming the countries whose names are different than in the previous dataframes, to avoid duplicates:

def df5_clean_country(arg):
    arg = re.sub('\s\(.*', '', arg)
    if arg == "Lao People's Democratic Republic":
        arg = 'Laos'
    elif arg == 'Republic of Korea':
        arg = 'South Korea'    
    elif arg == 'Republic of North Macedonia':
        arg = 'North Macedonia'         
    elif arg == 'Republic of Moldova':
        arg = 'Moldova'  
    elif arg == 'occupied Palestinian territory, including east Jerusalem':
        arg = 'Palestine'    
    elif arg == 'Russian Federation':
        arg = 'Russia'
    elif arg == 'Syrian Arab Republic':
        arg = 'Syria'     
    elif arg == 'United Kingdom of Great Britain and Northern Ireland':
        arg = 'United Kingdom' 
    elif arg == 'United Republic of Tanzania':
        arg = 'Tanzania'     
    return arg

df5["Country"] = df5["Country"].apply(df5_clean_country)

In [None]:
# Merging the final dataframe with dataframe 'df4':

df5 = df5.merge(df4, how='left', on='Country')

In [None]:
# Final dataframe:

df5

In [None]:
# Checking for missing data in the column 'GDP(US$)':

df6 = df5[df5['GDP(US$)'].isna()]
df6['Country'].values.tolist()

In [None]:
# Checking for null data in the dataframe 'df5':

mask = df5.isnull().any(axis=1)

df5[mask]

In [None]:
# Dropping rows with less than 2 filled columns:

df5 = df5.dropna(axis=0, thresh=2)

In [None]:
# Checking the final dataframe:

df5

In [None]:
# Dropping useless columns:

df5 = df5.drop(['Data Source','Year_x', 'Year_y'], axis=1)

In [None]:
# Checking columns' names, turning the 'df5' columns into a list:

df5.columns.values.tolist()

In [None]:
# Reorganizing and renaming the columns of the 'df5' dataframe:

df5.columns = ['Country',
 'Estimated number of road traffic deaths',
 'Estimated road traffic death rate (per 100,000 population)',
 'Number of registered vehicles',
 'Drivers/passengers of 4-wheeled vehicles (%)',
 'Drivers/passengers of motorized 2- or 3-wheelers (%)',
 'Cyclists (%)',
 'Pedestrians (%)',
 'Other/unspecified road users (%)',
 'GDP(US$)']

In [None]:
# Checking columns' names, turning the 'df5' columns into a list:

df5['Country'].values.tolist()

#### Exporting the final file:

In [None]:
# Exporting the final file as a '.csv' document:

df5.to_csv("who_road_traffic_deaths.csv", index=False)