## Analysing two csv files related to total population and diabetes prevalance rate of all countries

Loading all required libraries

In [None]:
library(tidyverse)
library(magrittr) # better handling of pipes
library(purrr) # to work with lists and map functions
library(glue) # to paste strings 
library(stringr) # to hand strings
library(rvest) # to make scraping easier
library(polite) #polite version of rvest
library(htmltab)
library(dplyr)
library(tidyr)

 Reading the file for population of all countries. Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. This data is sourced from following sources.
 (1) United Nations Population Division. World Population Prospects: 2017 Revision
 (2) Census reports and other statistical publications from national statistical offices
 (3) Eurostat: Demographic Statistics
 (4) United Nations Statistical Division. Population and Vital Statistics Reprot (various years)
 (5) U.S. Census Bureau: International Database 
 (6) Secretariat of the Pacific Community: Statistics and Demography Programme.

In [91]:
population_df <- read_csv(file = "API_SP.POP.TOTL_DS2_en_csv_v2_10134466.csv")

Parsed with column specification:
cols(
  .default = col_double(),
  `Country Name` = col_character(),
  `Country Code` = col_character(),
  `Indicator Name` = col_character(),
  `Indicator Code` = col_character()
)
See spec(...) for full column specifications.


In [19]:
population_df %>% glimpse()

Observations: 264
Variables: 62
$ `Country Name`   <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Ando...
$ `Country Code`   <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARB", "ARE", "...
$ `Indicator Name` <chr> "Population, total", "Population, total", "Populat...
$ `Indicator Code` <chr> "SP.POP.TOTL", "SP.POP.TOTL", "SP.POP.TOTL", "SP.P...
$ `1960`           <dbl> 54211, 8996351, 5643182, 1608800, 13411, 92490932,...
$ `1961`           <dbl> 55438, 9166764, 5753024, 1659800, 14375, 95044497,...
$ `1962`           <dbl> 56225, 9345868, 5866061, 1711319, 15370, 97682294,...
$ `1963`           <dbl> 56695, 9533954, 5980417, 1762621, 16412, 100411076...
$ `1964`           <dbl> 57032, 9731361, 6093321, 1814135, 17469, 103239902...
$ `1965`           <dbl> 57360, 9938414, 6203299, 1864791, 18549, 106174988...
$ `1966`           <dbl> 57715, 10152331, 6309770, 1914573, 19647, 10923059...
$ `1967`           <dbl> 58055, 10372630, 6414995, 1965598, 20758, 11240693...
$ `1968`           <

Reading the file of diabetes prevalance rate of all countries. Diabetes prevalence refers to the percentage of people ages 20-79 who have type 1 or type 2 diabetes. This data is sourced from International Diabetes Federation, Diabetes Atlas.

In [92]:
diabetes_df <- read_csv(file = "API_SH.STA.DIAB.ZS_DS2_en_csv_v2_10136460.csv")

Parsed with column specification:
cols(
  .default = col_character(),
  `2017` = col_double()
)
See spec(...) for full column specifications.


In [94]:
#Selecting only the data for 2017 since other columns dont have any value
diabetes_df<-diabetes_df %>% select('Country Name','Country Code','Indicator Name','Indicator Code','2017')

In [95]:
diabetes_df %>% glimpse()

Observations: 264
Variables: 5
$ `Country Name`   <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Ando...
$ `Country Code`   <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARB", "ARE", "...
$ `Indicator Name` <chr> "Diabetes prevalence (% of population ages 20 to 7...
$ `Indicator Code` <chr> "SH.STA.DIAB.ZS", "SH.STA.DIAB.ZS", "SH.STA.DIAB.Z...
$ `2017`           <dbl> 11.62000, 9.59000, 3.94000, 10.08000, 7.97000, 12....


In [60]:
#Selecting only the required columns for joining
population_2017_df<-population_df %>% select('Country Code','2017')

In [72]:
population_2017_df %>%  head()

Country Code,2017
ABW,105264
AFG,35530081
AGO,29784193
ALB,2873457
AND,76965
ARB,414491886


In [67]:
diabetes_population<-diabetes_df %>% inner_join(population_2017_df,by="Country Code") #%>% rename(replace=c("2017.y"="TotalPopulation")

In [86]:
#Renaming the column of 2017 in two tables to Population_2017 and Diabetes_prevalance_2017 respectively
population_2017_df  <- population_2017_df %>% rename("Population_2017" = !!names(.[2]))
diabetes_df  <- diabetes_df %>% rename("Diabetes_prevalance_2017" = !!names(.[5]))

In [87]:
population_2017_df %>% head()
diabetes_df %>% head()

Country Code,Population_2017
ABW,105264
AFG,35530081
AGO,29784193
ALB,2873457
AND,76965
ARB,414491886


Country Name,Country Code,Indicator Name,Indicator Code,Diabetes_prevalance_2017
Aruba,ABW,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,11.62
Afghanistan,AFG,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,9.59
Angola,AGO,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,3.94
Albania,ALB,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,10.08
Andorra,AND,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,7.97
Arab World,ARB,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,12.07702


In [88]:
#Joining the population data and diabetes data
diabetes_population<-diabetes_df %>% inner_join(population_2017_df,by="Country Code") 

In [89]:
diabetes_population %>% head()

Country Name,Country Code,Indicator Name,Indicator Code,Diabetes_prevalance_2017,Population_2017
Aruba,ABW,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,11.62,105264
Afghanistan,AFG,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,9.59,35530081
Angola,AGO,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,3.94,29784193
Albania,ALB,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,10.08,2873457
Andorra,AND,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,7.97,76965
Arab World,ARB,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,12.07702,414491886


In [90]:
#Converting the diabetes pevalance percentage to population
diabetes_population %>% mutate(New_diabetes_population=(Diabetes_prevalance_2017/100)*Population_2017)

Country Name,Country Code,Indicator Name,Indicator Code,Diabetes_prevalance_2017,Population_2017,New_diabetes_population
Aruba,ABW,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,11.62000,105264,12231.68
Afghanistan,AFG,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,9.59000,35530081,3407334.77
Angola,AGO,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,3.94000,29784193,1173497.20
Albania,ALB,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,10.08000,2873457,289644.47
Andorra,AND,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,7.97000,76965,6134.11
Arab World,ARB,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,12.07702,414491886,50058269.71
United Arab Emirates,ARE,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,17.26000,9400145,1622465.03
Argentina,ARG,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,5.50000,44271041,2434907.25
Armenia,ARM,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,7.11000,2930450,208354.99
American Samoa,ASM,Diabetes prevalence (% of population ages 20 to 79),SH.STA.DIAB.ZS,,55641,
