<h1>Analysis of Global COVID-19 Pandemic Data</h1>



![title](img/COVID-19_Map.png)

## Overview:

There are 10 tasks in this final project. All tasks will be graded by your peers who are also completing this assignment within the same session.

You need to submit the following the screenshot for the code and output for each task for review.

If you need to refresh your memories about specific coding details, you may refer to previous hands-on labs for code examples.


In [1]:
require("httr")
require("rvest")

library(httr)
library(rvest)

Loading required package: httr
Loading required package: rvest
Loading required package: xml2
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2


Note: if you can import above libraries, please use install.packages() to install them first.


## TASK 1: Get a `COVID-19 pandemic` Wiki page using HTTP request


First, let's write a function to use HTTP request to get a public COVID-19 Wiki page.

Before you write the function, you can open this public page from this

URL https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country using a web browser.

The goal of task 1 is to get the html page using HTTP request (`httr` library)


In [2]:
get_wiki_covid19_page <- function() {
    
  # Our target COVID-19 wiki page URL is: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country  
  # Which has two parts: 
    # 1) base URL `https://en.wikipedia.org/w/index.php  
    # 2) URL parameter: `title=Template:COVID-19_testing_by_country`, seperated by question mark ?
    
  # Wiki page base
  wiki_base_url <- "https://en.wikipedia.org/w/index.php"
  # You will need to create a List which has an element called `title` to specify which page you want to get from Wiki
  # in our case, it will be `Template:COVID-19_testing_by_country`
  query_param<-list(title = "Template:COVID-19_testing_by_country")
  # - Use the `GET` function in httr library with a `url` argument and a `query` arugment to get a HTTP response
  response<-GET(wiki_base_url, query=query_param)  
  #response<-GET(wiki_base_url)   
  # Use the `return` function to return the response
  return(response)
}

In [3]:
Sys.Date()

Call the `get_wiki_covid19_page` function to get a http response with the target html page


In [4]:
# Call the get_wiki_covid19_page function and print the response
get_wiki_covid19_page()

Response [https://en.wikipedia.org/w/index.php?title=Template%3ACOVID-19_testing_by_country]
  Date: 2021-06-29 00:36
  Status: 200
  Content-Type: text/html; charset=UTF-8
  Size: 406 kB
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Template:COVID-19 testing by country - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames...
"CS1 German-language sources (de)","CS1 Azerbaijani-language sources (az)","C...
"CS1 uses Japanese-language script (ja)","CS1 Japanese-language sources (ja)"...
"COVID-19 pandemic templates"],"wgPageContentLanguage":"en","wgPageContentMod...
"wgGEAskQuestionEnabled":!1,"wgGELinkRecommendationsFrontendEnabled":!1,"wgWi...
...

## TASK 2: Extract COVID-19 testing data table from the wiki HTML page


On the COVID-19 testing wiki page, you should see a data table `<table>` node contains COVID-19 testing data by country on the page:

<a href="https://cognitiveclass.ai/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkRP0101ENCoursera23911160-2021-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/M5_Final/images/covid-19-by-country.png" width="400" align="center">
</a>

Note the numbers you actually see on your page may be different from above because it is still an on-going pandemic when creating this notebook.

The goal of task 2 is to extract above data table and convert it into a data frame


Now use the `read_html` function in rvest library to get the root html node from response


In [5]:
# Get the root html node from the http response in task 1
root_node<-read_html(get_wiki_covid19_page())

Get the first table in the HTML root node using `html_node` function


In [6]:
# Get the table node from the root html node
table_node<-html_node(root_node, "table")

Read the table node as a data frame using `html_table` function


In [7]:
# Read the table node and convert it into a data frame, and print the data frame for review
data_frame<-html_table(table_node)
data_frame

Country or region,Date[a],Tested,Units[b],Confirmed(cases),"Confirmed /tested,%","Tested /population,%","Confirmed /population,%",Ref.
Afghanistan,17 Dec 2020,154767,samples,49621,32.1,0.40,0.13,[1]
Albania,18 Feb 2021,428654,samples,96838,22.6,15.0,3.4,[2]
Algeria,2 Nov 2020,230553,samples,58574,25.4,0.53,0.13,[3][4]
Andorra,21 Jun 2021,195713,samples,13864,7.1,252,17.9,[5]
Angola,12 Mar 2021,399228,samples,20981,5.3,1.3,0.067,[6]
Antigua and Barbuda,6 Mar 2021,15268,samples,832,5.4,15.9,0.86,[7]
Argentina,26 Jun 2021,16367985,samples,4393142,26.8,36.1,9.7,[8]
Armenia,28 Jun 2021,1177814,samples,224851,19.1,39.9,7.6,[9]
Australia,27 Jun 2021,20303632,samples,30499,0.15,80.9,0.12,[10]
Austria,27 Jun 2021,53713546,samples,646243,1.2,603,7.3,[11]


In [9]:
class(data_frame)

## TASK 3: Pre-process and export the extracted data frame

The goal of task 3 is to pre-process the extracted data frame from the previous step, and export it as a csv file


Let's get a summary of the data frame


In [10]:
# Print the summary of the data frame
summary(data_frame)

 Country or region    Date[a]             Tested            Units[b]        
 Length:173         Length:173         Length:173         Length:173        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Confirmed(cases)   Confirmed /tested,% Tested /population,%
 Length:173         Length:173          Length:173          
 Class :character   Class :character    Class :character    
 Mode  :character   Mode  :character    Mode  :character    
 Confirmed /population,%     Ref.          
 Length:173              Length:173        
 Class :character        Class :character  
 Mode  :character        Mode  :character  

In [11]:
str(data_frame)

'data.frame':	173 obs. of  9 variables:
 $ Country or region      : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ Date[a]                : chr  "17 Dec 2020" "18 Feb 2021" "2 Nov 2020" "21 Jun 2021" ...
 $ Tested                 : chr  "154,767" "428,654" "230,553" "195,713" ...
 $ Units[b]               : chr  "samples" "samples" "samples" "samples" ...
 $ Confirmed(cases)       : chr  "49,621" "96,838" "58,574" "13,864" ...
 $ Confirmed /tested,%    : chr  "32.1" "22.6" "25.4" "7.1" ...
 $ Tested /population,%   : chr  "0.40" "15.0" "0.53" "252" ...
 $ Confirmed /population,%: chr  "0.13" "3.4" "0.13" "17.9" ...
 $ Ref.                   : chr  "[1]" "[2]" "[3][4]" "[5]" ...


In [12]:
dim(data_frame)

As you can see from the summary, the columns names are little bit different to understand and some column data types are not correct. For example, the `Tested` column shows as `character`.

As such, the data frame read from HTML table will need some pre-processing such as removing irrelvant columns, renaming columns, and convert columns into proper data types.


We have prepared a pre-processing function for you to conver the data frame but you can also try to write one by yourself


In [13]:
preprocess_covid_data_frame <- function(data_frame) {
    
    shape <- dim(data_frame)

    # Remove the World row
    data_frame<-data_frame[!(data_frame$`Country or region`=="World"),]
    # Remove the last row
    data_frame <- data_frame[1:172, ]
    
    # We dont need the Units and Ref columns, so can be removed
    data_frame["Ref."] <- NULL
    data_frame["Units[b]"] <- NULL
    
    # Renaming the columns
    names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
    
    # Convert column data types
    data_frame$country <- as.factor(data_frame$country)
    data_frame$date <- as.factor(data_frame$date)
    data_frame$tested <- as.numeric(gsub(",","",data_frame$tested))
    data_frame$confirmed <- as.numeric(gsub(",","",data_frame$confirmed))
    data_frame$'confirmed.tested.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.tested.ratio`))
    data_frame$'tested.population.ratio' <- as.numeric(gsub(",","",data_frame$`tested.population.ratio`))
    data_frame$'confirmed.population.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.population.ratio`))
    
    return(data_frame)
}

Call the `preprocess_covid_data_frame` function


In [14]:
# call `preprocess_covid_data_frame` function and assign it to a new data frame
new_data_frame = preprocess_covid_data_frame(data_frame)

Get the summary of the processed data frame again


In [15]:
# Print the summary of the processed data frame again
summary(new_data_frame)

                country             date        tested         
 Afghanistan        :  1   27 Jun 2021:30   Min.   :     3880  
 Albania            :  1   28 Jun 2021:27   1st Qu.:   302594  
 Algeria            :  1   25 Jun 2021:13   Median :  1593082  
 Andorra            :  1   26 Jun 2021:13   Mean   : 14740536  
 Angola             :  1   20 Jun 2021: 6   3rd Qu.:  7025002  
 Antigua and Barbuda:  1   21 Jun 2021: 6   Max.   :482526393  
 (Other)            :166   (Other)    :77                      
   confirmed        confirmed.tested.ratio tested.population.ratio
 Min.   :       0   Min.   :  0.000        Min.   :  0.0065       
 1st Qu.:   18591   1st Qu.:  3.575        1st Qu.:  5.9250       
 Median :  109617   Median :  7.350        Median : 26.4500       
 Mean   : 1201693   Mean   : 10.645        Mean   : 72.2826       
 3rd Qu.:  459981   3rd Qu.: 13.500        3rd Qu.: 74.3000       
 Max.   :41587146   Max.   :212.000        Max.   :782.0000       
                   

In [16]:
str(new_data_frame)

'data.frame':	172 obs. of  7 variables:
 $ country                   : Factor w/ 172 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ date                      : Factor w/ 55 levels "1 Mar 2021","10 Feb 2021",..: 14 17 23 25 6 53 34 39 36 36 ...
 $ tested                    : num  154767 428654 230553 195713 399228 ...
 $ confirmed                 : num  49621 96838 58574 13864 20981 ...
 $ confirmed.tested.ratio    : num  32.1 22.6 25.4 7.1 5.3 5.4 26.8 19.1 0.15 1.2 ...
 $ tested.population.ratio   : num  0.4 15 0.53 252 1.3 15.9 36.1 39.9 80.9 603 ...
 $ confirmed.population.ratio: num  0.13 3.4 0.13 17.9 0.067 0.86 9.7 7.6 0.12 7.3 ...


After pre-processing, you can see the columns and columns names are simplified, and columns types are converted into correct types.


The data frame has following columns:

*   **country** - The name of the country
*   **date** - Reported date
*   **tested** - Total tested cases by the reported date
*   **confirmed** - Total confirmed cases by the reported date
*   **confirmed.tested.ratio** - The ratio of confirmed cases to the tested cases
*   **tested.population.ratio** - The ratio of tested cases to the population of the country
*   **confirmed.population.ratio** - The ratio of confirmed cases to the population of the country


OK, we can call `write.csv()` function to save the csv file into a file.


In [17]:
# Export the data frame to a csv file
write.csv(new_data_frame, file="covid.csv", row.names=FALSE)

Note for IBM Waston Studio, there is no traditional "hard disk" associated with a R workspace.

Even if you call `write.csv()` method to save the data frame as a csv file, it won't be shown in IBM Cloud Object Storage asset UI automatically.

However, you may still check if the `covid.csv` exists using following code snippet:


In [18]:
# Get working directory
wd <- getwd()
# Get exported 
file_path <- paste(wd, sep="", "/covid.csv")
# File path
#print(file_path)
file.exists(file_path)

**Optional Step**: If you have difficulties finishing above webscraping tasks, you may still continue with next tasks by downloading a provided csv file from here:


In [19]:
## Download a sample csv file
# covid_csv_file <- download.file("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/dataset/covid.csv", destfile="covid.csv")
covid_data_frame_csv <- read.csv("covid.csv", header=TRUE, sep=",")

In [20]:
head(covid_data_frame_csv)

country,date,tested,confirmed,confirmed.tested.ratio,tested.population.ratio,confirmed.population.ratio
Afghanistan,17 Dec 2020,154767,49621,32.1,0.4,0.13
Albania,18 Feb 2021,428654,96838,22.6,15.0,3.4
Algeria,2 Nov 2020,230553,58574,25.4,0.53,0.13
Andorra,21 Jun 2021,195713,13864,7.1,252.0,17.9
Angola,12 Mar 2021,399228,20981,5.3,1.3,0.067
Antigua and Barbuda,6 Mar 2021,15268,832,5.4,15.9,0.86


## TASK 4: Get a subset of the extracted data frame

The goal of task 4 is to get the 5th to 10th rows from the data frame with only `country` and `confirmed` columns selected


In [21]:
# Read covid_data_frame_csv from the csv file

# Get the 5th to 10th rows, with two "country" "confirmed" columns
covid_data_frame_csv[5:10,c("country","confirmed")]

Unnamed: 0,country,confirmed
5,Angola,20981
6,Antigua and Barbuda,832
7,Argentina,4393142
8,Armenia,224851
9,Australia,30499
10,Austria,646243


## TASK 5: Calculate worldwide COVID testing positive ratio

The goal of task 5 is to get the total confirmed and tested cases worldwide, and try to figure the overall positive ratio using `confirmed cases / tested cases`


In [22]:
# Get the total confirmed cases worldwide
print(c("Total confirmed cases worldwide:", sum(covid_data_frame_csv$confirmed)))
# Get the total tested cases worldwide
print(c("Total tested cases worldwide:", sum(covid_data_frame_csv$tested)))
# Get the positive ratio (confirmed / tested)
print(c("the positive ratio:", mean(covid_data_frame_csv$confirmed/covid_data_frame_csv$tested)))

[1] "Total confirmed cases worldwide:" "206691263"                       
[1] "Total tested cases worldwide:" "2535372206"                   
[1] "the positive ratio:" "0.105874976026517"  


## TASK 6: Get a country list which reported their testing data

The goal of task 6 is to get a catalog or sorted list of countries who have reported their COVID-19 testing data


In [23]:
# Get the `country` column
print(covid_data_frame_csv$country)
# Check its class (should be Factor)
print(class(covid_data_frame_csv$country))
# Conver the country column into character so that you can easily sort them
covid_data_frame_csv$country = as.character(covid_data_frame_csv$country)
print(class(covid_data_frame_csv$country))

  [1] Afghanistan            Albania                Algeria               
  [4] Andorra                Angola                 Antigua and Barbuda   
  [7] Argentina              Armenia                Australia             
 [10] Austria                Azerbaijan             Bahamas               
 [13] Bahrain                Bangladesh             Barbados              
 [16] Belarus                Belgium                Belize                
 [19] Benin                  Bhutan                 Bolivia               
 [22] Bosnia and Herzegovina Botswana               Brazil                
 [25] Brunei                 Bulgaria               Burkina Faso          
 [28] Burundi                Cambodia               Cameroon              
 [31] Canada                 Chad                   Chile                 
 [34] China[c]               Colombia               Costa Rica            
 [37] Croatia                Cuba                   Cyprus[d]             
 [40] Czechia            

In [24]:
# Sort the countries AtoZ
print(sort(covid_data_frame_csv$country))

# Sort the countries ZtoA
print(sort(covid_data_frame_csv$country, decreasing=TRUE))

  [1] "Afghanistan"            "Albania"                "Algeria"               
  [4] "Andorra"                "Angola"                 "Antigua and Barbuda"   
  [7] "Argentina"              "Armenia"                "Australia"             
 [10] "Austria"                "Azerbaijan"             "Bahamas"               
 [13] "Bahrain"                "Bangladesh"             "Barbados"              
 [16] "Belarus"                "Belgium"                "Belize"                
 [19] "Benin"                  "Bhutan"                 "Bolivia"               
 [22] "Bosnia and Herzegovina" "Botswana"               "Brazil"                
 [25] "Brunei"                 "Bulgaria"               "Burkina Faso"          
 [28] "Burundi"                "Cambodia"               "Cameroon"              
 [31] "Canada"                 "Chad"                   "Chile"                 
 [34] "China[c]"               "Colombia"               "Costa Rica"            
 [37] "Croatia"             

In [25]:
# Print the sorted ZtoA list
print(list(sort(covid_data_frame_csv$country, decreasing=TRUE)))

[[1]]
  [1] "Zimbabwe"               "Zambia"                 "Vietnam"               
  [4] "Venezuela"              "Uzbekistan"             "Uruguay"               
  [7] "United States"          "United Kingdom"         "United Arab Emirates"  
 [10] "Ukraine"                "Uganda"                 "Turkey"                
 [13] "Tunisia"                "Trinidad and Tobago"    "Togo"                  
 [16] "Thailand"               "Tanzania"               "Taiwan[m]"             
 [19] "Switzerland[l]"         "Sweden"                 "Sudan"                 
 [22] "Sri Lanka"              "Spain"                  "South Sudan"           
 [25] "South Korea"            "South Africa"           "Slovenia"              
 [28] "Slovakia"               "Singapore"              "Serbia"                
 [31] "Senegal"                "Saudi Arabia"           "San Marino"            
 [34] "Saint Vincent"          "Saint Lucia"            "Saint Kitts and Nevis" 
 [37] "Rwanda"        

## TASK 7: Identify countries names with a specific pattern

The goal of task 7 is using a regular expression to find any countires start with `United`


In [26]:
# Use a regular expression `United.+` to find matches
country_matches = grep("United.+", covid_data_frame_csv$country, value=TRUE)
# Print the matched country names
print(country_matches)

[1] "United Arab Emirates" "United Kingdom"       "United States"       


## TASK 8: Pick two countries you are interested, and then review their testing data

The goal of task 8 is to compare the COVID-19 test data between two countires, you will need to select two rows from the dataframe, and select `country`, `confirmed`, `confirmed-population-ratio` columns


In [27]:
# Select a subset (should be only one row) of data frame based on a selected country name and columns
subset_1 = subset(covid_data_frame_csv, country == "Peru")

# Select a subset (should be only one row) of data frame based on a selected country name and columns
subset_2 = subset(covid_data_frame_csv, country == "Colombia")

subset(covid_data_frame_csv, country == "Colombia" | country == "Peru")

Unnamed: 0,country,date,tested,confirmed,confirmed.tested.ratio,tested.population.ratio,confirmed.population.ratio
35,Colombia,27 Jun 2021,19619249,41587146,212.0,40.7,8.6
129,Peru,27 Jun 2021,14042806,2046057,14.6,42.8,6.2


## TASK 9: Compare which one of the selected countries has a larger ratio of confirmed cases to population

The goal of task 9 is to find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk


In [28]:
if(subset_1$'confirmed.population.ratio'>subset_2$'confirmed.population.ratio'){
    print(paste(subset_1$country, "has a larger ratio of confirmed cases to population."))
}else{
    print(paste(subset_2$country, "has a larger ratio of confirmed cases to population."))
}


[1] "Colombia has a larger ratio of confirmed cases to population."


## TASK 10: Find countries with confirmed to population ratio rate less than a threshold

The goal of task 10 is to find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low


In [33]:
# Get a subset of any countries with `confirmed.population.ratio` less than the threshold
isLess<-function(my_country, my_ratio, my_date, threshold=0.01){
    if(my_ratio<threshold){
        print(paste("Country:", my_country, "/ ratio:", my_ratio, "/ reported date:", my_date))
    }
}

countriesLess <- function(data){ 
    for (row in 1:nrow(data)) {
        v_country <- data[row, "country"]
        v_ratio  <- data[row, "confirmed.population.ratio"]
        v_date  <- data[row, "date"]
        isLess(v_country,v_ratio, v_date)
    }
}

In [34]:
countriesLess(covid_data_frame_csv)

[1] "Country: Burundi / ratio: 0.0074 / reported date: 5 Jan 2021"
[1] "Country: China[c] / ratio: 0.0061 / reported date: 31 Jul 2020"
[1] "Country: Laos / ratio: 0.00063 / reported date: 1 Mar 2021"
[1] "Country: North Korea / ratio: 0 / reported date: 25 Nov 2020"
[1] "Country: Tanzania / ratio: 0.00085 / reported date: 18 Nov 2020"
