# Cost of Living (CoL) Web Scraper

## Required Packages

In [1]:
library(rvest)
library(tidyverse)
library(stringr)

Loading required package: xml2
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


## Reading the webpage 

Reading the webpage and aiming for the data_wide_table selector, this is the 3rd returned table from html_nodes below
![Numbeo Page](./images/numbeo_swazi.png)

In [2]:
CoL_page <- read_html("https://www.numbeo.com/cost-of-living/country_result.jsp?country=Swaziland") #get html from page
CoL_elements <- html_nodes(x = CoL_page, css = "table") #get all the tables on the pages
CoL_text = html_text(CoL_elements[3], trim = TRUE) #3rd table returned
CoL_table = as_tibble(html_table(CoL_elements[3])[[1]])#removing table from list

In [3]:
colnames(CoL_table) = c("item", 'avgPrice', 'range') #rename the columns
CoL_table = CoL_table %>% filter(avgPrice != "[ Edit ]") #remove categories that corrupt table
CoL_table %>% head(5)

item,avgPrice,range
"Meal, Inexpensive Restaurant",70.00 R,50.00-70.00
"Meal for 2 People, Mid-range Restaurant, Three-course",300.00 R,300.00-400.00
McMeal at McDonalds (or Equivalent Combo Meal),50.00 R,50.00-59.00
Domestic Beer (0.5 liter draught),13.00 R,11.00-15.00
Imported Beer (0.33 liter bottle),17.50 R,12.00-20.00


## "Borrowing" Countries
Using the Incpect Element function of google chrome, "borrowing" the countries from the numbeo home page

![Borrowing Coutnries](./images/strealCountries.png)

In [4]:
countries = c("Afghanistan","Aland Islands","Albania","Algeria","Andorra","Angola","Antigua And Barbuda","Argentina","Armenia","Aruba","Australia","Austria","Azerbaijan","Bahamas","Bahrain","Bangladesh","Barbados","Belarus","Belgium","Belize","Bermuda","Bhutan","Bolivia","Bosnia And Herzegovina","Botswana","Brazil","British Virgin Islands","Brunei","Bulgaria","Burkina Faso","Burundi","Cambodia","Cameroon","Canada","Cape Verde","Cayman Islands","Chad","Chile","China","Colombia","Congo","Costa Rica","Croatia","Cuba","Curacao","Cyprus","Czech Republic","Denmark","Djibouti","Dominica","Dominican Republic","Ecuador","Egypt","El Salvador","Estonia","Ethiopia","Faroe Islands","Fiji","Finland","France","French Polynesia","Gabon","Gambia","Georgia","Germany","Ghana","Gibraltar","Greece","Greenland","Grenada","Guam","Guatemala","Guernsey","Guyana","Honduras","Hong Kong","Hungary","Iceland","India","Indonesia","Iran","Iraq","Ireland","Isle Of Man","Israel","Italy","Ivory Coast","Jamaica","Japan","Jersey","Jordan","Kazakhstan","Kenya","Kosovo (Disputed Territory)","Kuwait","Kyrgyzstan","Laos","Latvia","Lebanon","Lesotho","Liberia","Libya","Lithuania","Luxembourg","Macao","Macedonia","Madagascar","Malawi","Malaysia","Maldives","Mali","Malta","Mauritania","Mauritius","Mexico","Micronesia","Moldova","Monaco","Mongolia","Montenegro","Morocco","Mozambique","Myanmar","Namibia","Nepal","Netherlands","New Caledonia","New Zealand","Nicaragua","Nigeria","Northern Mariana Islands","Norway","Oman","Pakistan","Palestinian Territory","Panama","Papua New Guinea","Paraguay","Peru","Philippines","Poland","Portugal","Puerto Rico","Qatar","Reunion","Romania","Russia","Rwanda","Saint Kitts And Nevis","Saint Lucia","Saint Vincent And The Grenadines","Samoa","Saudi Arabia","Senegal","Serbia","Seychelles","Singapore","Sint Maarten","Slovakia","Slovenia","Somalia","South Africa","South Korea","South Sudan","Spain","Sri Lanka","Sudan","Suriname","Swaziland","Sweden","Switzerland","Syria","Taiwan","Tajikistan","Tanzania","Thailand","Timor-Leste","Togo","Tonga","Trinidad And Tobago","Tunisia","Turkey","Turkmenistan","Turks And Caicos Islands","Uganda","Ukraine","United Arab Emirates","United Kingdom","United States","Uruguay","Us Virgin Islands","Uzbekistan","Vanuatu","Venezuela","Vietnam","Yemen","Zambia","Zimbabwe")

Noticing that the URL for countries with spaces are addressed with +'s
![Spaces URL address space](./images/spaceURL.png)

## The Crawl Function
This is the function that takes a country and create the URL needed to get the table for that country

In [15]:
getCoL = function(country){
    now = Sys.time()
    writeLines(str_c(now, "Scraping ", country, sep="\t"))
    
    URL = str_c("https://www.numbeo.com/cost-of-living/country_result.jsp?country=", country, sep="")
    #get html from page
    CoL_page = read_html(URL) 
    
    #get all the tables on the pages
    CoL_elements = html_nodes(x = CoL_page, css = "table") 
    
    #3rd table returned
    CoL_text = html_text(CoL_elements[3], trim = TRUE) 
    
    #removing table from list
    CoL_table = as_tibble(html_table(CoL_elements[3])[[1]])
    
    #rename the columns
    colnames(CoL_table) = c("item", 'avgPrice', 'range') 
    
    #remove categories that corrupt table
    CoL_table = CoL_table %>% filter(avgPrice != "[ Edit ]") 
    
    #split the numbers out of the avgPrice column to identify currency used
    currency_pass_1 = CoL_table %>% select(avgPrice) %>% unlist() %>% str_split("[0-9]") 
    currency = str_split(currency_pass_1[[1]], " ") %>% tail(1) %>% unlist()
    
    writeLines(paste0("Time:\t\t ", Sys.time() -now))
    return(c(CoL_table, cur=currency))
    }

## Perform the crawl

In [21]:
now = Sys.time()
countries_plus = lapply(countries[1:10], function(x) gsub(" ", "+", x, fixed=TRUE))
country_tibbles = lapply(countries_plus, function(x) getCoL(x))
Sys.time() -now

2017-08-20 12:30:00	Scraping 	Afghanistan
Time:		 1.12367796897888
2017-08-20 12:30:01	Scraping 	Aland+Islands
Time:		 1.10028696060181
2017-08-20 12:30:03	Scraping 	Albania
Time:		 1.10572409629822
2017-08-20 12:30:04	Scraping 	Algeria
Time:		 1.07789421081543
2017-08-20 12:30:05	Scraping 	Andorra
Time:		 1.09718012809753
2017-08-20 12:30:06	Scraping 	Angola
Time:		 1.08916997909546
2017-08-20 12:30:07	Scraping 	Antigua+And+Barbuda
Time:		 1.097904920578
2017-08-20 12:30:08	Scraping 	Argentina
Time:		 1.12205195426941
2017-08-20 12:30:09	Scraping 	Armenia
Time:		 1.0961799621582
2017-08-20 12:30:10	Scraping 	Aruba
Time:		 1.08515501022339


Time difference of 11.00251 secs

### The next thing to address - exchange rates
The webstie return the cost in the home currency, writing another crawler the get the exchage rates from google

Encountering some troubles with the Character encoding... 

In [149]:
for( c in country_tibbles[7:10]){
    
    ref_currency = "USD"
    from_currency = iconv(str_replace(c$cur, "\\s", ""), to = "UTF-8")
    
    print(enc2utf8(from_currency))
    
    URL =  str_c("https://www.google.co.uk/search?q=", enc2utf8(from_currency), "+to+", ref_currency, sep="")
    print(URL)
    #get html from page
    CoL_page = read_html( URL, encoding = "utf8") 

    #get all the tables on the pages
    CoL_elements = html_nodes(x = CoL_page, css = "div")
    text = CoL_elements %>% html_text()
    exchange_rate = as.numeric(strsplit(strsplit(strsplit(unlist(strsplit(text, "'")[22]), "US dollars")[[1]], "'")[[1]], '=')[[1]][2])
    print(exchange_rate)
}

[1] "EC$"
[1] "https://www.google.co.uk/search?q=EC$+to+USD"


ERROR: Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : input conversion failed due to input error, bytes 0xBB 0x3C 0x2F 0x61 [6003]
