# Cost of Living (CoL) Web Scraper

## Required Packages

In [1]:
library(rvest)
library(tidyverse)
library(stringr)

Loading required package: xml2
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


## Reading the webpage 

Reading the webpage and aiming for the data_wide_table selector, this is the 3rd returned table from html_nodes below
![Numbeo Page](./images/numbeo_swazi.png)

In [2]:
CoL_page <- read_html("https://www.numbeo.com/cost-of-living/country_result.jsp?country=Swaziland") #get html from page
CoL_elements <- html_nodes(x = CoL_page, css = "table") #get all the tables on the pages
CoL_text = html_text(CoL_elements[3], trim = TRUE) #3rd table returned
CoL_table = as_tibble(html_table(CoL_elements[3])[[1]])#removing table from list

In [3]:
colnames(CoL_table) = c("item", 'avgPrice', 'range') #rename the columns
CoL_table = CoL_table %>% filter(avgPrice != "[ Edit ]") #remove categories that corrupt table
CoL_table %>% head(5)

item,avgPrice,range
"Meal, Inexpensive Restaurant",70.00 R,50.00-70.00
"Meal for 2 People, Mid-range Restaurant, Three-course",300.00 R,300.00-400.00
McMeal at McDonalds (or Equivalent Combo Meal),50.00 R,50.00-59.00
Domestic Beer (0.5 liter draught),13.00 R,11.00-15.00
Imported Beer (0.33 liter bottle),17.50 R,12.00-20.00


## "Borrowing" Countries
Using the Incpect Element function of google chrome, "borrowing" the countries from the numbeo home page

![Borrowing Coutnries](./images/strealCountries.png)

In [4]:
countries = c("Afghanistan","Aland Islands","Albania","Algeria","Andorra","Angola","Antigua And Barbuda","Argentina","Armenia","Aruba","Australia","Austria","Azerbaijan","Bahamas","Bahrain","Bangladesh","Barbados","Belarus","Belgium","Belize","Bermuda","Bhutan","Bolivia","Bosnia And Herzegovina","Botswana","Brazil","British Virgin Islands","Brunei","Bulgaria","Burkina Faso","Burundi","Cambodia","Cameroon","Canada","Cape Verde","Cayman Islands","Chad","Chile","China","Colombia","Congo","Costa Rica","Croatia","Cuba","Curacao","Cyprus","Czech Republic","Denmark","Djibouti","Dominica","Dominican Republic","Ecuador","Egypt","El Salvador","Estonia","Ethiopia","Faroe Islands","Fiji","Finland","France","French Polynesia","Gabon","Gambia","Georgia","Germany","Ghana","Gibraltar","Greece","Greenland","Grenada","Guam","Guatemala","Guernsey","Guyana","Honduras","Hong Kong","Hungary","Iceland","India","Indonesia","Iran","Iraq","Ireland","Isle Of Man","Israel","Italy","Ivory Coast","Jamaica","Japan","Jersey","Jordan","Kazakhstan","Kenya","Kosovo (Disputed Territory)","Kuwait","Kyrgyzstan","Laos","Latvia","Lebanon","Lesotho","Liberia","Libya","Lithuania","Luxembourg","Macao","Macedonia","Madagascar","Malawi","Malaysia","Maldives","Mali","Malta","Mauritania","Mauritius","Mexico","Micronesia","Moldova","Monaco","Mongolia","Montenegro","Morocco","Mozambique","Myanmar","Namibia","Nepal","Netherlands","New Caledonia","New Zealand","Nicaragua","Nigeria","Northern Mariana Islands","Norway","Oman","Pakistan","Palestinian Territory","Panama","Papua New Guinea","Paraguay","Peru","Philippines","Poland","Portugal","Puerto Rico","Qatar","Reunion","Romania","Russia","Rwanda","Saint Kitts And Nevis","Saint Lucia","Saint Vincent And The Grenadines","Samoa","Saudi Arabia","Senegal","Serbia","Seychelles","Singapore","Sint Maarten","Slovakia","Slovenia","Somalia","South Africa","South Korea","South Sudan","Spain","Sri Lanka","Sudan","Suriname","Swaziland","Sweden","Switzerland","Syria","Taiwan","Tajikistan","Tanzania","Thailand","Timor-Leste","Togo","Tonga","Trinidad And Tobago","Tunisia","Turkey","Turkmenistan","Turks And Caicos Islands","Uganda","Ukraine","United Arab Emirates","United Kingdom","United States","Uruguay","Us Virgin Islands","Uzbekistan","Vanuatu","Venezuela","Vietnam","Yemen","Zambia","Zimbabwe")

Noticing that the URL for countries with spaces are addressed with +'s
![Spaces URL address space](./images/spaceURL.png)

## The Crawl Function
This is the function that takes a country and create the URL needed to get the table for that country

In [5]:
getCoL = function(country){
    now = Sys.time()
    writeLines(str_c(now, "Scraping ", country, sep="\t"))
    
    URL = str_c("https://www.numbeo.com/cost-of-living/country_result.jsp?country=", country, sep="")
    #get html from page
    CoL_page = read_html(URL) 
    
    #get all the tables on the pages
    CoL_elements = html_nodes(x = CoL_page, css = "table") 
    
    #3rd table returned
    CoL_text = html_text(CoL_elements[3], trim = TRUE) 
    
    #removing table from list
    CoL_table = as_tibble(html_table(CoL_elements[3])[[1]])
    
    #rename the columns
    colnames(CoL_table) = c("item", 'avgPrice', 'range') 
    
    #remove categories that corrupt table
    CoL_table = CoL_table %>% filter(avgPrice != "[ Edit ]") 
    
    #split the numbers out of the avgPrice column to identify currency used
    currency_pass_1 = CoL_table %>% select(avgPrice) %>% unlist() %>% str_split("[0-9]") 
    currency = str_split(currency_pass_1[[1]], " ") %>% tail(1) %>% unlist()
    
    writeLines(paste0("Time:\t\t ", Sys.time() -now))
    return(c(CoL_table, cur=currency))
    }

## Perform the crawl

In [6]:
now = Sys.time()
countries_plus = lapply(countries[1:10], function(x) gsub(" ", "+", x, fixed=TRUE))
country_tibbles = lapply(countries_plus, function(x) getCoL(x))
Sys.time() -now

2017-08-21 07:24:36	Scraping 	Afghanistan
Time:		 1.24503087997437
2017-08-21 07:24:38	Scraping 	Aland+Islands
Time:		 1.24078488349915
2017-08-21 07:24:39	Scraping 	Albania
Time:		 1.20728588104248
2017-08-21 07:24:40	Scraping 	Algeria
Time:		 1.43342995643616
2017-08-21 07:24:42	Scraping 	Andorra
Time:		 1.84745717048645
2017-08-21 07:24:43	Scraping 	Angola
Time:		 1.56486487388611
2017-08-21 07:24:45	Scraping 	Antigua+And+Barbuda
Time:		 1.19656109809875
2017-08-21 07:24:46	Scraping 	Argentina
Time:		 1.20054483413696
2017-08-21 07:24:47	Scraping 	Armenia
Time:		 1.19802188873291
2017-08-21 07:24:49	Scraping 	Aruba
Time:		 1.25996398925781


Time difference of 13.40223 secs

In [11]:
as.data.frame(country_tibbles)

item,avgPrice,range,cur,item.1,avgPrice.1,range.1,cur.1,item.2,avgPrice.2,⋯,range.7,cur.7,item.8,avgPrice.8,range.8,cur.8,item.9,avgPrice.9,range.9,cur.9
"Meal, Inexpensive Restaurant",3.67 $,2.25-5.00,$,"Meal, Inexpensive Restaurant",10.40 €,10.00-14.00,€,"Meal, Inexpensive Restaurant",681.62 Lek,⋯,137.43-207.96,ARS,"Meal, Inexpensive Restaurant","3,000.00 AMD","2,000.00-3,000.00",AMD,"Meal, Inexpensive Restaurant",11.00 $,8.50-15.00,$
"Meal for 2 People, Mid-range Restaurant, Three-course",13.50 $,8.33-30.00,$,"Meal for 2 People, Mid-range Restaurant, Three-course",70.00 €,52.00-80.00,€,"Meal for 2 People, Mid-range Restaurant, Three-course","2,200.00 Lek",⋯,400.00-800.00,ARS,"Meal for 2 People, Mid-range Restaurant, Three-course","10,000.00 AMD","8,000.00-13,000.00",AMD,"Meal for 2 People, Mid-range Restaurant, Three-course",74.03 $,50.00-80.00,$
McMeal at McDonalds (or Equivalent Combo Meal),5.83 $,3.00-8.00,$,McMeal at McDonalds (or Equivalent Combo Meal),7.95 €,7.90-8.00,€,McMeal at McDonalds (or Equivalent Combo Meal),400.00 Lek,⋯,119.00-160.00,ARS,McMeal at McDonalds (or Equivalent Combo Meal),"2,000.00 AMD","1,500.00-2,500.00",AMD,McMeal at McDonalds (or Equivalent Combo Meal),6.85 $,6.00-7.42,$
Domestic Non-Alcoholic Beer (0.5 liter draught),4.50 $,2.00-7.00,$,Domestic Beer (0.5 liter draught),6.00 €,5.00-6.00,€,Domestic Beer (0.5 liter draught),120.00 Lek,⋯,25.00-70.00,ARS,Domestic Beer (0.5 liter draught),500.00 AMD,400.00-650.00,AMD,Domestic Beer (0.5 liter draught),3.50 $,2.50-7.00,$
Imported Non-Alcoholic Beer (0.33 liter bottle),3.00 $,2.00-5.00,$,Imported Beer (0.33 liter bottle),5.00 €,3.00-7.00,€,Imported Beer (0.33 liter bottle),170.00 Lek,⋯,45.00-86.65,ARS,Imported Beer (0.33 liter bottle),700.00 AMD,"500.00-1,000.00",AMD,Imported Beer (0.33 liter bottle),4.00 $,2.59-7.00,$
Cappuccino (regular),1.83 $,1.00-3.00,$,Cappuccino (regular),3.00 €,2.00-4.00,€,Cappuccino (regular),135.92 Lek,⋯,33.00-60.00,ARS,Cappuccino (regular),867.98 AMD,"500.00-1,200.00",AMD,Cappuccino (regular),2.96 $,2.28-3.50,$
Coke/Pepsi (0.33 liter bottle),0.37 $,0.30-0.60,$,Coke/Pepsi (0.33 liter bottle),1.40 €,1.00-2.50,€,Coke/Pepsi (0.33 liter bottle),114.40 Lek,⋯,17.33-40.00,ARS,Coke/Pepsi (0.33 liter bottle),291.67 AMD,200.00-500.00,AMD,Coke/Pepsi (0.33 liter bottle),1.90 $,1.40-3.00,$
Water (0.33 liter bottle),0.26 $,0.18-0.50,$,Water (0.33 liter bottle),1.50 €,0.50-2.50,€,Water (0.33 liter bottle),55.08 Lek,⋯,15.00-30.00,ARS,Water (0.33 liter bottle),155.25 AMD,119.45-200.00,AMD,Water (0.33 liter bottle),1.50 $,1.00-2.00,$
"Milk (regular), (1 liter)",1.04 $,0.75-1.26,$,"Milk (regular), (1 liter)",1.15 €,1.00-1.30,€,"Milk (regular), (1 liter)",120.08 Lek,⋯,17.00-25.00,ARS,"Milk (regular), (1 liter)",414.12 AMD,390.00-500.00,AMD,"Milk (regular), (1 liter)",1.27 $,0.79-1.67,$
Loaf of Fresh White Bread (500g),0.51 $,0.20-1.00,$,Loaf of Fresh White Bread (500g),2.70 €,1.40-4.00,€,Loaf of Fresh White Bread (500g),63.02 Lek,⋯,15.00-50.00,ARS,Loaf of Fresh White Bread (500g),226.11 AMD,200.00-250.00,AMD,Loaf of Fresh White Bread (500g),1.83 $,1.09-3.86,$


### The next thing to address - exchange rates
The webstie return the cost in the home currency, writing another crawler the get the exchage rates from google

Encountering some troubles with the Character encoding... 

In [7]:
for( c in country_tibbles[7:10]){
    
    ref_currency = "USD"
    from_currency = paste0("currency+in+",iconv(str_replace(c$cur, "\\s", ""), to = "UTF-8", sep=""))
    
    print(enc2utf8(from_currency))
    
    URL =  str_c("https://www.google.co.uk/search?q=", enc2utf8(from_currency), "+to+", ref_currency, sep="")
    print(URL)
    #get html from page
    CoL_page = read_html( URL, encoding = "utf8") 

    #get all the tables on the pages
    CoL_elements = html_nodes(x = CoL_page, css = "div")
    text = CoL_elements %>% html_text()
    exchange_rate = as.numeric(strsplit(strsplit(strsplit(unlist(strsplit(text, "'")[22]), "US dollars")[[1]], "'")[[1]], '=')[[1]][2])
    print(exchange_rate)
}

ERROR: Error in iconv(str_replace(c$cur, "\\s", ""), to = "UTF-8", sep = ""): unused argument (sep = "")
