In [1]:
library(dplyr)
library(rvest)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: xml2


# Assignment 2 – Programming Structure and Function

### Q1

Suppose we have a matrix of 1s and 0s. We want to create a vector as follows: for each row of the matrix, the corresponding element of the vector will be either 1 or 0, depending on whether the majority of the first c elements in the row is 1 or 0. Here c will be a parameter which we want to control. Create a function to perform this task. 

In [2]:
q1 = function(mtx, c) {
    means = df %>% apply(1, function(x) mean(x[1:c]))
    return((means + 0.5) %>% floor)
}

In [3]:
df = data.frame(hey=c(1, 0, 1, 0, 0), bye=c(1,1,1,1,0), whee=c(1, 0, 1, 1, 0), la=c(0,1,0,0,0), ah=c(0,0,0,0,0))

In [4]:
for (i in 1:5) {
    print(q1(df, i))
}

[1] 1 0 1 0 0
[1] 1 1 1 1 0
[1] 1 0 1 1 0
[1] 1 1 1 1 0
[1] 1 0 1 0 0


### Q2. 

Create a script to crawl all property data from SRX.com.sg. Please show me your codes
(using for loop or while loop) to crawl the complete data, but just need to store two pages
of data into a data frame and save it as “srx.csv”.

In [5]:
host_url = 'https://www.srx.com.sg'

In [6]:
# this function will fetch all listing nodes and return them as a list
fetch_property_nodes = function(page) {
    url = paste(host_url, "/search/sale/residential?page=", page, sep = "")
    srx_html = read_html(url)
    listing = html_nodes(srx_html, ".listingDetailTitle")
    if(length(listing) == 0)
        return(c('Done'))
    return(listing)
}

In [7]:
# this is just a try-catch wrapper on top of ^
trycatch_fetch_property_notes = function(n) {
    result = tryCatch({
        return (fetch_property_nodes(n))
    }, warning = function(n) {
        print(paste("Warning in crawling the ", n, "th page of SRX", sep = ""))
    }, error = function(n) {
        print(paste("Error in crawling the ", n, "th page of SRX", sep = ""))
        return(c('Done'))
    }, finally = function(n) { 
    })
    return (result)
}

In [8]:
# this function takes in the node's 1) details 2) facilities and 3) agent name and spits out a single-row df for entry
get_df_from_node_list = function(listing_info_node, facilities_info_node, agent_name) {
    labels = listing_info_node %>% html_children %>% `[`(c(T, F)) %>% html_text %>% takeout_last_char
    values = listing_info_node %>% html_children %>% `[`(c(F, T)) %>% html_text %>% remove_tabs
    facilities = facilities_info %>% html_children %>% html_text %>% remove_tabs %>% paste(collapse=' | ')
    
    data_list = list()
    data_list[labels] = values
    data_list['Facilities'] = facilities
    data_list['Agent'] = agent_name
    
    data_list %>% as.data.frame
}

In [9]:
# simple function to remove spaces of all kinds
remove_tabs = function(x) {
    gsub("([\t]|[\r\n])", "", x)
}

In [10]:
# simple function to remove the colon from the label
takeout_last_char = function(x) {
    x %>% substr(1, nchar(x) - 1)
}

In [11]:
# the real script, 
# 1. calls a query for each page
# 2. for every listing, call a query to pull data about listing
# 3. feeds listing data into get_df_from_node_list to get a single row df
# 4. stitch df into main df
# 5. breaks when done
#
# this function can be further refactored for SLAP (not important for this mod)

n = 1L
prop_df = data.frame()

while(n<3){
    property = trycatch_fetch_property_notes(n)
    
    if (property[1] == 'Done') {
        break
    }
    
    for (i in 1:length(property)) {
        listing_url = paste(host_url, html_attr(property[i], 'href'), sep="")
        listing_html = read_html(listing_url)
        listing_data = html_nodes(listing_html, '.listing-about-main')
        listing_info = html_nodes(listing_data[1], 'p')
        facilities_info = listing_data[2]
        agent_name = listing_html %>% html_node('.featuredAgentName') %>% html_text
        row_df = get_df_from_node_list(listing_info, facilities_info, agent_name)
        prop_df = suppressWarnings(bind_rows(prop_df, row_df))
    }
    
    n = n + 1
}

In [12]:
prop_df %>% head

Property.Name,Property.Type,Asking,PSF,Built.Year,Model,Developer,Address,District,Bedrooms,Bathrooms,Floor,Area,Tenure,No..of.Units,Facilities,Agent,Furnish
Laverne's Loft,Apartment,"$1,300,000","$1,041 psf (Built-up)",2013.0,Penthouse,Asimont Holdings Pte Ltd,66 Lorong L Telok Kurau (425509),D15 - East Coast / Marine Parade,2,2.0,5,"1,249 sqft (Built-up)",FREEHOLD,44,Renovated | City View | Jacuzzi | Air Conditioning | Cooker Hob/hood | Water Heater | Balcony,Steven Lau,
Queens Peak,Condominium,"$1,324,000","$1,708 psf (Built-up)",2020.0,Condominium,HY Realty (Dundee) Pte Ltd,1 Dundee Road (149456),D3 - Alexandra / Commonwealth,2,2.0,22,775 sqft (Built-up),LEASEHOLD/99 years,736,Original Condition | Park/greenery View | Air Conditioning | Intercom | Water Heater | Cooker Hob/hood | Balcony | Walk-in-wardrobe,"Henry Lim (MBA, EXPERT-SRX)",Partially Furnished
Marina One Residences,Apartment,"$2,793,950","$2,497 psf (Built-up)",,Apartment,MS Residential1 Pte Ltd/MS Residential 2 Pte Ltd/Ms Commercial Pte Ltd,21 Marina Way (018978),D1 - Boat Quay / Raffles Place / Marina,2,,28,"1,119 sqft (Built-up)",LEASEHOLD/99 years,1042,Renovated | Park/greenery View | Original Condition | Intercom | Cooker Hob/hood | Water Heater | Air Conditioning | Balcony | Walk-in-wardrobe,"Henry Lim (MBA, EXPERT-SRX)",
Gem Residences,Condominium,"$1,642,000","$1,734 psf (Built-up)",2020.0,Condominium,GEM Homes Pte Ltd,1 Lorong 5 Toa Payoh (319458),D12 - Balestier / Toa Payoh,3,1.0,9,947 sqft (Built-up),LEASEHOLD/99 years,578,City View | Original Condition | Air Conditioning | Intercom | Cooker Hob/hood | Water Heater | Balcony | Walk-in-wardrobe,"Henry Lim (MBA, EXPERT-SRX)",Partially Furnished
Duo Residences,Apartment,"$4,215,100","$2,759 psf (Built-up)",2017.0,Apartment,Ophir-Rochor Residential Pte. Ltd.,1 Fraser Street (189350),D7 - Beach Road / Bugis / Rochor,3,3.0,40,"1,528 sqft (Built-up)",LEASEHOLD/99 years,660,Patio Gym | Gymnasium | 50M Lap Pool With Spa Seats | Reflective Pool | Teppanyaki Terraces | Jacuzzi | Sun Deck | Children's Pool | Kid's Play Area | Sky Pool | Aqua Gym Equipment | Outdoor Living Room | Function Room | Original Condition | Park/greenery View | Air Conditioning | Water Heater | Intercom | Cooker Hob/hood | Balcony | Walk-in-wardrobe,Tina Lee Kim Lian,Partially Furnished
Gem Residences,Condominium,"$1,640,000","$1,732 psf (Built-up)",,Condominium,GEM Homes Pte Ltd,3 Lorong 5 Toa Payoh (319459),D12 - Balestier / Toa Payoh,3,2.0,26,947 sqft (Built-up),LEASEHOLD/99 years,578,Park/greenery View | Renovated | City View | Air Conditioning | Water Heater | Intercom | Cooker Hob/hood | Balcony,Kane Seow K H,


In [13]:
prop_df %>% write.csv(file='srx.csv')