# Web Scraping

## 1. Introduction

## 2. Requirements

In order to access website using `Chrome Webdriver`, one must verify that:

1. `Chrome` (denotes the *fully installed version of Chrome Browser*) must be installed in the system. Identify the **installed version** of *Chrome* first.
2. `Chrome Webdriver` must be provided, which its **version** shall match the **installed version** of `Chrome`. Further explanation on this can be found in https://chromedriver.chromium.org/downloads/version-selection. Download the matched version of `Chrome Webdriver`, the file name will be `chromedriver.exe`.
3. Make sure that `Chrome Webdriver` can be detected while utilizing `Selenium` from python. I choose to add the `chromedriver.exe` executable file path to ChromeDriver when instantiating `webdriver.Chrome`.

 <center><img src="Scraped_Data\pic\1_chromedriver_version.png"/></center>

 *<center> Matching `Chrome` version with `Chrome Webdriver` version</center>*

## 2. Web Scraping Flow

In [6]:
from selenium import webdriver
from bs4 import BeautifulSoup

URL = 'https://www.rumah.com/properti-dijual?freetext=DKI+Jakarta&listing_type=sale&property_type=B&property_type_code[]=BUNG&market=residential&newProject=all&search=true'
browser = webdriver.Chrome('D:\\chromedriver.exe')
browser.get(URL)


### 2.1. Accessing `main entry list`

All of the house listings in main page is stored in `listing-container` as below snapshot. We are going to retrieve all of the listings in a single page by accessing this `XPath`.


 <center><img src="Scraped_Data\pic\2_Main_content.png"/></center>

 <center><img src="Scraped_Data\pic\3_Listing_card.png"/></center>

 *<center> Webpage structure of house listing </center>*

In [126]:
main_listing_xpath = '//*[@id="listings-container"]/div[*]'
main_listings = browser.find_elements_by_xpath(main_listing_xpath)

main_listings[:5]

[<selenium.webdriver.remote.webelement.WebElement (session="227c197822a755d27c2026e279c1c7e9", element="2edfb874-bb33-41ed-883e-e39caebdba92")>,
 <selenium.webdriver.remote.webelement.WebElement (session="227c197822a755d27c2026e279c1c7e9", element="b9ed1a1e-340b-4be5-8865-d0eeec1aa4fe")>,
 <selenium.webdriver.remote.webelement.WebElement (session="227c197822a755d27c2026e279c1c7e9", element="93423941-71de-4cfa-9718-001db60c1f55")>,
 <selenium.webdriver.remote.webelement.WebElement (session="227c197822a755d27c2026e279c1c7e9", element="7b03caab-f063-403d-a5ff-0f9741b0f92d")>,
 <selenium.webdriver.remote.webelement.WebElement (session="227c197822a755d27c2026e279c1c7e9", element="d559eeca-e305-4b4e-b3c9-afc591803144")>]

House listing is located in `Listing Card`, the children of `main_listing` element above. We will access this to retrieve several information of listing:

1. Listing id
2. Navigation link to its detailed listing information

We are going to try to a house list as below codes:

In [127]:
# example of using 1 listing
listing_card_html = main_listings[0].get_attribute('outerHTML')
soup = BeautifulSoup(listing_card_html, "html.parser")

# listing card properties
listing_id = soup.select('a[data-listing-id]')[0]['data-listing-id'] # use 'a[...]' to look for attribute
nav_link = soup.select('.nav-link')[0]['href'] # use '.<class value>' to look for class value


Now we can collect all of the `listing_id` and `nav_link` from the main entry list, which contain more than one listing. Note that each listing has been isolated as a list of `WebElement` object so that we won't retrieve overlapped records.

In [132]:
listing_ids = []
nav_links = []

for listing in main_listings:
    soup = BeautifulSoup(listing.get_attribute('outerHTML'))
    listing_ids.append(
        soup.select('a[data-listing-id]')[0]['data-listing-id']
    )
    nav_links.append(
        soup.select('.nav-link')[0]['href']
    )

# check for {ids: nav_links}
dict(zip(listing_ids, nav_links))

{'20571009': 'https://www.rumah.com/listing-properti/dijual-griya-seroja-pesanggrahan-oleh-pt-hawra-karya-20571009#1225',
 '20586913': 'https://www.rumah.com/listing-properti/dijual-asya-oleh-pt-asya-mandira-land-20586913#1232',
 '18911127': 'https://www.rumah.com/listing-properti/dijual-jual-rumah-di-jagakarsa-jakarta-selatan-siap-huni-oleh-abdul-gofar-18911127',
 '20499239': 'https://www.rumah.com/listing-properti/dijual-dijual-rumah-lama-di-mendawai-kebayoran-baru-oleh-indrabrata-20499239',
 '19988140': 'https://www.rumah.com/listing-properti/dijual-cluster-malibu-vilage-oleh-richard-allan-19988140',
 '19678751': 'https://www.rumah.com/listing-properti/dijual-rawamangun-jakarta-timur-oleh-neni-supriati-19678751',
 '20519302': 'https://www.rumah.com/listing-properti/dijual-cluster-2lt-cipinang-kebembem-modern-mewah-oleh-rudi-hartono-20519302',
 '20596316': 'https://www.rumah.com/listing-properti/dijual-cluster-pisangan-baru-oleh-hermansyah-20596316',
 '19780767': 'https://www.rumah.c