# Web Scraping

## 1. Introduction

## 2. Requirements

In order to access website using `Chrome Webdriver`, one must verify that:

1. `Chrome` (denotes the *fully installed version of Chrome Browser*) must be installed in the system. Identify the **installed version** of *Chrome* first.
2. `Chrome Webdriver` must be provided, which its **version** shall match the **installed version** of `Chrome`. Further explanation on this can be found in https://chromedriver.chromium.org/downloads/version-selection. Download the matched version of `Chrome Webdriver`, the file name will be `chromedriver.exe`.
3. Make sure that `Chrome Webdriver` can be detected while utilizing `Selenium` from python. I choose to add the `chromedriver.exe` executable file path to ChromeDriver when instantiating `webdriver.Chrome`.

 <center><img src="Scraped_Data\pic\1_chromedriver_version.png"/></center>

 *<center> Matching `Chrome` version with `Chrome Webdriver` version</center>*

## 2. Web Scraping Flow

In [7]:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import numpy as np

In [8]:
URL = 'https://www.rumah123.com/jual/residensial/?place[]=dki-jakarta&place[]=bogor&place[]=depok&place[]=bekasi,bekasi&placeId[]=8de06376-49a3-4369-a01b-00085aefe766&placeId[]=dfd71096-eda2-4776-b3ca-542d8c5fb12b&placeId[]=a4a34395-ebe5-4930-9456-df327a9f484a&placeId[]=66899e8e-4896-467b-8e54-ab7c533bd616#qid~a0314d88-70ff-4b6c-b4bd-45f4c9f41d04'
browser = webdriver.Chrome('D:\\chromedriver.exe')
browser.implicitly_wait(10)
browser.get(URL)
time.sleep(float(np.random.uniform(5, 280, 1)))


### 2.1. Accessing `main entry list`

All of the house listings in main page is stored in `listing-container` as below snapshot. We are going to retrieve all of the listings in a single page by accessing this `XPath`.


 <center><img src="Scraped_Data\pic\2_Main_content.png"/></center>

 <center><img src="Scraped_Data\pic\3_Listing_card.png"/></center>

 *<center> Webpage structure of house listing </center>*

In [20]:
main_listing_class = 'ui-organism-intersection__element.intersection-card-container'
main_listings = browser.find_elements_by_class_name(main_listing_class)
print(len(main_listings))
main_listings[:5]

20


[<selenium.webdriver.remote.webelement.WebElement (session="9f61ba79abf6b5b6ac0cdf060bfac380", element="f71f77c3-6ec3-4ee3-b422-31a9de5d6a1f")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9f61ba79abf6b5b6ac0cdf060bfac380", element="3756d57c-56d7-4100-b13b-a08c9cb0de6e")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9f61ba79abf6b5b6ac0cdf060bfac380", element="a928baa0-51f3-4bc4-969b-2f5059b0d8e3")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9f61ba79abf6b5b6ac0cdf060bfac380", element="dd258a7a-c5a2-49ef-b883-f4875deb86f5")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9f61ba79abf6b5b6ac0cdf060bfac380", element="4c51c7b0-bff4-43ef-bb7a-b1009205cb94")>]

House listing is located in `Listing Card`, the children of `main_listing` element above. We will access this to retrieve Navigation link to its detailed listing information.
We are going to try to a house list as below codes:

In [23]:
# example of using 1 listing
listing_card_html = main_listings[2].get_attribute('outerHTML')
soup_listing = BeautifulSoup(listing_card_html, "html.parser")

# listing card properties
nav_link = 'https://www.rumah123.com' + soup_listing.select('[title]')[1]['href']
nav_link

'https://www.rumah123.com/properti/jakarta-timur/hos11319648/'

Using above lines, we can collect all of the `listing_id` and `nav_link` from the main entry list, which contain more than one listing. Note that each listing has been isolated as a list of `WebElement` object so that we won't retrieve overlapped records.

In [24]:
nav_links = []

for listing in main_listings:
    soup = BeautifulSoup(listing.get_attribute('innerHTML'))
    nav_links.append(
        'https://www.rumah123.com' + soup.select('[title]')[1]['href']
    )

nav_links

['https://www.rumah123.com/properti/jakarta-barat/hos11320425/',
 'https://www.rumah123.com/properti/jakarta-utara/hos11317048/',
 'https://www.rumah123.com/properti/jakarta-timur/hos11319648/',
 'https://www.rumah123.com/properti/jakarta-timur/hos11319428/',
 'https://www.rumah123.com/properti/jakarta-barat/hos11319359/',
 'https://www.rumah123.com/properti/jakarta-barat/hos10869556/',
 'https://www.rumah123.com/properti/jakarta-barat/hos8145565/',
 'https://www.rumah123.com/properti/jakarta-timur/hos11318124/',
 'https://www.rumah123.com/properti/jakarta-selatan/hos11318784/',
 'https://www.rumah123.com/properti/jakarta-selatan/hos11317620/',
 'https://www.rumah123.com/properti/jakarta-pusat/hos11068296/',
 'https://www.rumah123.com/properti/jakarta-barat/hos11317251/',
 'https://www.rumah123.com/properti/jakarta-barat/hos11317213/',
 'https://www.rumah123.com/properti/jakarta-barat/hos11318051/',
 'https://www.rumah123.com/properti/jakarta-timur/hos10843509/',
 'https://www.rumah123

Our scraping framework will loop-accessing all of the stored navigation link of house listing.

### 2.2. Accessing specific `House Listing`

Having information of each listing navigation link, we now begin to scrap information of each listing by repetitively accessing each of the link using our `Chrome Webdriver`.

We have to identify each particular object that we want to scrap.

#### **Listing Header**

 <center><img src="Scraped_Data\pic\4_Header.png"/></center>

 *<center> Listing Header </center>*

Information:

- Title
- Subtitle
- Property Type

In [13]:
URL = 'https://www.rumah.com/listing-properti/dijual-griya-seroja-pesanggrahan-oleh-pt-hawra-karya-20571009'
browser = webdriver.Chrome('D:\\chromedriver.exe')
browser.implicitly_wait(180)
time.sleep(120)
browser.get(URL)
time.sleep(120)


In [106]:
header_element = browser.find_element_by_class_name('listing-detail-header-bar.container.clearfix')
soup_header = BeautifulSoup(header_element.get_attribute('innerHTML'), 'html.parser')
# scraping
try: 
    title = soup_header.find('h1').text.strip()
    subtitle = soup_header.find('h2').text.strip()
    property_type = \
        ', '.join([prop.text.strip() \
            for prop in soup_header.select('.listing-property-type')[0].find_all('span')])
except AttributeError:
    pass
listing_header = {
    'title': title,
    'subtitle': subtitle,
    'property_type': property_type
}
listing_header

{'title': 'Jelambar, DKI Jakarta, Jakarta Barat',
 'subtitle': 'Rumah Jakarta Selatan, Nempel MRT Lebak Bulus dan Pondok Indah',
 'property_type': 'Properti Baru: 2022, Rumah'}

#### **Listing Overview**

 <center><img src="Scraped_Data\pic\6_Listing_Overview.png"/></center>

 *<center> Listing overview </center>*

In [104]:
URL = 'https://www.rumah.com/listing-properti/dijual-jelambar-dki-jakarta-jakarta-barat-oleh-rudy-yang-20177418'
browser.implicitly_wait(180)
time.sleep(120)
browser.get(URL)
time.sleep(120)

In [155]:
# NoSuchElementException Error if can't be found
overview_element = browser.find_element_by_class_name('price-overview-widget.clearfix')
overview_soup = BeautifulSoup(overview_element.get_attribute('outerHTML'), 'html.parser')
# Scraping
currency = overview_soup.select('[itemprop="priceCurrency"]')[0]['content']
price = overview_soup.select('[itemprop="price"]')[0]['content']
numberofrooms = overview_soup.select('[itemprop="numberOfRooms"]')[0].text
baths = overview_soup.select('.property-info-element.baths')[0].text.strip()
floorsize_value = overview_soup.select('[itemprop="floorSize"]')[0].select('[itemprop="value"]')[0]['content']
floorsize_unit = 'm2'
price_per_area_value = overview_soup.select('.property-info-element.psf')[0].select('.price-value')[0].text
price_per_area_unit = 'jt/m2'
streetaddress = overview_soup.select('[itemprop="streetAddress"]')[0].text

dict(
    currency = currency,
    price = price,
    numberofrooms = numberofrooms,
    baths = baths,
    floorsize_value = floorsize_value,
    floorsize_unit = floorsize_unit,
    price_per_area_value = price_per_area_value,
    price_per_area_unit = price_per_area_unit,
    streetaddress = streetaddress
)


{'currency': 'IDR',
 'price': '2850000000',
 'numberofrooms': '5',
 'baths': '4',
 'floorsize_value': '138',
 'floorsize_unit': 'm2',
 'price_per_area_value': '20,652174',
 'price_per_area_unit': 'jt/m2',
 'streetaddress': 'Jelambar, Jelambar, Jakarta Barat, DKI Jakarta'}

#### **Listing Detail**

`Listing Detail` is stored in a table format in class `row.table-row`. This class is marked with `flex`, which inform that this class is flexible so that each listing may have different object of details (i.e. some listing may report all of possible feature determined by website properties, while other may not report all of the possible feature).

 <center><img src="Scraped_Data\pic\5_Listing Details.png"/></center>

 *<center> Listing details </center>*

In [105]:
detail_element = browser.find_element_by_class_name('row.table-row')
detail_soup = BeautifulSoup(detail_element.get_attribute('outerHTML'), 'html.parser')
listing_details = {}
for detail in detail_soup.select('.property-attr'):
    listing_details[detail.find('h4').text] = detail.find('td', attrs={'itemprop': 'value'}).text
listing_details

{'Tipe Properti': 'Rumah Dijual',
 'Luas bangunan': '138 m²',
 'Pengembang': 'N/A',
 'Luas tanah': '138 m²',
 'per m²': 'Rp 20.652.174 per m²',
 'Interior': 'N/A',
 'Lantai': 'N/A',
 'Sertifikat': 'SHM - Sertifikat Hak Milik',
 'Tahun Dibuat': 'N/A',
 'ID Listing': '20177418',
 'Listrik': '3500 watt',
 'Terdaftar pada': '3 bulan yang lalu'}