# Web Scraping

## 1. Introduction

## 2. Requirements

In order to access website using `Chrome Webdriver`, one must verify that:

1. `Chrome` (denotes the *fully installed version of Chrome Browser*) must be installed in the system. Identify the **installed version** of *Chrome* first.
2. `Chrome Webdriver` must be provided, which its **version** shall match the **installed version** of `Chrome`. Further explanation on this can be found in https://chromedriver.chromium.org/downloads/version-selection. Download the matched version of `Chrome Webdriver`, the file name will be `chromedriver.exe`.
3. Make sure that `Chrome Webdriver` can be detected while utilizing `Selenium` from python. I choose to add the `chromedriver.exe` executable file path to ChromeDriver when instantiating `webdriver.Chrome`.

 <center><img src="Scraped_Data\pic\1_chromedriver_version.png"/></center>

 *<center> Matching `Chrome` version with `Chrome Webdriver` version</center>*

## 3. Web Scraping Flow

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from numpy import random
import time

In [2]:
URL = 'https://www.rumah123.com/jual/residensial/?place[]=dki-jakarta&place[]=bogor&place[]=depok&place[]=bekasi,bekasi&placeId[]=8de06376-49a3-4369-a01b-00085aefe766&placeId[]=dfd71096-eda2-4776-b3ca-542d8c5fb12b&placeId[]=a4a34395-ebe5-4930-9456-df327a9f484a&placeId[]=66899e8e-4896-467b-8e54-ab7c533bd616#qid~a0314d88-70ff-4b6c-b4bd-45f4c9f41d04'
browser = webdriver.Chrome('D:\\chromedriver.exe')
browser.implicitly_wait(10)
browser.get(URL)
time.sleep(float(random.uniform(5, 180, 1)))


### 2.1. Accessing `main entry list`

All of the house listings in main page is stored in `listing-container` as below snapshot. We are going to retrieve all of the listings in a single page by accessing this `XPath`.


 <center><img src="Scraped_Data\pic\2_Main_content.png"/></center>

 <center><img src="Scraped_Data\pic\3_Listing_card.png"/></center>

 *<center> Webpage structure of house listing </center>*

In [3]:
main_listing_class = 'ui-organism-intersection__element.intersection-card-container'
main_listings = browser.find_elements_by_class_name(main_listing_class)
print(len(main_listings))
main_listings[:5]

20


[<selenium.webdriver.remote.webelement.WebElement (session="5a67e7b79fb99de414f0dbe9ce05b762", element="7e5a1895-163d-4036-a40e-a0127ff97dee")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5a67e7b79fb99de414f0dbe9ce05b762", element="3e9e4682-7c93-4ded-8af0-e261feb77134")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5a67e7b79fb99de414f0dbe9ce05b762", element="757df6dd-c6c9-462a-a241-3d4ffbae3ac3")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5a67e7b79fb99de414f0dbe9ce05b762", element="44630d51-b024-4695-96a1-70d6206fd0cf")>,
 <selenium.webdriver.remote.webelement.WebElement (session="5a67e7b79fb99de414f0dbe9ce05b762", element="1b52b4bd-9ba6-46ed-84c9-0d848941c41c")>]

House listing is located in `Listing Card`, the children of `main_listing` element above. We will access this to retrieve Navigation link to its detailed listing information.
We are going to try to a house list as below codes:

In [4]:
# example of using 1 listing
listing_card_html = main_listings[2].get_attribute('outerHTML')
soup_listing = BeautifulSoup(listing_card_html, "html.parser")

# listing card properties
nav_link = 'https://www.rumah123.com' + soup_listing.select('[title]')[1]['href']
nav_link

'https://www.rumah123.com/properti/jakarta-timur/hos10008254/'

Using above lines, we can collect all of the `listing_id` and `nav_link` from the main entry list, which contain more than one listing. Note that each listing has been isolated as a list of `WebElement` object so that we won't retrieve overlapped records.

In [5]:
nav_links = []

for listing in main_listings:
    soup = BeautifulSoup(listing.get_attribute('innerHTML'))
    listing_class = 'ui-organisms-card-r123-featured__middle-section__title'
    nav_links.append(
        'https://www.rumah123.com' + soup.select(('.' + listing_class))[0]['href']
    )

nav_links

['https://www.rumah123.com/properti/jakarta-selatan/aps2723013/',
 'https://www.rumah123.com/properti/jakarta-timur/hos10116318/',
 'https://www.rumah123.com/properti/jakarta-timur/hos10008254/',
 'https://www.rumah123.com/properti/jakarta-selatan/hos10914618/',
 'https://www.rumah123.com/properti/jakarta-pusat/hos11325069/',
 'https://www.rumah123.com/properti/jakarta-utara/hos9727104/',
 'https://www.rumah123.com/properti/depok/hos10199621/',
 'https://www.rumah123.com/properti/jakarta-timur/hos11324434/',
 'https://www.rumah123.com/properti/jakarta-timur/hos9945268/',
 'https://www.rumah123.com/properti/jakarta-pusat/hos8317620/',
 'https://www.rumah123.com/properti/jakarta-selatan/hos11310013/',
 'https://www.rumah123.com/properti/jakarta-barat/hos10771399/',
 'https://www.rumah123.com/properti/jakarta-utara/hos11286910/',
 'https://www.rumah123.com/properti/jakarta-barat/hos11295397/',
 'https://www.rumah123.com/properti/jakarta-selatan/hos10934079/',
 'https://www.rumah123.com/pr

Our scraping framework will loop-accessing all of the stored navigation link of house listing.

### 2.2. Accessing specific `House Listing`

Having information of each listing navigation link, we now begin to scrap information of each listing by repetitively accessing each of the link using our `Chrome Webdriver`.

We have to identify each particular object that we want to scrap.

#### **Listing Header**

 <center><img src="Scraped_Data\pic\4_Header.png"/></center>

 *<center> Listing Header </center>*

Information:

- Price
- Title
- Address

In [10]:
URL = 'https://www.rumah123.com/properti/jakarta-timur/hos10008254/'
browser.get(URL)
time.sleep(float(random.uniform(5, 180, 1)))


In [80]:
header_class_name = 'ui-container.ui-property-page__main-container'
header_element = browser.find_element_by_class_name(header_class_name)
soup_header = BeautifulSoup(header_element.get_attribute('innerHTML'), 'html.parser')
# scraping
try: 
    currency, price, price_unit = \
        soup_header.select('.r123-listing-summary__price')[0].text.split()
    title = soup_header.select('.r123-listing-summary__header-container-title')[0].text.strip()
    address = soup_header.select('.r123-listing-summary__header-container-address')[0].text.strip()
except AttributeError:
    pass
# compile header
site_url = {'url': URL}
header = dict(
    title = title,
    price_currency = currency,
    price_value = price,
    price_unit = price_unit,
    address = address
)

header

{'title': 'Rumah Murah di Jalan Cipinang Baru Raya Jakarta Timur',
 'price_currency': 'Rp',
 'price_value': '2',
 'price_unit': 'Miliar',
 'address': 'Rawamangun, Jakarta Timur'}

#### **Property Specifications**

Detailed list of property specification will popped-up after we click *`menampilkan lebih banyak`*, as shown in below snapshot. We will retrieve the element after we *clicking* using the `WebDriver`.

 <center><img src="Scraped_Data\pic\5_Listing_Popup_Button.png"/></center>

 *<center> Listing specification </center>*

In [63]:
# commanding clicking
details_popup_class_name = 'relative.ui-content-half__selector'
browser.find_element_by_class_name(details_popup_class_name).click()

ElementClickInterceptedException: Message: element click intercepted: Element <div role="button" tabindex="0" class="relative ui-content-half__selector">...</div> is not clickable at point (403, 449). Other element would receive the click: <div class="ui-listing-specification__table--row">...</div>
  (Session info: chrome=106.0.5249.91)


In [60]:
details_element = browser.find_element_by_class_name('ui-listing-specification__table')
details_soup = BeautifulSoup(details_element.get_attribute('innerHTML'), 'html.parser')
# compile available details
details = {}
for spec in details_soup.select('.ui-listing-specification__table--row'):
    label, value = [_.text for _ in spec.find_all('p')]
    details.update({label: value})

details

{'K. Tidur': '3',
 'K. Mandi': '3',
 'L. Tanah': '56 m²',
 'L. Bangunan': '87 m²',
 'Carport': '1',
 'Tipe Properti': 'Rumah',
 'Sertifikat': 'SHM - Sertifikat Hak Milik',
 'Daya Listrik': '2200 mAh',
 'Jumlah Lantai': '2',
 'Tahun dibangun': '2022',
 'Kondisi Properti': 'Baru',
 'Kondisi Perabotan': 'Unfurnished',
 'ID Iklan': 'hos10008254'}

#### **Provided Facilities**

 <center><img src="Scraped_Data\pic\6_Facilities.png"/></center>

 *<center> Provided facilities </center>*

In [71]:
facilities_element = browser.find_element_by_class_name('ui-facilities-portal__item-wrapper')
facilities_soup = BeautifulSoup(facilities_element.get_attribute('innerHTML'), 'html.parser')

facilities = {
    'facilities': ', '.join({_.text for _ in facilities_soup.select('.ui-facilities-portal__item')})
}

facilities

{'facilities': 'Keamanan, CCTV'}

### 2.3. Combining Scraped Data from a Single Listing Page

From section `2.2A ccessing specific House Listing`, we have compiled several collections of scraped data as below:

1. Records of **Listing URL**, stored in `url` variable 
2. Records of **Listing Header**, stored in `header` variable
3. Records of **Property Details**, stored in `details` variable
4. Records of **Provided Facilities**, stored in `facilities` variable

Thus, a single observations of house listing will be presented as a `dictionary` of records, as explained in code below. We specifically use the dict `update` method to merge the records and ignoring the duplicated records (if any).

Collection of observations will be stored as a **list of dictionaries**, which then can be processed using `pandas.DataFrame` constructor. More of this will be explained in the next chapter.

## 3. Completing Web Scraping Framework

We now begin to complete our framework by creating a program to iterate over listing pages.

In [113]:
class WebScraper():
    """ """
    def __init__(self, start_url):
        self._URL = start_url
        self._nav_links = []
        self._observations = []
        self._chrome_options = webdriver.ChromeOptions()
        self._prefs = {"profile.default_content_setting_values.notifications": 2}
        self._chrome_options.add_experimental_option('prefs', self._prefs)
        self._browser = webdriver.Chrome('D:\\chromedriver.exe')

        
    def _main_page_accessor(self, url):
        """ Collects navigation links from a single entry page"""
        # Starts page access
        self._browser.get(url)
        # time.sleep(float(random.uniform(5, 120, 1)))
        # Post access
        main_listing_class_ = 'ui-organism-intersection__element.intersection-card-container'
        main_listings_ = self._browser.find_elements_by_class_name(main_listing_class_)
        links_ = []
        for listing_ in main_listings_:
            soup_ = BeautifulSoup(listing_.get_attribute('innerHTML'), 'html.parser')
            listing_class_ = 'ui-organisms-card-r123-featured__middle-section__title'
            links_.append(
                'https://www.rumah123.com' + soup_.select(('.' + listing_class_))[0]['href']
            )
        self._nav_links = links_.copy()

    def _listing_scraper(self, url):
        scrap_data_ = {}
        # Starts page access
        self._browser.get(url)
        #time.sleep(float(random.uniform(5, 120, 1)))
        # Post access:
        # 1. Scraping Header
        header_class_name_ = 'ui-container.ui-property-page__main-container'
        header_element_ = self._browser.find_element_by_class_name(header_class_name_)
        soup_header_ = BeautifulSoup(header_element_.get_attribute('innerHTML'), 'html.parser')
        try: 
            scrap_data_['currency'], scrap_data_['price'], scrap_data_['price_unit_scale'] = \
                soup_header_.select('.r123-listing-summary__price')[0].text.split()
            scrap_data_['title'] = soup_header.select('.r123-listing-summary__header-container-title')[0].text.strip()
            scrap_data_['address'] = soup_header.select('.r123-listing-summary__header-container-address')[0].text.strip()
        except AttributeError:
            pass
        # 2. Scraping Property Specification
        details_popup_class_name_ = 'relative.ui-content-half__selector'
        self._browser.find_element_by_class_name(details_popup_class_name_).click()
        details_element_ = self._browser.find_element_by_class_name('ui-listing-specification__table')
        details_soup_ = BeautifulSoup(details_element_.get_attribute('innerHTML'), 'html.parser')
        for spec in details_soup_.select('.ui-listing-specification__table--row'):
            label_, value_ = [_.text.lower() for _ in spec.find_all('p')]
            scrap_data_.update({label_: value_})
        # 3. Scraping Provided Facilities
        facilities_element_ = self._browser.find_element_by_class_name('ui-facilities-portal__item-wrapper')
        facilities_soup_ = BeautifulSoup(facilities_element_.get_attribute('innerHTML'), 'html.parser')        
        scrap_data_['facilities'] = \
            ', '.join({_.text for _ in facilities_soup_.select('.ui-facilities-portal__item')})
        self._observations.append(scrap_data_)
 
    def scrap(self, start_page, end_page):
        start_page_ = start_page
        end_page_ = end_page
        for i in range(start_page_, end_page_ + 1):
            entry_page_url_ = self._URL + f'&page={i}'    
            self._main_page_accessor(entry_page_url_)
            for link_ in self._nav_links:
                self._listing_scraper(link_)
        return self._observations


    # Simple Testing
    # def test_1(self):
    #     self._main_page_accessor(self._URL)
    #     return self._nav_links

    def test_2(self):
        self._listing_scraper(self._URL)
        return self._observations

In [109]:
list_example = []
dict_1 = dict(a = 'test')
dict_2 = dict(b = 'test 2')
list_example.append(dict_1)
list_example.append(dict_2)

list_example


[{'a': 'test'}, {'b': 'test 2'}]

In [114]:
TEST_URL = 'https://www.rumah123.com/properti/jakarta-selatan/hos11028271/'
WebScraper(TEST_URL).test_2()

[{'currency': 'Rp',
  'price': '10,5',
  'price_unit_scale': 'Miliar',
  'title': 'Rumah Murah di Jalan Cipinang Baru Raya Jakarta Timur',
  'address': 'Rawamangun, Jakarta Timur',
  'K. Tidur': '5',
  'K. Mandi': '5',
  'L. Tanah': '230 m²',
  'L. Bangunan': '550 m²',
  'Carport': '2',
  'Tipe Properti': 'Rumah',
  'Sertifikat': 'SHM - Sertifikat Hak Milik',
  'Daya Listrik': '6600 mAh',
  'KT. Pembantu': '2',
  'KM. Pembantu': '1',
  'Garasi': '4',
  'Jumlah Lantai': '3',
  'Tahun dibangun': '2022',
  'Kondisi Properti': 'Baru',
  'Kondisi Perabotan': 'Furnished',
  'Hadap': 'Utara',
  'ID Iklan': 'hos11028271',
  'facilities': 'Kolam Renang, Keamanan,  Jogging Track,  Ac,  Jalur Telepon,  Taman,  CCTV'}]