# Web scraping basics with Selenium

Unofficial Selenium documentation: https://selenium-python.readthedocs.io/

Versions used in this notebook:
* Python 3.10.2
* Selenium 4.1.3
* ChromeDriver 99.0.4844.51

In [None]:
# !pip install selenium

## Using WebDriver

1. Download the driver.
    * [ChromeDriver](https://sites.google.com/chromium.org/driver/) for Google Chrome
    * [GeckoDriver](https://github.com/mozilla/geckodriver/releases) for Mozilla Firefox
1. Put it in somewhere Selenium can find it.
    * Option 1: In the same folder with the script
    * Option 2: In a folder that is included in the system path (for Windows)
    * Option 3: Add its location to the system path (for Windows)

### Chrome

In [34]:
from selenium import webdriver

driver = webdriver.Chrome()

### Firefox

In [None]:
from selenium import webdriver

driver = webdriver.Firefox()

## Some options

It is possible to define the browser position/size, user agent, proxy, etc. Note that the option object is browser-specific.

In [35]:
from selenium import webdriver

# Creates an options object.
options = webdriver.ChromeOptions()

# Changes the user agent.
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
options.add_argument("--user-agent=%s" % user_agent)

# Proxy?
# proxy = "<IP>:<PORT>"
# options.add_argument("--proxy-server=%s" % proxy)

# Removes certain fields that can be used to detect WebDriver.
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)

# Sets English as the only accepted language.
options.add_experimental_option("prefs", {"intl.accept_languages": "en;q=0.9"})

# Opens the browser in the incognito mode.
options.add_argument("--incognito")

# Makes the browser headless.
# options.add_argument("--headless")

# Opens the browser window with the provided options.
driver = webdriver.Chrome(options=options)

# Sets window size.
driver.set_window_size(1400, 900)

# Sets window position.
driver.set_window_position(500, 0)

# Sets timeout threshold.
driver.set_page_load_timeout(30)

## Page navigation

### Current tab

In [36]:
driver.get("https://atomurl.net/myip/")

### New tab

While Firefox allows sending key combinations such as Ctrl + T, Chrome requires executing JavaScript.

In [37]:
new_tab_url = "https://www.hepsiemlak.com/ankara-satilik"
driver.execute_script(f"window.open('{new_tab_url}','_blank')")

### Tab switching

We can locate a tab using indices and switch to that tab.

In [38]:
# The first tab has an index of 0.
first_tab = driver.window_handles[0]

driver.switch_to.window(first_tab)

### Closing a specific tab

Closing a tab forces switching to the last open tab, but we need to explicitly change focus to a certain open tab. Otherwise, bugs can occur in the future. To make things simpler, we can write a function.

In [39]:
# -1 means the last index is used. Even if -1 is used to close a tab, once the tab is closed, -1 refers to the one comes just before the closed one.
def close_tab(index_to_close, index_to_switch_after=-1):
    driver.switch_to.window(driver.window_handles[index_to_close])
    driver.close()
    driver.switch_to.window(driver.window_handles[index_to_switch_after])


# close_tab(-1, -1) # Close the last tab, switch to the last remaining tab
# close_tab(0, 0) # Close the first tab, switch to the first remaining tab
close_tab(0, -1)  # Close the first tab, switch to the last remaining tab

## Locating elements

All possible ways: https://selenium-python.readthedocs.io/locating-elements.html

Some common ones:
* find_element_by_id
* find_element_by_class_name
* find_elements_by_class_name
* find_element_by_tag_name
* find_elements_by_tag_name
* find_element_by_xpath
* find_elements_by_xpath

Using the "find_element" versions retrieve only the first occurrences.

Xpath cheat sheet: https://devhints.io/xpath

### Retrieving the number of results

We firstly obtain the text inside the element and then process the text:

In [41]:
result_text = driver.find_element_by_class_name("applied-filters__count").text

result_int = int(result_text.split("için ")[1].split(" ilan")[0].replace(".", ""))

result_int

24781

### Retrieving an element using ID and retrieving the entire HTML content in it

This time we get everything including the text and HTML tags inside the element:

In [43]:
results_tab = driver.find_element_by_id("realty-owner-all-results")

# results_tab.text # This would simply retrieve the text without any tags
results_tab.get_attribute("innerHTML")

'<a href="/ankara-satilik" title="Ankara Satılık Ev" data-v-05396ec5="">\n        Tüm Sonuçlar\n      </a>'

### Retrieving all links using their tag ("a") and counting them

In [44]:
links = driver.find_elements_by_tag_name("a")

len(links)

401

### Retrieving all links that come right inside `<li>` and have an actual target instead of "#"

We will simply print the first 10 targets:

In [46]:
meaningful_list_links = driver.find_elements_by_xpath('//li/a[@href!="#"]')

for link in meaningful_list_links[:10]:
    print(link.get_attribute("href"))

https://www.hepsiemlak.com/ankara-satilik
https://www.hepsiemlak.com/ankara-kiralik
https://www.hepsiemlak.com/ankara-gunluk-kiralik
https://www.hepsiemlak.com/proje/ankara-projeleri
https://www.hepsiemlak.com/istanbul-satilik
https://www.hepsiemlak.com/ankara-satilik
https://www.hepsiemlak.com/izmir-satilik
https://www.hepsiemlak.com/adana-satilik
https://www.hepsiemlak.com/adiyaman-satilik
https://www.hepsiemlak.com/afyonkarahisar-satilik


## Interaction

Let us apply a filter to these results.

### Clicking on an element

We will firstly choose a district. To do so, we will click on the district selector:

In [47]:
district_selector = driver.find_element_by_xpath(
    '//section[@class="filter-item-wrap loc locationCountySec"]'
)

district_selector.click()

### Typing in

We can now start typing in the input field that shows up. To do so, we will locate the input field inside the selector and send our district of choice:

In [48]:
# Notice that we use the previously obtained element to directly search inside the element
district_search = district_selector.find_element_by_xpath(
    './/input[@class="he-select-base__search"]'
)

district = "Çankaya"

district_search.send_keys(district)

Now, we should have a visible district element that we can click on. This element is inside a div that follows the input field and has a class of "ps":

In [49]:
district_search.find_element_by_xpath(
    './following-sibling::div[@class="ps"]//a[@class="js-county-filter__list-link"]'
).click()

We can now apply this filter by clicking on the search button:

In [50]:
# There are two buttons in this section, but we need only the first one.
driver.find_element_by_xpath('//section[@class="filter-button-wrapper"]/a').click()

## Data collection

Now that we know how to navigate inside the page and between tabs, we can start collecting information. Let us firstly find how many results we now have:

In [51]:
result_text = driver.find_element_by_class_name("applied-filters__count").text
result_int = int(result_text.split("için ")[1].split(" ilan")[0].replace(".", ""))
result_int

7066

We can retrieve each real estate ad (except the ones that are external ads leading to other domains) on the page and count them. This part can be tricky and messy in certain websites especially since what we are trying to retrieve can be represented in different forms, which requires careful HTML inspection and some experimentation. It is also possible to find alternative approaches here.

In [15]:
ad_cards = driver.find_elements_by_xpath(
    '//div[@class="listView"]/div[contains(@class, "listing-project") or @data-v-7e43f948][@class!="w100p realty-owner-tab"]'
)

len(ad_cards)

24

Let us look at the pagination section. This part is crucial when the content is divided into pages. We see that page links are actually given in an unordered list (`<ul data-v-4ff47588="" class="he-pagination__links">`). While we cannot see the links of all pages, we can see that the last list item always corresponds to the last page. Therefore, we can obtain the last list item's text here and find the total number of pages. Instead of retrieving all elements and obtaining the last one, we can also retrieve the link for only the last list item. Then, we can get the link text and obtain an integer from it:

In [52]:
# We use "starts-with" here because the active page has an extra class name in it.
last_page = int(
    driver.find_element_by_xpath(
        '//li[starts-with(@class, "he-pagination__item")][last()]/a'
    ).text
)

last_page

293

In [53]:
result_int - len(ad_cards) < last_page * len(ad_cards)

False

This is peculiar, because we would expect to see the difference between the actual number of ads and the possible number of ads to be less than the number of ads shown on a single page, but it is not the case. After some manual examination of different pages, it seems the difference may stem from caching issue or something since each page (except the last page) indeed has 24 relevant ads.

At this point, we have different options:
* We can firstly visit all pages to retrieve the listing links and then visit each ad later.
* We can incrementally scrape all pages, directly starting scraping the ads one by one.

They have their advantages and disadvantages. We will choose the latter here.

We also need to decide on how we are going to navigate between pages. Instead of finding the relevant page link and clicking on it, we can actually use a shortcut here. Notice that the page number is parameterized as a HTTP GET parameter in the URL. So, when we go to the second page, we see `?page=2` being appended to the page URL. Therefore, we can generate the URL for each page and directly visit them instead. This saves us from juggling multiple tabs as well. If it would not be possible to visit each page using URLs, we would need to keep the results page, open new tabs each time we want to visit a specific ad, collect information, close that tab, and repeat. Now, once we visit a result page, we can extract all links and visit them one by one before we move on to the next page's URL.

In [54]:
current_url = driver.current_url

page_links = [f"{current_url}?page={index}" for index in range(1, last_page+1)]

# Printing only the first five links:
print(page_links[:5])

['https://www.hepsiemlak.com/cankaya-satilik?page=1', 'https://www.hepsiemlak.com/cankaya-satilik?page=2', 'https://www.hepsiemlak.com/cankaya-satilik?page=3', 'https://www.hepsiemlak.com/cankaya-satilik?page=4', 'https://www.hepsiemlak.com/cankaya-satilik?page=5']


To keep things short, we will loop over only the first two pages and only the first five ads on each page. For each page, we will retrieve the ads, extract their links, and accumulate them in a page-specific list. Then, we will visit each ad's page. For now, we will simply collect the sale price from them. It is possible to collect this without ever visiting a specific ad's page, but it would be too easy.

Once we are done visiting all ads on a page, we will continue with the following results page. Firstly, let us write a function that deals with scraping the ad information for a given results page URL:

In [55]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import random
import time

def scrape_results(results_page, driver):
    # Visits the results page
    driver.get(results_page)

    # These will be filled in
    ad_links = []
    sale_prices = []

    # Retrieving the ads after waiting for them to be available. Otherwise, WebDriver may fail to locate them.
    # More information on this can be found here: https://selenium-python.readthedocs.io/waits.html
    # Note that you we could also use the sleeping function to wait a few seconds before we try to retrieve any elements.
    wait = WebDriverWait(driver, 10) # It will wait up to 10 seconds until the necessary elements are 
    ad_cards = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@class="listView"]/div[contains(@class, "listing-project") or @data-v-7e43f948][@class!="w100p realty-owner-tab"]')))
    
    print(f"Scraping {results_page}. There are {len(ad_cards)} ads on this page.")

    # For the first five ads, we need to find the link
    for ad_card in ad_cards[:5]:
        # Different types of cards have their links at different places.
        # When we are not sure whether an element exists, we can use "find_elements" and then count the retrieved elements. 0 elements mean there is no match.
        
        links_alt1 = ad_card.find_elements_by_xpath('.//a[contains(@href, "/proje/")]')
        links_alt2 = ad_card.find_elements_by_xpath('.//div[@class="list-view-content"]/div[@class="links"]/a')
        
        if len(links_alt1) > 0:
            link = links_alt1[0].get_attribute("href")
        elif len(links_alt2) > 0:
            link = links_alt2[0].get_attribute("href")
        else:
            continue
            
        ad_links.append(link)
    
    # For each individual ad
    for ad_link in ad_links:
        # Visits the ad page
        driver.get(ad_link)
        
        # Waiting for a constant amount for the page to load
        time.sleep(1)
        
        # If price exists and its value is valid, it is appended to the price list
        prices = driver.find_elements_by_xpath('//p[contains(@class, "price")]')
        if len(prices) > 0:
            # Some formatting to remove unnecessary text
            price = prices[0].text.strip().split()[0].replace(".", "")
            if price.isdigit():
                price = int(price)
                sale_prices.append(price)
        
        sleep_duration = random.uniform(3, 5)
        time.sleep(sleep_duration)
                
    return sale_prices

You may want to test such functions before moving on.

In [20]:
# scrape_results('https://www.hepsiemlak.com/cankaya-satilik?page=1', driver)

Now, we can loop over the result pages and run this function for all of them. Notice that we put the program to sleep between visiting ad pages. This is done so that the server does not get overwhelmed or does not think we are malicious. Randomizing the sleep duration helps with leaving a more human-like impression. While we do not wait much long here, it is advisable to wait at least 10 seconds between each request just to be sure.

Note that every time we visit a new page, something can go wrong. An element may not exist, the data may change in the middle of the session, the website may decide to show an alert or pop-up-like message that was never shown before, etc. This would make your program throw an exception. While it is possible to write more elaborate measures, we can simply put it inside a try block, count the number of consecutively failed pages, and stop when the number reaches 3 (we would never reach it here since we only scrape the first two pages).

We can combine all of the scraped prices in a single list that can be used later.

In [56]:
consecutive_failures = 0
all_sale_prices = []

# For the first two results pages
for page_link in page_links[:2]:
    try:
        # Scrapes the current page
        page_sale_prices = scrape_results(page_link, driver)
        
        # Extends the page prices to the list
        all_sale_prices.extend(page_sale_prices)
        
        # Resets the failure counter
        consecutive_failures = 0
    except:
        # Increments the failure counter
        consecutive_failures += 1
    
    if consecutive_failures > 2:
        break
        
all_sale_prices

Scraping https://www.hepsiemlak.com/cankaya-satilik?page=1. There are 24 ads on this page.
Scraping https://www.hepsiemlak.com/cankaya-satilik?page=2. There are 24 ads on this page.


[955000,
 1295000,
 1350000,
 1100000,
 5750000,
 965000,
 2200000,
 875000,
 969000,
 955000]

## Scrolling

We can simulate scrolling using JavaScript:

In [57]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

However, simply scrolling to the end of the page may not be enough. Some pages implement infinite scrolling instead of pagination. In that case, we need to scroll, wait for the new content, and repeat. Note that it can potentially lead to scrolling infinitely or getting kicked out of the website, so make necessary changes:

In [58]:
def scroll_continuously(wait_between=1):
    # Taken from: https://stackoverflow.com/a/27760083/4825304
    
    # Retrieves the current scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")
    
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Waits a second for the new content
        time.sleep(wait_between)

        # Retrieves the new scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        
        # If the scroll height did not change, it stops
        if new_height == last_height:
            break
        
        last_height = new_height

In [59]:
driver.get("https://infinite-scroll.com/demo/masonry/page4.html")
time.sleep(2)

scroll_continuously(wait_between=1)