# Dynamic scraper
#### Goal
As the title suggests, this jupyter notebook includes a data scraper. The goal is to crawl through https://www.portaldrazeb.cz and to collect actual data about auctions, auctioneers and a list of auction attributes, which we will subsequently use to filter the auctions when the data is processed.  
#### Problem
The problem is that the webpage has dynamic content and therefore it is not possible to easily extract the data we need since the "static" source code differs from the "dynamic" one. The website also does not provide API (it actually does, however, not for us and not for the purposes we need). 

#### Solution
We need to use proper methods to handle the dynamic content - our solution is the installation of package selenium and setting up a Google Chrome webdriver. We basically open the webpage, collect its source code and navigate between pages. Thanks to this package (and the webdriver which is also included in the GitHub repository) we manage to download all the data we need. More detailed description of particular methods can be found in the class docstring and in the comments.

In [1]:
#!pip install selenium

In [8]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
import time
from tqdm import tqdm

In [9]:
url_auctions='https://www.portaldrazeb.cz/drazby/pripravovane' 
url_auctioneers = 'https://www.portaldrazeb.cz/drazebnici'

In [17]:
class DataDownloader:
    '''
    This class crawls through dynamic content of https://www.portaldrazeb.cz and collects following things:
    
            1) soup object for every auctioneer
            2) link to every auction + auction category (since the category is not within the auction page itself)
            3) list of all possible values from drop-down menu (auction categories, regions and districts)
    '''
    def __init__(self):
        # we initiate the lists for data within particular methods
        print('Downloader successfully initialized!')
        print(' ')
        print(DataDownloader.__doc__)
        
    def get_soups_of_auctioneers(self,link):
        '''
        Crawls through all pages of auctioneers and creates a soup object of everyone that is listed there right now.
        '''
        self.auctioneers_soups = []
        # initiating a webdriver
        driver = webdriver.Chrome('./chromedriver') 
        
        # opening the link in Chrome
        driver.get(link)  
        time.sleep(5) 
        
        # creating a soup object from the page source code
        soup = BeautifulSoup(driver.page_source, "html.parser")  
        
        # getting number of pages
        last_page = int(soup.find('div',{'class':'el-pagination'}).findAll('li',{'class':'number'})[-1].text) 
        
        # locating the element into which we write page number
        page_number = driver.find_element_by_css_selector('input[type="number"]')
        
        # looping through all pages and save the soups
        for page in tqdm(range(1,last_page+1)):
            # getting to a page (delete content, send number of the page, press Enter)
            page_number.send_keys(Keys.BACK_SPACE)
            page_number.send_keys(Keys.BACK_SPACE)
            page_number.send_keys(str(page)) 
            page_number.send_keys(Keys.RETURN)
            time.sleep(5)  

            # save soups of particular auctioneers
            html = driver.page_source   
            soup = BeautifulSoup(html, "html") # soup of current page  
            for i in soup.findAll('article'):
                self.auctioneers_soups.append(i) # extract all auctioneers
                
        # close the window and check soups
        driver.close()
        if len(self.auctioneers_soups)>50:
                print(f'Soup objects of auctioneers successfully downloaded! There are {len(self.auctioneers_soups)} of them right now.')
                
    def get_auction_links_and_categories(self,link):
        '''
        Crawls through all pages of auctions and from their source codes then collects link and category of every auction.
        
        '''
        self.auction_links_and_categories = []
        
        # initiating the webdriver and opening the link
        driver = webdriver.Chrome('./chromedriver')
        driver.get(url_auctions) 
        time.sleep(5) 
        
        # getting number of pages from a soup
        html = driver.page_source
        soup = BeautifulSoup(html, "html")
        last_page = int(soup.find('div',{'class':'el-pagination'}).findAll('li',{'class':'number'})[-1].text) # get number of pages
        
        # getting source codes from all pages and saving as soups
        page_number = driver.find_element_by_css_selector('input[type="number"]') # locating element into which we write page number
        auctions_pages_soups = []
        for page in tqdm(range(1,last_page+1)):
            # get to a page
            page_number.send_keys(Keys.BACK_SPACE)
            page_number.send_keys(Keys.BACK_SPACE)
            page_number.send_keys(str(page))
            page_number.send_keys(Keys.RETURN)
            time.sleep(5)
            # save soup object of the page
            html = driver.page_source 
            auctions_pages_soups.append(BeautifulSoup(html, "html"))

        for soup in auctions_pages_soups:
            for i in soup.findAll('article'):
                # extracting link
                auction = []
                auction.append(i.find('a')['href'])

                # extracting categories
                categ = soup.find('article').find('tbody').findAll('tr')[1].find('span').text.lstrip('/').split('/')
                auction.append(categ)
                self.auction_links_and_categories.append(auction) # saving the data
        
        # closing the window and checking whether something downloaded
        driver.close()
        if len(self.auction_links_and_categories)>200:
            print(f'Auction links and categories successfully downloaded! There are {len(self.auction_links_and_categories)} auctions right now.')
    def get_items_from_dropdown_menu(self,link):
        '''
        Downloads all auctions categories, regions and districts.
        '''
        self.categories = []
        self.regions_and_districts = []
        
        # initiating the webdriver and opening the link
        driver = webdriver.Chrome('./chromedriver')
        driver.get(url_auctions) 
        time.sleep(5) 
        
        # saving source code, extracting auction categories
        html = driver.page_source 
        soup = BeautifulSoup(html, "html")
        for categ in soup.findAll('ul',{'class':'el-scrollbar__view el-select-dropdown__list'})[0].findAll('span'):
            self.categories.append(categ.text)
            
        # extracting regions and districts
        for region in soup.findAll('ul',{'class':'el-scrollbar__view el-select-dropdown__list'})[1].findAll('ul',{'class':'el-select-group__wrap'}):
            aux = []
            for district in region.findAll('span'):
                aux.append(district.text)
            self.regions_and_districts.append([region.find('li').text,aux])
        
        # closing the window and checking whether everyhing downloaded
        driver.close()
        if (len(self.categories) > 5) & (len(self.regions_and_districts) == 14):
            print('Auction categories, regions and districts successfully downloaded!')

In [18]:
down = DataDownloader()

Downloader successfully initialized!
 

    This class crawls through dynamic content of https://www.portaldrazeb.cz and collects following things:
    
            1) soup object for every auctioneer
            2) link to every auction + auction category (since the category is not within the auction page itself)
            3) list of all possible values from drop-down menu (auction categories, regions and districts)
    


In [6]:
down.get_soups_of_auctioneers('https://www.portaldrazeb.cz/drazebnici') # takes approx. 1 minute

100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:43<00:00,  5.39s/it]

Soup objects of auctioneers successfully downloaded! There are 157 of them right now.





In [19]:
down.get_auction_links_and_categories('https://www.portaldrazeb.cz/drazby/pripravovane') # takes approx. 5 minutes

100%|██████████████████████████████████████████████████████████████████████████████████| 54/54 [05:13<00:00,  5.80s/it]


Auction links and categories successfully downloaded! There are 1080 auctions right now.


In [13]:
down.get_items_from_dropdown_menu('https://www.portaldrazeb.cz/drazby/pripravovane')

Auction categories, regions and districts successfully downloaded!
