### Web Scrapping - Selenium
*24 de Noviembre de 2021*

*Nicolás Tibatá*

In [None]:
import random 
import json
import pandas as pd
from time import sleep
from selenium import webdriver

#### Usaquén

`Getting the Driver`

It's necessary to download the driver of your web browser.<br/><br/>

Here you can download it and install it on your computer: https://selenium-python.readthedocs.io/installation.html#drivers <br/><br/>

And here you can get some documentation: https://selenium-python.readthedocs.io/


In [None]:
# After we get the driver, we have to call it with the path where is downloaded 
driver_usaquen = webdriver.Chrome('/Users/Nicolas/Downloads/chromedriver')

# Then we put the link that we want to use
driver_usaquen.get('https://www.olx.com.co/usaquen_g4300120/apartamentos-casas-venta_c367')

If I use the main link of all the city, I'll have less data that I wanted. To fix this problem is necessary to `get inside each link for the locations on the city`. Like that I can get more information. 

------

`Geting the 'Upload More' Button`

In [None]:
boton = driver_usaquen.find_element_by_xpath('//button[@data-aut-id="btnLoadMore"]')

In this case, we need too `create a button to get more data`, BUT is important to click the button in a random way nor you get blocked

In [None]:
for i in range(50): # That´s the maximum of clicks that olx let us do.
    try:
        boton.click()
        # Important to use sleep to maintain a randomized button
        sleep(random.uniform(5.0,20.0))
        boton = driver_usaquen.find_element_by_xpath('//button[@data-aut-id="btnLoadMore"]')
    except: 
        break # if theres no button, so finish the loop

-----

I want to get the link of each property and then use 'BeautifulSoup' to get the data of each link `(see part 2 of this repository)`

`Getting the links of the properties`

In [None]:
# We search the 'href' location with xpath
links_usaquen = driver_usaquen.find_elements_by_xpath('//*[@href]')

In [None]:
data_links_usaquen = []
for link in links_usaquen:
    data = link.get_attribute('href') #This will give me the link
    data_links_usaquen.append(data) # Then I append it into a new list
    
data_links_usaquen

In [None]:
driver_usaquen.quit() # Even faster even better

-----

BUT this list have a problem: 
- The links are not just from olx webpage, also 'properati' webpage. <br/><br/>

So I filter the list by a common character of olx link's

In [None]:
# 'iid' is the common character
base_links = []
for data in data_links_usaquen:
    if data.find("iid") != -1: #-1 means that this character is on the elements of the list
        base_links.append(data)
    else:
        continue

#### Suba

In [None]:
driver_suba = webdriver.Chrome('/Users/Nicolas/Downloads/chromedriver')
driver_suba.get('https://www.olx.com.co/suba_g4300117/apartamentos-casas-venta_c367') 
# See the difference on the driver´s name
# See the difference on the link

----

In [None]:
boton = driver_suba.find_element_by_xpath('//button[@data-aut-id="btnLoadMore"]')

In [None]:
for i in range(50): 
    try:
        boton.click()
        sleep(random.uniform(5.0,20.0))
        boton = driver_suba.find_element_by_xpath('//button[@data-aut-id="btnLoadMore"]')
    except: 
        break

-----

In [None]:
links_suba = driver_suba.find_elements_by_xpath('//*[@href]')

In [None]:
data_links_suba = []
for link in links_suba:
    data = link.get_attribute('href')
    data_links_suba.append(data)
    
data_links_suba

In [None]:
driver_suba.quit() # Even faster even better

In [None]:
for data in data_links_suba:
    if data.find("iid") != -1:
        base_links.append(data)
    else:
        continue

<span style="font-size:larger;">And so on and so forth</span>

<span style="font-size:larger;">But it's easier if we do it with a loop </span>

In [None]:
driver = webdriver.Chrome('/Users/Nicolas/Downloads/chromedriver’)
paginas_web = ['https://www.olx.com.co/usaquen_g4300120/apartamentos-casas-venta_c367',
              'https://www.olx.com.co/suba_g4300117/apartamentos-casas-venta_c367',
              'https://www.olx.com.co/chapinero_g4300107/apartamentos-casas-venta_c367',
              'https://www.olx.com.co/kennedy_g4300111/apartamentos-casas-venta_c367',
              'https://www.olx.com.co/engativa_g4300109/apartamentos-casas-venta_c367',
              'https://www.olx.com.co/fontibon_g4300110/apartamentos-casas-venta_c367',
              'https://www.olx.com.co/bosa_g4300105/apartamentos-casas-venta_c367'] #And so on...
links_data = []
for paginas in paginas_web:
#    try:
        driver.get(paginas_web)
        sleep(random.uniform(10.0,20.0))
        button = driver.find_element_by_xpath('//button[@data-aut-id="btnLoadMore"]')
        for i in range(50):
            try:
                button.click()
                sleep(random.uniform(5.0,20.0))
                button = driver.find_element_by_xpath('//button[@data-aut-id="btnLoadMore"]')
            except:
                break
        sleep(5)
        links = driver.find_elements_by_xpath('//*[@href]')
        sleep(5)
        for link in links:
            data = links.get_attribute('href')
            links_data.append(data)

        driver.quit()
        sleep(random.uniform(5.0,10.0))
#	except:
#		break

----

In [None]:
# Then we save our important list with links (is on part 2 of this repository)
with open('base_links.txt', 'w') as f:
    f.write(json.dumps(base_links))

<span style="font-size:larger;">And if you get blocked... there's nothing that change VPN cannot do ;) </span>