# Web Scraping

## Contents <a id=ov>
1. [Selenium](#selenium)
2. [Requests](#r)
3. [Beautiful Soup](#bs)



There are two ways two scrape data from the web. You can control a browser (via selenium) or send static requests to a server.
The first one offers more possibilities, while the second one is faster.

## Selenium <a id=classes>
[Back to Content Overview](#ov)

First you need to download a webdriver version for your browser. For chrome it is [chrome_driver](#https://chromedriver.chromium.org/)
Then you need to install the selenium package and start the webdriver application. (Take care that your browser is updated to same version as the chromedriver.)

In [None]:
#https://chromedriver.chromium.org/

In [None]:
import sys
!{sys.executable} -m pip install --upgrade selenium 

In [None]:
from selenium import webdriver

Initialize the driver object and browser window.

In [None]:
chrome_path=r'chromedriver.exe'
driver = webdriver.Chrome()

### Navigation
Now one can access any page on the web:

In [None]:
url='https://selenium-python.readthedocs.io/'
driver.get(url)

One can move backward and forward in your browser’s history:

In [None]:
driver.back()

In [None]:
driver.forward()

### Identification of web elements



In [None]:
from selenium.webdriver.common.by import By

Find the first element:

In [None]:
element=driver.find_element(By.TAG_NAME,'a')
print(element)
print(element.text)

Find all elements (returns a list):

In [None]:
elements=driver.find_elements(By.TAG_NAME, 'p')
#print(elements)
elements[3].text

In [None]:
driver.find_element(By.ID, 'selenium-with-python')

There are many ways to identify individual elements on a website.

In [None]:
find_element(By.ID, "id")
find_element(By.NAME, "name")
find_element(By.XPATH, "xpath")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.TAG_NAME, "tag name")
find_element(By.CLASS_NAME, "class name")
find_element(By.CSS_SELECTOR, "css selector")

Get element via HTML tag.

In [None]:
element=driver.find_element(By.TAG_NAME,'a')
print(element)
print(element.text,element.get_attribute('href'))

Get element via XPath (unique html adress of the web element):

In [None]:
element=driver.find_element(By.XPATH, '//*[@id="selenium-with-python"]/div[1]/p[2]')
print(element)

Get element via Class Name:

In [None]:
element=driver.find_element(By.CLASS_NAME, 'document')
print(element)
print(element.text)

Get element via ID:

In [None]:
element=driver.find_element(By.ID,'selenium-with-python')
print(element)
print(element.text)

### Get attributes of the elements

Get the Text:

In [None]:
print(element.text)

Get the link:

In [None]:
element=driver.find_element(By.TAG_NAME, 'a')
print(element.get_attribute('href'))

Other Attributes:

In [None]:
print(element.get_attribute('title'))
print(element.get_attribute('innerHTML'))
print(element.get_attribute('outerHTML'))

<span style="color:blue"><b>Task:</b></span> Save all links on page in one list:

In [None]:
all_links=[element.get_attribute('href') for element in driver.find_elements(By.TAG_NAME,'a')]

In [None]:
all_links




### Interact with elements
There are multiple ways to interact with web elements:


One can click on a element.

In [None]:
element=driver.find_element(By.XPATH,'//*[@id="selenium-with-python"]/div[2]/ul/li[1]/ul/li[3]/a')
print(element.text)

In [None]:
element.click()

One can send keys to a element.

In [None]:
search_bar=driver.find_element(By.XPATH,'//*[@id="searchbox"]/div/form/input[1]')

In [None]:
search_bar.send_keys('Tag name')

There are special keys:

In [None]:
from selenium.webdriver.common.keys import Keys

In [None]:
#BACKSPACE
search_bar.send_keys(Keys.BACKSPACE)

In [None]:
#ENTER
search_bar.send_keys(Keys.ENTER)

Clear the input form.

In [None]:
search_bar.clear()

## Chair website:

In [None]:
chrome_path=r'chromedriver.exe'
driver = webdriver.Chrome(chrome_path)

In [None]:
driver.get('https://lwus.statistik.tu-dortmund.de/en/chair/team/jentsch/')

<span style="color:blue"><b>Task:</b></span> Find the element "Team" and follow its link!


In [76]:
link=driver.find_element(By.TAG_NAME,'a')
element=driver.find_element(By.XPATH,'//*[@id="breadcrumb"]/ol/li[4]/a')
driver.get(element.get_attribute('href'))

<span style="color:blue"><b>Task:</b></span> Save the names (keys) and links (values) of all employees in dictionary. (
Hint: Manipulate the last number in the xpath.)




In [86]:
'//*[@id="c160807"]/div/div/div/a[1]/div[2]'
'//*[@id="c160807"]/div/div/div/a[3]'
i=1
link_dict={}
while True:
    try:
        element=driver.find_element(By.XPATH,f'//*[@id="c160807"]/div/div/div/a[{i}]')
        link_dict[element.text]=element.get_attribute('href')
        print(element.text,element.get_attribute('href'))
        i+=1
    except:
        break
    



Prof. Dr. Carsten Jentsch https://lwus.statistik.tu-dortmund.de/en/chair/team/jentsch/
Bettina Hilsmann (Office) https://lwus.statistik.tu-dortmund.de/en/chair/team/office/
M.Sc. Niklas Benner https://lwus.statistik.tu-dortmund.de/en/chair/team/benner/
M.Sc. Daniel Dzikowski https://lwus.statistik.tu-dortmund.de/en/chair/team/dzikowski/
M.Sc. Maxime Faymonville https://lwus.statistik.tu-dortmund.de/en/chair/team/faymonville/
M.Sc. Jonathan Flossdorf https://lwus.statistik.tu-dortmund.de/en/chair/team/flossdorf/
M.Sc. Kai-Robin Lange https://lwus.statistik.tu-dortmund.de/en/chair/team/lange/
Dr. Jan Prüser https://lwus.statistik.tu-dortmund.de/en/chair/team/prueser/
Dr. Jonas Rieger https://lwus.statistik.tu-dortmund.de/en/chair/team/rieger/
Dr. Thorsten Ziebach https://lwus.statistik.tu-dortmund.de/en/chair/team/ziebach/


<span style="color:blue"><b>Task:</b></span> Visit the site of every employee and save its email adress and telefone number. (Use time.sleep(1) in every iteration.)

In [106]:
import time
for name in link_dict:
    #print()
    driver.get(link_dict[name])
    time.sleep(1)
    
    # Select all element with p tag that start with 'Email:'
    p_elements=driver.find_elements(By.TAG_NAME,'p')
    element=[element for element in p_elements if 'E-Mail:' in element.text][0]
    
    # Get the Email
    email=element.find_element(By.TAG_NAME,'a').get_attribute('href').replace('mailto:','')
    #print(email)
    # Get the Phone number
    #print(element.text)
    text=element.text
    phone=text[text.find('+'):]
    print(name,email,phone)
    

Prof. Dr. Carsten Jentsch jentsch@statistik.tu-dortmund.de +49 231 755 3869
Bettina Hilsmann (Office) hilsmann@statistik.tu-dortmund.de +49 231 755 4354
Fax: +49 231 755 5284
M.Sc. Niklas Benner niklas.benner@tu-dortmund.de +49 231 755 7925
M.Sc. Daniel Dzikowski daniel.dzikowski@tu-dortmund.de e
M.Sc. Maxime Faymonville faymonville@statistik.tu-dortmund.de +49 231 755 5203
M.Sc. Jonathan Flossdorf flossdorf@statistik.tu-dortmund.de +49 231 755 5544
M.Sc. Kai-Robin Lange kalange@statistik.tu-dortmund.de +49 231 755 5477
Dr. Jan Prüser prueser@statistik.tu-dortmund.de +49 231 755 5528
Dr. Jonas Rieger rieger@statistik.tu-dortmund.de +49 231 755 5216
Dr. Thorsten Ziebach thorsten.ziebach@tu-dortmund.de +49 231 755 3122


## Google Translator:


<span style="color:blue"><b>Task:</b></span> Navigate to the google translator website.


In [None]:
driver.get('https://translate.google.com/?hl=en')

<span style="color:blue"><b>Task:</b></span> Save the input element als input_form and the output element as return_element.


In [None]:
input_form=driver.find_element(By.XPATH,'//*[@id="yDmH0d"]/c-wiz/div/div[2]/c-wiz/div[2]/c-wiz/div[1]/div[2]/div[2]/c-wiz[1]/span/span/div/textarea')

input_form.send_keys('Was ist der Sinn des Lebens?')


In [None]:
output_form=driver.find_element(By.XPATH,'//*[@id="ow290"]/div[1]/span[1]/span/span')
print(output_form.text)

In [None]:
input_form=driver.find_element(By.TAG_NAME,'textarea')
output_form=driver.find_element(By.ID,'tw-target-text')

<span style="color:blue"><b>Task:</b></span> Type a text in the input_form and query the translation.

In [None]:
def translate(text,input_form=input_form,output_form=output_form):
    import time
    input_form.clear()
    time.sleep(1)
    input_form.send_keys(text)
    
    time.sleep(2)
    output_form=driver.find_element(By.XPATH,'//*[@id="yDmH0d"]/c-wiz/div/div[2]/c-wiz/div[2]/c-wiz/div[1]/div[2]/div[2]/c-wiz[2]/div/div[6]/div/div[1]/span[1]/span/span')
    time.sleep(1)
    return output_form.text
translate("Einigkeit und Recht und Freiheit für das deutsche Vaterland")
    
    