# Web Scraping

## Contents <a id=ov>
1. [Selenium](#selenium)
2. [Requests](#r)
3. [Beautiful Soup](#bs)



There are two ways two scrape data from the web. You can control a browser (via selenium) or send static requests to a server.
The first one offers more possibilities, while the second one is faster.

## Selenium <a id=classes>
[Back to Content Overview](#ov)

First you need to download a webdriver version for your browser. For chrome it is [chrome_driver](#https://chromedriver.chromium.org/)
Then you need to install the selenium package and start the webdriver application. (Take care that your browser is updated to same version as the chromedriver.)

In [None]:
https://chromedriver.chromium.org/

In [None]:
import sys
!{sys.executable} -m pip install --upgrade selenium 

In [None]:
from selenium import webdriver

Initialize the driver object and browser window.

In [None]:
chrome_path=r'chromedriver.exe'
driver = webdriver.Chrome(chrome_path)

### Navigation
Now one can access any page on the web:

In [None]:
url='https://selenium-python.readthedocs.io/'
driver.get(url)

One can move backward and forward in your browser’s history:

In [None]:
driver.back()

In [None]:
driver.forward()

### Identification of web elements



In [None]:
from selenium.webdriver.common.by import By

Find the first element:

In [None]:
element=driver.find_element(By.TAG_NAME,'a')
print(element)

Find all elements (returns a list):

In [None]:
elements=driver.find_elements(By.TAG_NAME, 'a')
print(elements)

There are many ways to identify individual elements on a website.

In [None]:
find_element(By.ID, "id")
find_element(By.NAME, "name")
find_element(By.XPATH, "xpath")
find_element(By.LINK_TEXT, "link text")
find_element(By.PARTIAL_LINK_TEXT, "partial link text")
find_element(By.TAG_NAME, "tag name")
find_element(By.CLASS_NAME, "class name")
find_element(By.CSS_SELECTOR, "css selector")

Get element via HTML tag.

In [None]:
element=driver.find_element_by_tag_name('a')
print(element)
print(element.text,element.get_attribute('href'))

Get element via XPath (unique html adress of the web element):

In [None]:
element=driver.find_element(By.XPATH, '//*[@id="selenium-with-python"]/div[1]/p[2]')
print(element)

Get element via Class Name:

In [None]:
element=driver.find_element(By.CLASS_NAME, 'document')
print(element)
print(element.text)

Get element via ID:

In [None]:
element=driver.find_element(By.ID,'selenium-with-python')
print(element)
print(element.text)

### Get attributes of the elements

Get the Text:

In [None]:
print(element.text)

Get the link:

In [None]:
element=driver.find_element(By.TAG_NAME, 'a')
print(element.get_attribute('href'))

Other Attributes:

In [None]:
print(element.get_attribute('title'))
print(element.get_attribute('innerHTML'))
print(element.get_attribute('outerHTML'))

<span style="color:blue"><b>Task:</b></span> Save all links on page in one list:




### Interact with elements
There are multiple ways to interact with web elements:


One can click on a element.

In [None]:
element=driver.find_element(By.XPATH,'//*[@id="selenium-with-python"]/div[2]/ul/li[1]/ul/li[3]/a')
print(element.text)

In [None]:
element.click()

One can send keys to a element.

In [None]:
search_bar=driver.find_element(By.XPATH,'//*[@id="searchbox"]/div/form/input[1]')

In [None]:
search_bar.send_keys('Test!')

There are special keys:

In [None]:
from selenium.webdriver.common.keys import Keys

In [None]:
#BACKSPACE
search_bar.send_keys(Keys.BACKSPACE)

In [None]:
#ENTER
search_bar.send_keys(Keys.ENTER)

Clear the input form.

In [None]:
search_bar.clear()

## Chair website:

In [None]:
chrome_path=r'chromedriver.exe'
driver = webdriver.Chrome(chrome_path)

In [None]:
driver.get('https://lwus.statistik.tu-dortmund.de/en/chair/team/jentsch/')

<span style="color:blue"><b>Task:</b></span> Find the element "Team" and follow its link!


<span style="color:blue"><b>Task:</b></span> Save the names (keys) and links (values) of all employees in dictionary. (
Hint: Manipulate the last number in the xpath.)




<span style="color:blue"><b>Task:</b></span> Visit the site of every employee and save its email adress and telefone number. (Use time.sleep(1) in every iteration.)

## Google Translator:


<span style="color:blue"><b>Task:</b></span> Navigate to the google translator website.


<span style="color:blue"><b>Task:</b></span> Save the input element als input_form and the output element as return_element.


<span style="color:blue"><b>Task:</b></span> Type a text in the input_form and query the translation.