# Web Scraping

## Contents <a id=ov>
1. [Selenium](#selenium)
2. [Requests](#r)
3. [Beautiful Soup](#bs)



There are two ways two scrape data from the web. You can control a browser (via selenium) or send static requests to a server.
The first one offers more possibilities, while the second one is faster.

## Selenium <a id=classes>
[Back to Content Overview](#ov)

Frist you need to download a wevdriver version for your browser. For chrome it is [chrome_driver](#https://chromedriver.chromium.org/)
Then you need to install the selenium package and start the webdriver application.

In [None]:
https://chromedriver.chromium.org/

In [15]:
import sys
!{sys.executable} -m pip install selenium


from selenium import webdriver
chrome_path=r'chromedriver.exe'
driver = webdriver.Chrome(chrome_path)




Now you can access any page on the web:

In [16]:
url='https://lwus.statistik.tu-dortmund.de/lehrstuhl/arbeitsgruppe'
driver.get(url)

There are many ways to identify individual elements on a website.

#### Via tag name

In [17]:
element=driver.find_element_by_tag_name('a')
print(element.text,element.get_attribute('href'))

Zum Inhalt https://lwus.statistik.tu-dortmund.de/lehrstuhl/arbeitsgruppe#content


In [None]:
element=driver.find_elements_by_tag_name('a')

<span style="color:blue"><b>Task:</b></span> Save all links on page in one list:

#### Via ID

In [14]:
element=driver.find_element_by_id('c162600')
print(element.text)

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="c162600"]"}
  (Session info: chrome=102.0.5005.61)


#### Via class name

In [5]:
element=driver.find_element_by_class_name('tile tile-link tile--thirds')
print(element.text,element.get_attribute('href'))

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".tile tile-link tile--thirds"}
  (Session info: chrome=102.0.5005.61)


#### Via xpath

In [None]:
element=driver.find_element_by_xpath()

<span style="color:blue"><b>Task:</b></span> Save links to personal pages of all employees of the chair:

Benjamin Jeffrey


In [None]:
#via class
element=driver.find_element_by_class_name('tile tile-link tile--thirds')
print(element.text,element.get_attribute('href'))

In [7]:
#via tag name
element=driver.find_element_by_tag_name('a')
print(element.text,element.get_attribute('href'))

Zum Inhalt https://lwus.statistik.tu-dortmund.de/lehrstuhl/arbeitsgruppe#content


In [21]:
# You can query some attributes of the element:
print(element.text)
print(element.get_attribute('href'))
print(element.get_attribute('title'))
print(element.get_attribute('innerHTML'))
print(element.get_attribute('outerHTML'))

Zum Inhalt
https://lwus.statistik.tu-dortmund.de/lehrstuhl/arbeitsgruppe#content


    Zum Inhalt

<a href="#content" class="sr-only sr-only-focusable onfocus-top-left">
    Zum Inhalt
</a>


You can switch to next site using the link (href) or by clicking on the element

In [19]:
driver.get(element.get_attribute('href'))

In [22]:
element.click()

ElementNotInteractableException: Message: element not interactable
  (Session info: chrome=94.0.4606.71)


<span style="color:blue"><b>Task:</b></span> Scrape the name, the entry, the address, the mail and research interests of all rgs students.

### Type text in entry fields

In [29]:
url='https://www.google.de/'
driver.get(url)

In [32]:
from selenium.webdriver.common.keys import Keys
element=driver.find_element_by_xpath('/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input')
element.send_keys('Niklas Benner')
element.send_keys(Keys.ENTER)

## User agent
Use a user agend to hide your browser, os etc.

In [40]:
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
print(UserAgent)

NameError: name 'userAgent' is not defined

In [None]:
chrome_options = Options()
chrome_options.add_argument(f'user-agent={UserAgent}')
driver = webdriver.Chrome(chrome_path,chrome_options=chrome_options)

## Requests <a id=r>
[Back to Content Overview](#ov)

Get the source code of a website with this simple request:

In [36]:
import requests
r=requests.get(url)
html_doc = r.text
print(html_doc)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="Rbp6v5pd50XbX7dEusTeOA==">(function(){window.google={kEI:'Gj5lYaCIL4SP9u8Pj4iFmA4',kEXPI:'0,1302530,56879,6059,206,4804,926,1390,383,246,5,1354,5250,1122516,1197783,618,328866,51223,16115,28684,17572,4858,1362,9290,3027,4747,12835,4020,978,13228,3847,4192,6431,7431,11613,2777,921,5079,1593,1279,2212,530,149,1103,840,1983,213,4101,3514,606,2023,1777,520,14670,3227,419,2427,6,12354,5096,598,15722,908,2,941,15756,1,2,346,230,4385,1797,277,149,13975,4,1252,276,2304,1238,5225,576,4684,2014,18375,2658,7355,32,5664,7964,2306,637,1494,16786,5821,2536,4092,2,3138,8,906,3,3541,1,5096,2,1,3,9608,1814,283,912,5992,15447,8,1273,1715,2,8496,105,20,1218,1,35,1,4146,1244,1,686,1094,1,2816,1678,126,618,2350,3502,10463,1160,

In [None]:
import time
while True:
    try:
        html_doc = requests.get(url).text
        break
    except ConnectionError:
        time.sleep(1)

## User agent

In [39]:
from fake_useragent import UserAgent
headers = {"User-Agent":str(UserAgent)}
html_doc = requests.get(url,headers=headers).text

print(html_doc)


<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="bCklCf73Tm2st1O02oobhA==">(function(){window.google={kEI:'PT5lYfGWKe-N9u8P4NelqA4',kEXPI:'0,1302536,56873,6059,206,2415,2389,2316,383,246,5,1354,5250,1122516,1197754,320,327,328866,51224,16111,28687,17572,4859,1361,284,9006,3027,17582,4020,978,13228,3847,4192,6431,14761,4283,2777,919,5081,1593,1279,2212,239,291,149,1103,840,1983,4314,3514,606,2023,2297,6343,8327,2269,1,957,419,2426,7,5599,6755,5096,598,15722,908,2,941,6038,10,349,9359,1,2,346,230,6182,278,148,12314,1661,4,1252,276,2304,1236,5803,4684,2014,13611,4764,2658,872,6485,30,5615,5797,2216,2305,639,1493,16786,2521,3297,2539,4092,2,3138,6,908,3,3541,1,14710,1816,281,38,874,5992,1161,14263,23,1281,1715,2,3037,5459,105,30,1208,1,35,1,4146,1253,1,677,10

## Beautiful Soup <a id=bs>
[Back to Content Overview](#ov)


[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the most powerful library for reading HTML code.

In [41]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="de">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <title>
   Google
  </title>
  <script nonce="bCklCf73Tm2st1O02oobhA==">
   (function(){window.google={kEI:'PT5lYfGWKe-N9u8P4NelqA4',kEXPI:'0,1302536,56873,6059,206,2415,2389,2316,383,246,5,1354,5250,1122516,1197754,320,327,328866,51224,16111,28687,17572,4859,1361,284,9006,3027,17582,4020,978,13228,3847,4192,6431,14761,4283,2777,919,5081,1593,1279,2212,239,291,149,1103,840,1983,4314,3514,606,2023,2297,6343,8327,2269,1,957,419,2426,7,5599,6755,5096,598,15722,908,2,941,6038,10,349,9359,1,2,346,230,6182,278,148,12314,1661,4,1252,276,2304,1236,5803,4684,2014,13611,4764,2658,872,6485,30,5615,5797,2216,2305,639,1493,16786,2521,3297,2539,4092,2,3138,6,908,3,3541,1,14710,1816,281,38,874,5992,1161,14263,23,1281,1715,2,3037,5459,105,30,12