# Web scraping

## What is web scraping?- Machine reading of unstructured data from websites

## What&#39;s not web scraping?- Downloading data via API- Download of structured data (JSON, CSV, ...)- Crawling - crawling and indexing the entire website using its internal hyperlinks

## Examples of web scraping

### How Mirka Spáčilová evaluated the films
&gt; When you die and your whole life passes before your eyes,&gt; Mirka Spáčilová will come and give it 60%
Does Mirka Spáčilová really rate all films 60%? Is there any way to verify it?
Michal Bláha downloaded a total of 1333 articles by Mirka Spáčilová with the evaluation of films from iDnes and created a table of films and their evaluation. Sixty percent got every third movie :-)
<div>
<img src="static/spacilova.png"/>
</div>

https://www.michalblaha.cz/2017/10/filmova-kriticka-mirka-spacilova-v-cislech/

### Price analysis on Czech e-shops
Do you think we can trust the 80% discounts that most Czech e-shops offer during Black Friday events?
In 2017, a month before Black Friday, Apify started monitoring the prices of all products in the largest Czech e-shops on a daily basis. In addition, they monitored the prices of Black Friday products 4 times a day during Black Friday.
What did they find out? The average reported discount was around 30%, real around 20.Just increase the &quot;original price&quot; before the discount. There were also cases where you could buy goods more expensive, but with a bigger discount.
The project has grown in the following years, so today we can download the extension to the browser and see for ourselves.

<div>
    <img src="static/hlidacshopu.png"/>
</div>
<br />
https://blog.apify.com/black-friday-po-%C4%8Desku-kouzla-se-slevami-c7c0d2e7eeaa

https://medium.com/@jakubbalada/black-friday-2019-s-hl%C3%ADda%C4%8Dem-shop%C5%AF-9a3ddd352a8c

## Etika web scrapingu
- Before you start web scraping, see if the site offers structured data for download or does not provide an API.- **Examples:**    - https://data.gov.cz/datov%C3%A9-sady?poskytovatel=%C4%8Cesk%C3%BD%20statistick%C3%BD%20%C3%BA%C5%99ad
    - https://www.ncdc.noaa.gov/data-access
    - https://www.mapakriminality.cz/data/
    - http://opendata.praha.eu/dataset/meteostanice-chmi-api
- Find out what rights you have to the data, do not publish the obtained data illegally- Approach the site to a reasonable extent, you are not trying to drop the page, but to get the data :-)

## What a website is made of- ** HTML ** (HyperText Markup Language): structured page content (text and images)- ** CSS ** (Cascading Style Sheets): page layout adjustment- ** JavaScript **: interactivity of page content and layout

### HTML 
<div>
    <img src="static/html.png"/>
</div>

- It consists of HTML tags / tags, eg ``<img> ``- Most HTML tags are paired, eg ``<h2> `` and ``</h2> ``- Tags can have attributes that further specify what and how the tag will display- The class attribute is usually used to style a page, and we can often distinguish different parts of a page when webscraping

### CSS

- Describes how to display html elements- Contains 2 parts: element selector and declaration block:
```
p.error {
  color: red;
}
```


## Installation

In [None]:
#%pip install lxml beautifulsoup4 selenium

## Pandas: read_html()

If we just need to download the tables from the web, we can use the [Pandas] library (https://pandas.pydata.org/pandas-docs/version/0.25/reference/api/pandas.read_html.html):

In [1]:
import pandas as pd

We convert tables from the Wikipedia page [List of countries of the world by alcohol consumption] (https://en.wikipedia.org/wiki/List_of_world_alcohol_in_alcohol) to dataframes using the function `read_html () `:

In [2]:
tables = pd.read_html (&#39;https://cs.wikipedia.org/wiki/Seznam_st%C3%A1t%C5%AF_sv%C4%9Bta_podle_spot%C5%99eby_alkoholu&#39;)

The result is a list of dataframes:

In [3]:
type(tables)

list

In [4]:
len (tables)

2

In [5]:
tables[0].head()

Unnamed: 0,stát,evidováno,neevidováno,celkem,pivo,víno,destiláty,ostatní
0,Česko,14.97,1.48,16.45,8.51,2.33,3.59,0.39
1,Maďarsko,12.27,4.0,16.27,4.42,4.94,3.02,0.14
2,Rusko,11.03,4.73,15.76,3.65,0.1,6.88,0.34
3,Ukrajina,8.1,7.5,15.6,2.69,0.58,5.21,0.02
4,Estonsko,13.77,1.8,15.57,5.53,1.09,9.19,0.43


According to the WHO, the Czechia is the first! At least in average alcohol consumption in 2003-2005.

In [6]:
tables[1].head()

Unnamed: 0,Pořadí,Stát,Spotřeba v litrech,Rok
0,1,Francie,12.6,2011
1,2,Rakousko,12.2,2009
2,3,Estonsko,12.0,2011
3,4,Německo,11.7,2009
4,5,Irsko,11.6,2011


According to the OECD, it looks a little different 🧐

** Warning: ** The world is not ideal (that&#39;s why there is so much to drink in) and `read_html ()` is not always able to get tables. The command does not have to &quot;see&quot; the table at all, or it cannot deal with its structure.

### Exercises

Get a table of current economic data from the website of the [Czech Statistical Office] (https://www.czso.cz/csu/czso/aktualniinformace), `https://www.czso.cz / csu / czso / aktualniinformace`.
** Help **: If you see ugly characters in the dataframe, try to specify `encoding = &#39;utf-8&#39;`

If the data we are looking for is not in the form of a table on the web, we must move to &quot;more drastic&quot; tools :-)

## BeautifulSoup

The [BeautifulSoup] library (https://www.crummy.com/software/BeautifulSoup/) is used to extract data from HTML and XML files. It works with various parsers that analyze HTML files, and allows you to select the required HTML elements and work with them.
Footnote: If you have a bigger project and need to go through dozens of e-shops, for example, look at the [Scrapy] library (http://docs.scrapy.org/en/latest/intro/overview.html), which is more suitable for similar tasks.

In [7]:
from bs4 import BeautifulSoup

First we load a sample html document from a file:

In [8]:
with open("static/html_doc.html", "r") as f:
    html_doc = f.read()

We will create an object of type `BeautifulSoup`. The first argument is HTML data, the second is used to specify the parser.

In [9]:
soup = BeautifulSoup(html_doc, 'html.parser')

Using the `prettify ()` method, we can print nicely formatted html:

In [10]:
#print(soup.prettify())

We can move around the object of type `BeautifulSoup` using tags.The first element of type title:

In [11]:
soup.title

<title>PyData Prague | pydata.cz</title>

The first element of type `h1`:

In [12]:
soup.h1

<h1><a href="https://pydata.cz/">pydata.cz</a></h1>

Element name of type `h1`:

In [13]:
soup.h1.name

'h1'

Parent element name (`parent` and` name` attributes):

In [14]:
soup.h1.parent.name

'div'

Data uvnitř elementu `h1` (atribut [`string`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string))

In [15]:
soup.h1.string

'pydata.cz'

The first element of type `h2`:

In [16]:
soup.h2

<h2 id="code-of-conduct">Code of Conduct</h2>

Item Id `h2`:

In [17]:
soup.h2['id']

'code-of-conduct'

We can also find all tags of a given type, eg `find_all (&#39;a&#39;)` will find all links:

In [18]:
links = soup.find_all('a')
links

[<a href="https://pydata.cz/">pydata.cz</a>,
 <a class="pydata" href="https://pydata.org/">PyData</a>,
 <a href="https://numfocus.org/">NumFOCUS</a>,
 <a class="pydata" href="https://pydata.org/code-of-conduct/">pydata.org/code-of-conduct/</a>]

We can browse the search results:

In [19]:
links[0]['href']

'https://pydata.cz/'

In [20]:
links[0].text

'pydata.cz'

We don&#39;t have to search only by tag name, but also by their attributes:

In [21]:
soup.find(id="pydata-prague")

<h1 id="pydata-prague">PyData Prague</h1>

If we search by class, we can use `` soup.find (class = &quot;pydata&quot;) ``, because `` class`` is a keyword in Python. We have to use `` class_``

In [22]:
soup.find(class_="pydata")

<a class="pydata" href="https://pydata.org/">PyData</a>

We can also write `` class`` as a key in the `` attrs`` parameter:

In [23]:
soup.find(attrs={'class':'pydata'})

<a class="pydata" href="https://pydata.org/">PyData</a>

### Exercises

Select the paragraph with the id `` description``.

Select all links with class `` pydata``

## Requests

The [requests] library (https://requests.readthedocs.io/en/master/) is for HTTP queries. In our case, we will use it to retrieve the text of the website.
HTTP requests can also be made using the standard Python library, but `requests` have a much more human interface.

In [24]:
import requests

In [25]:
r = requests.get('https://pydata.cz/')

We can check the return status:

In [26]:
r.status_code

200

[Status codes] (https://cs.wikipedia.org/wiki/Stavov%C3%A9_k%C3%B3dy_HTTP) is divided into 5 groups:- 1xx - information answer- 2xx - success- 3xx - redirection- 4xx - probably the client- 5xx - server error

The text method shows the source code of the page:

In [27]:
#print(r.text)

## BeautifulSoup second: CSS selectors
With the BeautifulSoup library we can [search CSS selectors] (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) using the `select` (show all elements) and` select_one` functions (find the first element).

We will open an html document with the kin program. It is downloaded from https://dokina.tiscali.cz/program-kin, but since it could happen that there is nothing on the program today, we prefer to use the downloaded version. Of course, if there is something on the site, you can use `requests`.

In [28]:
with open("static/kina.html", "r") as f:
    html_doc = f.read()
soup = BeautifulSoup(html_doc)

We will try to find all tags with the class `title` and we will list some of them:(For the sake of clarity, we do not list all of them completely)

In [29]:
titles = soup.select('.title') 
titles[:6]

[<h4 class="title mb-0">3Bobule</h4>,
 <h4 class="title mb-0">Tiché místo: Část II</h4>,
 <h4 class="title mb-0">V síti</h4>,
 <h4 class="title mb-0">Princezna zakletá v čase</h4>,
 <h4 class="title mb-0">Bloodshot</h4>,
 <h3 class="title mb-0">
 <a data-ga-action="cinema-detail" data-ga-category="program-kin" href="https://dokina.tiscali.cz/evropsky-dum-7435" title="Profil kina">Evropský dům</a>
 <a class="favourite-toggler" data-ga-action="toggle-favorite-cinema" data-ga-category="program-kin" data-id="7435" href="https://dokina.tiscali.cz/program-kin?movie_id=&amp;place_id=&amp;cinema_type=&amp;town_id=&amp;start=25-03-2020#" title="Přidat do oblíbených"></a>
 </h3>]

We will find all tags of type `<h3 class="title mb-0"> `and let&#39;s look at the first one:

In [30]:
mb0_titles = soup.select('h3.title.mb-0')
mb0_titles[0]

<h3 class="title mb-0">
<a data-ga-action="cinema-detail" data-ga-category="program-kin" href="https://dokina.tiscali.cz/evropsky-dum-7435" title="Profil kina">Evropský dům</a>
<a class="favourite-toggler" data-ga-action="toggle-favorite-cinema" data-ga-category="program-kin" data-id="7435" href="https://dokina.tiscali.cz/program-kin?movie_id=&amp;place_id=&amp;cinema_type=&amp;town_id=&amp;start=25-03-2020#" title="Přidat do oblíbených"></a>
</h3>

We will find all the tags with the class `title` inside the tags with the class` movie-item`

In [31]:
soup.select('.movie-item .title ') 

[<h4 class="title mb-0">3Bobule</h4>,
 <h4 class="title mb-0">Tiché místo: Část II</h4>,
 <h4 class="title mb-0">V síti</h4>,
 <h4 class="title mb-0">Princezna zakletá v čase</h4>,
 <h4 class="title mb-0">Bloodshot</h4>,
 <h4 class="title mb-0">A dýchejte klidně</h4>,
 <h4 class="title mb-0">Sviňa</h4>,
 <h4 class="title mb-0">Králíček Jojo</h4>,
 <h4 class="title mb-0">La belle saison</h4>,
 <h4 class="title mb-0">Výjimeční</h4>,
 <h4 class="title mb-0">La Llorona</h4>,
 <h4 class="title mb-0">Morgiana</h4>,
 <h4 class="title mb-0">Noční ostraha</h4>,
 <h4 class="title mb-0">Vita and Virginia</h4>,
 <h4 class="title mb-0">Volej mámu</h4>,
 <h4 class="title mb-0">Čas beznaděje</h4>,
 <h4 class="title mb-0">Šarlatán</h4>,
 <h4 class="title mb-0">Žaluji!</h4>]

## Tools in the browser
Before we go through the page using Python, it&#39;s a good idea to look at its structure directly in the browser.

### The source code of the page
<br />

<div>
<img src="static/FF_pagesource.png" />
<img src="static/chrome_pagesource.png" /> 
</div>

### Explore Element / Inspect
<br />

<div>
<img src="static/FF_inspect.png" />
<br />
<img src="static/Inspect.png" /> 
</div>
<br />
<strong>Attention:</strong> The source code of the selected element may not match the code you download using Python. It may have been modified by JavaScript while in the browser.

### Developer Tools
<br />

<div>
<img src="static/FF_dev.png" />
<br />
<img src="static/chrome_dev.png" /> 
</div>


** Note: ** If you do not see anything in the Network tab, try refreshing the page.

## An example from (in) life

** Extraction of data on the 20 best films according to CSFD **
We want to create a table with the following columns:
- Movie title- Average rating- Country of origin- Year of introduction- The lenght of the film- Director

* There is no universal guide to scraping. Usually, we try trial and error to see what works. Then the site updates and we can start again ... *

In [32]:
url = 'https://www.csfd.cz/zebricky/nejlepsi-filmy/'
# page rejects GET requests without User-Agent identification:headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

### Task:Look at the source code of the page. What elements will we look for on the page? And where can we find information about the director?

Let&#39;s see what the elements of the `movie` class look like:

In [33]:
soup.select('.film')[0]

<td class="film" id="chart-2294"><a href="/film/2294-vykoupeni-z-veznice-shawshank/">Vykoupení z věznice Shawshank</a> <span class="film-year" dir="ltr">(1994)</span></td>

To get more information about the movies, we need to get to the movie page. To do this we will need links:

In [34]:
film_links = soup.select (&#39;. film a&#39;)movie_links [0]

<a href="/film/2294-vykoupeni-z-veznice-shawshank/">Vykoupení z věznice Shawshank</a>

In [35]:
links = [&#39;https://www.csfd.cz&#39; + film [&#39;href&#39;] for film in film_links]links [0]

'https://www.csfd.cz/film/2294-vykoupeni-z-veznice-shawshank/'

We&#39;ll save the movie titles straight away:

In [36]:
titles = [film.text for film in film_links]names [0]

'Vykoupení z věznice Shawshank'

The average rating can be found in the elements with the class `average`:

In [37]:
soup.select('.average')[0]

<td class="average">95,3%</td>

In [38]:
hodnoceni = [aver.text for aver in soup.select('.average')]
rated [0]

'95,3%'

### Movie Info
First we will try to find information about the first film:

In [39]:
r = requests.get(odkazy[0], headers=headers)
film_soup = BeautifulSoup(r.text, 'html.parser')

In [40]:
film_soup.select_one('.origin').text

'USA, 1994, 142 min'

In [41]:
film_soup.find('span', {'itemprop': 'director'}) 

<span data-truncate="60" itemprop="director">
<a href="/tvurce/2869-frank-darabont/">Frank Darabont</a>
</span>

In [42]:
film_soup.find('span', {'itemprop': 'director'}).a.text

'Frank Darabont'

Hurrah! It works. We can do it with all the movies.

In [43]:
from time import sleep
import re 

Footnote: `re` is a standard library module for working with strings using regular expressions. In the following code, we will use it to divide a string by two different characters.
We need to solve cases of type `166 min (Director&#39;s cut: 175 min, Alternative: 152 min)`, so we divide the text according to the characters &quot;,&quot; and &quot;(&quot;

In [44]:
land = []years = []lengths [[]reziseri = []for link in links:    r = requests.get(odkaz, headers=headers)
    film_soup = BeautifulSoup(r.text, 'html.parser')
    txt = film_soup.select_one('.origin').text
origin, year, time, * remainder = re.split (&#39;, | \ (&#39;, txt)    reziser = film_soup.find('span', {'itemprop': 'director'}).a.text
zeme.append (puvod)roky.append (year)delky.append(case)reziseri.append (reziser)sleep (1) # we behave humanely :-)    

In [45]:
filmy_df = pd.DataFrame (zip (names, countries, years, directors, lengths, ratings),columns = [&#39;Name&#39;,&#39;Origin&#39;,&#39;Year&#39;,&#39;Directing&#39;,&#39;Length&#39;,&#39;ČSFD&#39;])movies_df.head ()

Unnamed: 0,Název,Původ,Rok,Režie,Délka,ČSFD
0,Vykoupení z věznice Shawshank,USA,1994,Frank Darabont,142 min,"95,3%"
1,Forrest Gump,USA,1994,Robert Zemeckis,142 min,"94,5%"
2,Zelená míle,USA,1999,Frank Darabont,188 min,"92,8%"
3,Přelet nad kukaččím hnízdem,USA,1975,Miloš Forman,133 min,"92,5%"
4,Sedm,USA,1995,David Fincher,127 min,"92,4%"


## How to JavaScript

If a website uses JavaScript to generate content, we have two options:
1. Understand what JavaScript does and arrange yourself accordingly2. Act like a browser

### ExampleRace results Big&#39;s Backyard Ultra, [`https://my.raceresult.com/139372/#0_2C3B48`](https://my.raceresult.com/139372/#0_2C3B48)

<div>
  <img src="static/bigdog.png" >
</div>

In [46]:
r = requests.get('https://my.raceresult.com/139372/#0_2C3B48')

We will try to find the word Gavin in the text of the page.

In [47]:
r.text.find("Gavin")

-1

Nothing. Where they went wrong? We&#39;ll print the source code of the page ... and the spreadsheet nowhere!

In [48]:
#print(r.text)

### Option 1: Developer tools - Network

If we have data from only one website, we can use developer tools to find out what javascript does on the page. We are interested in where the site takes the data.
Open the developer tools and click on the Network tab. If there is nothing on it, we refresh the page. Typically, we&#39;re interested in queries of `XMLHttpRequest` (XHR), which allow a site to retrieve data from a URL.

<div>
  <img src="static/bigdog2.png">
<br />
</div>

In our case, we see two queries of the XHR type, we take a closer look to find out which of them contains the searched data. We will copy the link address.

<div>
  <img src="static/bigdog4.png">
<br />
</div>

<div>
  <img src="static/bigdog5.png">
<br />
</div>

In [49]:
r = requests.get('https://my2.raceresult.com/RRPublish/data/list.php?eventid=139372&key=3a52cf488ad3dfe5c994f6b203e7c2e8&listname=Result+Lists%7CLap+Details&page=results&contest=0&r=all&l=0')

In [50]:
#r.text

In [51]:
#r.json()['data']

In [52]:
r.json()['data']['#1_1///Gavin Woody///40Laps']

[['1', '1', '7:35:18', '55:19', '04:40'],
 ['1', '2', '8:35:23', '55:24', '04:35'],
 ['1', '3', '9:34:25', '54:26', '05:33'],
 ['1', '4', '10:35:09', '55:10', '04:49'],
 ['1', '5', '11:33:55', '53:56', '06:03'],
 ['1', '6', '12:34:23', '54:24', '05:35'],
 ['1', '7', '13:33:35', '53:36', '06:23'],
 ['1', '8', '14:33:57', '53:58', '06:01'],
 ['1', '9', '15:34:07', '54:08', '05:51'],
 ['1', '10', '16:34:39', '54:40', '05:19'],
 ['1', '11', '17:35:17', '55:18', '04:41'],
 ['1', '12', '18:34:22', '54:23', '05:36'],
 ['1', '13', '19:25:29', '45:30', '14:29'],
 ['1', '14', '20:28:55', '48:56', '11:03'],
 ['1', '15', '21:30:22', '50:23', '09:36'],
 ['1', '16', '22:27:05', '47:06', '12:53'],
 ['1', '17', '23:28:05', '48:06', '11:53'],
 ['1', '18', '24:28:35', '48:36', '11:23'],
 ['1', '19', '25:27:17', '47:18', '12:41'],
 ['1', '20', '26:28:12', '48:13', '11:46'],
 ['1', '21', '27:28:15', '48:16', '11:43'],
 ['1', '22', '28:26:35', '46:36', '13:23'],
 ['1', '23', '29:26:29', '46:30', '13:29'],


### Možnost 2: Headless browsers

** Selenium ** is a library designed to automate web application testing. It lets you launch and control your browser, so you can do virtually anything you do on the web with it. Need to fill out forms automatically or download data.

In [53]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options  
#from selenium.webdriver.chrome.options import Options  
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

In [54]:
options = Options() 
options.headless = True

If you don&#39;t have a chromedriver or geckodriver (for Firefox), you can download them here:- [https://sites.google.com/a/chromium.org/chromedriver/home](https://sites.google.com/a/chromium.org/chromedriver/home)
- [https://github.com/mozilla/geckodriver/releases](https://github.com/mozilla/geckodriver/releases)

In [55]:
driver = webdriver.Firefox(options=options)
#driver = webdriver.Chrome(options=options)
driver.get (&#39;https://my3.raceresult.com/139372/#0_2C3B48&#39;)

<div>
  <img src="static/bigdog3.png">
<br />
</div>

Download the table with the class `MainTable`:

In [56]:
try: # we wait 5 s if the table loads    table = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'MainTable'))) 
except TimeoutException: 
    print("Time out!")

In [57]:
table

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="a900b51a-9168-024d-ae19-5291d48f166d", element="8744ffeb-6ae3-0b43-983b-24ebdf28152c")>

In [58]:
txt = table.text

In [59]:
print(txt[:1000])

  #
Measurement
Lap Split
Rest Time
  1
Gavin Woody
40Laps
1 7:35:18 55:19 04:40  
2 8:35:23 55:24 04:35  
3 9:34:25 54:26 05:33  
4 10:35:09 55:10 04:49  
5 11:33:55 53:56 06:03  
6 12:34:23 54:24 05:35  
7 13:33:35 53:36 06:23  
8 14:33:57 53:58 06:01  
9 15:34:07 54:08 05:51  
10 16:34:39 54:40 05:19  
11 17:35:17 55:18 04:41  
12 18:34:22 54:23 05:36  
13 19:25:29 45:30 14:29  
14 20:28:55 48:56 11:03  
15 21:30:22 50:23 09:36  
16 22:27:05 47:06 12:53  
17 23:28:05 48:06 11:53  
18 24:28:35 48:36 11:23  
19 25:27:17 47:18 12:41  
20 26:28:12 48:13 11:46  
21 27:28:15 48:16 11:43  
22 28:26:35 46:36 13:23  
23 29:26:29 46:30 13:29  
24 30:27:35 47:36 12:23  
25 31:34:26 54:27 05:32  
26 32:32:42 52:43 07:16  
27 33:32:27 52:28 07:31  
28 34:34:26 54:27 05:32  
29 35:33:34 53:35 06:24  
30 36:33:37 53:38 06:21  
31 37:33:08 53:09 06:50  
32 38:34:02 54:03 05:56  
33 39:33:41 53:42 06:17  
34 40:33:59 54:00 05:59  
35 41:33:41 53:42 06:17  
36 42:36:38 56:39 03:20  
37 43:34:18 54:19

In [60]:
driver.close()
driver.quit()


## In conclusion
We learned how to get data even without API or CSV files. But we must not forget to comply with laws and ethics. Always try webscraping as a last resort, just the slightest change on the page and your procedure will stop working.