# Webdata for Dummies (oDCM)

*Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce pretium risus at ultricies egestas. Vivamus sit amet arcu sem. In hac habitasse platea dictumst. Nulla pharetra vitae mauris sed mollis. Pellentesque placerat mauris dui, in venenatis nisl posuere ac. Nunc vitae tincidunt risus, ut pellentesque odio. Donec quam neque, iaculis id eros et, condimentum vulputate nulla. Nullam sed ligula leo.*

--- 

## Learning Objectives

Students will be able to: 
* Define what web scraping is and the issues surrounding it
* Develop strategies for identifying relevant structures in semi-structed data using browser console tools
* Utilize Python-based libraries to make request and parse web data
* Navigate and access structured web data like HTML, XML, and JSON
* Retrieve data from platforms' application programming interfaces (APIs)


--- 

## Acknowledgements
This course draws on online resources built by Brian Keegan, Colt Steele, David Amos, Hannah Cushman Garland, Kimberly Fessel, and Thomas Laetsch. 


--- 

## Contact
For technical issues try to be as specific as possible (e.g., include screenshots, your notebook, errors) so that we can help you better.

**WhatsApp**  
+31 13 466 8938

**Email**  
odcm@uvt.nl

---

## 1. Why Learn Web Scraping?

Say that you want to store or analyze data from a website. Then of course you can manually copy paste the data from each page but that has several limitations. What if the data on the page gets updated? Or what if there are simply so many pages that you cannot do it all by hand? Web scraping can help you overcome these issues by programmaticaly grabbing data from the web. Before we can extract elements from a website, we need to understand how a page is built up. 

--- 

## 2. HTML basics

### 2.1 Structure 
A web page consists of specifically formatted text files which serve various functions:

- `.html` (HyperText Markup Language) files give structure to a page (e.g., menu navigation, text, tables)
- `.css` (Cascading Style Sheet) files determine how the page looks (e.g., colors and fonts)
- `.js` (JavaScript) files add interactivity (e.g., page animations)

Most HTML elements are represented by a pair of tags -- an opening tag and a closing tag. A table, for example, starts with `<table>` and ends with `</table>`. The first tag tells the browser: "Hey! I got a table here! Render it as a table." The closing tag (note the forward slash!) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested more HTML tags representing rows (`<tr>`) and cells (`<td>`). 

```html
<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>
```

This what the rendered HTML table looks like:

<html>
    <table id="example-table" class="striped-table" style="width: 95%">
        <tr> <!-- Header -->
            <td>Column A</td>
            <td>Column B</td>
        </tr>
        <tr> <!-- Row 1 --->
            <td>Row 1, Column A</td>
            <td>Row 1, Column B</td>
        </tr>
        <tr> <!-- Row 2 --->
            <td>Row 2, Column A</td>
            <td>Row 2, Column B</td>
        </tr>
    </table>
</html>


HTML elements can have any number of attributes, such as IDs, which uniquely identify elements --

```html
<table id="example-table">
```

-- classes, which identify a type of element (contrary to ids, a class may be used more than once) --

```html
<table class="striped-table">
```

-- and styles, which define how specific elements appear (e.g. the width of the table) --

```html
<table style="width:95%;">
```

As you may already noticed, we use spaces (or TABs) to separate the elements from one another (a.k.a. indentation) to provide structure and improve readability. For example, the `<table>` tag is placed farther to the right than the `<html>` tag which indicates that the table is nested within the HTML block.

This may be a lot to take in if you're completely new to HTML, but don't worry as the goal of this section is not to learn you how to write code from scratch but rather to teach you what HTML is and why it is relevant for web scraping.

#### Exercise 1
Double click on the rendered table above to edit the HTML structure, finish the exercises below, and run the cell again (`Shift + Enter`) to see the result of your changes.

1. Add another row in the table above so that it becomes a 2 (columns) x 4 (rows) table. That is, 3 regular rows and 1 table header row.
2. Fill the cells with the corresponding text labels (e.g., Row 3, Column A). 
3. Change the table width to `50%` so that the table becomes narrower.

--- 
### 2.2 Inspecting HTML in your browser

You can look at the HTML that makes up any web page by _inspecting the source_ in a web browser. We assume you're using Chrome, if you're using Safari or Mozilla, there are slightly different workflows. 

#### Inspect element

You can inspect specific elements on the page by right-clicking on the page and selecting "Inspect" or "Inspect Element" from the context menu that pops up. Hover over elements in the "Elements" tab to highlight them on the page. This can be helpful when you're trying to figure how to uniquely identify the element you want to scrape.

In this exercise we look at the HTML structure of a fictitious [online bookstore](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html). Each of the 1000 books has its own page which shows the title, stock level, star rating, product description, and a table with other product information. Note that the prices and ratings are randomly generated and therefore the figures on your screen may deviate from the ones below.

<img src="images/inspect.png" width=80%/>

In the screenshot above I selected the book title ("A Light in the Attic") with a right-click, and chose "Inspect". The same text is highlighted in blue in the HTML code below. The `<h1>` and `</h1>` tags surrounding the title indicate that this text is a header on the web page . Move your pointer down to the line below (`<p class="price_color">£51.77</p>`) and you'll see that in the top screen it now highlights the price (rather than the title) of the book. This way you can easily investigate any webpage. 

As we discussed earlier, tags can be nested within other tags. This also becomes clear from the screenshot below in which the small gray triangles (▶) indicate that there is code hidden within these blocks. Click on them to expand the code and see what's inside and click again to collapse them. 

<img src="images/html_structure.png" width=80%/>

#### Exercise 2
1. Use the inspect tool to find the HTML element that constitutes the table header "**Number of reviews**" at the bottom of the [page](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html).
2. Look up how many elements on the page are associated with the class `sub-header` (within the Inspector screen use `Ctrl+F` on a PC or `⌘+F` on Mac to search)
3. You can make local (only on your computer) changes to the web page by double clicking in the inspector and swapping the code for something else (yes, you can overwrite what's already written there!). Change the price of the book to £39.95 and assign it a 5 star-rating. What happens once you refresh the page?

<img src="images/exercise_inspector.png" width=40% align="left"  style="border: 1px solid black"/>




### 2.3 Requesting HTML
Rather than using the Inspector to look up the source code, we can use Python's `requests` library for that purpose. As this library is not loaded by default, we first need to import it. The total source code contains over 9000 characters, therefore we only print out the product description.

In [8]:
import requests

# make a get request to the "A Light in the Attic" webpage
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
book_request = requests.get(url)

# return the source code from the request object
book_source_code = book_request.text

# print out part of the source code
print(book_source_code[5710:6860])

        <div id="product_description" class="sub-header">
            <h2>Product Description</h2>
        </div>
        <p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to r

Now it's your turn! Use slicers in the code below so that it prints out the title of the book: *A Light in the Attic* (without all other shebang)!

In [None]:
print(book_source_code[... : ...])

As you probably noticed that was quite a hassle to extract the right elements from the page. Fortunately, there is a much better way using XPath selectors and CSS locators, which we'll discuss in week 3. For now we'll give you a quick preview of what's coming up next.



### 2.4 Introduction to BeautifulSoup 
This open-source Python library allows you to navigate through and extract data from HTML files using Python. It does NOT gather information from the web, for this we use `requests` as we did above. So first you send a request to a page to gather the data and then you send it over to Beautifulsoup to extract the information. 

In the code snippet below we import the package and turn the `book_source_code` (the HTML code from the "A Light in the Attic" webpage we used earlier) into a BeautifulSoup object. Once parsed, we can easily navigate the code by tag name. Since we know that the title is surrounded by `<h1>` tags (see Google Inspector screenshot above), we use `soup.find('h1')` to print out the title of the book. Do you see how much easier that is?!

In [33]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(book_source_code)
print(soup.find('h1'))

<h1>A Light in the Attic</h1>


The `.find()` method will always print out the first matching element that it finds. For example, the web page has two `<h2>` elements which contain the "Product Description" and "Product Information" subheaders.  Only the first one will be returned by `.find()`:

In [34]:
print(soup.find('h2'))

<h2>Product Description</h2>


To capture all matching `<h2>` elements you use the `find_all()` method like this:

In [35]:
print(soup.find_all('h2'))

[<h2>Product Description</h2>, <h2>Product Information</h2>]


Note that it now returns a list of elements (`[element1, element2]`), so to access individual elements you need to apply indexing: 

In [68]:
# obtain first h2 element 
print(soup.find_all('h2')[0])

# obtain second h2 element
print(soup.find_all('h2')[1])

<h2>Product Description</h2>
<h2>Product Information</h2>


Both subheaders are still surrounded by `<h2>` and `</h2>` tags. To get rid of them, append `.get_text()` to your code: 

In [94]:
# sub header without h2 tags
print(soup.find_all('h2')[0].get_text())

Product Description


#### Exercise 4

1. Collect the HTML table that contains product information (UPC, type, price, tax, etc.) of the "A Light in the Attic" book using BeautifulSoup. The output should look like this ("Â" has to do with the GBP-pound symbol): 

```html
<table class="table table-striped">
    <tr>
        <th>UPC</th><td>a897fe39b1053632</td>
    </tr>
    <tr>
        <th>Product Type</th><td>Books</td>
    </tr>
    <tr>
        <th>Price (excl. tax)</th><td>Â£51.77</td>
    </tr>
    <tr>
        <th>Price (incl. tax)</th><td>Â£51.77</td>
    </tr>
    <tr>
        <th>Tax</th><td>Â£0.00</td>
    </tr>
    <tr>
        <th>Availability</th>
        <td>In stock (22 available)</td>
    </tr>
    <tr>
        <th>Number of reviews</th>
        <td>0</td>
    </tr>
</table>
```


2. Obtain the tax amount from the table. Tip: you can chain `.find()` and `find_all()` statements, for example: `.find('body').find_all('h1')` returns all `<h1>`s in the `body`. Your code should return the following output (so no tax): 

```html
Â£0.00 
```

--- 

## Application Programming Interface

* Wellicht dit deel eerst doen en dan pas web scraping(!)

### Contents 
* What happens when you type a URL in the URL bar?   
    * Your computer is called CLIENT
    * Computer to the other is called SERVER
* Describe the request/response cycle
    1. DNS lookup (like a phonebook for the internet)
        * Turn domain names (google.com) into an IP-address (172.217.9.142)
    2. Computer makes a REQUEST to a server
        * What is the url you're looking for
        * Type of data you (html, json, xml)
    3. Server processes the REQUEST
    4. Server issues a RESPONSE
* Explain what a request or response header is, and give examples
    * Header contains meta data of the request (provide additional information about the request or response)
        * User-agent (information about the software used to make 
            
* Explain the different categories of response codes
    * 2xx = Success
    * 3xx = Redirect
    * 4xx = Client error (your fault!) - e.g. 404
    * 5xx = Server error
* Compare GET and POST requests
    * GET = useful for retrieving data (e.g., visit Reddit.com)
        * You're not sending any information that is inserted anywhere
    * POST = useful for writing data (e.g., sign up on Facebook)
        * try to send data

* Appilcation Programming Interface
    * Version of a website intended for computers to talk with one another
    * Allows you to get data from another application within needing to understand how the application works (without the entire interface)
    * Examples of companies with APIs: Github, Spotify, Google




* we use `request` for both web scraping / crawling, grabbing data fram other APIs

In [151]:
# get below refers to a GET request (https:// is required)
url = "https://www.google.com"
response = requests.get(url)

print(f"your request to {url} came back w/ status code {response.status_code}")

your request to https://www.google.com came back w/ status code 200


In [153]:
response.text

'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="nl"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="zoA3IuPkDyW4o5gfD1dTnQ==">(function(){window.google={kEI:\'foPHX7PSCoTjkgXO6bmQAQ\',kEXPI:\'0,18167,183995,1157247,730,224,5105,206,3204,10,1144,82,364,896,603,817,383,246,5,304,1050,222,426,3451,315,3,66,768,1,216,282,981,3859,7,1887,108,128,743,1114303,1197776,506,7,302673,26305,13677,4855,32691,16115,28684,9188,8384,4859,1361,9290,3024,2849,1895,12841,4998,7931,5297,2054,920,873,4192,6430,14527,4518,2777,919,2277,9,2795,1593,1279,2212,530,149,1103,840,517,1466,56,4258,108,204,1137,2,2063,606,2025,1775,520,1704,243,2229,93,328,1284,2943,2246,3600,3227,2846,6,4774,825,6755,4455,641,7876,3748,1181,108,2855,552,908,2,941,2614,2397,1027,10,349,6084,3275,3,576,970,865,4625,148,5990,6324,1661,4,498,950,80,2304

* API easier to work with than using request on a webpage
* https://icanhazdadjoke.com

In [154]:
import requests

response = requests.get(
    "http://www.example.com", 
    headers = {
        "header1": "value1", 
        "header2": "value2"
    }
)

In [157]:
url = "https://icanhazdadjoke.com"

response = requests.get(url, headers={"Accept": "text/plain"})
response.text

# but this method doesn't work for all sites 

'Mahatma Gandhi, as you know, walked barefoot most of the time, which produced an impressive set of calluses on his feet. \r\nHe also ate very little, which made him rather frail and with his odd diet, he suffered from bad breath. \r\nThis made him a super calloused fragile mystic hexed by halitosis.'

In [160]:
response = requests.get(url, headers={"Accept": "application/json"})
data = response.json() # similar to a Python dictionary
data

{'id': 'dprjbhyAAAd',
 'joke': 'A Sandwich walks into a bar, the bartender says “Sorry, we don’t serve food here”',
 'status': 200}

* query string = way to pass data to the server as part of a GET request
* `http://www.example.com/?key1=value1&key2=value2`
* https://www.google.com/search?
    * q=tilburg+university (query)
    * oq=tilburg+university+ (original query)
    * aqs=chrome..69i57j46j46i175i199j0l4j46i175i199.4848j0j4
    * sourceid=chrome&ie=UTF-8
    
* https://icanhazdadjoke.com/search?term=cat

* https://github.com/kimfetti/Conferences/tree/master/PyCon_2020
* https://www.youtube.com/watch?v=RUQWPJ1T6Zc&t=190s
* https://github.com/hancush/web-scraping-with-python/blob/master/session/web-scraping-with-python.ipynb#HTML-basics
* https://www.udemy.com/course/the-modern-python3-bootcamp/learn/lecture/7991196#overview
* https://campus.datacamp.com/courses/web-scraping-with-python/introduction-to-html?ex=1
* https://realpython.com/python-web-scraping-practical-introduction/
* https://github.com/CU-ITSS/Web-Data-Scraping-S2019