# Webdata for Dummies (oDCM)

*Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce pretium risus at ultricies egestas. Vivamus sit amet arcu sem. In hac habitasse platea dictumst. Nulla pharetra vitae mauris sed mollis. Pellentesque placerat mauris dui, in venenatis nisl posuere ac. Nunc vitae tincidunt risus, ut pellentesque odio. Donec quam neque, iaculis id eros et, condimentum vulputate nulla. Nullam sed ligula leo.*

--- 

## Learning Objectives

Students will be able to: 
* Define what web scraping is and the issues surrounding it
* Develop strategies for identifying relevant structures in semi-structed data using browser console tools
* Utilize Python-based libraries to make request and parse web data
* Navigate and access structured web data like HTML, XML, and JSON
* Retrieve data from platforms' application programming interfaces (APIs)


--- 

## Acknowledgements
This course draws on online resources built by Brian Keegan, Colt Steele, David Amos, Hannah Cushman Garland, Kimberly Fessel, and Thomas Laetsch. 


--- 

## Contact
For technical issues try to be as specific as possible (e.g., include screenshots, your notebook, errors) so that we can help you better.

**WhatsApp**  
+31 13 466 8938

**Email**  
odcm@uvt.nl

---

## Why Learn Web Scraping?

Say that you want to store or analyze data from a website. Then of course you can manually copy paste the data from each page but that has several limitations. What if the data on the page gets updated? Or what if there are simply so many pages that you cannot do it all by hand? Web scraping can help you overcome these issues by programmaticaly grabbing data from the web. Before we can extract elements from a website, we need to understand how a page is built up. 

--- 

## HTML basics

A web page consists of specifically formatted text files which serve various functions:

- `.html` (HyperText Markup Language) files give structure to a page (e.g., menu navigation, body text, footer)
- `.css` (Cascading Style Sheet) files determine how the page looks (e.g., colors and fonts)
- `.js` (JavaScript) files add interactivity (e.g., animations)

Most HTML elements are represented by a pair of tags -- an opening tag and a closing tag. A table, for example, starts with `<table>` and ends with `</table>`. The first tag tells the browser: "Hey! I got a table here! Render it as a table." The closing tag (note the forward slash!) tells the browser: "Hey! I'm all done with that table, thanks." Inside the table are nested more HTML tags representing rows (`<tr>`) and cells (`<td>`).


```html
<table id="example-table" class="striped-table" style="width: 95%">
    <tr> <!-- Header -->
        <td>Column A</td>
        <td>Column B</td>
        <td>Column C</td>
    </tr>
    <tr> <!-- Row 1 --->
        <td>Row 1, Column A</td>
        <td>Row 1, Column B</td>
        <td>Row 1, Column C</td>
    </tr>
    <tr> <!-- Row 2 --->
        <td>Row 2, Column A</td>
        <td>Row 2, Column B</td>
        <td>Row 2, Column C</td>
    </tr>
</table>
```

HTML elements can have any number of attributes, such as IDs, which uniquely identify elements --

```html
<table id="example-table">
```

-- classes, which identify a type of element --

```html
<table class="striped-table">
```

-- and styles, which define how specific elements appear --

```html
<table style="width:95%;">
```

-- that will be useful to know about when we're scraping. 

### Inspecting HTML in your browser

You can look at the HTML that makes up any web page by _inspecting the source_ in a web browser. We assume you're using Chrome, if you're using Safari or Mozilla, there are slightly different workflows. 

#### Inspect element

You can inspect specific elements on the page by right-clicking on the page and selecting "Inspect" or "Inspect Element" from the context menu that pops up. Hover over elements in the "Elements" tab to highlight them on the page. This can be helpful when you're trying to figure how to uniquely identify the element you want to scrape.

In this exercise we look at the HTML structure of a fictitious [online bookstore](https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html). Each of the 1000 books has its own page which shows the title, stock level, star rating, product description, and a table with other product information. Note that the prices and ratings are randomly generated and therefore the figures on your screen may deviate.

<img src="images/inspect.png" width=80%/>

In the screenshot above I selected the book title ("A Light in the Attic") with a right-click, and chose "Inspect". The same text is highlighted in blue in the HTML code below. The `<h1>` and `</h1>` tags surrounding the title indicate that the text is a header on the web page. Move your pointer down to
`<p class="price_color">£51.77</p>` and you'll see that in the top screen it now highlights the price (rather than the title) of the book. 

This way you can easily investigate through any webpage. As we discussed earlier, tags can be nested within other tags. This also becomes clear from the screenshot below in which the small gray triangles (▶) indicate that there is code hidden within these blocks. Click on them to expand the code and see what's inside. 

<img src="images/html_structure.png" width=80%/>

#### Exercises 
1. Use the inspect tool to find the HTML element that constitutes the table header "**Number of reviews**" at the bottom of the page.
2. Look up how many elements on the page are associated with the class `sub-header` (within the Inspector screen use `Ctrl+F` on a PC or `⌘+F` on Mac to search)
3. You can make local (only on your computer) changes to the web page by double clicking in the inspector and swapping the code for something else. Change the price of the book to £39.95 and assign it a 5 star-rating.


### Requesting HTML
Rather than using the Inspector to look up the source code, we can use Python's `requests` library for that purpose. As this library is not loaded by default, we first need to import it. The total source code contains over 9000 characters, therefore we only print out the product description.

In [9]:
import requests

# make a get request to the "A Light in the Attic" webpage
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
book_request = requests.get(url)

# return the source code from the request object
book_source_code = book_request.text

# print out part of the source code
print(book_source_code[5710:6860])

        <div id="product_description" class="sub-header">
            <h2>Product Description</h2>
        </div>
        <p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to r

Now it's your turn! Use slicers in the code below so that it prints out the title of the book (without all other shebang)!

In [None]:
print(book_source_code[... : ...])

As you probably noticed that was quite a hassle to extract the right elements from the page. Fortunately, there is a much better way using XPath and CSS Selectors, which we'll discuss next.

--- 
## XPath & Selectors 

* Single forward slash `/` used to move forward one generation
* Double forward slashes `//` used to direct to all elements within the entire HTML code
* Tag names between slashes give direction to which element(s)
* Brackets [] after a tag name tell us which of the selected siblings to choose.

In case the cell below throws a `ModuleNotFoundError` at you, you first need to install the `scrapy` package. Go to your terminal and type `conda install scrapy` and press `y` to proceed.

In [15]:
from scrapy import Selector

html = requests.get(url).content 
sel = Selector(text=html)

sel.xpath("//h1")

[<Selector xpath='//h1' data='<h1>A Light in the Attic</h1>'>]

--- 

## CSS Locators 

--- 

## Application Programming Interface
* Basic Reddit requests

* https://github.com/kimfetti/Conferences/tree/master/PyCon_2020
* https://www.youtube.com/watch?v=RUQWPJ1T6Zc&t=190s
* https://github.com/hancush/web-scraping-with-python/blob/master/session/web-scraping-with-python.ipynb#HTML-basics
* https://www.udemy.com/course/the-modern-python3-bootcamp/learn/lecture/7991196#overview
* https://campus.datacamp.com/courses/web-scraping-with-python/introduction-to-html?ex=1
* https://realpython.com/python-web-scraping-practical-introduction/
* https://github.com/CU-ITSS/Web-Data-Scraping-S2019