# Introduction to Web Scraping

---

## 1. Prerequisite: HTML

You should have at least know how an HTML document looks like. 

- HTML contents are enclosed by HTML tags like `<tag> </tag>`. Most of the tags come in pairs - one opening tag and one closing tag.

- HTML tags can contain one or more attributes. 
    - Some attributes are functional, like `href` in `<a href="some_link"></a>` or `src` in `<img src="image_link">`.
    - Some attributes are for identification, like `class` and `id`. They are mainly used by web developer, for the ease of adding CSS and JavaScript.
    

Here is a simple example of HTML:

```html
<!DOCTYPE HTML>
<html>
<body>
    <h3>Interesting unknown facts about butterflies</h3>
    <ul>
        <li class="facts">The fastest butterflies can fly up to 30 miles per hour.</li>
        <li class="facts">There are around 28,000 species of butterflies in the world</li>
    </ul>
    <p id="name_origin"> Many believe butterflies got their name because they would fly around the buckets of milk on farms. While the milk was being churned into butter, many noticed these flying insects would appear and soon they were being called butterflies. </p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/3/3d/Fesoj_-_Papilio_machaon_%28by%29.jpg" style="width:100px">
</body>
</html>
```

which, when rendered by a browser, will look like this

> <html>
 <body>
     <h3>Interesting unknown facts about butterflies</h3>
     <ul>
         <li>The fastest butterflies can fly up to 30 miles per hour.</li>
         <li>There are around 28,000 species of butterflies in the world</li>
    </ul>
    <p id="name_origin"> Many believe butterflies got their name because they would fly around the buckets of milk on farms. While the milk was being churned into butter, many noticed these flying insects would appear and soon they were being called butterflies. </p>
    <img src="https://upload.wikimedia.org/wikipedia/commons/3/3d/Fesoj_-_Papilio_machaon_%28by%29.jpg" style="width:100px">
</body>
</html>

## 1.1 DOM tree

The Document Object Model (DOM) is the backbone of how a browser interprets an HTML document. Basically:

- Every HTML tag is a tree node. 
- Nested tags are “children” of the enclosing one. The text inside a tag is a node too.
- Tags of the same level are "silbings" of each other. They are ordered by their first apperance. 
- Attributes of a tag are considered as the properties of the node.


By these principles, we can build a tree-like structure for a webpage - the DOM tree. For example, the above HTML text will look like:

<figure style="text-align: center">

    
</figure>

Through the visualization by the DOM tree, we can clearly see the relations between different HTML tags. 

## 1.2. XPath

Supposed we want to select the content inside a specific HTML tag. Traditionally, we use the XPath language to identify the position of that specific node in the DOM tree. In general, we can find the node by, for example,

- By the node's tag name. 
- By the parent-child relations between nodes.
- By the node's attributes.

I have also uploaded a cheatsheet for how to read/write the XPath of a tag.

Note that the XPath of a node is not unique - you can have at least two ways to call it - by its absolute path or relative path. For example, if I want to get the text of the sentence "There are...", 

- The absolute path is `/html/body/ul/li[2]/text()`

- The are so many ways to call it by relative path, e.g.
    - `li[position()=2]/text()` 
    - `li[last()]/text()`
    - `li[contains(.,"28000")]/text()`

---

## 2. Interactivity by JavaScript

A website that only contains static HTML is boring. Nowadays web developers always try to add users' interactivity on their sites to make them interesting. These functions heavily relies on the use of JavaScript, especially the functions of AJAX. 

**_You are not required to learn the syntax of JavaScript, but only need to understand the mechanism behind interactivity._**

### 3.1. Event listener

Interactivity on a site is created by an event listener, which is a JavaScript function running by the browser that performs these in order:

1. Wait (Listen) for user's inputs 
2. Trigger a callback function when an input is received. 
3. The DOM tree is modified according to the callback function.
4. Ask the browser to re-render the display of the page, according to the modified DOM tree. 

The callback functions mainly modify the DOM tree by

- Changing the attribute of a node. E.g. change the `style` attribute can change the appearence of that element.

- Modifing the content in the tree. E.g. append a new node in the DOM tree can make new content appear in the page.

Here is a sample JavaScript I write for the above butterfly demo, which when the `<p>` tag with id `name_origin` is clicked, a new `<p>` node with text 'Butterflies are amazing!' is appended to the DOM tree, as the next sibling to the paragraph of `name_origin`. 

```javascript
<script>
document.getElementById("name_origin").addEventListener("click", myFunction);

function myFunction() {
    var newPara = document.createElement('p');
    newPara.textContent = 'Butterflies are amazing!';
    this.insertAdjacentElement('afterend', newPara);
}
</script>
``` 

<figure style="text-align: center">


</figure>

This is a very simple example, but complicate interactive functions can also be built in the same way - by an "event listener" that modifies the DOM tree so that new HTML content appear. 

On the other hand, **once the new HTML contents appear on the DOM tree after appeneded by JavaScript, we can scrap it just like static HTML.** We can run our crawler through a headless browser (i.e. browser without GUI) so that JavaScript can run during scraping.

### 3.2. AJAX

AJAX stands for **A**synchronous **J**avaScript **A**nd **X**ML. It is the name of the technique for web to send and retrieve data from a server asynchronously (in the background), without disrupting the display and behaviour of the existing page. We can use AJAX to enhance the interactivity of a page, for example,

1. An event listener waits for user's inputs.
2. The event listener triggers a callback function once an input is received.
3. The callback function tells the browser to perform an HTTP request to the server. 
4. The server send back a response (in XML in the old days, but JSON nowadays).
5. The callback function parses the response, transforms it to HTML text and then uses it to modify the DOM tree
6. The browser re-renders the display of the page, according to the modified DOM tree.

Technically, only point 3-4 involve the technique of AJAX. The advantage of AJAX is that HTTP requests are only processed in the background, without disrupting the user's experience. At the same time the server does not need to send all content to the user at once, but one-by-one on request. 


**The difficulty in scraping a site with AJAX is that it takes additional steps (sending HTTP requests) to make the HTML contents appear.** But once the content are inserted to the DOM tree, it is just the same as scraping a static site.

---

## 3. Workflow

The steps of scraping a website in person or by automatic crawler is basically the same.

|Steps|In person|By automatic crawler|
|:---:|:---:|:---:|
|1|Type the URL in the browser <br> then press "Enter"| Load the URL into the <br> crawler|
|2|The browser sends an HTTP <br> request to the site's server | The crawler sends an HTTP <br> request to the site's server |
|3| The site's server sends back an HTTP <br> response, which is a text of HTML| The site's server sends back an HTTP <br> response, which is a text of HTML|
|4| The browser renders the HTML text <br> into a human-readable website | - |
|5| The user identifies the data <br> he is interested in | The crawler identifies the HTML elements <br> by user's instruction |
|6| Copy & paste the data | Copy & paste the data |
|7| Organize into tables | Organize into tables |


To scrap a web, we can just think as if we are doing the copy & paste by ourselves. We can think of these things when we are constructing a scraping task:

- How to make the HTML content appear?
- What is the order of doing copy & paste?
- Do you need to be redirected to another page?
- How to organize the extracted data?

The workflows of scraping differ by the site you are scraping and the fields of data you want to scrap. You must plan your own workflow. 


### Example: A wikipedia index page

Let's say we want to scrap information of each Nobel prize winners from the [List of nobel laureates by country](https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country) on Wikipedia. Define the data we would like to scrap:

- Name of the laureate
- Fields of the prize
- Year of the prize
- Nationality
- Date of birth
- Place of birth
- Date of death (if any)
- Place of death (if any)
- Image of the laureate

Some of these data are already on the index page, but for the others you need to click into their personal page. Think of how you would scrap in person:

For each laureate in the list:
1. Copy the name, fields of the prize, year of the prize, nationality.
2. Click on the laureates' names to direct their personal page (open on new tab).
3. Copy the birth and death details
4. Save the image
5. Go back to the index page

You can turn these instructions into a script to make your crawler do exactly the same way you are doing in person. Here is a plan of the workflow.


<figure style="text-align: center">


</figure>

---

## 4. Try not to get blocked

Not every site likes people to scrap its data - imagine if all the visitors of your website are just (ro)bots who just want to "steal" your content (and you cannot earn ad views from them). Moreover, bots may be malicious because they can occupy your bandwidth (e.g. DDoS). To prevent such thing happening, some servers will try to detect if the visitor is a bot, usually by these signs:

- Sending requests too fast and for too many pages, faster than a human ever can.
- Unusual traffic/high download rate from a single client or IP address within a short time span.
- Repetitive browsing pattern.
- Checking if you are using a real browser (remember what does the header of HTTP requests do?)
- Setting up honeypots, i.e. links which are not visible to a normal user but only to a crawler. Alarm will be triggered if the links are clicked.
- Disrupt your request by Captcha.

To prevent getting detected as a bot, here are a few suggestions we can do to make our bots "acting" more like a human:

- Set timeout between requests. Never scrap too fast.
- Randomize the timeout. Only robots can send requests at constant rate.
- Divide your task to shorter tasks. Do not scrap too many things per task.
- Insert some useless, random requests.
- Make use of the HTTP headers - use fake browser info to pretend you are using a real browser.
- IP rotation. Switch your connection through different proxy server during the task.

To learn more, you can have a read on [this article](https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/). 
