# 1. Overview
There are basically two ways to scrape data from websites, one is via HTML responses (server-side) and the other is via JSON responses (client-side).
- In the first method, we extract structured data from user interface, which is friendly for human eyes but not for computers. Because we scrape what we see, data will be availale most of the time except for cases such as when website owners protect their data using images rather than raw text. This crawling approach requires some basic knowledge of HTML.
- In the second method, we try to crawl REST API responses, which is only available in specific website that use this protocol. The advantage is that the returned data is in JSON format, so they can be easily extracted and processed. Crawling data this way is much easier, so we are going to start with it.

## 1.1. Requests
Instead of accessing websites using a browser such as Google Chrome, we can use the [Requests] library to download the raw content of that page and interact with it. Most of the time, we are going to use the
<code style='font-size:13px'>get()</code>
function followed by the
<code style='font-size:13px'>text</code>
attribute.

[Requests]: https://github.com/psf/requests

In [27]:
import requests

In [42]:
url = 'https://books.toscrape.com/index.html'
response = requests.get(url)
response

<Response [200]>

In [36]:
response.text[:1000]

'<!DOCTYPE html>\n<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->\n    <head>\n        <title>\n    All products | Books to Scrape - Sandbox\n</title>\n\n        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n        <meta name="created" content="24th Jun 2016 09:29" />\n        <meta name="description" content="" />\n        <meta name="viewport" content="width=device-width" />\n        <meta name="robots" content="NOARCHIVE,NOCACHE" />\n\n        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->\n        <!--[if lt IE 9]>\n        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>\n        <![endif]-->\n\n        \n            <link rel="shortcut icon" href

In [37]:
response.headers

{'Date': 'Fri, 06 Jan 2023 02:51:21 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Last-Modified': 'Thu, 26 May 2022 21:15:15 GMT', 'ETag': 'W/"628fede3-c85e"', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload', 'Content-Encoding': 'br'}

## 1.2. APIs crawling
Not all URLs point to an HTML page, for example, the URL https://api.github.com/repos/dmlc/xgboost points to a raw document in JSON format. Such an URL is called a REST API endpoint and can be easily converted into Python dictionaries using the 
<code style='font-size:13px'>json()</code>
method. But APIs in practice are not always that simple, they can go with concepts such as headers and payloads. So, learning real-world API structures and how to find them will be our goal in this section.

In [47]:
import pandas as pd
import requests

### Documented APIs
Many organizations officially support REST APIs for data accessing, such as [GitHub], [Facebook], [Twitter], [Reddit] and [Clash Royale]. To start using APIs provided this way, developers usually need to register an account and generate an API key, but some of them don't require any authenication. Either way, the instructions for requesting data can be found in their documentation sites.

Now let's hand on an example by requesting GitHub's [list-repository-languages] endpoint to show the size (in bytes) of code written in each language.
- The main component of a request is the URL which follows a pre-defined syntax. In this case, the URL has two placeholders for
<code style='font-size:13px'>OWNER</code> and <code style='font-size:13px'>REPO</code>
they can be handled nicely using Python formatted strings. We can use this URL to access data of any public repository.
- For private repositories, we must provide an authenication key with appropriae permissions. These additional information are called the headers, you can think of them as metadata of the API call.
- We might notice that there are different requesting methods available such as GET, POST, PUT and DELETE that serve different purposes. As we only want to collect data, we only need to care about GET, and sometimes, POST.

[GitHub]: https://docs.github.com/en/rest
[Facebook]: https://developers.facebook.com/docs/groups-api/reference
[Twitter]: https://developer.twitter.com/en/docs/twitter-api
[Reddit]: https://www.reddit.com/dev/api/
[Clash Royale]: https://developer.clashroyale.com/
[list-repository-languages]: https://docs.github.com/en/rest/repos/repos#list-repository-languages

In [3]:
OWNER = 'dmlc'
REPO = 'xgboost'

url = f'https://api.github.com/repos/{OWNER}/{REPO}/languages'
response = requests.get(url)
response.json()

{'C++': 2215388,
 'Python': 1203192,
 'Cuda': 863316,
 'Scala': 470919,
 'R': 343950,
 'Java': 206895,
 'CMake': 52369,
 'Shell': 45902,
 'C': 22503,
 'Makefile': 8179,
 'PowerShell': 4308,
 'CSS': 3812,
 'Dockerfile': 2364,
 'M4': 2131,
 'Batchfile': 1383,
 'Groovy': 1251,
 'TeX': 913}

In [5]:
OWNER = 'hungpq7'
REPO = 'courses'

url = f'https://api.github.com/repos/{OWNER}/{REPO}/languages'
headers = {
    'Accept': 'application/vnd.github+json',
    'Authorization': 'Bearer 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',
    'X-GitHub-Api-Version': '2022-11-28',
}
response = requests.get(url, headers=headers)
response.json()

{'Jupyter Notebook': 8166118, 'Perl': 1432, 'Shell': 1286}

<b style='color:navy'><i class="fa fa-info-circle"></i>&nbsp; Note</b><br>
We can make API calls with command line too, using the [cURL](https://en.wikipedia.org/wiki/CURL) command.

In [6]:
!curl https://api.github.com/repos/hungpq7/courses/languages \
    -H "Accept: application/vnd.github+json" \
    -H "Authorization: Bearer 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" \
    -H "X-GitHub-Api-Version: 2022-11-28"

{
  "Jupyter Notebook": 8166118,
  "Perl": 1432,
  "Shell": 1286
}


### Hidden APIs
The increasing popularity of JavaScript-based frameworks such as React, Angular and Vue encourages websites to be rendered *client-side*. This means, more websites use REST APIs to send and receive data to fill HTML templates, then render the page on user's computers. Of course, such APIs are not documented, so benefiting them requires some tricks including finding them and understanding their structures. In this section, we are going to inspect a page's network activities to locate scrapable APIs.

<b style='color:navy'><i class="fa fa-book"></i>&nbsp; Case study</b><br>
In the first example, we will be crawling all articles in the home page of https://techcrunch.com/ with the following steps:
- Go to the target page and open the browser's developer tool. The shortcut in Google Chrome is
<code style='font-size:13px'>F12</code> or <code style='font-size:13px'>Ctrl + Shift + I</code>.
However, the tool will not record activities before it was opened, so we need to press
<code style='font-size:13px'>Ctrl + R</code>
to reload the target page.
- Navigate to the Network tab to show all requests the page has made and filter Fetch/XHR requests. This filter leaves only requests that fetch JSON data, separating them from other types of response that we don't need such as image, media and CSS. These buttons are yellow circled in the image below.

<img src='image/rest_api_response.png' style='height:300px; margin:20px auto;'>

- At this point, one of the displaying requests returns the data we are looking for. We will need to explore a bit to determine that API, start with names. In the example of TechCrunch, the API "magazine" sounds promising. Indeed, when we click this API and preview its response, we see a list of items storing articles in the website.
- Now we have found the API, let's learn how to use it. Switching to the Header tab reveals to us the URL and the request method of this API. Other APIs may require headers as well as payload, but this one is not the case.

<img src='image/rest_api_url.png' style='height:250px; margin:20px auto;'>

In [49]:
url = 'https://techcrunch.com/wp-json/tc/v1/magazine?page=1&_embed=true&cachePrevention=0'
response = requests.get(url)
response

<Response [200]>

In [50]:
data = []
for item in response.json():
    sample = {
        'id': item['id'],
        'category': item['primary_category']['slug'],
        'author': item['parselyMeta']['parsely-author'][0],
        'title': item['parselyMeta']['parsely-title'],
        'url': item['canonical_url'],
    }
    data.append(sample)

pd.DataFrame.from_dict(data).head()

Unnamed: 0,id,category,author,title,url
0,2466319,fintech,Mary Ann Azevedo,"Does everyone want to be a landlord, or what?",https://techcrunch.com/2023/01/08/does-everyon...
1,2465989,climate,Tim De Chant,Plant-based foods investor says her focus is m...,https://techcrunch.com/2023/01/08/plant-based-...
2,2466520,hardware,Haje Jan Kamps,Urine luck: these CES startups want to take a ...,https://techcrunch.com/2023/01/07/ces-urine/
3,2465453,hardware,Haje Jan Kamps,"A big CES 2023 trend: all battery power, every...",https://techcrunch.com/2023/01/07/batteries-ba...
4,2466468,gadgets,Haje Jan Kamps,Did you hear? AnkerWork is going after the wir...,https://techcrunch.com/2023/01/07/ankerwork-m6...


<b style='color:navy'><i class="fa fa-book"></i>&nbsp; Case study</b><br>
Sometimes, websites block connections from non-browser clients. For example, when inspecting the website https://tiki.vn/nha-sach-tiki/c8322, I found an API named "listing" which contains all products shown in the page. With the naked URL, we can access its response using Chrome but will get blocked using Requests. This can be easily bypassed by we overwriting the *user agent* header as follows.

<img src='image/rest_api_payload.png' style='height:300px; margin:20px auto;'>

Now, if switch to the Payload tab, we can observe that the parameters here match exactly the components in the URL. With this insight, we can rewrite the request with URL and payload separatedly, which is far more readable.

In [46]:
requests.utils.default_headers()

{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

In [56]:
url = 'https://tiki.vn/api/personalish/v1/blocks/listings?limit=40&category=8322&page=1&urlKey=nha-sach-tiki'
headers = {'user-agent': 'Mozilla/5.0 Chrome/108.0.0.0 Safari/537.36'}
response = requests.get(url, headers=headers)
response

<Response [200]>

In [60]:
url = 'https://tiki.vn/api/personalish/v1/blocks/listings'
headers = {'user-agent': 'Mozilla/5.0 Chrome/108.0.0.0 Safari/537.36'}
payload = {
    'limit': 40,
    'category': 8322,
    'page': 1,
    'urlKey': 'nha-sach-tiki',
}
response = requests.get(url, headers=headers, params=payload)
response

<Response [200]>

In [57]:
for product in response.json()['data'][:5]:
    name = product['name']
    print(name)

Cây Cam Ngọt Của Tôi
Hành Tinh Của Một Kẻ Nghĩ Nhiều
Không Phải Sói Nhưng Cũng Đừng Là Cừu -Tặng kèm bookmark 2 mặt
Thao Túng Tâm Lý
Thiên Tài Bên Trái, Kẻ Điên Bên Phải (Tái Bản)


# 2. Basic HTML
Web scraping via APIs is technically easy, as data is already structured. The harder approach is via HTML, where data is designed for human eyes, not for computers. So, this section and the next two will guide you through some useful techniques for crawling HTML data.

## 2.1. HTML concepts
[HTML] is a language for creating web pages. The easiest way to think about HTML, is a language with the same purpose with Markdown, with less readability but more expressivity.

A HTML document is constructed by *elements*, organized in a hierarchical structure. An element is defined by two *tags*, an opening one and a closing one. The text between two tags is the *content* of that element. Tags can have *attributes* and their corresponding *values* that specify styles. That's enough for us to write scrapers, no need to care what different tags and attributes do.

[HTML]: https://en.wikipedia.org/wiki/HTML

In [73]:
%%html
<span style='color:royalblue'>Machine Learning</span>

In [62]:
%%html
<html>
<head>...</head>
<body>
<form id="loginForm">
<input name="name" type="text" value="First Name" />
<input name="name" type="text" value="Last Name" />
<input name="email" type="text" value="Business Email" />
<input name="password" type="password" />
<input name="continue" type="submit" value="Sign Me Up" />
</form>
</body>
</html>

## 2.2. Element locating

### ID and class

### XPath

### CSS selector

# 3. HTML parser

# 4. Web driver

# References
- *gregreda.com - [Web Scraping 201: finding the API](http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/)*
- *jovian.ai - [Introduction to Web Scraping and REST APIs](https://jovian.ai/aakashns/python-web-scraping-and-rest-api)*
- *blog.devgenius.io - [Scrape Data without Selenium by Exposing Hidden APIs](https://blog.devgenius.io/scrape-data-without-selenium-by-exposing-hidden-apis-946b23850d47)*
- *medium.com - [Web Crawling Made Easy with Scrapy and REST API](https://medium.com/@geneng/web-crawling-made-easy-with-scrapy-and-rest-api-ed993e84abd3)*

---
*&#9829; By Quang Hung x Thuy Linh &#9829;*