# 1. Overview
There are basically two ways to scrape data from websites, one is via HTML responses (server-side) and the other is via JSON responses (client-side).
- In the first method, we extract structured data from user interface, which is friendly for human eyes but not for computers. Because we scrape what we see, data will be availale most of the time except for cases such as when website owners protect their data using images rather than raw text. This crawling approach requires some basic knowledge of HTML.
- In the second method, we try to crawl REST API responses, which is only available in specific website that use this protocol. The advantage is that the returned data is in JSON format, so they can be easily extracted and processed. Crawling data this way is much easier, so we are going to start with it.

## 1.1. Requests
Instead of accessing websites using a browser such as Google Chrome, we can use the [Requests] library to download the raw content of that page and interact with it.

[Requests]: https://github.com/psf/requests

## 1.2. API crawling

In [2]:
import requests

### Public APIs

In [87]:
OWNER = 'psf'
REPO = 'requests'

url = f'https://api.github.com/repos/{OWNER}/{REPO}'
headers = {
    'Accept': 'application/vnd.github+json',
    'Authorization': 'Bearer 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',
    'X-GitHub-Api-Version': '2022-11-28',
}
response = requests.get(url)
response.json()['owner']

{'login': 'psf',
 'id': 50630501,
 'node_id': 'MDEyOk9yZ2FuaXphdGlvbjUwNjMwNTAx',
 'avatar_url': 'https://avatars.githubusercontent.com/u/50630501?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/psf',
 'html_url': 'https://github.com/psf',
 'followers_url': 'https://api.github.com/users/psf/followers',
 'following_url': 'https://api.github.com/users/psf/following{/other_user}',
 'gists_url': 'https://api.github.com/users/psf/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/psf/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/psf/subscriptions',
 'organizations_url': 'https://api.github.com/users/psf/orgs',
 'repos_url': 'https://api.github.com/users/psf/repos',
 'events_url': 'https://api.github.com/users/psf/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/psf/received_events',
 'type': 'Organization',
 'site_admin': False}

In [92]:
url

'https://api.github.com/repos/hungpq7/courses'

In [93]:
OWNER = 'hungpq7'
REPO = 'courses'

url = f'https://api.github.com/repos/{OWNER}/{REPO}'
headers = {
    'Accept': 'application/vnd.github+json',
    'Authorization': 'Bearer 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',
    'X-GitHub-Api-Version': '2022-11-28',
}
response = requests.get(url, headers=headers)
response.json()

{'message': 'Not Found',
 'documentation_url': 'https://docs.github.com/rest/reference/repos#get-a-repository'}

### Authenication

In [49]:
data = '''^[^{^\^"operationName^\^":^\^"SearchProductQueryV4^\^",^\^"variables^\^":^{^\^"params^\^":^\^"device=desktop&navsource=&ob=23&page=1&q=White^%^20Linen^%^20Shirt&related=true&rows=60&safe_search=false&scheme=https&shipping=&source=search&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&start=0&topads_bucket=true&unique_id=2f4091dc2e8de466e113561c8ce8b631&user_addressId=&user_cityId=176&user_districtId=2274&user_id=&user_lat=&user_long=&user_postCode=&user_warehouseId=12210375&variants=^\^"^},^\^"query^\^":^\^"query SearchProductQueryV4(^$params: String^!) ^{^\^\n  ace_search_product_v4(params: ^$params) ^{^\^\n    header ^{^\^\n      totalData^\^\n      totalDataText^\^\n      processTime^\^\n      responseCode^\^\n      errorMessage^\^\n      additionalParams^\^\n      keywordProcess^\^\n      componentId^\^\n      __typename^\^\n    ^}^\^\n    data ^{^\^\n      banner ^{^\^\n        position^\^\n        text^\^\n        imageUrl^\^\n        url^\^\n        componentId^\^\n        trackingOption^\^\n        __typename^\^\n      ^}^\^\n      backendFilters^\^\n      isQuerySafe^\^\n      ticker ^{^\^\n        text^\^\n        query^\^\n        typeId^\^\n        componentId^\^\n        trackingOption^\^\n        __typename^\^\n      ^}^\^\n      redirection ^{^\^\n        redirectUrl^\^\n        departmentId^\^\n        __typename^\^\n      ^}^\^\n      related ^{^\^\n        position^\^\n        trackingOption^\^\n        relatedKeyword^\^\n        otherRelated ^{^\^\n          keyword^\^\n          url^\^\n          product ^{^\^\n            id^\^\n            name^\^\n            price^\^\n            imageUrl^\^\n            rating^\^\n            countReview^\^\n            url^\^\n            priceStr^\^\n            wishlist^\^\n            shop ^{^\^\n              city^\^\n              isOfficial^\^\n              isPowerBadge^\^\n              __typename^\^\n            ^}^\^\n            ads ^{^\^\n              adsId: id^\^\n              productClickUrl^\^\n              productWishlistUrl^\^\n              shopClickUrl^\^\n              productViewUrl^\^\n              __typename^\^\n            ^}^\^\n            badges ^{^\^\n              title^\^\n              imageUrl^\^\n              show^\^\n              __typename^\^\n            ^}^\^\n            ratingAverage^\^\n            labelGroups ^{^\^\n              position^\^\n              type^\^\n              title^\^\n              url^\^\n              __typename^\^\n            ^}^\^\n            componentId^\^\n            __typename^\^\n          ^}^\^\n          componentId^\^\n          __typename^\^\n        ^}^\^\n        __typename^\^\n      ^}^\^\n      suggestion ^{^\^\n        currentKeyword^\^\n        suggestion^\^\n        suggestionCount^\^\n        instead^\^\n        insteadCount^\^\n        query^\^\n        text^\^\n        componentId^\^\n        trackingOption^\^\n        __typename^\^\n      ^}^\^\n      products ^{^\^\n        id^\^\n        name^\^\n        ads ^{^\^\n          adsId: id^\^\n          productClickUrl^\^\n          productWishlistUrl^\^\n          productViewUrl^\^\n          __typename^\^\n        ^}^\^\n        badges ^{^\^\n          title^\^\n          imageUrl^\^\n          show^\^\n          __typename^\^\n        ^}^\^\n        category: departmentId^\^\n        categoryBreadcrumb^\^\n        categoryId^\^\n        categoryName^\^\n        countReview^\^\n        customVideoURL^\^\n        discountPercentage^\^\n        gaKey^\^\n        imageUrl^\^\n        labelGroups ^{^\^\n          position^\^\n          title^\^\n          type^\^\n          url^\^\n          __typename^\^\n        ^}^\^\n        originalPrice^\^\n        price^\^\n        priceRange^\^\n        rating^\^\n        ratingAverage^\^\n        shop ^{^\^\n          shopId: id^\^\n          name^\^\n          url^\^\n          city^\^\n          isOfficial^\^\n          isPowerBadge^\^\n          __typename^\^\n        ^}^\^\n        url^\^\n        wishlist^\^\n        sourceEngine: source_engine^\^\n        __typename^\^\n      ^}^\^\n      violation ^{^\^\n        headerText^\^\n        descriptionText^\^\n        imageURL^\^\n        ctaURL^\^\n        ctaApplink^\^\n        buttonText^\^\n        buttonType^\^\n        __typename^\^\n      ^}^\^\n      __typename^\^\n    ^}^\^\n    __typename^\^\n  ^}^\^\n^}^\^\n^\^"^}^]'''

In [54]:
payload = {
    "operationName": "SearchProductQueryV4",
    "variables": {
      "params": "device=desktop&navsource=&ob=23&page=1&q=White%20Linen%20Shirt&related=true&rows=60&safe_search=false&scheme=https&shipping=&source=search&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&start=0&topads_bucket=true&unique_id=2f4091dc2e8de466e113561c8ce8b631&user_addressId=&user_cityId=176&user_districtId=2274&user_id=&user_lat=&user_long=&user_postCode=&user_warehouseId=12210375&variants="
    },
    "query": "query SearchProductQueryV4($params: String!) {\n  ace_search_product_v4(params: $params) {\n    header {\n      totalData\n      totalDataText\n      processTime\n      responseCode\n      errorMessage\n      additionalParams\n      keywordProcess\n      componentId\n      __typename\n    }\n    data {\n      banner {\n        position\n        text\n        imageUrl\n        url\n        componentId\n        trackingOption\n        __typename\n      }\n      backendFilters\n      isQuerySafe\n      ticker {\n        text\n        query\n        typeId\n        componentId\n        trackingOption\n        __typename\n      }\n      redirection {\n        redirectUrl\n        departmentId\n        __typename\n      }\n      related {\n        position\n        trackingOption\n        relatedKeyword\n        otherRelated {\n          keyword\n          url\n          product {\n            id\n            name\n            price\n            imageUrl\n            rating\n            countReview\n            url\n            priceStr\n            wishlist\n            shop {\n              city\n              isOfficial\n              isPowerBadge\n              __typename\n            }\n            ads {\n              adsId: id\n              productClickUrl\n              productWishlistUrl\n              shopClickUrl\n              productViewUrl\n              __typename\n            }\n            badges {\n              title\n              imageUrl\n              show\n              __typename\n            }\n            ratingAverage\n            labelGroups {\n              position\n              type\n              title\n              url\n              __typename\n            }\n            componentId\n            __typename\n          }\n          componentId\n          __typename\n        }\n        __typename\n      }\n      suggestion {\n        currentKeyword\n        suggestion\n        suggestionCount\n        instead\n        insteadCount\n        query\n        text\n        componentId\n        trackingOption\n        __typename\n      }\n      products {\n        id\n        name\n        ads {\n          adsId: id\n          productClickUrl\n          productWishlistUrl\n          productViewUrl\n          __typename\n        }\n        badges {\n          title\n          imageUrl\n          show\n          __typename\n        }\n        category: departmentId\n        categoryBreadcrumb\n        categoryId\n        categoryName\n        countReview\n        customVideoURL\n        discountPercentage\n        gaKey\n        imageUrl\n        labelGroups {\n          position\n          title\n          type\n          url\n          __typename\n        }\n        originalPrice\n        price\n        priceRange\n        rating\n        ratingAverage\n        shop {\n          shopId: id\n          name\n          url\n          city\n          isOfficial\n          isPowerBadge\n          __typename\n        }\n        url\n        wishlist\n        sourceEngine: source_engine\n        __typename\n      }\n      violation {\n        headerText\n        descriptionText\n        imageURL\n        ctaURL\n        ctaApplink\n        buttonText\n        buttonType\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n"
}

In [74]:
url = 'https://gql.tokopedia.com/graphql/SearchProductQueryV4'
headers = {
    # "authority": "gql.tokopedia.com",
    # "accept": "*/*",
    # "accept-language": "en,vi;q=0.9",
    # "content-type": "application/json",
    # "origin": "https://www.tokopedia.com",
    # "referer": "https://www.tokopedia.com/search?st=product&q=White^%^20Linen^%^20Shirt&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&navsource=",
    # "sec-ch-ua": '''^\^"Not?A_Brand^\^";v=^\^"8^\^", ^\^"Chromium^\^";v=^\^"108^\^", ^\^"Google Chrome^\^";v=^\^"108^\^"''',
    # "sec-ch-ua-mobile": "?0",
    # "sec-ch-ua-platform": '''^\^"Windows^\^"''',
    # "sec-fetch-dest": "empty",
    # "sec-fetch-mode": "cors",
    # "sec-fetch-site": "same-site",
    # "tkpd-userid": "0",
    # "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
    # "x-device": "desktop-0.0",
    # "x-source": "tokopedia-lite",
    # "x-tkpd-lite-service": "zeus",
    # "x-version": "505bf79",
}
response = requests.post(url, json=payload)

response.json()

{'data': {'ace_search_product_v4': {'header': {'totalData': 2488,
    'totalDataText': '2.488',
    'processTime': 0.11293385,
    'responseCode': 0,
    'errorMessage': '',
    'additionalParams': '',
    'keywordProcess': '4',
    'componentId': '02.01.00.00',
    '__typename': 'AceSearchUnifyHeader'},
   'data': {'banner': {'position': 0,
     'text': '',
     'imageUrl': '',
     'url': '',
     'componentId': '',
     'trackingOption': 0,
     '__typename': 'AceSearchUnifyBanner'},
    'backendFilters': '',
    'isQuerySafe': True,
    'ticker': {'text': '',
     'query': '',
     'typeId': 0,
     'componentId': '',
     'trackingOption': 0,
     '__typename': 'AceSearchUnifyTicker'},
    'redirection': {'redirectUrl': '',
     'departmentId': 0,
     '__typename': 'AceSearchUnifyRedirection'},
    'related': {'position': 0,
     'trackingOption': 0,
     'relatedKeyword': '',
     'otherRelated': [],
     '__typename': 'AceSearchUnifyRelated'},
    'suggestion': {'currentKeyword

In [73]:
response.json()['data']['ace_search_product_v4']['data']['products'][2]

{'id': 6846598409,
 'name': 'Joie Linen Camp Collar Stripes Shirt White Cream Short Sleeve',
 'ads': {'adsId': '',
  'productClickUrl': '',
  'productWishlistUrl': '',
  'productViewUrl': '',
  '__typename': 'AceSearchUnifyAds'},
 'badges': [{'title': 'Official Store',
   'imageUrl': 'https://images.tokopedia.net/img/official_store_badge.png',
   'show': True,
   '__typename': 'AceSearchUnifyBadge'}],
 'category': 3579,
 'categoryBreadcrumb': 'fashion-pria/atasan-pria/kemeja-pria',
 'categoryId': 1759,
 'categoryName': 'Fashion Pria',
 'countReview': 24,
 'customVideoURL': '',
 'discountPercentage': 43,
 'gaKey': '/searchproduct/fashion-pria/atasan-pria/kemeja-pria/white+linen+shirt/tenuedeattire/joie-linen-camp-collar-stripes-shirt-white-cream-short-sleeve-l',
 'imageUrl': 'https://images.tokopedia.net/img/cache/200-square/hDjmkQ/2022/10/29/23cf7036-3d7d-4d1e-a8f9-0de610a74f83.jpg',
 'labelGroups': [{'position': 'integrity',
   'title': 'Terjual 80+',
   'type': 'textDarkGrey',
   'ur

In [45]:
requests.post?

[1;31mSignature:[0m [0mrequests[0m[1;33m.[0m[0mpost[0m[1;33m([0m[0murl[0m[1;33m,[0m [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mjson[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Sends a POST request.

:param url: URL for the new :class:`Request` object.
:param data: (optional) Dictionary, list of tuples, bytes, or file-like
    object to send in the body of the :class:`Request`.
:param json: (optional) json data to send in the body of the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
[1;31mFile:[0m      c:\users\hungpq5\appdata\roaming\python\python39\site-packages\requests\api.py
[1;31mType:[0m      function


In [4]:
url = 'https://techcrunch.com/wp-json/tc/v1/magazine?page=1&_embed=true&cachePrevention=0'
response = requests.get(url)

In [12]:
response.json()[0].keys()

dict_keys(['id', 'date', 'date_gmt', 'guid', 'modified', 'modified_gmt', 'slug', 'status', 'type', 'link', 'title', 'content', 'excerpt', 'author', 'featured_media', 'comment_status', 'ping_status', 'sticky', 'template', 'format', 'meta', 'categories', 'tags', 'crunchbase_tag', 'tc_stories_tax', 'tc_ec_category', 'tc_event', 'tc_regions_tax', 'jetpack_featured_media_url', 'parsely', 'shortlink', 'parselyMeta', 'rapidData', 'premiumContent', 'premiumCutoffPercent', 'featured', 'subtitle', 'editorialContentProvider', 'tc_cb_mapping', 'associatedEvent', 'event', 'authors', 'hide_featured_image', 'canonical_url', 'primary_category', '_links', '_embedded'])

In [30]:
data

'{"@type":"NewsArticle","mainEntityOfPage":{"@type":"WebPage"},"dateModified":"2023-01-05T02:23:52+00:00","description":"Nearly a year after Sony and Honda shared plans to jointly make and sell electric vehicles, the two companies revealed a prototype under the brand name Afeela. The four-door sedan was driven onstage at CES Wednesday as Kenichiro Yoshida, the CEO of Sony, talked through the company&#8217;s mobility philosophy, which prioritizes building vehicles that have [&hellip;]","speakable":{"@type":"SpeakableSpecification","cssSelector":[".alpha","#speakable-summary"]},"publisher":{"@type":"Organization","name":"TechCrunch","logo":{"@type":"imageObject","url":"https:\\/\\/techcrunch.com\\/wp-content\\/themes\\/techcrunch-2017\\/images\\/logo-json-ld.png","width":"600","height":"60"}}}'

In [32]:
response.json()[0]['parselyMeta']

{'parsely-title': 'Sony and Honda reveal Afeela, their joint EV brand, at CES',
 'parsely-link': 'https://techcrunch.com/2023/01/04/sony-and-honda-reveal-afeela-their-joint-ev-brand-at-ces/',
 'parsely-type': 'post',
 'parsely-pub-date': '2023-01-05T02:23:52+00:00',
 'parsely-image-url': 'https://techcrunch.com/wp-content/uploads/2023/01/Honda-Sony-Afeela-CES2023-4.jpeg?w=680',
 'parsely-author': ['Rebecca Bellan'],
 'parsely-section': 'Transportation',
 'parsely-tags': '@post-id:2465033,ces,ces 2023,honda,sony',
 'parsely-metadata': '{"@type":"NewsArticle","mainEntityOfPage":{"@type":"WebPage"},"dateModified":"2023-01-05T02:23:52+00:00","description":"Nearly a year after Sony and Honda shared plans to jointly make and sell electric vehicles, the two companies revealed a prototype under the brand name Afeela. The four-door sedan was driven onstage at CES Wednesday as Kenichiro Yoshida, the CEO of Sony, talked through the company&#8217;s mobility philosophy, which prioritizes building v

In [35]:
for element in response.json():
    print(element['id'])
    print()

2465033
2464751
2465031
2465029
2464075
2463175
2464961
2464604
2457917
2464588
2464279
2464867
2464718
2464852
2464750
2464023
2464706
2464559
2464545
2464377


In [14]:
print(response.headers)

{'Server': 'ATS', 'Date': 'Thu, 05 Jan 2023 03:00:03 GMT', 'Content-Type': 'application/json; charset=UTF-8', 'Content-Length': '87935', 'X-Robots-Tag': 'noindex', 'Link': '<https://techcrunch.com/wp-json/>; rel="https://api.w.org/"', 'X-Content-Type-Options': 'nosniff', 'Access-Control-Expose-Headers': 'X-WP-Total, X-WP-TotalPages, Link', 'Access-Control-Allow-Headers': 'Authorization, X-WP-Nonce, Content-Disposition, Content-MD5, Content-Type', 'X-WP-Total': '234736', 'X-WP-TotalPages': '11737', 'Cache-Control': 'max-age=60', 'Allow': 'GET', 'X-rq': 'sea1 0 2 9980', 'Content-Encoding': 'gzip', 'Age': '56', 'X-Cache': 'hit', 'Vary': 'Accept-Encoding, Origin', 'Accept-Ranges': 'bytes', 'Strict-Transport-Security': 'max-age=31536000', 'Connection': 'keep-alive', 'Expect-CT': 'max-age=31536000, report-uri="http://csp.yahoo.com/beacon/csp?src=yahoocom-expect-ct-report-only"', 'Referrer-Policy': 'no-referrer-when-downgrade', 'X-XSS-Protection': '1; mode=block', 'X-Frame-Options': 'SAMEORIG

## Hidden APIs

# 2. Basic HTML

## 2.1. HTML concepts

```html
<html>
<head>...</head>
<body>
...
<form id="loginForm">
<input name="name" type="text" value="First Name" />
<input name="name" type="text" value="Last Name" />
<input name="email" type="text" value="Business Email" />
<input name="password" type="password" />
<input name="continue" type="submit" value="Sign Me Up" />
</form>
</body>
</html>
```

## 2.2. Element locating

### ID and class

### XPath

### CSS selector

# 3. HTML parser

# 4. Web driver

---
*&#9829; By Quang Hung x Thuy Linh &#9829;*