# Web Scraping Tutorial

This tutorial will teach you how to use Python to scrap and extract data from a web page. We will use two packages, `requests` to scrap the webpage and `BeautifulSoup` to extract the data.

Many good references on web scraping are available online. I would recommend the following resources:
1. Automate Boring Stuff with Python by Al Sweigart (2020) has a chapter on Web Scraping tutorial, which can be read [online](https://automatetheboringstuff.com/2e/chapter12/).
2. Web Scraping With Python by Ryan Mitchell (2018) is a bit old book but provides a comprehensive guide to the topic.

## Step 0: Getting to know the web page

In this tutorial, we will try to extract the cryptocurrency market prices from the CoinGecko website https://www.coingecko.com/.

Your first step should always be to familiarize yourself with the website you want to scrape. Take a look at the website and try to inspect the HTML elements on the webpage.

## Step 1: Scrap a web page

Now, we are ready to scrap a webpage we want to get the data from with the `requests` package. We will use the following functions:

* `requests.get('URL')` - make a request to the specified URL
* `r.status_code` - get the status code of the request
* `r.content` - get the binary content of the page

More functions in the `requests` package are available in [its documentation](https://requests.readthedocs.io/en/latest/).

In [3]:
# First, we will import the requests package
import requests

In [4]:
# Request the webpage
url = "https://www.coingecko.com"
res = requests.get(url=url)

<Response [200]>

In [8]:
# Type of the request we've got
type(res)

requests.models.Response

In [9]:
# Check the status code
res.status_code

200

In [10]:
# Get the header of the web page
res.headers

{'Date': 'Tue, 04 Oct 2022 08:44:29 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'CF-Ray': '754ca2ce0fee88bf-LHR', 'Age': '9', 'Cache-Control': 'max-age=30, public, must-revalidate, s-maxage=30', 'Vary': 'Accept-Encoding', 'CF-Cache-Status': 'HIT', 'Alternate-Protocol': '443:npn-spdy/2', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'X-Content-Type-Options': 'nosniff', 'X-Download-Options': 'noopen', 'X-Frame-Options': 'SAMEORIGIN', 'X-Permitted-Cross-Domain-Policies': 'none', 'X-Request-Id': '113bb2a1-404b-4c06-a833-536da3e9a450', 'X-Runtime': '14.545432', 'X-XSS-Protection': '1; mode=block', 'Set-Cookie': '__cf_bm=Tloz8aIJhqO.No.EgCkocD1CgzgN4UbGPvgxufgaKwE-1664873069-0-AUO9MKKq1k0RmzT3w+yUU4KJ5EtRHzPuixQYWb33H6ff6ttBucTiLuhyuWLSF/2qZUXOICr/l7ny/xr8FrQdMas=; path=/; expires=Tue, 04-Oct-22 09:14:29 GMT; domain=.coingecko.com; HttpOnly; Secure; SameSite=None', 'Server': 'cloudflare', 'Content-Encoding': 'br', 'alt

In [12]:
# Get the content of the web page
res.content[:100]

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n<script src="/cdn-cgi/apps/head/gYtXOyllgyP3-Z2iKTP8rRWGBm4.'

In [13]:
# Save the content of web page to the local drive

with open("../data/coin_gecko.html", "w") as f:
    f.write(res.text)

## Step 2: Extract data from the web page

After we crawled the web page and download it to the local disk, we will use `BeautifulSoup` package to parse HTML file and access the content. We will use the following functions:

**1. Load the web page to BeautifulSoup**
* `soup = BeautifulSoup(html_doc, 'html.parser')` - parse the HTML content to BeautifulSoup object

**2. Get the content of the element**
* `soup.title` - get the title of the page
* `soup.title.string` - get the string in the title tag
* `soup.h1` - get the H1 element in the web page
* `soup.h1.attrs` - get all attributes in the H1 element
* `soup.h1['class']` - get the class attribute in the H1 element

**3. Look for the element in the web page**
* `soup.find('HTML_tag')` - get the element from an HTML tag
* `soup.find_all('HTML_tag')` - get the list of elelemts that has the specified HTML tag
* `soup.select('CSS_selector')` - get the list of elements with the specified [CSS selector](https://www.w3schools.com/cssref/css_selectors.asp)

In [14]:
# First, we will import the BeautifulSoup from bs4 package
from bs4 import BeautifulSoup

In [15]:
# Load the web page and parse it to BeautifulSoup
with open("../data/coin_gecko.html", "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f, "html.parser")

In [16]:
# Check the type of our soup object
type(soup)

bs4.BeautifulSoup

In [19]:
# Get all text in the web page
soup.get_text().replace("\n", " ")[:500]

'         Cryptocurrency Prices, Charts, and Crypto Market Cap | CoinGecko                                                                                This website uses cookies for functionality, analytics and advertising purposes as described in our Privacy Policy. If you agree to our use of cookies, please continue to use our site.   OK                    Continue in app  Track prices in real-time   Open App                 Continue in app  Track prices in real-time   Open App               '

In [20]:
# Get the title of the page
soup.title

<title>Cryptocurrency Prices, Charts, and Crypto Market Cap | CoinGecko</title>

In [21]:
# We can also get the page title using soup.find() function
dir(soup)

['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'DEFAULT_INTERESTING_STRING_TYPES',
 'ROOT_TAG_NAME',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_decode_markup',
 '_feed',
 '_find_all',
 '_find_one',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_linkage_fixer',
 '_markup_is_url',
 '_markup_resembles_filename',
 '_most_recent_element',
 '_namespaces',
 '_popToTag',
 '_should_pretty_print',
 'append',
 'attrs',
 'builder',
 'can_be_empty_element',
 'cdata_list_

In [None]:
# Other HTML tags also work too


Now, we will extract the cryptocurrencies market price from the table.

In [None]:
# Get the table element in the web page


In [None]:
# Get the table headers


In [None]:
# If there are > 1 elements that match the tagged, 
# use soup.find_all() to retrieve all of them as a list.


In [None]:
# Iterate over rows and get the data for each coin


## Step 3: Create data table and save as CSV file

Let's wrap our data table as the pandas's DataFrame and save it as a CSV file.

In [None]:
import pandas as pd