# Web Scraping Tutorial

This tutorial will teach you how to use Python to scrap and extract data from a web page. We will use two packages, `requests` to scrap the webpage and `BeautifulSoup` to extract the data.

Many good references on web scraping are available online. I would recommend the following resources:
1. Automate Boring Stuff with Python by Al Sweigart (2020) has a chapter on Web Scraping tutorial, which can be read [online](https://automatetheboringstuff.com/2e/chapter12/).
2. Web Scraping With Python by Ryan Mitchell (2018) is a bit old book but provides a comprehensive guide to the topic.

## Step 0: Getting to know the web page

In this tutorial, we will try to extract the cryptocurrency market prices from the CoinGecko website https://www.coingecko.com/.

Your first step should always be to familiarize yourself with the website you want to scrape. Take a look at the website and try to inspect the HTML elements on the webpage.

## Step 1: Scrap a web page

Now, we are ready to scrap a webpage we want to get the data from with the `requests` package. We will use the following functions:

* `requests.get('URL')` - make a request to the specified URL
* `r.status_code` - get the status code of the request
* `r.content` - get the binary content of the page

More functions in the `requests` package are available in [its documentation](https://requests.readthedocs.io/en/latest/).

In [69]:
# First, we will import the requests package
import requests

from os.path import join, exists, isfile, isdir, abspath, dirname, basename, realpath
from os import makedirs, listdir, pardir, getcwd

import re

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [46]:
parent_dir = abspath(join(join(getcwd(), pardir), pardir))
data_dir = join(parent_dir, 'data')
html_file = join(data_dir, 'coin_gecko.html')

In [4]:
# Request the webpage
url = "https://www.coingecko.com"
res = requests.get(url=url)

<Response [200]>

In [8]:
# Type of the request we've got
type(res)

requests.models.Response

In [29]:
# Check the status code
res.status_code == requests.codes.ok

True

In [30]:
# Get the header of the web page
res.headers['content-type']

'text/html; charset=utf-8'

In [12]:
# Get the content of the web page
res.content[:100]

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n<script src="/cdn-cgi/apps/head/gYtXOyllgyP3-Z2iKTP8rRWGBm4.'

In [47]:
# Save the content of web page to the local drive
with open(html_file, "w") as f:
    f.write(res.text)

1329426

## Step 2: Extract data from the web page

After we crawled the web page and download it to the local disk, we will use `BeautifulSoup` package to parse HTML file and access the content. We will use the following functions:

**1. Load the web page to BeautifulSoup**
* `soup = BeautifulSoup(html_doc, 'html.parser')` - parse the HTML content to BeautifulSoup object

**2. Get the content of the element**
* `soup.title` - get the title of the page
* `soup.title.string` - get the string in the title tag
* `soup.h1` - get the H1 element in the web page
* `soup.h1.attrs` - get all attributes in the H1 element
* `soup.h1['class']` - get the class attribute in the H1 element

**3. Look for the element in the web page**
* `soup.find('HTML_tag')` - get the element from an HTML tag
* `soup.find_all('HTML_tag')` - get the list of elelemts that has the specified HTML tag
* `soup.select('CSS_selector')` - get the list of elements with the specified [CSS selector](https://www.w3schools.com/cssref/css_selectors.asp)

In [14]:
# First, we will import the BeautifulSoup from bs4 package
from bs4 import BeautifulSoup

In [48]:
# Load the web page and parse it to BeautifulSoup
with open(html_file, "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f, "html.parser")

In [16]:
# Check the type of our soup object
type(soup)

bs4.BeautifulSoup

In [24]:
# Get all text in the web page
soup.get_text().strip().replace("\n", "")[:500]

'Cryptocurrency Prices, Charts, and Crypto Market Cap | CoinGeckoThis website uses cookies for functionality, analytics and advertising purposes as described in our Privacy Policy. If you agree to our use of cookies, please continue to use our site.OKContinue in appTrack prices in real-timeOpen AppContinue in appTrack prices in real-timeOpen AppEN LanguageEnglishDeutschEspañolFrançaisItalianojęzyk polskiLimba românăMagyar nyelvNederlandsPortuguêsSvenskaTiếng việtTürkçeРусский日本語简体中文繁體中文한국어العربي'

In [20]:
# Get the title of the page
soup.title

<title>Cryptocurrency Prices, Charts, and Crypto Market Cap | CoinGecko</title>

In [27]:
soup.h1.text.strip()

'Cryptocurrency Prices by Market Cap'

In [25]:
# We can also get the page title using soup.find() function
soup.find("title")

<title>Cryptocurrency Prices, Charts, and Crypto Market Cap | CoinGecko</title>

In [35]:
# Other HTML tags also work too
for i in range(1, 6):
    print(f"Total {len(soup.find_all(f'h{i}'))} H{i} tags")

Total 1 H1 tags
Total 4 H2 tags
Total 3 H3 tags
Total 0 H4 tags
Total 2 H5 tags


Now, we will extract the cryptocurrencies market price from the table.

In [37]:
# Get the table element in the web page
table = soup.find("table", attrs={"class": "table"})
type(table)

bs4.element.Tag

In [57]:
# Get the table headers
headers = [x.text.strip() for x in table.thead.find_all("th")]

In [54]:
# If there are > 1 elements that match the tagged, 
# use soup.find_all() to retrieve all of them as a list.
rows = [x for x in table.find_all("tr")]
cols = [x.find_all("td") for x in rows]
data = [[x.text.strip() for x in y] for y in cols]
# len(rows)
# len(cols)
# len(data)
# data[:5]

101

101

101

In [74]:
# Iterate over rows and get the data for each coin
data = list()
for row in table.find_all("tr"):
    col_data = row.find_all("td")

    temp = list()
    for col in col_data:
        spans = col.select(".tw-flex-auto")
        if len(spans):
            for span in spans[0].find_all("span"):
                temp.append(span.text.strip())
        else:
            for i in col.find_all("span"):
                temp.append(float(re.sub("[^0-9.]", "", i.text.strip())))
    
    data.append(temp) if len(temp) else None

data[:2]

[['Bitcoin',
  'BTC',
  19914.09,
  0.0,
  3.7,
  3.6,
  27068124639.0,
  382219128162.0,
  418733988129.0],
 ['Ethereum', 'ETH', 1347.54, 0.1, 4.3, 0.9, 8846089254.0, 162810730982.0]]

In [75]:
[dict(zip(headers, v)) for v in data]

[{'': 'Bitcoin',
  '#': 'BTC',
  'Coin': 19914.09,
  'Price': 0.0,
  '1h': 3.7,
  '24h': 3.6,
  '7d': 27068124639.0,
  '24h Volume': 382219128162.0,
  'Mkt Cap': 418733988129.0},
 {'': 'Ethereum',
  '#': 'ETH',
  'Coin': 1347.54,
  'Price': 0.1,
  '1h': 4.3,
  '24h': 0.9,
  '7d': 8846089254.0,
  '24h Volume': 162810730982.0},
 {'': 'Tether',
  '#': 'USDT',
  'Coin': 0.999061,
  'Price': 0.0,
  '1h': 0.1,
  '24h': 0.2,
  '7d': 31670303985.0,
  '24h Volume': 67885610166.0},
 {'': 'BNB',
  '#': 'BNB',
  'Coin': 290.49,
  'Price': 0.3,
  '1h': 1.8,
  '24h': 5.4,
  '7d': 570182494.0,
  '24h Volume': 47450255431.0,
  'Mkt Cap': 47984919219.0},
 {'': 'USD Coin',
  '#': 'USDC',
  'Coin': 1.0,
  'Price': 0.1,
  '1h': 0.0,
  '24h': 0.1,
  '7d': 3177940647.0,
  '24h Volume': 47178443705.0},
 {'': 'XRP',
  '#': 'XRP',
  'Coin': 0.460894,
  'Price': 0.4,
  '1h': 4.5,
  '24h': 1.4,
  '7d': 1671017415.0,
  '24h Volume': 22975970912.0,
  'Mkt Cap': 46043414935.0},
 {'': 'Binance USD',
  '#': 'BUSD',
 

## Step 3: Create data table and save as CSV file

Let's wrap our data table as the pandas's DataFrame and save it as a CSV file.

In [None]:
import pandas as pd