# MEMO
The take away of this notebook is:


*   CSS selector
*   getting familiar with the html structure



# Web Scraping Tutorial

This tutorial will teach you how to use Python to scrap and extract data from a web page. We will use two packages, `requests` to scrap the webpage and `BeautifulSoup` to extract the data.

Many good references on web scraping are available online. I would recommend the following resources:
1. Automate Boring Stuff with Python by Al Sweigart (2020) has a chapter on Web Scraping tutorial, which can be read [online](https://automatetheboringstuff.com/2e/chapter12/).
2. Web Scraping With Python by Ryan Mitchell (2018) is a bit old book but provides a comprehensive guide to the topic.

## Step 0: Getting to know the web page

In this tutorial, we will try to extract the cryptocurrency market prices from the CoinGecko website https://www.coingecko.com/.

Your first step should always be to familiarize yourself with the website you want to scrape. Take a look at the website and try to inspect the HTML elements on the webpage.

## Step 1: Scrap a web page

Now, we are ready to scrap a webpage we want to get the data from with the `requests` package. We will use the following functions:

* `requests.get('URL')` - make a request to the specified URL
* `r.status_code` - get the status code of the request
* `r.content` - get the binary content of the page

More functions in the `requests` package are available in [its documentation](https://requests.readthedocs.io/en/latest/).

In [6]:
# First, we will import the requests package
import requests

In [36]:
# Request the webpage
URL='https://www.coingecko.com/'
r=requests.get(URL,params={'page':1})
r.status_code #state of code, Successful responses (200–299)
# r.content

200

In [37]:
# Type of the request we've got
type(r)

requests.models.Response

In [38]:
# Check the status code
r.status_code == requests.codes.ok

True

In [39]:
# Get the header of the web page
r.headers['Date']

'Tue, 04 Oct 2022 08:52:50 GMT'

In [40]:
# Get the content of the web page
r.text



In [41]:
# Save the content of web page to the local drive
with open('coinGeko.html', 'wb') as fp:
  fp.write(r.content)

## Step 2: Extract data from the web page

After we crawled the web page and download it to the local disk, we will use `BeautifulSoup` package to parse HTML file and access the content. We will use the following functions:

**1. Load the web page to BeautifulSoup**
* `soup = BeautifulSoup(html_doc, 'html.parser')` - parse the HTML content to BeautifulSoup object

**2. Get the content of the element**
* `soup.title` - get the title of the page
* `soup.title.string` - get the string in the title tag
* `soup.h1` - get the H1 element in the web page
* `soup.h1.attrs` - get all attributes in the H1 element
* `soup.h1['class']` - get the class attribute in the H1 element

**3. Look for the element in the web page**
* `soup.find('HTML_tag')` - get the element from an HTML tag
* `soup.find_all('HTML_tag')` - get the list of elelemts that has the specified HTML tag
* `soup.select('CSS_selector')` - get the list of elements with the specified [CSS selector](https://www.w3schools.com/cssref/css_selectors.asp)

In [42]:
# First, we will import the BeautifulSoup from bs4 package
from bs4 import BeautifulSoup

In [49]:
with open('/content/coinGeko.html', 'r') as f:
    contents = f.read()

In [50]:
# Load the web page and parse it to BeautifulSoup
soup=BeautifulSoup(contents,'html.parser')

In [51]:
# Check the type of our soup object
type(soup)

bs4.BeautifulSoup

In [52]:
# Get all text in the web page
soup.text

'\n\n\n\n\n\nwindow.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"cd4a6493ab","applicationID":"83495717","transactionName":"dV5dRBNcDlkEEU5SDF9fQB8IXQZQGQ==","queueTime":1,"applicationTime":3194,"agent":""}\n(window.NREUM||(NREUM={})).init={ajax:{deny_list:["bam.nr-data.net"]}};(window.NREUM||(NREUM={})).loader_config={xpid:"VQ4EVVBUCBAIV1VbAgYGUQ==",licenseKey:"cd4a6493ab",applicationID:"83495717"};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var i=e[n]={exports:{}};t[n][0].call(i.exports,function(e){var i=t[n][1][e];return r(i||e)},i,i.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(t,e,n){function r(t){try{s.console&&console.log(t)}catch(e){}}var i,o=t("ee"),a=t(27),s={};try{i=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,i.indexOf("dev")!==-1&&(s.

In [53]:
# Get the title of the page
soup.title

<title>Cryptocurrency Prices, Charts, and Crypto Market Cap | CoinGecko (Pg1)</title>

In [55]:
# We can also get the page title using soup.find() function
soup.find('title')

<title>Cryptocurrency Prices, Charts, and Crypto Market Cap | CoinGecko (Pg1)</title>

In [60]:
# Other HTML tags also work too
soup.h1.text.strip()

'Cryptocurrency Prices by Market Cap'

Now, we will extract the cryptocurrencies market price from the table.

In [61]:
# Get the table element in the web page
soup.table

<table class="sort table mb-0 text-sm text-lg-normal table-scrollable" data-target="gecko-table.table portfolios-v2.table" style="overflow-y: hidden;">
<thead>
<tr>
<th class="cg-sticky-col-header cg-sticky-first-col" data-sort-method="none"></th>
<th class="table-number cg-sticky-col-header cg-sticky-second-col">
#
</th>
<th class="coin-name text-left cg-sticky-col-header cg-sticky-third-col px-0">
Coin
</th>
<th class="price text-right pl-0" data-sort-method="number">
Price
</th>
<th class="change1h text-right col-market" data-sort-method="number" style="width: 70px">
1h
</th>
<th class="change24h text-right col-market" data-sort-method="number" style="width: 70px">
24h
</th>
<th class="change7d text-right col-market" data-sort-method="number" style="width: 70px">
7d
</th>
<th class="lit text-right col-market" data-sort-method="number">
24h Volume
</th>
<th class="cap text-right col-market" data-sort-method="number">
Mkt Cap
</th>
<th class="fdv text-right col-market tw-hidden" data-

In [73]:
# Get the table headers
for i in soup.table.tr.find_all('th'):
  print(i.text.strip())


#
Coin
Price
1h
24h
7d
24h Volume
Mkt Cap
FDV
Last 7 Days


In [131]:
# If there are > 1 elements that match the tagged, 
# use soup.find_all() to retrieve all of them as a list.
l=list()
import re
len(soup.find_all('tr'))
for i in soup.find_all('tr'):
  data=i.find_all('td')
  tmp=list()
  for col in data:
    coin_col=col.select('.tw-flex-auto')
    if len(coin_col)>0:
      for j in coin_col[0].find_all('span'):
        tmp.append((j.text.strip()))
    else :
      for j in col.find_all('span'):
        tmp.append((j.text.strip()))
  l.append(tmp)

In [137]:
# Iterate over rows and get the data for each coin
import pandas as pd
df=pd.DataFrame(l)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,,,,,,,,,
1,Bitcoin,BTC,"$19,683.71",0.4%,2.5%,2.4%,"$24,630,691,871","$377,094,523,809","$413,121,427,046"
2,Ethereum,ETH,"$1,331.33",0.3%,3.0%,-0.3%,"$8,192,469,516","$160,825,581,668",
3,Tether,USDT,$0.996346,-0.4%,-0.4%,-0.5%,"$28,988,625,368","$67,887,752,933",
4,USD Coin,USDC,$0.994543,-0.6%,-0.6%,-0.4%,"$3,000,592,326","$47,043,037,898",
...,...,...,...,...,...,...,...,...,...
96,Mina Protocol,MINA,$0.593424,-0.1%,2.4%,-0.0%,"$8,262,854","$413,805,366",
97,DeFiChain,DFI,$0.690817,0.5%,0.9%,-8.3%,"$3,062,706","$411,863,224",
98,Compound,COMP,$59.69,0.5%,1.5%,-4.7%,"$30,431,359","$409,406,149","$597,142,689"
99,Ethereum Name Service,ENS,$15.52,-0.7%,7.1%,4.7%,"$55,729,601","$400,467,480","$1,553,208,921"


## Step 3: Create data table and save as CSV file

Let's wrap our data table as the pandas's DataFrame and save it as a CSV file.

In [None]:
import pandas as pd

In [140]:
df.to_csv('/content/out.csv') 