In [1]:
%run common.ipynb


# imports
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# Webscraping

Sometimes the data we require may not be readily available on a database or a CSV file. It may be available on a webpage or scattered on multiple websites. Manually copying this data (often huge) from multiple sources can be tedious and sometimes not feasible. In these cases, we create our data by scraping websites.


What is web scraping?
Web scraping is the process of extraction of data from websites. Web scraping generally refers to automated processes implemented by ‘web crawlers’. 

### Open Source tools available for web scraping

**BeautifulSoup** 
Documentation:https://beautiful-soup-4.readthedocs.io/en/latest/
“Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with a parser to provide idiomatic ways of navigating, searching, and modifying the parse tree”. 
BeautifulSoup is beginner-friendly and can be used for smaller projects

**Scrapy**
Documentation: https://docs.scrapy.org/en/latest/
Scrapy is a framework that is generally faster than BeautifulSoup
Scrapy is more complex and is used for large projects
It can scrape dynamic pages and (Javascript content)

To start with let’s learn to web scrape using BeautifulSoup.


## Before web scraping... 
1. Check the terms and conditions and the Privacy Policy on the website to see if you are allowed to scrape the webpage. This is a crucial step. 
2. Inspect the  website using your browser’s web development tools.
3. Take note of the available data and how much of it is necessary for your project. If the webpage is dynamic and uses javascript to load its content, disable Javascript to see how much data that you require remains. There are several techniques by which one can extract javascript content as well. 
4. Design the schema for storing the extracted data. We can store as .csv, .json, .xml and also in databases like SQLite3, mongoDB and many more  


Disable javascript on chrome: 
https://developer.chrome.com/docs/devtools/javascript/disable/#:~:text=Press%20Control%2BShift%2BP%20or,to%20open%20the%20Command%20Menu.&text=Start%20typing%20javascript%20%2C%20select%20Disable,JavaScript%20is%20now%20disabled.

## Traversing a HTML document

###  Structure of a HTML document

Structure of HTML or XML documents.

Basic html tags
* `<html>`
* `<head>`
* `<title>`
* `<body>`
* `<p>`
* `<a>`
* `<span>`
* `<div>`
* `<ul>`
* `<ol>`
* `<li`>

https://www.w3schools.com/TAgs/default.asp

### Tree of nodes

XML documents are treated as trees of nodes. 
1. The topmost element of the tree is called the **root element**.
2. Parent node
3. Children node
4. Sibling node
5. Descendant node

https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree

## BeautifulSoup

BeautifulSoup transforms a complex HTML document into a complex tree of Python objects.

HTML parser: HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values.

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser (which is recommended by BeautifulSoup).

`soup = BeautifulSoup(‘html_txt’, ‘lxml’)`

For this exercise we will scrape the **BillBoard hot 100 data** from https://www.billboard.com/charts/hot-100/


In [2]:
# url of webpage to scrape
url = 'https://www.billboard.com/charts/hot-100/'

html_text = requests.get(url).text

soup = BeautifulSoup(html_text, 'lxml')

print(soup.prettify())


<!DOCTYPE html>
<!--[if IE 6]>
<html id="ie6" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html id="ie7" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html id="ie8" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8) ]><!-->
<html lang="en-US">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="#ffffff" name="theme-color"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <!-- Add to home screen for iOS -->
  <meta content="black-translucent" name="apple-mobile-web-app-status-bar-style"/>
  <link href="https://www.billboard.com/wp-content/themes/vip/pmc-billboard-2021/assets/app/icons/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
  <!-- Tile icons for Windows -->
  <meta content="https://www.billboard.com/wp-content/themes/vip/pmc-billboard-2021/assets/app/browserconfig.xml" name="msapplication-config"/>
  <meta content="https://www.billboard.com/wp-cont

Next we inspect the html tree structure to find the data we are looking for. The node we are looking for is the tag `div` with CSS class `o-chart-results-list-row-container`.

In [3]:
# find all row elements
div = soup.find_all('div', class_='o-chart-results-list-row-container')

# check if all 100 rows are in the list
print(len(div))

100


## Schema
In this exercise, we wil store the scraped data in the following dictionary

In [4]:
# schema
billboard100 = { 
    'title': [],              # song title
    'artist': [],             # song artist
    'last_wk': [],            # rank from last week
    'peak_pos': [],           # peak rank position
    'wks_on_chart': []        # number of weeks on chart
}

## Scrape
After inspecting the HTML document and noting the nodes where the required data is located we will store them in the empty python dictionary  created earlier

In [5]:
for idx, row in enumerate(div):

    items = row.ul.find('li', class_='lrv-u-width-100p')
    list_items = items.find_all('li')

    billboard100['title'].append(list_items[0].h3.text.replace('\n', '').replace('\t', ''))
    billboard100['artist'].append(list_items[0].span.text.replace('\n', '').replace('\t', ''))
    billboard100['last_wk'].append(list_items[3].text.replace('\n', '').replace('\t', ''))
    billboard100['peak_pos'].append(list_items[4].text.replace('\n', '').replace('\t', ''))
    billboard100['wks_on_chart'].append(list_items[5].text.replace('\n', '').replace('\t', ''))

# store in a pandas DataFrame
df = pd.DataFrame(billboard100)
df

Unnamed: 0,title,artist,last_wk,peak_pos,wks_on_chart
0,About Damn Time,Lizzo,2,1,14
1,As It Was,Harry Styles,1,1,16
2,Running Up That Hill (A Deal With God),Kate Bush,4,3,28
3,First Class,Jack Harlow,3,1,15
4,Wait For U,Future Featuring Drake & Tems,5,1,12
...,...,...,...,...,...
95,Arson,j-hope,-,96,1
96,Right On,Lil Baby,-,13,14
97,Cash In Cash Out,"Pharrell Williams Featuring 21 Savage & Tyler,...",99,26,6
98,La Corriente,Bad Bunny & Tony Dize,-,32,8


## Export

In [6]:
# export results to a csv file

results = soup.find('div', class_=re.compile('chart-results'))
date = results.find('p').text.replace(' ', '_').replace(',', '')
print(date)
df.to_csv(f'billboard100_{date}.csv')

Week_of_July_30_2022
