# HTML Parsing
<hr style="border:2px solid black">

## 1. Introduction

**HTML string?**

- inherent hyerarchical structure of a html file is lost
- needs a lot of complex regular expressions

**`BeautifulSoup`** 

- Python library that helps parse HTML code without requiring complex regex
- often used along with requests library (to parse HTML pages downloaded from websites)

**load packages**

In [1]:
# data processing libraries
import numpy as np
import pandas as pd

# webscraping annd HTML parsing libraries
import requests
import re
from bs4 import BeautifulSoup

# other libraries
import time

**installation (if required)**

`pip install bs4`

`conda install -c conda-forge bs4 -y`

### Warmup Exercise

In [2]:
html = """<html><head></head><body>
<h1>Header</h1>
<ul class="cast"> 
  <li>Hamlet</li>
  <li>Polonius</li>
  <li>Ophelia</li>
  <li>Claudius</li>
  <li>Ajun</li>
</ul>
<ul class="authors">
  <li>William Shakespeare</li>
</ul>
</body></html>"""

In [3]:
soup = BeautifulSoup(html,'html.parser')

In [15]:
# find() --->  only gives the first one on the list
# find_all()  ---> gives everything
soup.find('ul')

<ul class="cast">
<li>Hamlet</li>
<li>Polonius</li>
<li>Ophelia</li>
<li>Claudius</li>
<li>Ajun</li>
</ul>

In [5]:
# find_all()
# "class" is a reserved name in python, when we want to use it we add an underscore to make something different
soup.find_all('ul', class_='cast')

[<ul class="cast">
 <li>Hamlet</li>
 <li>Polonius</li>
 <li>Ophelia</li>
 <li>Claudius</li>
 <li>Ajun</li>
 </ul>]

In [6]:
# .text
soup_text = soup.find('ul', class_='cast').text
soup_text 

'\nHamlet\nPolonius\nOphelia\nClaudius\nAjun\n'

In [7]:
stripped_text = soup_text.strip()
stripped_text

'Hamlet\nPolonius\nOphelia\nClaudius\nAjun'

In [8]:
stripped_text.split("\n")

['Hamlet', 'Polonius', 'Ophelia', 'Claudius', 'Ajun']

In [16]:
# .find_all(), .get(), .get_text() 

for ul in soup.find_all('ul'):
    if "cast" in ul.get('class'):
        items = [item.get_text() for item in ul.find_all('li')]
items

['Hamlet', 'Polonius', 'Ophelia', 'Claudius', 'Ajun']

### Teasers 

- what is the data type of the HTML document?

- what does the `find_all()` function return?

- what does the argument of the `find_all()` function refer to?

- what does the argument of the `get()` function refer to?

- what does the `get_text()` function extract?

- how would you extract the title of the play?

<hr style="border:2px solid black">

## 2. Using BeautifulSoup

- there are multiple ways to get to the solution!
- one can navigate the html tree down to the tag that contains the desired data

|method|return|
|:--:|:--:|
|`.find()`|the first instance of your query|
|`.find_all()`|a list-like object ("ResultSet") that contains at least one matching result|
|`.text`/`.get_text()`|the actual part of the tag that is outside of the < angled brackets > (text)|
|`.get(attribute_name)`|the value corresponding to a certain attribute|

### Task: Extract topic and description from ds curriculum

From the <a href="http://www.spiced-academy.com/en/program/data-science">SPICED Academy Data Science Page</a>, extract the text description for each of part of the curriculum.
<br>



<img src="curriculum.png">

**Download the HTML text from the website**

+ Use the `requests` library
+ url: https://www.spiced-academy.com/en/program/data-science



In [10]:
url = 'https://www.spiced-academy.com/en/program/data-science'

#header for request
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

**send a HTTP GET request to the above address**

In [11]:
response=requests.get(url=url)

In [12]:
response.status_code

200

**check the response status**

In [17]:
response

<Response [200]>

**extract the html text data from the response**

In [18]:
spiced_html = response.text
print(spiced_html)

<!DOCTYPE html>
<html lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#" dir="ltr">

<head>
    <title>Data Science Bootcamp in Germany | Spiced Academy</title>
    <meta name="description" content="Get equipped with the most in-demand technologies. Led by domain experts.">
    
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <link rel='stylesheet' href='/css/main.css'>
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png?v=3">
    <link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png?v=3">
    <link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png?v=3">
    <link rel="mask-icon" href="/safari-pinned-tab.svg" color="#5bbad5">
    <meta name="msapplication-TileColor" content="#da532c">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta name="google-site-verification" content="DICJgJwgfNeeiq66MAwDzpaeqBBgDqZsBNFBMwVtoaY">
    <met

In [19]:
#What is the type of variable
type(spiced_html)

str

**convert the raw HTML into a BeautifulSoup object**

In [20]:
spiced_soup=BeautifulSoup(
    markup=spiced_html,
    features='html.parser'
)

In [21]:
spiced_soup

<!DOCTYPE html>

<html dir="ltr" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
<head>
<title>Data Science Bootcamp in Germany | Spiced Academy</title>
<meta content="Get equipped with the most in-demand technologies. Led by domain experts." name="description"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport">
<link href="/css/main.css" rel="stylesheet"/>
<link href="/apple-touch-icon.png?v=3" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png?v=3" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png?v=3" rel="icon" sizes="16x16" type="image/png"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c" name="msapplication-TileColor"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="DICJgJwgfNeeiq66MAwDzpaeqBBgDqZsBNFBMwVtoaY" name="google-site-verification"/>
<meta content="#2E016D" name="theme-color">
<meta con

In [22]:
type(spiced_soup)

bs4.BeautifulSoup

In [23]:
type(spiced_soup.find(name='div', attrs={'class':'curriculum-mini-section'}))

bs4.element.Tag

In [24]:
spiced_soup.find(name='div', attrs={'class':'curriculum-mini-section'})

<div class="curriculum-mini-section">
<div class="shape-box-small">
<div class="shape-box-small--half-box"></div>
<div class="half-box-border"></div>
</div> <div class="description">
<h3>Data Analysis in Python</h3>
<h3 class="mob-hidden">Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas &amp; NumPy.</h3>
</div>
</div>

In [25]:
spiced_soup.find(name='div', attrs={'class':'curriculum-mini-section'}).find(name='h3').get_text()

'Data Analysis in Python'

In [26]:
spiced_soup.find(name='div', attrs={'class':'curriculum-mini-section'}).find(name='h3', attrs={'class':'mob-hidden'}).get_text()

'Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas & NumPy.'

In [27]:
ds_curriculum=spiced_soup.find_all(name='div', attrs={'class':'curriculum-mini-section'})
ds_curriculum

[<div class="curriculum-mini-section">
 <div class="shape-box-small">
 <div class="shape-box-small--half-box"></div>
 <div class="half-box-border"></div>
 </div> <div class="description">
 <h3>Data Analysis in Python</h3>
 <h3 class="mob-hidden">Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas &amp; NumPy.</h3>
 </div>
 </div>,
 <div class="curriculum-mini-section">
 <div class="shape-box-small">
 <div class="shape-box-small--square-border"></div>
 </div> <div class="description">
 <h3>Machine Learning</h3>
 <h3 class="mob-hidden">Delve into the world of supervised and unsupervised learning with the scikit-learn and statsmodels frameworks.</h3>
 </div>
 </div>,
 <div class="curriculum-mini-section">
 <div class="shape-box-small">
 <div class="shape-box-small--half-box half-box-dark-purple"></div>
 <div class="shape-box-small--half-box"></div>
 </div> <div class="description">
 <h3>PostgreSQL</h3>
 <h3 class="mob-hidden">Or

In [28]:
type(ds_curriculum)

bs4.element.ResultSet

In [None]:
for index, element in enumerate(ds_curriculum):# use enumerate to get the index 
    #print(index, element)
    topic = element.find(name="h3").get_text()
    description = element.find(name="h3", attrs={"class":"mob-hidden"}).get_text()
    filename = f"{topic.lower().replace(' ','_')}.txt"
    print(index,topic)
    print(description,filename)
    with open(filename,"w") as file:
        file.write(description)
        

In [30]:
description

'Use Git and GitHub throughout the course to collaborate & version control your code.'

### Task 190423 - Create the CSV file

In [None]:
# to save song into csv files
# split("\.n") to split lines
# pd.DataFrame({"songline":[],"songline":[]})
# stratify = df["singer"]

<hr style="border:2px solid black">

## 3. Exercise: Build Lyrics Corpus

1. Find out where the lyrics start and end in the HTML code
2. Extract the lyrics as a string using BeautifulSoup
3. Write a loop that extracts the lyrics of all songs
4. Collect all **song-line** strings in a list

<hr style="border:2px solid black">

### References

- [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)