## Parsing HTML

#### <span style="background-color: #FFFF00">**Beautiful Soup**</span> is a Pythin library that helps parse HTML code for you, so you don't have to write a lot of complex regular expressions yourself. 

BeautifulSoup is often used in conjunction with the *requests* library, since it's primary purpose is to parse HTML pages that are downloaded from websites.


In [3]:
import requests

In [4]:
response = requests.get('https://www.spiced-academy.com/en/program/data-science')

In [6]:
response.status_code

200

In [9]:
#response.text #  gives hmtl text

#### When we get the content, or "text" of an HTTP request in python, we get the raw HTML code of that webpage.
- more specifically, we get the response as a python string

#### Let's make our lives a bit easier and convert this python string to a more meaningful object that understands the structure of HTML.

In [10]:
from bs4 import BeautifulSoup

#### Convert the raw HTML string to a BeautifulSoup object, so that you can parse the data.

In [13]:
soup = BeautifulSoup(response.text)
soup

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
<head>
<title>Data Science Bootcamp in Germany | Spiced Academy</title>
<meta content="Level up your career with our 12 week Data Science bootcamp in Germany." name="description"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="https://fonts.googleapis.com/css?family=Poppins:300,400,600&amp;display=swap" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono:400,500&amp;display=swap" rel="stylesheet"/>
<link href="/css/main.css" rel="stylesheet"/>
<link href="/apple-touch-icon.png?v=3" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png?v=3" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png?v=3" rel="icon" sizes="16x16" type="image/png"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c" name="msapplication-TileCo

In [14]:
type(soup)
# this BeautifulSoup object encodes all the hierarchy/ structure of the underlying HTML code
# i. e. it's not longer a simple python string

bs4.BeautifulSoup

#### So now, with our "beatiful soup-ified" object, we can now much more easily query pieces of the HTML "text" that we care about. We no longer have to write regular expressions from scratch.

- `.find()` always returns the first instance of your "query"
- `.find_all()` returns a list-like object (called a "ResultSet") that contains at least one matching result
- `.text` returns the actual part of the tag that is outside of the **<angled brackets>** (i. e. the text)
    

<span style="background-color:lightyellow">Hint: A good way to query/ identify elements of the HTML page is by their **class** or **id**. In HTML, the difference between *class* and *id* is that an *id* is unique, while a *class* is not.
    </span>

In [17]:
soup.head.title.text
# we can already descend/ navigate the tree by accessing nested properties through
# "object-oriented" dot notation.

'Data Science Bootcamp in Germany | Spiced Academy'

#### We can use `.find()` method to jump straight down to the part of the HTML tree that we care about.

#### We can search for specific tags:
- h1, h2, h3, ...., h6
- p
- a
- div
- body
- table

#### Tags also have attributes within them, such as:
- class
- id

In [21]:
soup.find('h3', attrs={'class': 'mob-hidden'}).text
# so now we can perform a more specific query

'Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas & NumPy.'

In [26]:
results = soup.find_all('h3', attrs={'class': 'mob-hidden'})
# please be aware that find_all returns multiple results in some kind of iterable object

In [27]:
type(results)

bs4.element.ResultSet

In [32]:
paragraphs = []
for i in results:
    paragraphs.append(i.text)

paragraphs

['Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas & NumPy.',
 'Delve into the world of supervised and unsupervised learning with the scikit-learn and statsmodels frameworks.',
 'Organize data in SQL databases like PostgreSQL, fill it with data and run queries.',
 'Deploy code on remote servers using Docker and AWS, and build an online dashboard.',
 'Acquire state-of-the-art engineering tools to write, test, and deploy bigger Python applications.',
 'Use Git and GitHub throughout the course to collaborate & version control your code.']

#### Let's use our knowledge of BeautifulSoup to parse both the text and the titles.

In [36]:
result2 = soup.find_all('div', attrs={'class': 'description'})

In [37]:
result2

[<div class="description">
 <h3>Data Analysis in Python</h3>
 <h3 class="mob-hidden">Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas &amp; NumPy.</h3>
 </div>,
 <div class="description">
 <h3>Machine Learning</h3>
 <h3 class="mob-hidden">Delve into the world of supervised and unsupervised learning with the scikit-learn and statsmodels frameworks.</h3>
 </div>,
 <div class="description">
 <h3>PostgreSQL</h3>
 <h3 class="mob-hidden">Organize data in SQL databases like PostgreSQL, fill it with data and run queries.</h3>
 </div>,
 <div class="description">
 <h3>Data Infrastructure</h3>
 <h3 class="mob-hidden">Deploy code on remote servers using Docker and AWS, and build an online dashboard.</h3>
 </div>,
 <div class="description">
 <h3>Software Engineering</h3>
 <h3 class="mob-hidden">Acquire state-of-the-art engineering tools to write, test, and deploy bigger Python applications.</h3>
 </div>,
 <div class="description">
 <h3

In [46]:
final_results = []
for i in result2:
    title = i.h3.text
    print(title)
    para = i.find('h3', attrs={'class':'mob-hidden'}).text
    result_pair = (title, para)
    final_results.append(result_pair) # append as a tuple

Data Analysis in Python
Machine Learning
PostgreSQL
Data Infrastructure
Software Engineering
Git & Bash


In [48]:
import pandas as pd

In [51]:
df = pd.DataFrame(final_results)
df

Unnamed: 0,0,1
0,Data Analysis in Python,"Become fluent in using Python to collect, anal..."
1,Machine Learning,Delve into the world of supervised and unsuper...
2,PostgreSQL,Organize data in SQL databases like PostgreSQL...
3,Data Infrastructure,Deploy code on remote servers using Docker and...
4,Software Engineering,Acquire state-of-the-art engineering tools to ...
5,Git & Bash,Use Git and GitHub throughout the course to co...


#### When you scrape your data, also think about saving the text/ artists to a file.

In [52]:
df.to_csv('spiced_descriptions.csv')

### Scraping in Python:
1. Requests + RegEx
2. Requests + BeautifulSoup
3. Scrapy