## Beautiful Soup and parsing data

- JSON -- data structure typical of APIs
    - cleaner data, easier to parse
- HTML -- markup language used to structure web pages
    - messier data, requires specialized tools to parse

### Step 1: Download HTML text from a website

In [162]:
import requests
from bs4 import BeautifulSoup as soup

In [163]:
site = 'https://www.spiced-academy.com/en/'
site2 = 'https://www.eventbrite.com/d/germany--berlin/music--events/'

In [164]:
spiced = requests.get(site)
spiced

<Response [200]>

### Step 2. Convert the raw HTML string into a "BeautifulSoup" object, so that the data can be easily parsed.

In [165]:
spiced_html = soup(spiced.text, 'html.parser') #parser attribute is us telling bs4 that we are giving it html, as opposed to xml, yml, etc...

In [92]:
spiced_html

<!DOCTYPE html>

<html dir="ltr" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
<head>
<title>Your new career starts here | Spiced Academy</title>
<meta content="Kickstart your new career with our intensive, on-site tech programs in Web Development and Data Science." name="description"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport">
<link href="https://fonts.googleapis.com/css?family=Poppins:300,400,600&amp;display=swap" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono:400,500&amp;display=swap" rel="stylesheet"/>
<link href="/css/main.css" rel="stylesheet"/>
<link href="/apple-touch-icon.png?v=3" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png?v=3" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png?v=3" rel="icon" sizes="16x16" type="image/png"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c

### Step 3. Use the BeautifulSoup object to parse the HTML document tree down to the tag / element that contains the data you want.

There are multiple ways to get to the solution!

Use ``.tag_name`` to return tags without custom filters.

``.find()`` to return the first instance of your 'query'.

``.find_all()`` to return all tags, (a list-like object (called a "ResultSet")) but this time its customisable.

``.get()`` returns an attribute of a tag

``.get_text()`` returns the actual part of the tag that is outside of the < angled brackets > (i.e. the text)

In [115]:
spiced_html.h3.text #parse the html code to reveal the text, use .text attribute to extract text

'Kickstart your new career with our intensive, on-site tech programs in Web Development and Data Science.'

In [123]:
spiced_html.find_all('h3')[0].text

'Kickstart your new career with our intensive, on-site tech programs in Web Development and Data Science.'

In [131]:
bool(False), bool(), bool(''), bool([]), bool(True), bool('True'), bool([1]), 

(False, False, False, False, True, True, True)

In [124]:
for i in spiced_html.find_all('h3'):
    if i.text:
        print(i.text)
    else:
        print('im empty!')

Kickstart your new career with our intensive, on-site tech programs in Web Development and Data Science.
im empty!
Average review on Course Report


In [166]:
lyrics = spiced_html.find_all('h3')[0].text

In [134]:
with open('lyrics.txt','w') as f:
    f.write(lyrics)

In [135]:
import os

In [167]:
[x for x in os.listdir('.') if 'lyrics' in x]

['lyrics.txt']

In [2]:
suffix = spiced_html.find_all('a', attrs={"href":"/program"})[0].get('href')

NameError: name 'spiced_html' is not defined

In [169]:
suffix

'/program'

In [170]:
site[:-1]

'https://www.spiced-academy.com/en'

In [171]:
program_page = site[:-1]+suffix

In [172]:
program_page

'https://www.spiced-academy.com/en/program'

In [173]:
program_text = requests.get(program_page)

In [176]:
program_soup = soup(program_text.text, 'html.parser')

---

### Applying requests and bs4!

1. Metrolyrics.com
2. Lyrics.com
3. AZLyrics.com

---

# ETHICS + LEGALITY!!

#### You're fine doing everything we raise in the course material, but if you branch out into doing your own work on scraping, beware legality and ethical questions! Check GDPR for the european legal framework, and use common sense for ethics