# Beautiful Soup 

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

**Task:** From this [SPICED Academy page](https://www.spiced-academy.com/en/program/data-science), extract the following paragraph using web-scraping: 
- **"Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas & NumPy."**

- Use a nice browser that allows you to easily view the HTML code side-by-side with the webpage.
    - For example, in Google Chrome, you can right-click (or ctrl+click) a webpage and click '*View Page Source*'. 
    - You can also click '*Inspect*' to get a more interactive comparison of the HTML code that corresponds to a section of the website you're interested in.
    - In other browsers: check the web-dev tools. 

### Step 1. Get the raw HTML text from the above website

In [2]:
sp_html = requests.get('https://www.spiced-academy.com/en/program/data-science').text
type(sp_html)

str

In [3]:
sp_html



### Step 2. Convert the raw HTML string to a BeautifulSoup-object, so that we can parse the data.

In [5]:
spiced_soup = BeautifulSoup(sp_html, 'html.parser')
type(spiced_soup)

bs4.BeautifulSoup

In [6]:
spiced_soup

<!DOCTYPE html>

<html dir="ltr" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
<head>
<title>Data Science Bootcamp in Germany | Spiced Academy</title>
<meta content="Level up your career with our 12 week Data Science bootcamp in Germany." name="description"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport">
<link href="https://fonts.googleapis.com/css?family=Poppins:300,400,600&amp;display=swap" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono:400,500&amp;display=swap" rel="stylesheet"/>
<link href="/css/main.css" rel="stylesheet"/>
<link href="/apple-touch-icon.png?v=3" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png?v=3" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png?v=3" rel="icon" sizes="16x16" type="image/png"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c" name="msapplication-TileCo

In [7]:
print(spiced_soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
 <head>
  <title>
   Data Science Bootcamp in Germany | Spiced Academy
  </title>
  <meta content="Level up your career with our 12 week Data Science bootcamp in Germany." name="description"/>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport">
   <link href="https://fonts.googleapis.com/css?family=Poppins:300,400,600&amp;display=swap" rel="stylesheet"/>
   <link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono:400,500&amp;display=swap" rel="stylesheet"/>
   <link href="/css/main.css" rel="stylesheet"/>
   <link href="/apple-touch-icon.png?v=3" rel="apple-touch-icon" sizes="180x180"/>
   <link href="/favicon-32x32.png?v=3" rel="icon" sizes="32x32" type="image/png"/>
   <link href="/favicon-16x16.png?v=3" rel="icon" sizes="16x16" type="image/png"/>
   <link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
   <meta conte

In [8]:
spiced_soup.title

<title>Data Science Bootcamp in Germany | Spiced Academy</title>

In [9]:
spiced_soup.h3

<h3 class="sub-heading">Level up your career with our 12 week Data Science bootcamp in Germany.</h3>

In [10]:
spiced_soup.body.header.div

<div class="header header-left">
<a href="/">
<img alt="Spiced logo" class="header-logo" src="/img/Spiced_Logo_Dark.svg"/>
</a>
<div class="header-links">
<a href="/program">
<p class="js-header-links-left">PROGRAMS</p>
</a>
<a href="/about">
<p class="js-header-links-left">ABOUT</p>
</a>
<a class="apply-now" href="/apply">
<p class="js-header-links-left">APPLY NOW</p>
</a>
</div>
</div>

In [11]:
spiced_soup.get_text()

"\n\n\nData Science Bootcamp in Germany | Spiced Academy\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPROGRAMS\n\n\nABOUT\n\n\nAPPLY NOW\n\n\n\n\n\n\nPROGRAMS\n\n\nABOUT\n\n\nAPPLY NOW\n\n\nDE | EN\n\n\n\n\n\n\nDE | EN\n\n\n\nPrograms\n\n\nAbout\n\n\n\n\n\nGraduates\n\n\nOutcomes\n\n\nFAQ\n\n\nBlog\n\n\n\nAPPLY NOW\n\n\n\n\n\n\n\nData  Science\nLevel up your career with our 12 week Data Science bootcamp in Germany.\nCampuses in Berlin, Hamburg, Cologne and Stuttgart\n\n\nRead More\n\n\nApply Now\n\n\n \n\n\n\n\n\n \n\n\n\n\n\n            Your new career starts here\n\n\n\n\n0/4\nPreparation Phase\nGet started with the core skills that you need\nUsing Python basics, Linear Algebra and calculus, you will learn to write your very first programs. You’ll begin working with programming fundamentals such as data types, loops and conditionals. Once you’ve finished the prep course, you’re required to solve a small data analysis

In [12]:
spiced_soup.body.div.get_text()

'\n\n\n\n\n\nPROGRAMS\n\n\nABOUT\n\n\nAPPLY NOW\n\n\n'

### Step 3. Use the BeautifulSoup object to parse the HTML document tree down to the tag that contains the data you want.
- There are multiple ways to get to the solution!
- `.find()` always returns the first instance of your "query"
- `.find_all()` returns a list-like object (called a "ResultSet") that contains the matching results.
- `.text` returns the actual part of the tag that is outside of the **< angled brackets  >** (i.e. the text)

In [13]:
spiced_soup.find_all('h3')

[<h3 class="sub-heading">Level up your career with our 12 week Data Science bootcamp in Germany.</h3>,
 <h3>Campuses in <a href="/program/data-science/berlin">Berlin</a>, <a href="/program/data-science/hamburg">Hamburg</a>, <a href="/program/data-science/cologne">Cologne</a> and <a href="/program/data-science/stuttgart">Stuttgart</a></h3>,
 <h3 class="preparation-short-description">Get started with the core skills that you need</h3>,
 <h3 class="preparation-short-description">Machine learning, data analysis and more</h3>,
 <h3 class="preparation-short-description">Time for the back-end.</h3>,
 <h3 class="preparation-short-description">Deep learning and your final project</h3>,
 <h3 class="preparation-short-description">A place to continue to learn and develop.</h3>,
 <h3 class="mob-hidden">Know your language at glance:</h3>,
 <h3>Data Analysis in Python</h3>,
 <h3 class="mob-hidden">Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful librarie

In [14]:
spiced_soup.find_all(class_='mob-hidden')

[<h1 class="main-heading career-heading career-start-container mob-hidden">
             Your new career starts here</h1>,
 <p class="mob-hidden">0/4</p>,
 <h1 class="main-heading career-heading career-start-container mob-hidden">
             Your new career starts here</h1>,
 <p class="mob-hidden">1/4</p>,
 <h1 class="main-heading career-heading career-start-container mob-hidden">
             Your new career starts here</h1>,
 <p class="mob-hidden">2/4</p>,
 <h1 class="main-heading career-heading career-start-container mob-hidden">
             Your new career starts here</h1>,
 <p class="mob-hidden">3/4</p>,
 <h1 class="main-heading career-heading career-start-container mob-hidden">
             Your new career starts here</h1>,
 <p class="mob-hidden">4/4</p>,
 <h3 class="mob-hidden">Know your language at glance:</h3>,
 <h3 class="mob-hidden">Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas &amp; NumPy.</h3>,
 <h3 clas

In [18]:
spiced_soup.find_all('h3', attrs = {'class' : 'mob-hidden'})

[<h3 class="mob-hidden">Know your language at glance:</h3>,
 <h3 class="mob-hidden">Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas &amp; NumPy.</h3>,
 <h3 class="mob-hidden">Delve into the world of supervised and unsupervised learning with the scikit-learn and statsmodels frameworks.</h3>,
 <h3 class="mob-hidden">Organize data in SQL databases like PostgreSQL, fill it with data and run queries.</h3>,
 <h3 class="mob-hidden">Deploy code on remote servers using Docker and AWS, and build an online dashboard.</h3>,
 <h3 class="mob-hidden">Acquire state-of-the-art engineering tools to write, test, and deploy bigger Python applications.</h3>,
 <h3 class="mob-hidden">Use Git and GitHub throughout the course to collaborate &amp; version control your code.</h3>]

In [19]:
spiced_soup.find_all('h3', attrs = {'class' : 'mob-hidden'})[1]

<h3 class="mob-hidden">Become fluent in using Python to collect, analyze, and visualize data, focusing on the powerful libraries Pandas &amp; NumPy.</h3>

### Tasks regarding the project:   
After downloading the HTML-texts for two (or more) artists-pages:   
    1. get all the links to their song-pages and collect them in a list  (either with bs4 or RegEx)  
    2. loop through the list and download the lyrics-pages   
    3. extract and save the lyrics -> they'll be the input for your model   
    4. (clean up text)  
    

For more details, see the Challenges in the Course Material here: http://krspiced.pythonanywhere.com/chapters/project_lyrics/web_scraping/README.html and   http://krspiced.pythonanywhere.com/chapters/project_lyrics/web_scraping/beautiful_soup.html