# Beautiful Soup

Main tools used for scraping web data!

For the project, we will scrape lyrics from the web and use Machine Learning to Identify song authors!

- Going to scrape the web using requests to grab the content

- Parse the content - breaking stuff into substructures - using Beautiful Soup

- Apply text classification to the content you've parsed - one aspect of the project will be to parse english and break it down to understand verbs, nouns, adverbs, etc... - using spaCy

- Create a bag of words - basic introduction to word vectorisation - turns lyrics and text into numerical data that we can process 

- We'll also use a Naive Bayes model

- using Scrapy - scraping tool to grab data off the web

The data we'll be using is text data - text corpora ie a body of text

### Theory:

- Regex - context free grammar - a very simple computer grammar - practically applied through RegEx - RE uses it

- Dependency grammar - spaCy uses it

- Naive Bayes - probability theory 

- Bag of words, Count Vectorizer, TF-IDF: manner of transforming our word data into numerical data! Each word has a different vector and similar words have similar vectors

- Class balance: making sure that your training data is split in a nice way so you're not making your classes imbalanced 

### 3 main ways to scrape text off the internet

We need our computer to skip all the mess of crap of HTML and CSS and all that and just skip down to the lyrics

1. Requests + RegEx: Highly customisable and can be used on any text, but very hard when you have a complex page source, and it doesn't understand HTML structure!

2. Requests + Beautiful Soup: it is built for XML/HTML, it uses Regular Expressions, it parses it for you and it's super useful, however, it ONLY works for XML/HTML

3. Scrapy: Most powerful tool - you can specify what you search and it can trick websites and loads of other awesome stuff, BUT it's a total black box

### Grab and parse web content

#### Read the documentation here (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

#### Simple request - pass in the url only

In [1]:
import requests

lyrics = requests.get('https://www.lyrics.com/lyric/23507253/Perth')

In [2]:
lyrics.status_code #200 is http code for accepted!

200

In [3]:
lyrics

<Response [200]>

#### complex request: 
* get requests and post requests can both receive dictionaries as values.
* pass in custom headers (to avoice website bot blocking) using keyword headers
* (pass in get parameters using keyword params)
* headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

Instead of the site thinking you're a bot and presenting a firewall, it will think you're a regular person browsing if you use headers

In [4]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

In [6]:
lyrics = requests.get('https://www.lyrics.com/lyric/23507253/Perth', headers = headers)

* Check the status of the request grab

In [7]:
lyrics.status_code

200

* Get the content as text

In [8]:
lyrics.text[:200]

'\n<!doctype html>\n<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]-->\n<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]-->\n<!--[if IE 8]>    <html '

#### There will be some JavaScript in there, some HTML... but it's all pretty useless

##### Common http status codes - https://www.restapitutorial.com/httpstatuscodes.html

### BS4
* With the data we have collected, we can try out Beautiful Soup concepts
* We have some raw text data but now we need to parse it!

In [9]:
from bs4 import BeautifulSoup as soup

#### BeautifulSoup accepts text data and binary data too! 99% of the time you are parsing HTML, but sometimes XML and LXML...

In [10]:
parse_lyrics = soup(lyrics.text, 'html.parser') # gonna give it text data

In [11]:
type(parse_lyrics) 

bs4.BeautifulSoup

**.prettify** should clean up your text

In [1]:
# parse_lyrics.prettify

### What are we looking at? _(attributes)_

#### The attributes of a BeautifulSoup object are:
* Tags - representing html tags, whose attributes are stored in a dictionary on the Tag object (BeautifulSoup object behaves like a Tag object)


* NavigableString - represents the text stored within Tags. Behaves like a string but carries parsetree reference, which is memory intensive, so convert using unicode() or string() before using elsewhere (Comments are special kinds of NavigableStrings)

### And how do we extract information from it? _(methods)_

#### The methods we can use on a BeautifulSoup object are:
* prettify - convert the returned BS object into formatted html output
* find - find the first instance of an html attribute
* find_all - find all instances of an html attribute
* There are several other find and find_all methods which replicate the above behaviour, but for particular attributes, and include (find_next / find_all_next /  find_previous / find_all_previous / find_parent / next_sibling / previous_sibling)
* Generators - incremental iterables, main one is children and descendants

#### Let's combine these attributes and methods to look for some useful information

#### 1: Generator method: children or descendents

Generators are a fancy / niche list to only generate data only when they need it

In [15]:
gen = [type(x) for x in parse_lyrics.children]
gen

[bs4.element.NavigableString,
 bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Comment,
 bs4.element.NavigableString,
 bs4.element.Comment,
 bs4.element.NavigableString,
 bs4.element.Comment,
 bs4.element.NavigableString,
 bs4.element.Comment,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Comment]

- We get back a bunch of elements - not particularly useful

- This shows us it is a nested structure

- **NavigatableStrings are what we are interested in**, we want to ignore .Doctypes and .Comments

In [20]:
gen = [x for x in parse_lyrics.children]
gen2 = [type(x) for x in gen[0]]
gen2

[str]

#### 2: Directly access tags in the html body using their tag name

In [24]:
type(parse_lyrics.body.p)

bs4.element.Tag

#### 3: Use find or find_all to search for particular instances

find finds the first value amnd returns that to you

find_all returns all of them

In [25]:
parse_lyrics.find('div')

<div id="fb-root"></div>

#### Would be nice to grab divs with specific attributes!

In [27]:
parse_lyrics.find('pre', attrs={'class': 'lyric-body'})

<pre class="lyric-body" data-lang="en" dir="ltr" id="lyric-body-text">I’m <a href="https://www.definitions.net/definition/tearing" style="color:#222;">tearing</a> up, <a href="https://www.definitions.net/definition/across" style="color:#222;">across</a> your face
Move dust <a href="https://www.definitions.net/definition/through" style="color:#222;">through</a> the light
To find your name
It's <a href="https://www.definitions.net/definition/something" style="color:#222;">something</a> faint
This is not a place
Not yet awake, I'm <a href="https://www.definitions.net/definition/raised" style="color:#222;">raised</a> to make

Still <a href="https://www.definitions.net/definition/alive" style="color:#222;">alive</a> who you, love
Still <a href="https://www.definitions.net/definition/alive" style="color:#222;">alive</a> who you, love
Still <a href="https://www.definitions.net/definition/alive" style="color:#222;">alive</a> who you, love

In a <a href="https://www.definitions.net/definition/m

#### Or use a Boolean ...

In [28]:
parse_lyrics.find('pre', attrs={'class': True})

<pre class="lyric-body" data-lang="en" dir="ltr" id="lyric-body-text">I’m <a href="https://www.definitions.net/definition/tearing" style="color:#222;">tearing</a> up, <a href="https://www.definitions.net/definition/across" style="color:#222;">across</a> your face
Move dust <a href="https://www.definitions.net/definition/through" style="color:#222;">through</a> the light
To find your name
It's <a href="https://www.definitions.net/definition/something" style="color:#222;">something</a> faint
This is not a place
Not yet awake, I'm <a href="https://www.definitions.net/definition/raised" style="color:#222;">raised</a> to make

Still <a href="https://www.definitions.net/definition/alive" style="color:#222;">alive</a> who you, love
Still <a href="https://www.definitions.net/definition/alive" style="color:#222;">alive</a> who you, love
Still <a href="https://www.definitions.net/definition/alive" style="color:#222;">alive</a> who you, love

In a <a href="https://www.definitions.net/definition/m

In [29]:
all_pre_tags = parse_lyrics.find_all('pre', attrs={'class': 'lyric-body'})

In [30]:
len(all_pre_tags)

1

In [33]:
all_pre_tags[0].text

"I’m tearing up, across your face\r\nMove dust through the light\r\nTo find your name\r\nIt's something faint\r\nThis is not a place\r\nNot yet awake, I'm raised to make\r\n\r\nStill alive who you, love\r\nStill alive who you, love\r\nStill alive who you, love\r\n\r\nIn a matter of an month\r\nFrom forests for the soft\r\nGotta know been lead aloft\r\nSo I'm ridding all your stories\r\nWhat I know, what it is, is boring, wire it up!\r\nYou're breaking your ground"

**.text will give you everything you don't want inside your tags!**
- But now we want to get rid of special characters! Let's use regular expressions to get rid of these special symbols

In [34]:
import re 

In [48]:
# r tells python you're about to give it RegEx!

clean_text = re.sub('\r\n', '. ', all_pre_tags[0].text)

In [49]:
clean_text

"I’m tearing up, across your face. Move dust through the light. To find your name. It's something faint. This is not a place. Not yet awake, I'm raised to make. . Still alive who you, love. Still alive who you, love. Still alive who you, love. . In a matter of an month. From forests for the soft. Gotta know been lead aloft. So I'm ridding all your stories. What I know, what it is, is boring, wire it up!. You're breaking your ground"

#### Let's try and get all of the bon iver songs:

In [110]:
bi = requests.get('https://www.lyrics.com/artist/Bon+Iver/991558')

In [111]:
bi_html = soup(bi.text, 'html.parser')

In [112]:
for i in bi_html.find_all(attrs={'class':'tal qx'}):
    a_tags = i.a
    try:
        print('http://www.lyrics.com'+ a_tags.get('href'))
    except:
        pass

http://www.lyrics.com/lyric/13416985/Bon+Iver/Blindsided
http://www.lyrics.com/lyric/13416986/Bon+Iver/Creature+Fear
http://www.lyrics.com/lyric/13416981/Bon+Iver/Flume
http://www.lyrics.com/lyric/13416988/Bon+Iver/For+Emma
http://www.lyrics.com/lyric/13416982/Bon+Iver/Lump+Sum
http://www.lyrics.com/lyric/13416989/Bon+Iver/Re%3A+Stacks
http://www.lyrics.com/lyric/13416983/Bon+Iver/Skinny+Love
http://www.lyrics.com/lyric/13416984/Bon+Iver/The+Wolves%2C+Acts+1+%26+2
http://www.lyrics.com/lyric/14834010/Bon+Iver/Wisconsin+%5B%2A%5D
http://www.lyrics.com/lyric/26244541/Bon+Iver/Babys
http://www.lyrics.com/lyric/26244542/Bon+Iver/Beach+Baby
http://www.lyrics.com/lyric/26244543/Bon+Iver/Blood+Bank
http://www.lyrics.com/lyric/26244540/Bon+Iver/Woods
http://www.lyrics.com/lyric/16205906/Bon+Iver/Brackett%2C+WI
http://www.lyrics.com/lyric/17777238/Saint+Vincent/Rosyln
http://www.lyrics.com/lyric/19503761/Bon+Iver/Come+Talk+to+Me
http://www.lyrics.com/lyric/22402977/Kanye+West/Lost+in+the+World


In [113]:
bi_song_list=[]
for i in bi_html.find_all(attrs={'class':'tal qx'}):
    a_tags = i.a
    try:
        bi_song_list.append('http://www.lyrics.com'+ a_tags.get('href'))
    except:
        pass

In [114]:
bi_song_list

['http://www.lyrics.com/lyric/13416985/Bon+Iver/Blindsided',
 'http://www.lyrics.com/lyric/13416986/Bon+Iver/Creature+Fear',
 'http://www.lyrics.com/lyric/13416981/Bon+Iver/Flume',
 'http://www.lyrics.com/lyric/13416988/Bon+Iver/For+Emma',
 'http://www.lyrics.com/lyric/13416982/Bon+Iver/Lump+Sum',
 'http://www.lyrics.com/lyric/13416989/Bon+Iver/Re%3A+Stacks',
 'http://www.lyrics.com/lyric/13416983/Bon+Iver/Skinny+Love',
 'http://www.lyrics.com/lyric/13416984/Bon+Iver/The+Wolves%2C+Acts+1+%26+2',
 'http://www.lyrics.com/lyric/14834010/Bon+Iver/Wisconsin+%5B%2A%5D',
 'http://www.lyrics.com/lyric/26244541/Bon+Iver/Babys',
 'http://www.lyrics.com/lyric/26244542/Bon+Iver/Beach+Baby',
 'http://www.lyrics.com/lyric/26244543/Bon+Iver/Blood+Bank',
 'http://www.lyrics.com/lyric/26244540/Bon+Iver/Woods',
 'http://www.lyrics.com/lyric/16205906/Bon+Iver/Brackett%2C+WI',
 'http://www.lyrics.com/lyric/17777238/Saint+Vincent/Rosyln',
 'http://www.lyrics.com/lyric/19503761/Bon+Iver/Come+Talk+to+Me',
 '

In [75]:
bi_song_list=[]
for i in bi_html.find_all(attrs={'class':'tal qx'}):
    a_tags = i.a
    try:
        bi_song_list.append('http://www.lyrics.com'+ a_tags.get('href'))
    except:
        pass

#### Let's choose a song and find the lyrics

In [76]:
bi_song_list[0]

'http://www.lyrics.com/lyric/13416985/Bon+Iver/Blindsided'

In [78]:
song0 = requests.get(bi_song_list[0])

In [90]:
song0_html = soup(song0.text, 'html.parser')

In [93]:
lyrics_song0 = song0_html.find_all(attrs={'id':'lyric-body-text'})

In [97]:
song0_lyrics = []

In [99]:
song0_lyrics.append(lyrics_song0[0].text)

In [100]:
song0_lyrics[0]

"Bike down, down to the downtown\r\nDown to the lock down, boards, nails lie around\r\n\r\nI crouch like a crow\r\nContrasting the snow\r\nFor the agony, I'd rather know\r\n'Cause blinded I am blindsided\r\n\r\nPeek in, into the peer in\r\nI'm not really like this, I'm probably plightless\r\n\r\nI cup the window\r\nI'm crippled and slow\r\nFor the agony\r\nI'd rather know\r\n'Cause blinded I am blindsided\r\n\r\nWould you really rush out for me now?\r\n\r\nTaught line, down to the shoreline\r\nThe end of a blood line, the moon is a cold light\r\n\r\nThere's a pull to the flow\r\nMy feet melt the snow\r\nFor the irony, I'd rather know\r\n'Cause blinded I was blindsided\r\n\r\n'Cause blinded I was blindsided\r\n\r\n'Cause blinded I was blindsided"

In [101]:
import re

In [107]:
clean_song0 = re.sub('\r\n', ', ', lyrics_song0[0].text)

In [108]:
clean_song0

"Bike down, down to the downtown, Down to the lock down, boards, nails lie around, , I crouch like a crow, Contrasting the snow, For the agony, I'd rather know, 'Cause blinded I am blindsided, , Peek in, into the peer in, I'm not really like this, I'm probably plightless, , I cup the window, I'm crippled and slow, For the agony, I'd rather know, 'Cause blinded I am blindsided, , Would you really rush out for me now?, , Taught line, down to the shoreline, The end of a blood line, the moon is a cold light, , There's a pull to the flow, My feet melt the snow, For the irony, I'd rather know, 'Cause blinded I was blindsided, , 'Cause blinded I was blindsided, , 'Cause blinded I was blindsided"

#### Let's put it all in a function!

In [118]:
def get_lyrics(url):

    song = requests.get(url)
    song_html = soup(song.text, 'html.parser')
    lyrics_song = song_html.find(attrs={'id':'lyric-body-text'}) 
    clean_song = re.sub('\r\n', ' ', lyrics_song.text)
    return clean_song

In [119]:
get_lyrics(bi_song_list[2])

"I am my mother's only one It's enough  I wear my garment so it shows Now you know  Only love is all maroon Gluey feathers on a flume Sky is womb and she's the moon  I am my mother on the wall, with us all I move in water, shore to shore; Nothing's more  Only love is all maroon Lapping lakes like leary loons Leaving rope burns Reddish rouge  Only love is all maroon Gluey feathers on a flume Sky is womb and she's the moon"

In [138]:
lyrics_list = []

for i in bi_song_list:
    
    text = get_lyrics(i)
    artist = i.split('/')[-2]
    
    lyrics_list.append((artist, text))

In [149]:
for i in bi_song_list:
    print(i.split('/')[-2])

Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Saint+Vincent
Bon+Iver
Kanye+West
Jay-Z
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
The+Flaming+Lips
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver
Bon+Iver


In [140]:
import pandas as pd
df = pd.DataFrame(lyrics_list)
df

Unnamed: 0,0,1
0,Bon+Iver,"Bike down, down to the downtown Down to the lo..."
1,Bon+Iver,I was lost but your fool Was a long visit wron...
2,Bon+Iver,I am my mother's only one It's enough I wear ...
3,Bon+Iver,So abruptly Saw death on a sunny snow For eve...
4,Bon+Iver,Sold my cold knot A heavy stone Sold my red ho...
5,Bon+Iver,This my excavation and to Day is Qumran Everyt...
6,Bon+Iver,Come on skinny love just last the year Pour a ...
7,Bon+Iver,"Someday my pain, someday my pain Will mark you..."
8,Bon+Iver,You ride in the park and you're peaking Piss p...
9,Bon+Iver,Summer comes To multiply To multiply! Summer ...


In [141]:
df.to_csv('bon_iver_songs.csv')

In [142]:
df2 = pd.read_csv('bon_iver_songs.csv')