![b4s](img/beautiful_soup.png)

## [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup)

## Benefits of *not* scraping
![options](img/other_options.png)

### Use case

![use](img/use_case.png)

### Goal

![python](img/how_works.png)

#### Discuss
What's a website you'd like  to scrape?

### Scenario

I want to analyze the top song award of the Grammies to see if I can find any patterns in country of origin, singer, song content, etc. 

But where do I start finding that data? Not from an API.

Well, we can start [here](https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year)

### This is our target
![target](img/target.png)

### Learning goals:

- scrape a basic wikipedia website using beautiful soup
- transform the html table we want to a pandas `DataFrame`
- scrape a more complex wikipedia
- transform the wanted scraped data into a pandas `DataFrame`
- if time, go hunt a wild website and scrape it

## Basic wikipedia

![vheck](img/basic.gif)

Task: Get one column from a table on wikipedia

Let's get those libraries we want

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Use the `url` inside of a `requests.get` and assign it to `website_url`

First, a wikipedia article where we only want to get one column of information - countries!

https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area

In [3]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area').text

Start to use the BeautifulSoup functions to create a BeautifulSoup object

In [6]:
soup = BeautifulSoup(website_url,'lxml')#print(soup.prettify())
type(soup)

bs4.BeautifulSoup

In [7]:
print(soup)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of Asian countries by area - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Asian_countries_by_area","wgTitle":"List of Asian countries by area","wgCurRevisionId":895418120,"wgRevisionId":895418120,"wgArticleId":47659173,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from January 2017","All articles needing additional references","Asia-related lists","Lists by area","Lists of countries by geography"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTr

In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Asian countries by area - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Asian_countries_by_area","wgTitle":"List of Asian countries by area","wgCurRevisionId":895418120,"wgRevisionId":895418120,"wgArticleId":47659173,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from January 2017","All articles needing additional references","Asia-related lists","Lists by area","Lists of countries by geography"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel

Find the class of interest

In [11]:
table = soup.find('table',{'class':'wikitable sortable'})

In [12]:
table

<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Country
</th>
<th>Area (km²)
</th>
<th class="unsortable">Notes
</th></tr>
<tr>
<td>1
</td>
<td><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></span> <a href="/wiki/Russia" title="Russia">Russia</a>*
</td>
<td>13,100,000
</td>
<td>17,098,242 including European part<sup class="reference" id="cite_ref-russiaTotalAreaByCIA_1-0"><a href="#cite_note-russiaTotalAreaByCIA-1">[1]</a></sup>
</td></tr>
<tr>
<td>2
</td>
<td><span class="flagicon" style="display:inline-block;width:25px;"><img alt=""

Keep looking at the html to see if you can find any commonalities in what you want to scrape....

All the country names are links! We can use the `a` tag!

In [13]:
links = table.find_all('a')

We can now iterate over links to process it and create a list of text

In [14]:
Countries = []
for link in links:
    Countries.append(link.get('title'))
    
print(Countries)

['Russia', None, 'China', 'Hong Kong', 'Macau', 'India', None, 'Kazakhstan', 'Saudi Arabia', 'Iran', 'Mongolia', 'Indonesia', 'Pakistan', 'Gilgit-Baltistan', 'Azad Kashmir', 'Turkey', 'Myanmar', 'Afghanistan', 'Yemen', 'Thailand', 'Turkmenistan', 'Uzbekistan', 'Iraq', 'Japan', 'Vietnam', 'Malaysia', 'Oman', 'Philippines', 'Laos', 'Kyrgyzstan', 'Syria', 'Golan Heights', 'Cambodia', 'Bangladesh', 'Nepal', 'Tajikistan', 'North Korea', 'South Korea', 'Jordan', 'Azerbaijan', 'United Arab Emirates', 'Georgia (country)', 'Sri Lanka', 'Egypt', 'Bhutan', 'Taiwan', 'Armenia', 'Israel', 'Kuwait', 'East Timor', 'Qatar', 'Lebanon', 'Cyprus', 'Northern Cyprus', 'State of Palestine', 'Brunei', 'Bahrain', 'Singapore', 'Maldives']


Now, let's convert that list to a data frame

In [15]:
df = pd.DataFrame()
df['Country'] = Countries

In [16]:
df.head()

Unnamed: 0,Country
0,Russia
1,
2,China
3,Hong Kong
4,Macau


## Less Basic - Get a whole table
Let's go inspect the webiste to find the right tag/heading/etc for the table we want

What are the important tags here?<br>
What class is the important one?

`table`<br>
`wikitable sortable`

**Task**<br>
Work with a partner to comment the following code and figure out what it does

In [20]:
response = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text
# gets the web page
soup = BeautifulSoup(website_url,'lxml')
#creates soup, source code of website without any formating 

#print(soup.prettify())
#prints source code formatted

tab = soup.find("table",{"class":"wikitable sortable"})
#get the table with the "wikitable sortable" class

# pd.read_html(tab.prettify())
#looks at the source code again

rows = tab.find_all('tr')
#finds all the rows in table

data = []
for row in rows:
    data.append([x.get_text().strip() for x in row.find_all(['th','td'])])
#loops through rows and gets all the text, strips out white space before and after, puts it into
#array

df = pd.DataFrame(data)
#converts array to data frame

new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
#creates headers based on the first row

In [21]:
df

Unnamed: 0,Rank,Country,Area (km²),Notes,None
1,1.0,Russia*,13100000,"17,098,242 including European part[1]",
2,2.0,China,9596961,"excludes Hong Kong, Macau, Taiwan and disputed...",
3,3.0,India[2],3287263,,
4,4.0,Kazakhstan*,2455034,"2,724,902 km² including European part",
5,5.0,Saudi Arabia,2149690,,
6,6.0,Iran,1648195,,
7,7.0,Mongolia,1564110,,
8,8.0,Indonesia*,1472639,"1,904,569 km² including Oceanian part",
9,9.0,Pakistan,796095,"882,363 km² including Gilgit-Baltistan and AJK",
10,10.0,Turkey*,747272,"783,562 km² including European part",


In [27]:
response = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text
# gets the web page
soup = BeautifulSoup(response,'lxml')
#creates soup, source code of website without any formating 

#print(soup.prettify())
#prints source code formatted

tab = soup.find("table",{"class":"wikitable sortable"})
#get the table with the "wikitable sortable" class

# pd.read_html(tab.prettify())
#looks at the source code again

rows = tab.find_all('tr')
#finds all the rows in table

data1 = []
for row in rows:
    data1.append([x.get_text().strip() for x in row.find_all(['th','td'])])
#loops through rows and gets all the text, strips out white space before and after, puts it into
#array

df1 = pd.DataFrame(data1)
#converts array to data frame

new_header = df1.iloc[0]
df1 = df1[1:]
df1.columns = new_header
#creates headers based on the first row

In [29]:
df1['Winner(s)']

1                                      Domenico Modugno
2                                       Jimmy Driftwood
3                                           Ernest Gold
4                            Henry ManciniJohnny Mercer
5                         Leslie BricusseAnthony Newley
6                            Henry ManciniJohnny Mercer
7                                          Jerry Herman
8                     Paul Francis WebsterJohnny Mandel
9                             John LennonPaul McCartney
10                                           Jimmy Webb
11                                        Bobby Russell
12                                            Joe South
13                                           Paul Simon
14                                          Carole King
15                                         Ewan MacColl
16                             Norman GimbelCharles Fox
17              Alan and Marilyn BergmanMarvin Hamlisch
18                                     Stephen S

In [28]:
df1

Unnamed: 0,Year[I],Winner(s),Nationality,Work,Performing artist(s)[II],Nominees,Ref.
1,1959,Domenico Modugno,Italy,"""Volare"" *",Domenico Modugno,"Paul Vance & Lee Pockriss for ""Catch a Falling...",[10]
2,1960,Jimmy Driftwood,United States,"""The Battle of New Orleans""",Johnny Horton,"Sammy Cahn & Jimmy Van Heusen for ""High Hopes""...",[11]
3,1961,Ernest Gold,United States Austria,"""Theme of Exodus""",Instrumental (Various Artists),"Charles Randolph Grean, Joe Allison & Audrey A...",[12]
4,1962,Henry ManciniJohnny Mercer,United States,"""Moon River"" *",Henry Mancini,"Jimmy Dean for ""Big Bad John"" performed by Jim...",[13]
5,1963,Leslie BricusseAnthony Newley,United Kingdom,"""What Kind of Fool Am I?""",Sammy Davis Jr.,"Lionel Bart for ""As Long as He Needs Me"" perfo...",[14]
6,1964,Henry ManciniJohnny Mercer,United States,"""Days of Wine and Roses"" *",Henry Mancini,"Sammy Cahn & Jimmy Van Heusen for ""Call Me Irr...",[15]
7,1965,Jerry Herman,United States,"""Hello, Dolly!""",Louis Armstrong,"John Lennon & Paul McCartney for ""A Hard Day's...",[16]
8,1966,Paul Francis WebsterJohnny Mandel,United States,"""The Shadow of Your Smile""",Tony Bennett,"Michel Legrand, Norman Gimbel & Jacques Demy f...",
9,1967,John LennonPaul McCartney,United Kingdom,"""Michelle""",The Beatles,"John Barry & Don Black for ""Born Free"" perform...",
10,1968,Jimmy Webb,United States,"""Up, Up, and Away"" *",The 5th Dimension,"Jimmy Webb for ""By the Time I Get to Phoenix"" ...",[17]


### But this is hard. Is there an easier way to do this?

Another way, if you **know** there is a `table` in the `html` somewhere

In [30]:
grammies = pd.read_html('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year')

`grammies` returns a `list` of `DataFrames`<br>
We still need to find the _correct_ one

In [32]:
grammies

[  Grammy Award for Song of the Year  \
 0                       Awarded for   
 1                           Country   
 2                      Presented by   
 3                     First awarded   
 4                 Currently held by   
 5                           Website   
 
                  Grammy Award for Song of the Year.1  
 0     Quality song containing both lyrics and melody  
 1                                      United States  
 2    National Academy of Recording Arts and Sciences  
 3                                               1959  
 4  Donald Glover, Ludwig Göransson & Jeffery Lama...  
 5                                         grammy.com  ,
     Year[I]                                          Winner(s)  \
 0      1959                                   Domenico Modugno   
 1      1960                                    Jimmy Driftwood   
 2      1961                                        Ernest Gold   
 3      1962                         Henry ManciniJohnny 

Another way with the same concept....

In [34]:
response = requests.get('https://en.wikipedia.org/wiki List_of_American_Grammy_Award_winners_and_nominees').text
soup = BeautifulSoup(response)

tab = soup.find("table",{"class":"wikitable sortable"})
df = pd.read_html(tab.prettify())

AttributeError: 'NoneType' object has no attribute 'prettify'

## Now find a free-range website

get in groups of four and try to scrape a website into a pandas df

In [None]:
https://www.bloomberg.com/markets/stocks

response = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text
# gets the web page
soup = BeautifulSoup(response,'lxml')
#creates soup, source code of website without any formating 

#print(soup.prettify())
#prints source code formatted

tab = soup.find("table",{"class":"wikitable sortable"})
#get the table with the "wikitable sortable" class

# pd.read_html(tab.prettify())
#looks at the source code again

rows = tab.find_all('tr')
#finds all the rows in table

data1 = []
for row in rows:
    data1.append([x.get_text().strip() for x in row.find_all(['th','td'])])
#loops through rows and gets all the text, strips out white space before and after, puts it into
#array

df1 = pd.DataFrame(data1)
#converts array to data frame

new_header = df1.iloc[0]
df1 = df1[1:]
df1.columns = new_header
#creates headers based on the first row

In [35]:
response = requests.get('https://www.bloomberg.com/markets/stocks').text

In [38]:
soup = BeautifulSoup(response,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Bloomberg - Are you a robot?
  </title>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://assets.bwbx.io/font-service/css/BWHaasGrotesk-55Roman-Web,BWHaasGrotesk-75Bold-Web,BW%20Haas%20Text%20Mono%20A-55%20Roman/font-face.css" rel="stylesheet" type="text/css"/>
  <style rel="stylesheet" type="text/css">
   html, body, div, span, applet, object, iframe,
        h1, h2, h3, h4, h5, h6, p, blockquote, pre,
        a, abbr, acronym, address, big, cite, code,
        del, dfn, em, img, ins, kbd, q, s, samp,
        small, strike, strong, sub, sup, tt, var,
        b, u, i, center,
        dl, dt, dd, ol, ul, li,
        fieldset, form, label, legend,
        table, caption, tbody, tfoot, thead, tr, th, td,
        article, aside, canvas, details, embed,
        figure, figcaption, footer, header, hgroup,
        menu, nav, output, ruby, section, summary,
        time, mark, audio, video {
            mar

In [42]:
tab = soup.find("table",{"class":"data-table"})
tab

In [41]:
pd.read_html(tab.prettify())
#looks at the source code again

rows = tab.find_all('tr')
#finds all the rows in table

data1 = []
for row in rows:
    data1.append([x.get_text().strip() for x in row.find_all(['th','td'])])
#loops through rows and gets all the text, strips out white space before and after, puts it into
#array

df1 = pd.DataFrame(data1)
#converts array to data frame

new_header = df1.iloc[0]
df1 = df1[1:]
df1.columns = new_header
#creates headers based on the first row

AttributeError: 'NoneType' object has no attribute 'find_all'

In [43]:
response = requests.get('http://schoolprofiles.fcps.edu/schlprfl/f?p=108:9:::NO::P0_CURRENT_SCHOOL_ID,P0_EDSL:320,0')

In [47]:
soup = BeautifulSoup(response.text,'lxml')
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
  <meta content="IE=edge" http-equiv="x-ua-compatible"/>
  <link href="/edsl_edslr_img/app_ui/css/Core.min.css?v=5.1.2.00.09" rel="stylesheet" type="text/css"/>
  <link href="/edsl_edslr_img/app_ui/css/Theme-Standard.min.css?v=5.1.2.00.09" rel="stylesheet" type="text/css"/>
  <link href="/edsl_edslr_img/libraries/jquery-ui/1.10.4/themes/base/jquery-ui.min.css?v=5.1.2.00.09" rel="stylesheet" type="text/css"/>
  <link href="/edsl_edslr_img/legacy_ui/css/5.0.min.css?v=5.1.2.00.09" rel="stylesheet" type="text/css"/>
  <link href="edsl/r/108/files/static/v80/styles.css" rel="stylesheet" type="text/css"/>
  <link href="edsl/r/108/files/static/v80/external_school_profile.css" rel="stylesheet" type="text/css"/>
  <script type="text/javascript">
   var apex_img_dir = "/edsl_edslr_img/", htmldb_Img_Dir = apex_img_dir;
  </script>
  <!--[if lt IE 9]><script type="text/javascript" 

In [51]:
tab = soup.find(attrs={"id":"report_R54044944978002207"})
tab

<table cellpadding="0" cellspacing="0" class="t3standardalternatingrowcolors" id="report_R54044944978002207" width="70%">
<tr>
<th align="left" class="t3header" nowrap="" width="40%">Subject</th>
<th align="center" class="t3header" scope="col" width="30%"><a class="forscreen" href="javascript: openWindow('f?p=108:6:::::P0_CURRENT_SCHOOL_ID,P6_PAGETYPE:320,ACCREDITATION', 500, 300);">Accreditation Pass Rate</a><span class="forprint">Accreditation Pass Rate</span></th>
</tr>
<tr>
<td align="left" class="t3dataalt" nowrap="" scope="row">English</td>
<td align="center" class="t3dataalt">92</td>
</tr>
<tr>
<td align="left" class="t3data" nowrap="" scope="row">Graduation And Completion Index</td>
<td align="center" class="t3data">93</td>
</tr>
<tr>
<td align="left" class="t3dataalt" nowrap="" scope="row">History</td>
<td align="center" class="t3dataalt">90</td>
</tr>
<tr>
<td align="left" class="t3data" nowrap="" scope="row">Mathematics</td>
<td align="center" class="t3data">83</td>
</tr>
<t

In [None]:
<table id="report_R54044944978002207" class="t3standardalternatingrowcolors" width="70%" cellpadding="0" cellspacing="0">

<tbody><tr>
   <th class="t3header" width="40%" nowrap="" align="left">Subject</th>
   <th class="t3header" align="center" scope="col" width="30%"><a class="forscreen" href="javascript: openWindow('f?p=108:6:::::P0_CURRENT_SCHOOL_ID,P6_PAGETYPE:320,ACCREDITATION', 500, 300);">Accreditation Pass Rate</a><span class="forprint">Accreditation Pass Rate</span></th>
   </tr>	
<tr>
   <td class="t3dataalt" nowrap="" align="left" scope="row">English</td>
   <td class="t3dataalt" align="center">92</td>

</tr>
<tr>
   <td class="t3data" nowrap="" align="left" scope="row">Graduation And Completion Index</td>
   <td class="t3data" align="center">93</td>

</tr>
<tr>
   <td class="t3dataalt" nowrap="" align="left" scope="row">History</td>
   <td class="t3dataalt" align="center">90</td>

</tr>
<tr>
   <td class="t3data" nowrap="" align="left" scope="row">Mathematics</td>
   <td class="t3data" align="center">83</td>

</tr>
<tr>
   <td class="t3dataalt" nowrap="" align="left" scope="row">Science</td>
   <td class="t3dataalt" align="center">86</td>

</tr>
</tbody></table>

In [53]:
pd.read_html(tab.prettify())
#looks at the source code again

rows = tab.find_all('tr')
#finds all the rows in table

data1 = []
for row in rows:
    data1.append([x.get_text().strip() for x in row.find_all(['th','td'])])
#loops through rows and gets all the text, strips out white space before and after, puts it into
#array

df1 = pd.DataFrame(data1)
#converts array to data frame

new_header = df1.iloc[0]
df1 = df1[1:]
df1.columns = new_header
df1

Unnamed: 0,Subject,Accreditation Pass RateAccreditation Pass Rate
1,English,92
2,Graduation And Completion Index,93
3,History,90
4,Mathematics,83
5,Science,86
