![b4s](img/beautiful_soup.png)

## [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup)

## Benefits of *not* scraping
![options](img/other_options.png)

### Use case

![use](img/use_case.png)

### Goal

![python](img/how_works.png)

#### Discuss
What's a website you'd like  to scrape?

### Scenario

I want to analyze the top song award of the Grammies to see if I can find any patterns in country of origin, singer, song content, etc. 

But where do I start finding that data? Not from an API.

Well, we can start [here](https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year)

### This is our target
![target](img/target.png)

### Learning goals:

- scrape a basic wikipedia website using beautiful soup
- transform the html table we want to a pandas `DataFrame`
- scrape a more complex wikipedia
- transform the wanted scraped data into a pandas `DataFrame`
- if time, go hunt a wild website and scrape it

## Basic wikipedia

![vheck](img/basic.gif)

Task: Get one column from a table on wikipedia

Let's get those libraries we want

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Use the `url` inside of a `requests.get` and assign it to `website_url`

First, a wikipedia article where we only want to get one column of information - countries!

https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area').text

Start to use the BeautifulSoup functions to create a BeautifulSoup object

In [3]:
soup = BeautifulSoup(website_url,'lxml')
#print(soup.prettify())

In [4]:
type(soup)

bs4.BeautifulSoup

Find the class of interest

In [5]:
table = soup.find('table',{'class':'wikitable sortable'})

In [8]:
print(soup)

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of Asian countries by area - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Asian_countries_by_area","wgTitle":"List of Asian countries by area","wgCurRevisionId":895418120,"wgRevisionId":895418120,"wgArticleId":47659173,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from January 2017","All articles needing additional references","Asia-related lists","Lists by area","Lists of countries by geography"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTr

In [9]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Asian countries by area - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Asian_countries_by_area","wgTitle":"List of Asian countries by area","wgCurRevisionId":895418120,"wgRevisionId":895418120,"wgArticleId":47659173,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from January 2017","All articles needing additional references","Asia-related lists","Lists by area","Lists of countries by geography"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel

Keep looking at the html to see if you can find any commonalities in what you want to scrape....

All the country names are links! We can use the `a` tag!

In [10]:
links = table.find_all('a')

We can now iterate over links to process it and create a list of text

In [14]:
Countries = []
for link in links:
    Countries.append(link.get('title'))
    
print(Countries)


['Russia', None, 'China', 'Hong Kong', 'Macau', 'India', None, 'Kazakhstan', 'Saudi Arabia', 'Iran', 'Mongolia', 'Indonesia', 'Pakistan', 'Gilgit-Baltistan', 'Azad Kashmir', 'Turkey', 'Myanmar', 'Afghanistan', 'Yemen', 'Thailand', 'Turkmenistan', 'Uzbekistan', 'Iraq', 'Japan', 'Vietnam', 'Malaysia', 'Oman', 'Philippines', 'Laos', 'Kyrgyzstan', 'Syria', 'Golan Heights', 'Cambodia', 'Bangladesh', 'Nepal', 'Tajikistan', 'North Korea', 'South Korea', 'Jordan', 'Azerbaijan', 'United Arab Emirates', 'Georgia (country)', 'Sri Lanka', 'Egypt', 'Bhutan', 'Taiwan', 'Armenia', 'Israel', 'Kuwait', 'East Timor', 'Qatar', 'Lebanon', 'Cyprus', 'Northern Cyprus', 'State of Palestine', 'Brunei', 'Bahrain', 'Singapore', 'Maldives']


Now, let's convert that list to a data frame

In [15]:
df = pd.DataFrame()
df['Country'] = Countries

In [16]:
df.head()

Unnamed: 0,Country
0,Russia
1,
2,China
3,Hong Kong
4,Macau


## Less Basic - Get a whole table
Let's go inspect the webiste to find the right tag/heading/etc for the table we want

What are the important tags here?<br>
What class is the important one?

`table`<br>
`wikitable sortable`

**Task**<br>
Work with a partner to comment the following code and figure out what it does

In [18]:
response = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text
# pull website data and assigning it to an object
print(response)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Grammy Award for Song of the Year - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Grammy_Award_for_Song_of_the_Year","wgTitle":"Grammy Award for Song of the Year","wgCurRevisionId":895745021,"wgRevisionId":895745021,"wgArticleId":44636,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use mdy dates from April 2012","Grammy Award categories","Grammy Award for Song of the Year","Songwriting awards"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wg

In [19]:
soup = BeautifulSoup(website_url,'lxml')
#print(soup.prettify())
# make it easier to use

In [23]:
tab = soup.find("table",{"class":"wikitable sortable"})
# pd.read_html(tab.prettify())
# identifying table of interest in our downloaded information
print(tab)

<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Country
</th>
<th>Area (km²)
</th>
<th class="unsortable">Notes
</th></tr>
<tr>
<td>1
</td>
<td><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></span> <a href="/wiki/Russia" title="Russia">Russia</a>*
</td>
<td>13,100,000
</td>
<td>17,098,242 including European part<sup class="reference" id="cite_ref-russiaTotalAreaByCIA_1-0"><a href="#cite_note-russiaTotalAreaByCIA-1">[1]</a></sup>
</td></tr>
<tr>
<td>2
</td>
<td><span class="flagicon" style="display:inline-block;width:25px;"><img alt=""

In [24]:
rows = tab.find_all('tr')
# finding tag 'tr' in tab
print(rows)

[<tr>
<th>Rank
</th>
<th>Country
</th>
<th>Area (km²)
</th>
<th class="unsortable">Notes
</th></tr>, <tr>
<td>1
</td>
<td><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/23px-Flag_of_Russia.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/35px-Flag_of_Russia.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/f/f3/Flag_of_Russia.svg/45px-Flag_of_Russia.svg.png 2x" width="23"/></span> <a href="/wiki/Russia" title="Russia">Russia</a>*
</td>
<td>13,100,000
</td>
<td>17,098,242 including European part<sup class="reference" id="cite_ref-russiaTotalAreaByCIA_1-0"><a href="#cite_note-russiaTotalAreaByCIA-1">[1]</a></sup>
</td></tr>, <tr>
<td>2
</td>
<td><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="

In [26]:
data = []
for row in rows:
    data.append([x.get_text().strip() for x in row.find_all(['th','td'])])

print(data)
# creates a list of information about each country

[['Rank', 'Country', 'Area (km²)', 'Notes'], ['1', 'Russia*', '13,100,000', '17,098,242 including European part[1]'], ['2', 'China', '9,596,961', 'excludes Hong Kong, Macau, Taiwan and disputed areas/islands'], ['3', 'India[2]', '3,287,263', ''], ['4', 'Kazakhstan*', '2,455,034', '2,724,902\xa0km² including European part'], ['5', 'Saudi Arabia', '2,149,690', ''], ['6', 'Iran', '1,648,195', ''], ['7', 'Mongolia', '1,564,110', ''], ['8', 'Indonesia*', '1,472,639', '1,904,569\xa0km² including Oceanian part'], ['9', 'Pakistan', '796,095', '882,363\xa0km² including Gilgit-Baltistan and AJK'], ['10', 'Turkey*', '747,272', '783,562\xa0km² including European part'], ['11', 'Myanmar', '676,578', ''], ['12', 'Afghanistan', '652,230', ''], ['13', 'Yemen', '527,968', ''], ['14', 'Thailand', '513,120', ''], ['15', 'Turkmenistan', '488,100', ''], ['16', 'Uzbekistan', '447,400', ''], ['17', 'Iraq', '438,317', ''], ['18', 'Japan', '377,930', ''], ['19', 'Vietnam', '331,212', ''], ['20', 'Malaysia', '3

In [28]:
df = pd.DataFrame(data)

new_header = df.iloc[0]
df = df[1:]
df.columns = new_header

df

Unnamed: 0,Rank,Country,Area (km²),Notes,None
1,1.0,Russia*,13100000,"17,098,242 including European part[1]",
2,2.0,China,9596961,"excludes Hong Kong, Macau, Taiwan and disputed...",
3,3.0,India[2],3287263,,
4,4.0,Kazakhstan*,2455034,"2,724,902 km² including European part",
5,5.0,Saudi Arabia,2149690,,
6,6.0,Iran,1648195,,
7,7.0,Mongolia,1564110,,
8,8.0,Indonesia*,1472639,"1,904,569 km² including Oceanian part",
9,9.0,Pakistan,796095,"882,363 km² including Gilgit-Baltistan and AJK",
10,10.0,Turkey*,747272,"783,562 km² including European part",


### But this is hard. Is there an easier way to do this?

Another way, if you **know** there is a `table` in the `html` somewhere

In [29]:
grammies = pd.read_html('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year')

`grammies` returns a `list` of `DataFrames`<br>
We still need to find the _correct_ one

In [30]:
len(grammies)

5

In [34]:
print(grammies[3])

                                                   0  \
0                                    vteGrammy Award   
1  Categories Grammy Nominees Records Locations EGOT   
2                                     Special awards   
3                                      Ceremony year   
4                                            Related   
5                                         By Country   
6  Grammy Award Record of the Year Song of the Ye...   

                                                   1  
0                                                NaN  
1                                                NaN  
2  Legend Award Lifetime Achievement Award Truste...  
3  1959 May Nov 1961 1962 1963 1964 1965 1966 196...  
4                                      Grammy Museum  
5  American Argentine Australian Austrian Brazili...  
6                                                NaN  


Another way with the same concept....

In [60]:
response = requests.get('https://en.wikipedia.org/wiki/List_of_American_Grammy_Award_winners_and_nominees').text


In [61]:
response

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of American Grammy Award winners and nominees - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_American_Grammy_Award_winners_and_nominees","wgTitle":"List of American Grammy Award winners and nominees","wgCurRevisionId":897031016,"wgRevisionId":897031016,"wgArticleId":56375923,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from May 2019","All articles needing additional references","Lists of Grammy Award winners and nominees by nationality","Lists of American musicians"],"wgBreakFrames":

In [62]:
soup = BeautifulSoup(response,'lxml')
soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of American Grammy Award winners and nominees - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_American_Grammy_Award_winners_and_nominees","wgTitle":"List of American Grammy Award winners and nominees","wgCurRevisionId":897031016,"wgRevisionId":897031016,"wgArticleId":56375923,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from May 2019","All articles needing additional references","Lists of Grammy Award winners and nominees by nationality","Lists of American musicians"],"wgBreakFrames":false,"wg

In [63]:
tab = soup.find("table",{"class":"wikitable sortable"})
tab

<table class="wikitable sortable">
<tbody><tr>
<th>Nominee</th>
<th>Wins</th>
<th>Nominations
</th></tr>
<tr>
<td><a href="/wiki/Quincy_Jones" title="Quincy Jones">Quincy Jones</a> <sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup></td>
<td>28</td>
<td>80
</td></tr>
<tr>
<td><a href="/wiki/Alison_Krauss" title="Alison Krauss">Alison Krauss</a> <sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup></td>
<td>27</td>
<td>42
</td></tr>
<tr>
<td><a href="/wiki/Stevie_Wonder" title="Stevie Wonder">Stevie Wonder</a> <sup class="reference" id="cite_ref-3"><a href="#cite_note-3">[3]</a></sup></td>
<td>25</td>
<td>74
</td></tr>
<tr>
<td><a href="/wiki/Vladimir_Horowitz" title="Vladimir Horowitz">Vladimir Horowitz</a> <sup class="reference" id="cite_ref-4"><a href="#cite_note-4">[4]</a></sup></td>
<td>25</td>
<td>45
</td></tr>
<tr>
<td><a href="/wiki/John_Williams" title="John Williams">John Williams</a> <sup class="reference" id="cite_ref-5"><a href=

In [68]:
df = pd.read_html(tab.prettify())
df

[                                                     0     1            2
 0                                              Nominee  Wins  Nominations
 1                                    Quincy Jones  [1]    28           80
 2                                   Alison Krauss  [2]    27           42
 3                                   Stevie Wonder  [3]    25           74
 4                               Vladimir Horowitz  [4]    25           45
 5                                   John Williams  [5]    24           69
 6                                         Beyoncé  [6]    23           66
 7                                           Jay-Z  [7]    22           77
 8                                     Chick Corea  [8]    22           64
 9                                      Kanye West  [9]    21           69
 10                                    Vince Gill  [10]    21           45
 11                                 Henry Mancini  [11]    20           72
 12                      

## Now find a free-range website

get in groups of four and try to scrape a website into a pandas df

In [77]:
programs = pd.read_html('https://www.nova.edu/graduate/masters.html')
df = pd.DataFrame(programs[0])
df

Unnamed: 0,Program,On-Campus,Off-Campus,Online
0,Accounting (M.Acc.),,,
1,Anesthesiologist Assistant (M.S.),,,
2,Athletic Training (M.S.A.T.),,,
3,Biological Sciences (M.S.),,,
4,Biomedical Informatics (M.S.B.I.),,,
5,Biomedical Sciences (M.B.S.),,,
6,Business Administration (M.B.A.),,,
7,Business Administration in Business Intelligen...,,,
8,Business Administration in Complex Health Syst...,,,
9,Business Administration in Entrepreneurship (M...,,,


In [69]:
website_url = requests.get('https://www.nova.edu/graduate/masters.html').text

In [70]:
soup = BeautifulSoup(website_url,'lxml')

In [102]:
table = soup.find('table')
for row in table:
    print(row)



<thead>
<tr>
<th scope="col">Program</th>
<th scope="col">On-Campus</th>
<th scope="col">Off-Campus</th>
<th scope="col">Online</th>
</tr>
</thead>


<tbody>
<tr>
<th class="row" scope="row"><a href="https://www.business.nova.edu/masters/master-of-accounting.html">Accounting (M.Acc.)</a></th>
<td class="text-center"><img alt="On-Campus" class="center-block" src="//www.nova.edu/common-lib/includes/images/apply.gif"/></td>
<td class="text-center"> Â  </td>
<td class="text-center"><img alt="Online" class="center-block" src="//www.nova.edu/common-lib/includes/images/apply.gif"/></td>
</tr>
<tr>
<th class="row" scope="row"><a href="https://healthsciences.nova.edu/healthsciences/anesthesia/index.html">Anesthesiologist Assistant (M.S.)</a></th>
<td class="text-center"><img alt="On-Campus" class="center-block" src="//www.nova.edu/common-lib/includes/images/apply.gif"/></td>
<td class="text-center"> Â  </td>
<td class="text-center"> Â  </td>
</tr>
<tr>
<th class="row" scope="row"><a href="htt

In [82]:
type(table)

bs4.element.Tag

In [98]:
html_table = pd.read_html(table.prettify())
html_table

[                                              Program On-Campus Off-Campus  \
 0                                 Accounting (M.Acc.)       NaN          Â   
 1                   Anesthesiologist Assistant (M.S.)       NaN          Â   
 2                        Athletic Training (M.S.A.T.)       NaN          Â   
 3                          Biological Sciences (M.S.)       NaN          Â   
 4                   Biomedical Informatics (M.S.B.I.)       NaN          Â   
 5                        Biomedical Sciences (M.B.S.)       NaN          Â   
 6                    Business Administration (M.B.A.)       NaN        NaN   
 7   Business Administration in Business Intelligen...       NaN        NaN   
 8   Business Administration in Complex Health Syst...       NaN          Â   
 9   Business Administration in Entrepreneurship (M...       NaN          Â   
 10        Business Administration in Finance (M.B.A.)       NaN          Â   
 11  Business Administration in Human Resources (M..

In [99]:
type(html_table)

list

In [100]:
pd.DataFrame(html_table)

Unnamed: 0,0
0,...
