![b4s](img/beautiful_soup.png)

## [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup)

## Benefits of *not* scraping
![options](img/other_options.png)

### Use case

![use](img/use_case.png)

### Goal

![python](img/how_works.png)

#### Discuss
What's a website you'd like  to scrape?

### Scenario

I want to analyze the top song award of the Grammies to see if I can find any patterns in country of origin, singer, song content, etc. 

But where do I start finding that data? Not from an API.

Well, we can start [here](https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year)

### This is our target
![target](img/target.png)

### Learning goals:

- scrape a basic wikipedia website using beautiful soup
- transform the html table we want to a pandas `DataFrame`
- scrape a more complex wikipedia
- transform the wanted scraped data into a pandas `DataFrame`
- if time, go hunt a wild website and scrape it

## Basic wikipedia

![vheck](img/basic.gif)

Task: Get one column from a table on wikipedia

Let's get those libraries we want

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Use the `url` inside of a `requests.get` and assign it to `website_url`

First, a wikipedia article where we only want to get one column of information - countries!

https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area').text

Start to use the BeautifulSoup functions to create a BeautifulSoup object

In [5]:
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Asian countries by area - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Asian_countries_by_area","wgTitle":"List of Asian countries by area","wgCurRevisionId":895418120,"wgRevisionId":895418120,"wgArticleId":47659173,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from January 2017","All articles needing additional references","Asia-related lists","Lists by area","Lists of countries by geography"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of Asian countries by area - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Asian_countries_by_area","wgTitle":"List of Asian countries by area","wgCurRevisionId":895418120,"wgRevisionId":895418120,"wgArticleId":47659173,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from January 2017","All articles needing additional references","Asia-related lists","Lists by area","Lists of countries by geography"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTr

Find the class of interest

In [7]:
table = soup.find('table',{'class':'wikitable sortable'})

Keep looking at the html to see if you can find any commonalities in what you want to scrape....

All the country names are links! We can use the `a` tag!

In [9]:
links = table.find_all('a')

In [10]:
links

[<a href="/wiki/Russia" title="Russia">Russia</a>,
 <a href="#cite_note-russiaTotalAreaByCIA-1">[1]</a>,
 <a href="/wiki/China" title="China">China</a>,
 <a href="/wiki/Hong_Kong" title="Hong Kong">Hong Kong</a>,
 <a href="/wiki/Macau" title="Macau">Macau</a>,
 <a href="/wiki/India" title="India">India</a>,
 <a href="#cite_note-2">[2]</a>,
 <a href="/wiki/Kazakhstan" title="Kazakhstan">Kazakhstan</a>,
 <a href="/wiki/Saudi_Arabia" title="Saudi Arabia">Saudi Arabia</a>,
 <a href="/wiki/Iran" title="Iran">Iran</a>,
 <a href="/wiki/Mongolia" title="Mongolia">Mongolia</a>,
 <a href="/wiki/Indonesia" title="Indonesia">Indonesia</a>,
 <a href="/wiki/Pakistan" title="Pakistan">Pakistan</a>,
 <a href="/wiki/Gilgit-Baltistan" title="Gilgit-Baltistan">Gilgit-Baltistan</a>,
 <a href="/wiki/Azad_Kashmir" title="Azad Kashmir">AJK</a>,
 <a href="/wiki/Turkey" title="Turkey">Turkey</a>,
 <a href="/wiki/Myanmar" title="Myanmar">Myanmar</a>,
 <a href="/wiki/Afghanistan" title="Afghanistan">Afghanistan<

We can now iterate over links to process it and create a list of text

In [11]:
Countries = []
for link in links:
    Countries.append(link.get('title'))
    
print(Countries)

['Russia', None, 'China', 'Hong Kong', 'Macau', 'India', None, 'Kazakhstan', 'Saudi Arabia', 'Iran', 'Mongolia', 'Indonesia', 'Pakistan', 'Gilgit-Baltistan', 'Azad Kashmir', 'Turkey', 'Myanmar', 'Afghanistan', 'Yemen', 'Thailand', 'Turkmenistan', 'Uzbekistan', 'Iraq', 'Japan', 'Vietnam', 'Malaysia', 'Oman', 'Philippines', 'Laos', 'Kyrgyzstan', 'Syria', 'Golan Heights', 'Cambodia', 'Bangladesh', 'Nepal', 'Tajikistan', 'North Korea', 'South Korea', 'Jordan', 'Azerbaijan', 'United Arab Emirates', 'Georgia (country)', 'Sri Lanka', 'Egypt', 'Bhutan', 'Taiwan', 'Armenia', 'Israel', 'Kuwait', 'East Timor', 'Qatar', 'Lebanon', 'Cyprus', 'Northern Cyprus', 'State of Palestine', 'Brunei', 'Bahrain', 'Singapore', 'Maldives']


Now, let's convert that list to a data frame

In [12]:
df = pd.DataFrame()
df['Country'] = Countries

In [13]:
df.head()

Unnamed: 0,Country
0,Russia
1,
2,China
3,Hong Kong
4,Macau


## Less Basic - Get a whole table
Let's go inspect the webiste to find the right tag/heading/etc for the table we want

What are the important tags here?<br>
What class is the important one?

`table`<br>
`wikitable sortable`

**Task**<br>
Work with a partner to comment the following code and figure out what it does

In [15]:
website_url = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text

soup = BeautifulSoup(website_url,'lxml')

In [18]:
tab = soup.find("table",{"class":"wikitable sortable"})

In [20]:
rows = tab.find_all('tr')


In [21]:
response = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text

soup = BeautifulSoup(website_url,'lxml')
#print(soup.prettify())

tab = soup.find("table",{"class":"wikitable sortable"})
# pd.read_html(tab.prettify())

rows = tab.find_all('tr')

data = []
for row in rows:
    data.append([x.get_text().strip() for x in row.find_all(['th','td'])])

df = pd.DataFrame(data)

new_header = df.iloc[0]
df = df[1:]
df.columns = new_header

In [30]:
response = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text

soup = BeautifulSoup(website_url,'lxml')
#print(soup.prettify())

tab = soup.find("table",{"class":"wikitable sortable"})
# pd.read_html(tab.prettify())

rows = tab.find_all('tr')

data = []
for row in rows:
    data.append([x.get_text().strip() for x in row.find_all(['th','td'])])

df = pd.DataFrame(data)

new_header = df.iloc[0]
df = df[1:]
df.columns = new_header

In [29]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Year[I],Winner(s),Nationality,Work,Performing artist(s)[II],Nominees,Ref.
1,1959,Domenico Modugno,Italy,"""Volare"" *",Domenico Modugno,"Paul Vance & Lee Pockriss for ""Catch a Falling...",[10]
2,1960,Jimmy Driftwood,United States,"""The Battle of New Orleans""",Johnny Horton,"Sammy Cahn & Jimmy Van Heusen for ""High Hopes""...",[11]
3,1961,Ernest Gold,United States Austria,"""Theme of Exodus""",Instrumental (Various Artists),"Charles Randolph Grean, Joe Allison & Audrey A...",[12]
4,1962,Henry ManciniJohnny Mercer,United States,"""Moon River"" *",Henry Mancini,"Jimmy Dean for ""Big Bad John"" performed by Jim...",[13]


### But this is hard. Is there an easier way to do this?

Another way, if you **know** there is a `table` in the `html` somewhere

In [25]:
grammies = pd.read_html('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year')

`grammies` returns a `list` of `DataFrames`<br>
We still need to find the _correct_ one

In [26]:
len(grammies)

5

Another way with the same concept....

In [70]:
response = requests.get('https://en.wikipedia.org/wiki/List_of_American_Grammy_Award_winners_and_nominees').text
soup = BeautifulSoup(response)

tab = soup.find("rel",{"class":"nofollow"})
df = pd.read_html(tab.prettify())

AttributeError: 'NoneType' object has no attribute 'prettify'

## Now find a free-range website

get in groups of four and try to scrape a website into a pandas df

In [39]:
df=pd.DataFrame

In [40]:
df.head()

TypeError: head() missing 1 required positional argument: 'self'

In [72]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

website_url=requests.get('https://www.everybuckcounts.com/best-health-insurance-companies/').text
# soup = BeautifulSoup(website_url,'lxml')

In [73]:
website_url

'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#">\n<head>\t\n\t<!-- Meta -->\n\t<meta charset="UTF-8">\n    <meta name="google-site-verification" content="ZKvLGlxa6VdC8v1UDNbgszRxAdpNHKAojzQwRsbZB1Y" />\n    <meta name="p:domain_verify" content="d066c48339c8d59c3da95e682e40a0a1"/>\n    <meta name="msvalidate.01" content="295310B72F5C10F19C38DD3D4897F04E" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\t\n\t<!-- Link -->\n\t<link rel="profile" href="http://gmpg.org/xfn/11">\n\t<link rel="pingback" href="https://www.everybuckcounts.com/xmlrpc.php">\n  <!-- Hotjar Tracking Code for https://www.everybuckcounts.com/ -->\n\t<!-- WP Head -->\n\t<title>10 Best Health Insurance Companies of 2019 - EveryBuckCounts</title>\n\n<!-- This site is optimized with the Yoast SEO plugin v11.1.1 - https://yoast.com/wordpress/plugins/seo/ -->\n<meta name="description" content="Read on for the 10 best health insurance companies of 2019, based on criteria like af

In [74]:
soup = BeautifulSoup(website_url,'lxml')

In [82]:

for link in soup.find_all('a'):
    print(link.get('href'))

# tab = soup.find("rel",{"class":"nofollow"})
# # tab = soup.find(class:"href")
# # df = pd.read_html(tab.prettify())

https://www.everybuckcounts.com/
#
https://www.everybuckcounts.com/loan-explore/
https://www.everybuckcounts.com/debt-consolidation-loans/
https://www.everybuckcounts.com/bad-credit-loans/
https://www.everybuckcounts.com/personal-loans/
https://www.everybuckcounts.com/auto-loans/
https://www.everybuckcounts.com/debt-relief-options/
https://www.everybuckcounts.com/best-personal-loan-choices/
https://www.everybuckcounts.com/best-personal-loan-offer/
https://www.everybuckcounts.com/loan-blogs/
https://www.everybuckcounts.com/best-debt-consolidation-loan/
https://www.everybuckcounts.com/personal-loan-for-good-credit/
#
https://www.everybuckcounts.com/card-explore/
https://www.everybuckcounts.com/credit-card-fundamentals/
https://www.everybuckcounts.com/credit-score/
https://www.everybuckcounts.com/debt-payoff/
https://www.everybuckcounts.com/best-balance-transfer-credit-cards/
https://www.everybuckcounts.com/best-hotel-credit-cards/
https://www.everybuckcounts.com/top-6-best-rewards-credit

In [78]:
tab