![b4s](img/beautiful_soup.png)

## [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup)

### Learning goals:

- scrape a basic wikipedia website using beautiful soup
- transform the html table we want to a pandas `DataFrame`
- scrape a more complex wikipedia
- transform the wanted scraped data into a pandas `DataFrame`
- if time, go hunt a wild website and scrape it
- **Discuss the ethics of webscraping!!**

## Outline

* HTML Crash Course
* CSS Crashier Course
* STOP!, Ethics Time!
* Beautiful Soup Documentation
* Scraping a Page Together
* Scraping a Page Apart

## Benefits of *not* scraping
![options](img/other_options.png)

### Use case

![use](img/use_case.png)

### Goal

![python](img/how_works.png)

#### Discuss
What's a website you'd like  to scrape?

## HTML Crash Course

* HTML is the language of the web
* Skeleton of all websites
* Tags -> Areas and sections of the website
* Compliler does the rest
* All you need to know is [here](https://www.w3schools.com/html/default.asp). 

First declare you are going to make some HTML.

```
<html>

</html>
```

Add in the main sections you need.

```
<html>
    <head>
    </head>
    
    <body>
    </body>


</html>
```

Can put this in a browswer.
Can do live demonstration.

```
<html>
    <head>
        <title>My Website</title>
    </head>
    
    <body>
    </body>


</html>
```

```
<html>
    <head>
        <title>My Website</title>
    </head>
    
    <body>
        <h1>A Big Title</h1>
        <p>Some paragraph</p>
    </body>


</html>
```

```
<html>
    <head>
        <title>My Website</title>
    </head>
    
    <body>
        <h1>A Big Title</h1>
        <p>Some paragraph</p>
        <img src="img/beautiful_soup.png"/>
    </body>


</html>
```

## CSS Crash Course

* CSS stands for cascading style sheets. 
* All you need to know [here](https://www.w3schools.com/css/).
* Essentially it's what you use to make files nicer.
* Can be in HTML file if you are a psychopath, mostly in separate file. 

Let's add this file to our "website". 

```
body {
  background-color: lightblue;
}

h1 {
  color: red;
  text-align: center;
}

p {
  font-family: verdana;
  font-size: 20px;
}
```

Add it like so...

```
<html>
    <head>
        <title>My Website</title>
        <link rel="stylesheet" type="text/css" href="style.css" />
    </head>
    
    <body>
        <h1>A Big Title</h1>
        <p>Some paragraph</p>
        <img src="img/beautiful_soup.png"/>
    </body>


</html>
```

### Getting Tricker

Most websites are a bit more complex than the one that we just coded together. Not only do they consist of HTML and CSS, but often have interactive elements (as is obvious), was to store data from users, among pretty much everything else. 

In order to get an idea of what a website actually looks like, let's use the ```Inspect Element``` option that we have access to with pretty much every web browser. Let's start our exploration with the website that we are eventually going to work with. 

### Scenario

I want to analyze the top song award of the Grammies to see if I can find any patterns in country of origin, singer, song content, etc. 

But where do I start finding that data? Not from an API.

What we are interested in is located [here](https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year)

For demonstration purposes, I will also be using this tool to help make things a bit clearer.

Also helpful is the [CSS Selector Gadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) for Chrome.


## Ethics

**BUT FIRST, IS WHAT I AM ABOUT TO DO ETHICAL? WHAT DOES THAT MEAN? WHAT IS ETHICAL DATA SCRAPING? WHAT IS UNETHICAL DATA SCRAPING?**

Topics to consider

* Who owns the data you are scraping?
* Who owns the website that you are scraping data from?
* If you take something that you don't own, are you stealing?
* Why might a business or website make their data availble?
* Are there ways you might be too greedy in taking your data?
* What if you use data with intent to harm others? 
* Is that even possible? 


### This is our target
![target](img/target.png)

## Beautiful Soup 

Your go to tool for scraping the web using Python is going to be a package called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). It's one of the most used libraries and its website provides an excellent example of what it means to have effective documentation. The following examples below come from Beautiful Soup's documentation and will provide a link between our crash course to HTML/CSS and how Python is thinking about web scraping under the hood. 

## Basic wikipedia

![vheck](img/basic.gif)

Task: Get one column from a table on wikipedia

Let's get those libraries we want

In [17]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Use the `url` inside of a `requests.get` and assign it to `website_url`

First, a wikipedia article where we only want to get one column of information - countries!

https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area

In [18]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_Asian_countries_by_area').text

In [19]:
website_url

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of Asian countries by area - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Asian_countries_by_area","wgTitle":"List of Asian countries by area","wgCurRevisionId":922550463,"wgRevisionId":922550463,"wgArticleId":47659173,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from January 2017","All articles needing additional references","Asia-related lists","Lists by area","Lists of countries by geography"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","J

Start to use the BeautifulSoup functions to create a BeautifulSoup object

In [24]:
soup = BeautifulSoup(website_url,'lxml') # Use lxml parser

#print(soup.prettify())

Find the class of interest

In [25]:
table = soup.find('table',{'class':'wikitable sortable'})

Keep looking at the html to see if you can find any commonalities in what you want to scrape....

All the country names are links! We can use the `a` tag!

In [26]:
links = table.find_all('a')

We can now iterate over links to process it and create a list of text

In [27]:
Countries = []
for link in links:
    Countries.append(link.get('title'))
    
print(Countries)

['Russia', None, 'China', 'Hong Kong', 'Macau', 'India', None, 'Kazakhstan', 'Saudi Arabia', 'Iran', 'Mongolia', 'Indonesia', 'Pakistan', 'Turkey', 'Myanmar', 'Afghanistan', 'Yemen', 'Thailand', 'Turkmenistan', 'Uzbekistan', 'Iraq', 'Japan', 'Vietnam', 'Malaysia', 'Oman', 'Philippines', 'Laos', 'Kyrgyzstan', 'Syria', 'Golan Heights', 'Cambodia', 'Bangladesh', 'Nepal', 'Tajikistan', 'North Korea', 'South Korea', 'Jordan', 'Azerbaijan', 'United Arab Emirates', 'Georgia (country)', 'Sri Lanka', 'Egypt', 'Bhutan', 'Taiwan', 'Armenia', 'Kuwait', 'East Timor', 'Qatar', 'Lebanon', 'Israel', 'State of Palestine', 'Brunei', 'Singapore', 'Bahrain', 'Maldives']


Now, let's convert that list to a data frame

In [28]:
df = pd.DataFrame()
df['Country'] = Countries

In [29]:
df.head()

Unnamed: 0,Country
0,Russia
1,
2,China
3,Hong Kong
4,Macau


## Less Basic - Get a whole table
Let's go inspect the webiste to find the right tag/heading/etc for the table we want

What are the important tags here?<br>
What class is the important one?

`table`<br>
`wikitable sortable`

**Task**<br>
Work with a partner to comment the following code and figure out what it does

In [30]:
response = requests.get('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year').text

soup = BeautifulSoup(website_url,'lxml')
#print(soup.prettify())

tab = soup.find("table",{"class":"wikitable sortable"})
# pd.read_html(tab.prettify())

rows = tab.find_all('tr')

data = []
for row in rows:
    data.append([x.get_text().strip() for x in row.find_all(['th','td'])])

df = pd.DataFrame(data)

new_header = df.iloc[0]
df = df[1:]
df.columns = new_header

df

Unnamed: 0,Rank,Country,Area (km²),Notes
1,1.0,Russia*,13100000.0,"17,098,242 including European part[1]"
2,2.0,China,9596961.0,"excludes Hong Kong, Macau, Taiwan and disputed..."
3,3.0,India[2],3287263.0,"including Punjab , Jammu and Kashmir part"
4,4.0,Kazakhstan*,2455034.0,"2,724,902 km² including European part"
5,5.0,Saudi Arabia,2149690.0,
6,6.0,Iran,1648195.0,
7,7.0,Mongolia,1564110.0,
8,8.0,Indonesia*,1472639.0,"1,904,569 km² including Oceanian part"
9,9.0,Pakistan,881913.0,
10,10.0,Turkey*,747272.0,"783,562 km² including European part"


### But this is hard. Is there an easier way to do this?

Another way, if you **know** there is a `table` in the `html` somewhere

In [32]:
grammies = pd.read_html('https://en.wikipedia.org/wiki/Grammy_Award_for_Song_of_the_Year')

[  Grammy Award for Song of the Year  \
 0                       Awarded for   
 1                           Country   
 2                      Presented by   
 3                     First awarded   
 4                 Currently held by   
 5                           Website   
 
                  Grammy Award for Song of the Year.1  
 0     Quality song containing both lyrics and melody  
 1                                      United States  
 2    National Academy of Recording Arts and Sciences  
 3                                               1959  
 4  Donald Glover, Ludwig Göransson & Jeffery Lama...  
 5                                         grammy.com  ,
     Year[I]                                          Winner(s)  \
 0      1959                                   Domenico Modugno   
 1      1960                                    Jimmy Driftwood   
 2      1961                                        Ernest Gold   
 3      1962                         Henry ManciniJohnny 

`grammies` returns a `list` of `DataFrames`<br>
We still need to find the _correct_ one

In [34]:
len(grammies)

grammies[0]
grammies[1]

Unnamed: 0,Year[I],Winner(s),Nationality,Work,Performing artist(s)[II],Nominees,Ref.
0,1959,Domenico Modugno,Italy,"""Volare"" *",Domenico Modugno,"Paul Vance & Lee Pockriss for ""Catch a Falling...",[10]
1,1960,Jimmy Driftwood,United States,"""The Battle of New Orleans""",Johnny Horton,"Sammy Cahn & Jimmy Van Heusen for ""High Hopes""...",[11]
2,1961,Ernest Gold,United States Austria,"""Theme of Exodus""",Instrumental (Various Artists),"Charles Randolph Grean, Joe Allison & Audrey A...",[12]
3,1962,Henry ManciniJohnny Mercer,United States,"""Moon River"" *",Henry Mancini,"Jimmy Dean for ""Big Bad John"" performed by Jim...",[13]
4,1963,Leslie BricusseAnthony Newley,United Kingdom,"""What Kind of Fool Am I?""",Sammy Davis Jr.,"Lionel Bart for ""As Long as He Needs Me"" perfo...",[14]
5,1964,Henry ManciniJohnny Mercer,United States,"""Days of Wine and Roses"" *",Henry Mancini,"Sammy Cahn & Jimmy Van Heusen for ""Call Me Irr...",[15]
6,1965,Jerry Herman,United States,"""Hello, Dolly!""",Louis Armstrong,"John Lennon & Paul McCartney for ""A Hard Day's...",[16]
7,1966,Paul Francis WebsterJohnny Mandel,United States,"""The Shadow of Your Smile""",Tony Bennett,"Michel Legrand, Norman Gimbel & Jacques Demy f...",
8,1967,John LennonPaul McCartney,United Kingdom,"""Michelle""",The Beatles,"John Barry & Don Black for ""Born Free"" perform...",
9,1968,Jimmy Webb,United States,"""Up, Up, and Away"" *",The 5th Dimension,"Jimmy Webb for ""By the Time I Get to Phoenix"" ...",[17]


Another way with the same concept....

In [50]:
# response = requests.get('https://en.wikipedia.org/wiki List_of_American_Grammy_Award_winners_and_nominees').text

# response

# soup = BeautifulSoup(response)

# soup.find_all("table",attrs = {"class":"wikitable sortable"})

# # tab = soup.find_all("table", {"class":"wikitable sortable"})

# # tab

# # df = pd.read_html(tab.prettify())

# # df

[]

## Now find a free-range website

get in groups of four and try to scrape a website into a pandas df