# Introduction to Web Scraping

**GOALS**: 

- Introduce structure of webpage 
- Use requests to get website data
- Use Beautiful Soup to parse basic HTML page

## What is a website

Behind every website is HTML code.  This HTML code is accessible to you on the internet.  If we navigate to a website that contains 50 interesting facts about Kanye West (http://www.boomsbeat.com/articles/2192/20140403/50-interesting-facts-about-kanye-west-had-a-near-death-experience-in-2002-his-stylist-went-to-yale.htm), we can view the HTML behind it using the source code.  

I'm using a macintosh computer and browsing with chrome.  To get the source code I hit `control` and click on the page to see the page source option.  Other browsers are similar.  The result will be a new tab containing HTML code.  Both are shown below.

<img src="images/kanye_web.png" style="float: left; width: 45%; margin-right: 1%; margin-bottom: 0.5em;"> <img src="images/kanye_html.png" style="float: right; width: 45%; margin-right: 1%; margin-bottom: 0.5em;">



## HTML Tags

Tags are used to identify different objects on a website, and every tag has the same structure.  For example, to write a paragraph on a webpage we would use the paragraph tags and put our text between the tags, as shown below.

```html
<p>
This is where my text would go.
</p>
```

Here, the `<p>` starts the paragraph and the `</p>` ends the paragraph.  Tags can be embedded within other tags.  If we wanted to make a word bold and insert an image within the paragraph, we could write the following HTML code.

```html
<p>
This is a <strong>heavy</strong> paragraph.  Here's a heavy picture.
<img src="images/heavy_pic.jpg"/img>
</p>
```

Also, tags may be given attributes.  This may be used to apply a style using CSS.  For example, the first fact about Kanye uses the `dir` attribute, and it was named `ltr`.  This differentiates it from the opening paragraph that uses no attribute.

```html
<div class="caption">Source: Flickr</div>
</div>
<p>Kanye West is a Grammy-winning rapper who is currently engaged to Kim Kardashian and he is well known for his outrageous statements and for his broad musical palette.</p>
<ol>
<li dir="ltr">
<p dir="ltr">Kanye Omari West was born June 8, 1977 in Atlanta.</p>
```

We can use Python to pull the HTML of a webpage into a Jupyter notebook, and then use libraries with functions that know how to read HTML.  We will use the attributes to further fine tune parsing the pieces of interest on the webpage.  

## Getting the HTML with Requests

The requests library can be used to fetch the HTML content of our website.  We will assign the content of the webpage to a variable k.  We can peek at this after, printing the first 400 characters of the request.

In [4]:
import requests
k = requests.get('http://www.boomsbeat.com/articles/2192/20140403/50-interesting-facts-about-kanye-west-had-a-near-death-experience-in-2002-his-stylist-went-to-yale.htm')

In [5]:
print(k.text[:400])

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>50 interesting facts about Kanye West: Had a near death-experience in 2002, his stylist went to Yale : People : BOOMSbeat</title>
<meta content="width=device-width" name="viewport">

<meta name="Keywords" content="Kanye West, Kanye West facts, Kanye West net worth, Kanye West full name" />
<meta name="Description" content="Kanye West is a


As we wanted, we have all the HTML content that we saw in our source view.

### Parsing HTML with Beautiful Soup

Now, we will use the Beautiful Soup library to parse the HTML.  Beautiful soup knows how to read the HTML and has many functions we can use to pull specific pieces of interest out.  To begin, we turn our request object into a beautiful soup object named `soup`.

In [6]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(k.text, 'html.parser')

Now, let us take a look at the source again and locate the structure surrounding the interesting facts.  By searching on the source page for the first fact, I find the following.

   ![](images/kanye_sourced.png)

Here, it's important to notice that the facts lie inside `<p>` paragraph tags.  These tags also have an attribute `dir = "ltr"`.  We can use beautiful soup to locate all these instances.  If we are correct, we should have 50 interesting facts.

In [7]:
facts = soup.find_all('p', attrs={'dir':'ltr'})

In [8]:
len(facts)

50

In [9]:
facts[0]

<p dir="ltr">Kanye Omari West was born June 8, 1977 in Atlanta.</p>

In [10]:
facts[0].text

'Kanye Omari West was born June 8, 1977 in Atlanta.'

### Creating a Table of Facts

Now, we can create a table that contains each interesting fact.  To do so, we will start with an empty list and append each interesting fact using our above syntax and a for loop.

In [13]:
table = []
for i in facts:
    fact = i.text
    table.append(fact)

In [14]:
len(table)

50

In [20]:
table[:5]

['Kanye Omari West was born June 8, 1977 in Atlanta.',
 'His father Ray West was a black panther in the 60s and 70s and he later became one of the first black photojournalists at the Atlanta-Journal Constitution and later became a Christian counselor. His mother Donda was English professor at Clark Atlanta University. He later moved to Chicago at the age of three when his parents divorced.',
 'The name Kanye means "the only one" in Swahilli.',
 'Kanye lived in China for more than a year with his mother when he was in fifth grade. His mother was a visiting professor there at the time and he joined her.',
 'Kanye attended Chicago State University/Columbia College in Chicago.  He dropped out to pursue music which is why he named his 2004 debut album, "The College Dropout."']

### Pandas and DataFrames

The standard library for data analysis in Python is Pandas.  Here, the typical row and column format for data used is called a DataFrame.  We can convert our table data to a dataframe as follows.

In [21]:
import pandas as pd
df = pd.DataFrame(table, columns=['Interesting Facts'])

We can use the `head()` function to examine the top 5 rows of our new DataFrame.

In [17]:
df.head()

Unnamed: 0,Interesting Facts
0,"Kanye Omari West was born June 8, 1977 in Atla..."
1,His father Ray West was a black panther in the...
2,"The name Kanye means ""the only one"" in Swahilli."
3,Kanye lived in China for more than a year with...
4,Kanye attended Chicago State University/Columb...


#### Save our Data

Now, we can convert the dataframe to a comma separated value file on our computer.  We could read this back in at any time as shown with the `read_csv` file.

In [22]:
df.to_csv('kanye_facts.csv', index=False, enconding='utf-8')

In [23]:
df = pd.read_csv('kanye_facts.csv', encoding='utf-8')

In [24]:
df.head(7)

Unnamed: 0,Interesting Facts
0,"Kanye Omari West was born June 8, 1977 in Atla..."
1,His father Ray West was a black panther in the...
2,"The name Kanye means ""the only one"" in Swahilli."
3,Kanye lived in China for more than a year with...
4,Kanye attended Chicago State University/Columb...
5,Kanye's struggle to transition from producer t...
6,"At the start of his music career, Kanye appare..."
