# 016_Web Scraping_Key

## Downloading Webpages and Parsing HTML

Web scraping allows us to gather large amounts of data from the web quickly instead of opening the data source page by page and copying and pasting it into a file.

Examples where web scraping can come in handy:
    

* To save a table from a Wikipedia page
* To get a list of reviews from a movie site
* To get a list of trending news stories 
* To get a listing of real-estate properties in a particular area
* etc

## The Requests Library

Probably the easiest way to download a web page in Python is to use the Requests library.  

The requests module doesn't come with Python, so you have to install it with pip install command. 

In [None]:
pip install requests

Next time you want to use the requests module, all you have to do is

In [None]:
import requests

## Example 1: Scraping HackerNews Front Page

Source: Broucke S., Baesens B. (2018). Practical Web Scraping for Data Science. Github page: https://github.com/Macuyiko/webscrapingfordatascience/blob/master/python-examples/hacker-news/without_api.py

This example uses requests and Beautiful Soup to scrape the Hacker News front page.

Our goal is to create a list with the link and title of each of 30 articles on the page, as well as the article's score and the number of comments.

In [None]:
url = 'https://news.ycombinator.com/news'

We will use the requests.get() function and pass our URL to it. That's what makes it so simple: we are passing the URL in the format that we are accustomed to and the requests.get() function will format a proper HTTP request.

In [None]:
r = requests.get(url)

To see the contents of the page, use r.text 

In [None]:
r.text

The contents of the page that we just visited are rendered to us via HTML, a markup language. 

If you don't want to print out all of it, you can do a slice. Sometimes printing out the beginning is enough to see the tags. 

In [None]:
print(r.text[:2001])

The raise_for_status() method is a way to ensure that you are notified if something goes wrong. You can wrap it in try/except clause to see the exception right away.

In [None]:
try:
    r.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))


If you see "200" when you run the following line, it means everything went as expected.

In [None]:
print(r.status_code)

We can also get the status in plain English using the following command

In [None]:
print(r.reason)

The following line returns the HTTP request headers.

In [None]:
print(r.request.headers)

Now we need to learn how to extract information from our HTML string.

## BeautifulSoup Module

Once you downloaded a web page using the requests library, you can parse it using the BeautifulSoup module.

***
>"The Beautiful Soup library was named after a Lewis Carroll poem bearing the same name from "Alice's Adventures in Wonderland." In the tale, the poem is sung by a character called the "Mock Turtle" and goes as follows: "Beautiful Soup, so rich and green,// Waiting in a hot tureen!// Who for such dainties would not stoop?// Soup of the evening, beautiful Soup!". Just like in the story, Beautiful Soup tries to organize complexity: it helps to parse, structure and organize the oftentimes very messy web by fixing bad HTML and presenting us with an easy-to-work-with Python structure."
***

First, create a BeautifulSoup object. If you already have an HTML page contained in a string (as we have), this is straightforward.

The Beautiful Soup library depends on an HTML parser to perform most of the parsing work. We'll be using 'html.parser' that doesn't require any additional installation.

In [None]:
from bs4 import BeautifulSoup

In [None]:
html_soup = BeautifulSoup(r.text, 'html.parser') #Creating a BeautifulSoup object from HTML

Once you have a BeautifulSoup object, you can use its methods to locate specific parts of an HTML document.

There are two main methods to locate elements within HTML: **find** and **find_all**.

* find (name, attrs, recursive, string, **keywords)

* find_all (name, attrs, recursive, string, limit, **keywords)

Note: Beautiful Soup also recognizes "camelCaps" capitalization. So instead of find_all, sometimes you'll see findAll.

Let's look at the arguments that find and find_all take.

>find (**name**, attrs, recursive, string, keywords);
<br> The name argument defines the tags you wish to "find" on the page. You can pass a string, or a list of tags. Leaving this argument as an empty string simply selects all elements.

>find (name, **attrs**, recursive, string, keywords);
<br>The attrs argument takes a Python dictionary of attributes and matches HTML elements that match those attributes.

>find_all (name, attrs, recursive, string, **limit**, keywords);
<br>The limit argument is only used in the find_all method and can be used to limit the number of elements that are retrieved. Note that find is functionally equivalent to calling find_all with the limit set to 1, with the exception that the former returns the retrieved element directly, and that the latter will always return a list of items, even if it just contains a single element. 

Apart from find and find_all, there are also a number of other methods for searching the HTML tree, which are very similar to find and find_all. The difference is that they will search different parts of the HTML tree:

* **find_parent** and **find_parents** work their way up the tree, looking at a tag's parents using its parents attribute. Remember that find and find_all work their way down the tree, looking at a tag's descendants.

* **find_next_sibling** and **find_next_siblings** will iterate and match a tag's siblings using next_siblings attribute.

For a full list of available methods and parameters, the official Beautiful Soup documentation is available at https://www.crummy.com/software/BeautifulSoup/bs4/doc/.


### Hypertext Markup Language (HTML) Refresher

HTML defines how a web page is structured and formatted. Its main building blocks are called tags.

To help us find the tags that we need, we can use browser tools. 
<br>Nearly every browser will have a tab titled Elements or HTML. 
<br>In Chrome and Firefox, you can right click on an element on the page and select Inspect Element. 
<br>For Internet Explorer, you need to open the Developer toolbar by pressing F12. 
<br>Then you can select items by clicking Ctrl + B .



HTML tags that have content come in pairs, others do not. Tags are enclosed in angled brackets. Here are some of the most commonly used tags:

## Example 1 (Continued)

As we said before, our goal is to create a list of links to each of the 30 articles mentioned on the page, record their titles, number of points and number of comments. 

We'll be using regular expression to find the comments section, so we need to import *re* module

In [None]:
import requests
from bs4 import BeautifulSoup
import re

In [None]:
url = 'https://news.ycombinator.com/news'

r = requests.get(url)

html_soup = BeautifulSoup(r.text, 'html.parser')

We'll use a list to store our list of articles and other findings.

In [None]:
articles = []

In [None]:
for item in html_soup.find_all('tr', class_='athing'):
    item_a = item.find('a', class_='storylink')
    item_link = item_a.get('href') if item_a else None
    item_text = item_a.get_text(strip=True) if item_a else None
    next_row = item.find_next_sibling('tr')
    item_score = next_row.find('span', class_='score')
    item_score = item_score.get_text(strip=True) if item_score else '0 points'
    # We use regex here to find the correct element
    item_comments = next_row.find('a', text=re.compile('\d+(&nbsp;|\s)comment(s?)'))
    item_comments = item_comments.get_text(strip=True).replace('\xa0', ' ') \
                        if item_comments else '0 comments'
    
    articles.append({
        'link' : item_link,
        'title' : item_text,
        'score' : item_score,
        'comments' : item_comments})

In [None]:
for article in articles:
    print(article)

In [None]:
print(len(articles))

## Practice. Scraping analytics.usa.gov

Source: Pierson, L. Python for Data Science Essential Training (2019). LinkedIn Learning. https://www.linkedin.com/learning/python-for-data-science-essential-training-part-1/web-scraping-in-practice

The goal is to get a list of all web links included in the https://analytics.usa.gov/ webpage.

First, import all the necessary modules.

In [None]:
import requests
from bs4 import BeautifulSoup
import re

Next, create a BeautifulSoup object. 

In [None]:
url = 'https://analytics.usa.gov/'

r = requests.get(url)

html_soup = BeautifulSoup(r.text, 'html.parser')

print(r.text)

Get a list of all web links included in the webpage. To do that, use a loop and find_all function to find and print all a tags.

In [None]:
for link in html_soup.find_all('a'):
    print(link.get('href'))

In [None]:
for link in html_soup.find_all('a', attrs = {'href': re.compile("^http")}):
    print(link)

Now, save the links in a text file.

In [None]:
file = open("parsed_data.txt", "w")
for link in html_soup.find_all('a', attrs = {'href': re.compile("^http")}):
    soup_link = str(link)
    print(soup_link)
    file.write(soup_link)
file.flush()
file.close()

Check the contents of the file.

In [None]:
%pwd

## Example 2: List of Game of Thrones Episodes

In this example we'll be working with Game of Thrones Wikipedia page that has a number of tables listing the episodes with their directors, writers, air dates, and number of viewers. 

https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes

Let's try to fetch all of this data using what we have learned.

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'

In [None]:
#Use requests module to import the URL
r = requests.get(url)



In [None]:
r.status_code

In [None]:
html_contents = r.text

In [None]:
#Create a BeautifulSoup object
html_soup = BeautifulSoup(html_contents, 'html.parser')

type(html_soup)


Inspect the episode tables on the page. What tag is used to define a table on this page?

How are the episode tables distinguished from all other tables?

For every table, we first want to retrieve the headers to use as keys in a Python dictionary.

In [None]:
# Find the first h1 tag
first_h1 = html_soup.find('h1')

print(first_h1.name)     # h1
print(first_h1.contents) # ['List of ', [...], ' episodes']

print(str(first_h1))
# Prints out: <h1 class="firstHeading" id="firstHeading" lang="en">List of
#             <i>Game of Thrones</i> episodes</h1>

print(first_h1.text)       # List of Game of Thrones episodes
print(first_h1.get_text()) # Does the same

print(first_h1.attrs)
# Prints out: {'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}

print(first_h1.attrs['id']) # firstHeading
print(first_h1['id'])       # Does the same
print(first_h1.get('id'))   # Does the same

We'll use a list to store our episode list.

In [None]:
episodes = []

Inspect the episode tables on the page. Note how they're all defined by means of a *table* tag. However, the page also contains tables we do not want to include. Some further investigation leads us to a solution: all the episode tables have "wikiepisodetable" as a class, whereas the other tables do not. You'll often have to puzzle your way through a page first before coming up with a solid approach. In many cases, you'll have to perform multiple find and find_all iterations before ending up where you want to be.

In [None]:
#Find all episode tables and store them in a variable ep_tables
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')


Now we loop through every episode table. First, we create a list of headers like we did above. Next, we loop through all the rows (the *tr* tags), except for the first one (the header row). For each row, we loop through the *th* and *td* tags to extract the column values (the first column is wrapped inside of a *th* tag, the others in *td* tags, which is why we need to handle both). At the end of each row, we're ready to add a new entry to the "episodes" variable. To store each entry, we use a normal Python dictionary (episode_dict). 


In [None]:
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    # Start by fetching the header cells from the first row to determine
    # the field names
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    # Then go through all the rows except the first one
    for row in table.find_all('tr')[1:]:
        values = []
        # And get the column cells, the first one being inside a th-tag
        for col in row.find_all(['th','td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]: values[i] for i in
            range(len(values))}
            episodes.append(episode_dict)

# Show the results
for episode in episodes:
    print(episode)

## Practice. Scrape MPG to Compare Cars

The task is to find MPG for 4 cars (Ford Escape 2019, Honda CRV 2019,Hyundai Santa Fe 2019, and Toyota Rav4 2019) on Kelly Blue Book site and store it in a file.

In [None]:
list_of_links = ["https://www.kbb.com/ford/escape/2019/s/?vehicleid=439781&intent=buy-new", "https://www.kbb.com/honda/cr-v/2019/lx/?vehicleid=439688&intent=buy-new", 'https://www.kbb.com/hyundai/santa-fe/2019/24-se/?vehicleid=436474&intent=buy-new', 'https://www.kbb.com/toyota/rav4/2019/le/?vehicleid=440064&intent=buy-new']

What tag is used to define the MPG?

Create an empty list and fill it with the data fetched from the website. Use requests, BeautifulSoup, and a for loop.

In [None]:
contents = []
for i in list_of_links:
    r = requests.get(i)
    html_contents = r.text
    html_soup = BeautifulSoup(html_contents, 'html.parser')
    fuel_economy = html_soup.find('p', {"class": "mpg-value"})
    #print(fuel_economy)
    contents.append(fuel_economy)
print(contents)

In [None]:
contents1 = []
for i in contents:
    mpg = i.contents
    mpg = str(mpg)
    contents1.append(mpg)
    
print(contents1)

In [None]:
output_list = []
car_list = ['Ford Escape 2019: ', 'Honda CRV 2019: ', 'Hyundai Santa Fe 2019: ', 'Toyota Rav4 2019: ']
for i in range(len(car_list)):
    car_mpg = car_list[i] + contents1[i]
    output_list.append(car_mpg)
output_list

Next, save your results in a file.

In [None]:
import csv

In [None]:
with open ('output.csv', 'w', newline='') as output_file:
    wr = csv.writer(output_file)
    for item in output_list:
        wr.writerow([item])