# Part 1 of 2: Processing an HTML file

One of the richest sources of information is [the Web](http://www.computerhistory.org/revolution/networking/19/314)! In this notebook, we ask you to use string processing and regular expressions to mine a web page, which is stored in HTML format.

**The data: Yelp! reviews.** The data you will work with is a snapshot of a recent search on the [Yelp! site](https://yelp.com) for the best fried chicken restaurants in Atlanta. That snapshot is hosted here: https://cse6040.gatech.edu/datasets/yelp-example

If you go ahead and open that site, you'll see that it contains a ranked list of places:

![Top 10 Fried Chicken Spots in ATL as of September 12, 2017](https://cse6040.gatech.edu/datasets/yelp-example/ranked-list-snapshot.png)

**Your task.** In this part of this assignment, we'd like you to write some code to extract this list.

## Getting the data

First things first: you need an HTML file. The following Python code will download a particular web page that we've prepared for this exercise and store it locally in a file.

> If the file exists, this command will not overwrite it. By not doing so, we can reduce accesses to the server that hosts the file. Also, if an error occurs during the download, this cell may report that the downloaded file is corrupt; in that case, you should try re-running the cell.

In [3]:
import requests
import os
import hashlib

if os.path.exists('.voc'):
    data_url = 'https://cse6040.gatech.edu/datasets/yelp-example/yelp.htm'
else:
    data_url = 'https://github.com/cse6040/labs-fa17/raw/master/datasets/yelp.htm'

if not os.path.exists('yelp.htm'):
    print("Downloading: {} ...".format(data_url))
    r = requests.get(data_url)
    with open('yelp.htm', 'w', encoding=r.encoding) as f:
        f.write(r.text)

with open('yelp.htm', 'r') as f:
    yelp_html = f.read().encode(encoding='utf-8')
    checksum = hashlib.md5(yelp_html).hexdigest()
    assert checksum == "4a74a0ee9cefee773e76a22a52d45a8e", "Downloaded file has incorrect checksum!"
    
print("'yelp.htm' is ready!")

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 711138: character maps to <undefined>

**Viewing the raw HTML in your web browser.** The file you just downloaded is the raw HTML version of the data described previously. Before moving on, you should go back to that site and use your web browser to view the HTML source for the web page. Do that now to get an idea of what is in that file.

> If you don't know how to view the page source in your browser, try the instructions on [this site](http://www.wikihow.com/View-Source-Code).

**Reading the HTML file into a Python string.** Let's also open the file in Python and read its contents into a string named, `yelp_html`.

In [3]:
with open('yelp.htm') as yelp_file:
    yelp_html = yelp_file.read()
    
# Print first few hundred characters of this string:
print("*** type(yelp_html) == {} ***".format(type(yelp_html)))
n = 1000
print("*** Contents (first {} characters) ***\n{} ...".format(n, yelp_html[:n]))

*** type(yelp_html) == <class 'str'> ***
*** Contents (first 1000 characters) ***
<!DOCTYPE html>
<!-- saved from url=(0079)https://www.yelp.com/search?find_desc=fried+chicken&find_loc=Atlanta%2C+GA&ns=1 -->
<html xmlns:fb="http://www.facebook.com/2008/fbml" class="js gr__yelp_com" lang="en"><!--<![endif]--><head data-component-bound="true"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><link type="text/css" rel="stylesheet" href="./Best Fried chicken in Atlanta, GA - Yelp_files/css"><style type="text/css">.gm-style .gm-style-cc span,.gm-style .gm-style-cc a,.gm-style .gm-style-mtc div{font-size:10px}
</style><style type="text/css">@media print {  .gm-style .gmnoprint, .gmnoprint {    display:none  }}@media screen {  .gm-style .gmnoscreen, .gmnoscreen {    display:none  }}</style><style type="text/css">.gm-style-pbc{transition:opacity ease-in-out;background-color:rgba(0,0,0,0.45);text-align:center}.gm-style-pbt{font-size:22px;color:white;font-family:Roboto,Arial,san

Oy, what a mess! It will be great to have some code read and process the information contained within this file.

## Exercise (5 points): Extracting the ranking

Write some Python code to create a variable named `rankings`, which is a list of dictionaries set up as follows:

* `rankings[i]` is a dictionary corresponding to the restaurant whose rank is `i+1`. For example, from the screenshot above, `rankings[0]` should be a dictionary with information about Gus's World Famous Fried Chicken.
* Each dictionary, `rankings[i]`, should have these keys:
    * `rankings[i]['name']`: The name of the restaurant, a string.
    * `rankings[i]['stars']`: The star rating, as a string, e.g., `'4.5'`, `'4.0'`
    * `rankings[i]['numrevs']`: The number of reviews, as an **integer.**
    * `rankings[i]['price']`: The price range, as dollar signs, e.g., `'$'`, `'$$'`, `'$$$'`, or `'$$$$'`.
    
Of course, since the current topic is regular expressions, you might try to apply them (possibly combined with other string manipulation methods) find the particular patterns that yield the desired information.

In [69]:
from bs4 import BeautifulSoup
import re

yelp_soup = BeautifulSoup(yelp_html, 'html.parser')
soup_li = yelp_soup.findAll('li', attrs={'class':"regular-search-result"})

names_raw = []
reviews_raw = []

for result in soup_li:
    result.span
    names_raw.append(result.get_text()[25:57])
    reviews_raw.append(result.get_text()[57:100])

names_raw
#reviews_raw

['Gus’s World Famous Fried Chicken',
 'South City Kitchen - Midtown\n\n\n\n',
 'Mary Mac’s Tea Room\n\n\n\n\n\n\n\n     ',
 'Busy Bee Cafe\n\n\n\n\n\n\n\n           ',
 'Richards’ Southern Fried\n\n\n\n\n\n\n\n',
 'Greens & Gravy\n\n\n\n\n\n\n\n          ',
 'Colonnade Restaurant\n\n\n\n\n\n\n\n    ',
 'South City Kitchen Buckhead\n\n\n\n\n',
 'Poor Calvin’s\n\n\n\n\n\n\n\n           ',
 ' Rock’s Chicken & Fries\n\n\n\n\n\n\n\n ']

In [70]:
names = []
name_matcher = re.compile('.*')
for name in names_raw:
    name_match = name_matcher.search(name)
    names.append(name_match.group().strip())

names = [line.replace('&','&amp;') for line in names]
names

['Gus’s World Famous Fried Chicken',
 'South City Kitchen - Midtown',
 'Mary Mac’s Tea Room',
 'Busy Bee Cafe',
 'Richards’ Southern Fried',
 'Greens &amp; Gravy',
 'Colonnade Restaurant',
 'South City Kitchen Buckhead',
 'Poor Calvin’s',
 'Rock’s Chicken &amp; Fries']

In [71]:
reviews = []
cost = []
review_matcher = re.compile(r'[\d]+')
cost_matcher = re.compile(r'[\$]+')

for review in reviews_raw:
    review_match = review_matcher.search(review)
    reviews.append(int(review_match.group()))
    cost_match = cost_matcher.search(review)
    cost.append(cost_match.group())
reviews
#cost

[549, 1777, 2241, 481, 108, 93, 350, 248, 1558, 67]

In [72]:
rates_raw = []
for result in soup_li:
    soup_img = result.findAll('img')
    for image in soup_img:
        rates_raw.append(image.get('alt', ''))

rates_raw

["Gus's World Famous Fried Chicken",
 '4.0 star rating',
 'V D.',
 'South City Kitchen - Midtown',
 '4.5 star rating',
 'Tori P.',
 "Mary Mac's Tea Room",
 '4.0 star rating',
 'Monique V.',
 'Busy Bee Cafe',
 '4.0 star rating',
 'Joe G.',
 "Richards' Southern Fried",
 '4.0 star rating',
 'Kurtis K.',
 'Greens & Gravy',
 '3.5 star rating',
 'Tammy J.',
 'Colonnade Restaurant',
 '4.0 star rating',
 'Peter S.',
 'South City Kitchen Buckhead',
 '4.5 star rating',
 'T. M.',
 "Poor Calvin's",
 '4.5 star rating',
 'Monique V.',
 "Rock's Chicken & Fries",
 '4.0 star rating',
 'Sabri3l A.']

In [73]:
rates = []
rate_matcher = re.compile(r'[\d\.]+')
for item in rates_raw:
    if item[0].isnumeric():
        rate_match = rate_matcher.search(item)
        rates.append(rate_match.group())
rates

['4.0', '4.5', '4.0', '4.0', '4.0', '3.5', '4.0', '4.5', '4.5', '4.0']

In [74]:
rankings = [{'name': a, 'stars': b, 'numrevs': c, 'price': d} for a, b, c, d in zip(names, rates, reviews, cost)]
rankings

[{'name': 'Gus’s World Famous Fried Chicken',
  'numrevs': 549,
  'price': '$$',
  'stars': '4.0'},
 {'name': 'South City Kitchen - Midtown',
  'numrevs': 1777,
  'price': '$$',
  'stars': '4.5'},
 {'name': 'Mary Mac’s Tea Room',
  'numrevs': 2241,
  'price': '$$',
  'stars': '4.0'},
 {'name': 'Busy Bee Cafe', 'numrevs': 481, 'price': '$$', 'stars': '4.0'},
 {'name': 'Richards’ Southern Fried',
  'numrevs': 108,
  'price': '$$',
  'stars': '4.0'},
 {'name': 'Greens &amp; Gravy', 'numrevs': 93, 'price': '$$', 'stars': '3.5'},
 {'name': 'Colonnade Restaurant',
  'numrevs': 350,
  'price': '$$',
  'stars': '4.0'},
 {'name': 'South City Kitchen Buckhead',
  'numrevs': 248,
  'price': '$$',
  'stars': '4.5'},
 {'name': 'Poor Calvin’s', 'numrevs': 1558, 'price': '$$', 'stars': '4.5'},
 {'name': 'Rock’s Chicken &amp; Fries',
  'numrevs': 67,
  'price': '$',
  'stars': '4.0'}]

In [75]:
# Test cell: `rankings_test`

assert type(rankings) is list, "`rankings` must be a list"
assert all([type(r) is dict for r in rankings]), "All `rankings[i]` must be dictionaries"

print("=== Rankings ===")
for i, r in enumerate(rankings):
    print("{}. {} ({}): {} stars based on {} reviews".format(i+1,
                                                             r['name'],
                                                             r['price'],
                                                             r['stars'],
                                                             r['numrevs']))

assert rankings[0] == {'numrevs': 549, 'name': 'Gus’s World Famous Fried Chicken', 'stars': '4.0', 'price': '$$'}
assert rankings[1] == {'numrevs': 1777, 'name': 'South City Kitchen - Midtown', 'stars': '4.5', 'price': '$$'}
assert rankings[2] == {'numrevs': 2241, 'name': 'Mary Mac’s Tea Room', 'stars': '4.0', 'price': '$$'}
assert rankings[3] == {'numrevs': 481, 'name': 'Busy Bee Cafe', 'stars': '4.0', 'price': '$$'}
assert rankings[4] == {'numrevs': 108, 'name': 'Richards’ Southern Fried', 'stars': '4.0', 'price': '$$'}
assert rankings[5] == {'numrevs': 93, 'name': 'Greens &amp; Gravy', 'stars': '3.5', 'price': '$$'}
assert rankings[6] == {'numrevs': 350, 'name': 'Colonnade Restaurant', 'stars': '4.0', 'price': '$$'}
assert rankings[7] == {'numrevs': 248, 'name': 'South City Kitchen Buckhead', 'stars': '4.5', 'price': '$$'}
assert rankings[8] == {'numrevs': 1558, 'name': 'Poor Calvin’s', 'stars': '4.5', 'price': '$$'}
assert rankings[9] == {'numrevs': 67, 'name': 'Rock’s Chicken &amp; Fries', 'stars': '4.0', 'price': '$'}

print("\n(Passed!)")

=== Rankings ===
1. Gus’s World Famous Fried Chicken ($$): 4.0 stars based on 549 reviews
2. South City Kitchen - Midtown ($$): 4.5 stars based on 1777 reviews
3. Mary Mac’s Tea Room ($$): 4.0 stars based on 2241 reviews
4. Busy Bee Cafe ($$): 4.0 stars based on 481 reviews
5. Richards’ Southern Fried ($$): 4.0 stars based on 108 reviews
6. Greens &amp; Gravy ($$): 3.5 stars based on 93 reviews
7. Colonnade Restaurant ($$): 4.0 stars based on 350 reviews
8. South City Kitchen Buckhead ($$): 4.5 stars based on 248 reviews
9. Poor Calvin’s ($$): 4.5 stars based on 1558 reviews
10. Rock’s Chicken &amp; Fries ($): 4.0 stars based on 67 reviews

(Passed!)


**Fin!** This cell marks the end of Part 1. Don't forget to save, restart and rerun all cells, and submit it. When you are done, proceed to Part 2.