# Lab 2
UIC CS 418, Spring 2023

## Academic Integrity Policy

According to the Academic Integrity Policy of this course, all work submitted for grading must be done individually, unless otherwise specified. While we encourage you to talk to your peers and learn from them, this interaction must be superficial with regards to all work submitted for grading. This means you cannot work in teams, you cannot work side-by-side, you cannot submit someone else’s work (partial or complete) as your own. In particular, note that you are guilty of academic dishonesty if you extend or receive any kind of unauthorized assistance. Absolutely no transfer of program code between students is permitted (paper or electronic), and you may not solicit code from family, friends, or online forums. Other examples of academic dishonesty include emailing your program to another student, copying-pasting code from the internet, working in a group on a homework assignment, and allowing a tutor, TA, or another individual to write an answer for you. 
If you have questions, please ask us on Piazza.
You must reference (including URLs) of any resources other than those linked to in the assignment, or provided by the professor or TA.

Academic dishonesty is unacceptable, and penalties range from failure to expulsion from the university; cases are handled via the official student conduct process described at https://dos.uic.edu/conductforstudents.shtml._

We will run your code through MOSS software to detect copying and plagiarism.

##To submit this assignment:
1. Execute all commands and complete this notebook	
2. Download your Python Notebook (**.ipynb** file) and upload it to Gradescope
under *Lab 2*. **Make sure you check that your **ipynb** file includes all parts of your solution (including the outputs).**
2.	Export your Notebook as a python file (**.py** file) and upload it to Gradescope under *.py file for Lab 2*. 


### Part 1 - Questions (50%)

The practice problems below use the department of transportation's "On-Time" flight data for all flights originating from SFO or OAK in January 2016. Information about the airports and airlines are contained in the comma-delimited files `airports.dat` and `airlines.dat`, respectively.  Both were sourced from http://openflights.org/data.html.

Disclaimer: There is a more direct way of dealing with time data that is not presented in these problems.  This activity is merely an academic exercise.

In [None]:
import pandas as pd

#### Setup

In [None]:
flights = pd.read_csv("/content/flights.dat", dtype={'sched_dep_time': 'f8', 'sched_arr_time': 'f8'})
# show the first few rows, by default 5
flights.head(3) 

Unnamed: 0,year,month,day,date,carrier,tailnum,flight,origin,destination,sched_dep_time,actual_dep_time,sched_arr_time,actual_arr_time
0,2016,1,1,2016-01-01,AA,N3FLAA,208,SFO,MIA,630.0,628.0,1458.0,1431.0
1,2016,1,2,2016-01-02,AA,N3APAA,208,SFO,MIA,600.0,553.0,1428.0,1401.0
2,2016,1,3,2016-01-03,AA,N3DNAA,208,SFO,MIA,630.0,626.0,1458.0,1431.0


In [None]:
airports_cols = [
    'openflights_id',
    'name',
    'city',
    'country',
    'iata',
    'icao',
    'latitude',
    'longitude',
    'altitude',
    'tz',
    'dst',
    'tz_olson',
    'type',
    'airport_dsource'
]

airports = pd.read_csv("airports.dat", names=airports_cols)
airports.head(3)

Unnamed: 0,openflights_id,name,city,country,iata,icao,latitude,longitude,altitude,tz,dst,tz_olson,type,airport_dsource
0,1,Goroka,Goroka,Papua New Guinea,GKA,AYGA,-6.081689,145.391881,5282,10.0,U,Pacific/Port_Moresby,,
1,2,Madang,Madang,Papua New Guinea,MAG,AYMD,-5.207083,145.7887,20,10.0,U,Pacific/Port_Moresby,,
2,3,Mount Hagen,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.826789,144.295861,5388,10.0,U,Pacific/Port_Moresby,,


#### Question 1.1 **(20%)**
It looks like the departure and arrival in `flights` were read in as floating-point numbers.  Write two functions, `extract_hour` and `extract_mins` that converts military time to hours and minutes, respectively. Hint: You may want to use modular arithmetic and integer division. Keep in mind that the data has not been cleaned and you need to check whether the extracted values are valid. Replace all the invalid values with `NaN`. The documentation for `pandas.Series.where` provided [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html) should be helpful.

In [None]:
import numpy as np

def extract_hour(time):
    """
    Extracts hour information from military time.
    
    Args: 
        time (float64): series of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): series of input dimension with hour information.  
          Should only take on integer values in 0-23
    """
    # [YOUR CODE HERE]
    hour = time / 100
    #only take the first 2 digits
    hour = hour.apply(lambda x: x if (x > 0 and x < 24) else np.nan)
    hour = np.floor(hour)    
    #hour is a np variable, so i can use where (this works on the entire series) instead of if/else (i dont need to iterate ecvery element). Only 0 to 23 is valid
    #https://numpy.org/doc/stable/reference/generated/numpy.where.html
    #i was getting confused as to how np.where actually worked as in the conditional assignments confused me so used lambda instead
    return hour




# test your code to receive credit
test_ser = pd.Series([1030.0, 1259.0, np.nan, 2400], dtype='float64')
extract_hour(test_ser)

# 0    10.0
# 1    12.0
# 2     NaN
# 3     NaN
# dtype: float64

0    10.0
1    12.0
2     NaN
3     NaN
dtype: float64

In [None]:
def extract_mins(time):
    """
    Extracts minute information from military time
    
    Args: 
        time (float64): series of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): series of input dimension with minute information.  
          Should only take on integer values in 0-59
    """

    hour = extract_hour(time) # take the hours so 2234 -> get 22
    min = time - hour * 100 # minute = 2234 - 22*100 = 2234 - 2200 = 34 mins
    min = min.apply(lambda x: x if (x >= 0 and x <= 59) else np.nan) # np variable, only 0 to 59 valid, else nan
    return min


# test your code to receive credit
test_ser = pd.Series([1030.0, 1259.0, np.nan, 2475], dtype='float64')
extract_mins(test_ser)

# 0    30.0
# 1    59.0
# 2     NaN
# 3     NaN
# dtype: float64

0    30.0
1    59.0
2     NaN
3     NaN
dtype: float64

#### Question 1.2 **(20%)**

Using your two functions above, filter the `flights` data for flights that departed 20 or more minutes later than scheduled by comparing `sched_dep_time` and `actual_dep_time`.  You need not worry about flights that were delayed to the next day for this question.

In [None]:
def convert_to_minofday(time):
    """

    Converts military time to minute of day
    
    Args:
        time (float64): series of time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
    
    Returns:
        array (float64): series of input dimension with minute of day
    
    Example: 1:03pm is converted to 783.0

    """
    
    # [YOUR CODE HERE]
    min = extract_mins(time)
    hour = extract_hour(time)
    min_day = hour*60 + min 
    # straight foward math, once I have the hour and minute of the day
    return min_day

# Test your code  to receive credit
ser = pd.Series([1303, 1200, 2400], dtype='float64')
convert_to_minofday(ser)

# 0    783.0
# 1    720.0
# 2      NaN
# dtype: float64

0    783.0
1    720.0
2      NaN
dtype: float64

In [None]:
def calc_time_diff(x, y):
    """
    Calculates delay times y - x
    
    Args:
        x (float64): series of scheduled time given in military format.  
          Takes on values in 0.0-2359.0 due to float64 representation.
        y (float64): series of same dimensions giving actual time
    
    Returns:
        array (float64): series of input dimension with delay time
    """
    # [YOUR CODE HERE]
    
    return y-x #( float64 - float64 )
    
#Test your code  to receive credit
sched = pd.Series([1303, 1210], dtype='float64')
actual = pd.Series([1304, 1215], dtype='float64')
calc_time_diff(sched, actual)

# 0    1.0
# 1    5.0
# dtype: float64

0    1.0
1    5.0
dtype: float64

In [None]:
### Apply your functions here to calculate delay between `sched_dep_time` and `actual_dep_time` on flights, to receive credit.
# [YOUR CODE HERE]


delay = calc_time_diff(convert_to_minofday(flights['sched_dep_time']), \
                                  convert_to_minofday(flights['actual_dep_time']))
delay
# 0         -2.0
# 1         -7.0
# 2         -4.0
# 3         -4.0
# 4         -8.0
#          ...  
# 16856     56.0
# 16857     74.0
# 16858    196.0
# 16859    169.0
# 16860    137.0
# Length: 16861, dtype: float64
delay

0         -2.0
1         -7.0
2         -4.0
3         -4.0
4         -8.0
         ...  
16856     56.0
16857     74.0
16858    196.0
16859    169.0
16860    137.0
Length: 16861, dtype: float64

In [None]:
flights.head(2)

Unnamed: 0,year,month,day,date,carrier,tailnum,flight,origin,destination,sched_dep_time,actual_dep_time,sched_arr_time,actual_arr_time
0,2016,1,1,2016-01-01,AA,N3FLAA,208,SFO,MIA,630.0,628.0,1458.0,1431.0
1,2016,1,2,2016-01-02,AA,N3APAA,208,SFO,MIA,600.0,553.0,1428.0,1401.0


In [None]:
airports

Unnamed: 0,openflights_id,name,city,country,iata,icao,latitude,longitude,altitude,tz,dst,tz_olson,type,airport_dsource
0,1,Goroka,Goroka,Papua New Guinea,GKA,AYGA,-6.081689,145.391881,5282,10.0,U,Pacific/Port_Moresby,,
1,2,Madang,Madang,Papua New Guinea,MAG,AYMD,-5.207083,145.788700,20,10.0,U,Pacific/Port_Moresby,,
2,3,Mount Hagen,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.826789,144.295861,5388,10.0,U,Pacific/Port_Moresby,,
3,4,Nadzab,Nadzab,Papua New Guinea,LAE,AYNZ,-6.569828,146.726242,239,10.0,U,Pacific/Port_Moresby,,
4,5,Port Moresby Jacksons Intl,Port Moresby,Papua New Guinea,POM,AYPY,-9.443383,147.220050,146,10.0,U,Pacific/Port_Moresby,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8102,9537,Mansons Landing Water Aerodrome,Mansons Landing,Canada,YMU,\N,50.066667,-124.983333,0,-8.0,A,America/Vancouver,,
8103,9538,Port McNeill Airport,Port McNeill,Canada,YMP,\N,50.575556,-127.028611,225,-8.0,A,America/Vancouver,,
8104,9539,Sullivan Bay Water Aerodrome,Sullivan Bay,Canada,YTG,\N,50.883333,-126.833333,0,-8.0,A,America/Vancouver,,
8105,9540,Deer Harbor Seaplane,Deer Harbor,United States,DHB,\N,48.618397,-123.005960,0,-8.0,A,America/Los_Angeles,,


In [None]:
airports_new = airports[['city','iata']]
airports_new.set_index('iata') # Airports iata as index with the city column

Unnamed: 0_level_0,city
iata,Unnamed: 1_level_1
GKA,Goroka
MAG,Madang
HGU,Mount Hagen
LAE,Nadzab
POM,Port Moresby
...,...
YMU,Mansons Landing
YMP,Port McNeill
YTG,Sullivan Bay
DHB,Deer Harbor


#### Question 1.3 **(10%)**

Using your answer from question 1.2, find the full name of every destination city with a flight from SFO or OAK that was delayed by 60 or more minutes.  The airport codes used in `flights` are IATA codes.  Sort the cities alphabetically. Make sure you remove duplicates. (You may find `drop_duplicates` and `sort_values` helpful.)

In [None]:
# Complete code here to receive credit.
# HINT: You will need to use `delayed20` and `airport` dataframes
# [YOUR CODE HERE]
delayed_destinations_iata = flights[(delay > 60)]
#flights with dealy of >60, delay is the list, searched based on indices
delayed_destinations_iata = delayed_destinations_iata[flights['origin'].isin(['SFO','OAK'])]
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html

#isolating sfo and oak
delayed_airports = delayed_destinations_iata['destination']
# Dataframe showing airports(destinations) that satisfy above conditions
delayed_destinations = airports_new[airports_new['iata'].isin(delayed_airports)].sort_values(by='city')
# Unique and sorted destination cities
# (iata is the primary key of airports, we create a partion in airports, search in delayed airports (continues>>)
# for city name given iata and then sort the values based on cities
delayed_destinations = delayed_destinations['city'].drop_duplicates().reset_index().drop('index',axis=1)
#we drop duplicates , reset index so the orignal indices are removed. I used google to look up the syntaxs of drop_duplicates() and reset index()
delayed_destinations
# 0     Albuquerque
# 1       Anchorage
# 2       Arcata CA
# 3           Aspen
# 4         Atlanta
#          ...     
# 65        Seattle
# 66        Spokane
# 67      St. Louis
# 68         Tucson
# 69     Washington
# Name: city, Length: 70, dtype: object

  delayed_destinations_iata = delayed_destinations_iata[flights['origin'].isin(['SFO','OAK'])]


Unnamed: 0,city
0,Albuquerque
1,Anchorage
2,Arcata CA
3,Aspen
4,Atlanta
...,...
65,Seattle
66,Spokane
67,St. Louis
68,Tucson


## Part 2: Web scraping and data collection (50%)

### Note and Setup

Here, you will practice collecting and processing data in Python. By the end of this exercise hopefully you should look at the wonderful world wide web without fear, comforted by the fact that anything you can see with your human eyes, a computer can see with its computer eyes. In particular, we aim to give you some familiarity with:

* Using HTTP to fetch the content of a website
* HTTP Requests (and lifecycle)
* RESTful APIs
    * Authentication (OAuth)
    * Pagination
    * Rate limiting
* JSON vs. HTML (and how to parse each)
* HTML traversal (CSS selectors)

Since everyone loves food (presumably), the ultimate end goal of this homework will be to acquire the data to answer some questions and hypotheses about the restaurant scene in Chicago (which we will get to later). We will download __both__ the metadata on restaurants in Chicago from the Yelp API and with this metadata, retrieve the comments/reviews and ratings from users on restaurants.

**Library Documentation:**

For solving this part, you need to look up online documentation for the Python packages you will use:

* Standard Library: 
    * [io](https://docs.python.org/3/library/io.html)
    * [time](https://docs.python.org/3/library/time.html)
    * [json](https://docs.python.org/3/library/json.html)

* Third Party
    * [requests](https://requests.readthedocs.io/en/latest/)
    * [Beautiful Soup (version 4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
    * [yelp-fusion](https://www.yelp.com/developers/documentation/v3/get_started)

**Note:** You may come across a `yelp-python` library online. The library is deprecated and incompatible with the current Yelp API, so do not use the library.

First, import necessary libraries:

In [None]:
import io, time, json
import requests
from bs4 import BeautifulSoup

### Authentication and working with APIs



There are various authentication schemes that APIs use, listed here in relative order of complexity:

* No authentication
* [HTTP basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication)
* Cookie based user login
* OAuth (v1.0 & v2.0, see this [post](http://stackoverflow.com/questions/4113934/how-is-oauth-2-different-from-oauth-1) explaining the differences)
* API keys
* Custom Authentication

For the NYT example below (**Q2.1**), since it is a publicly visible page we did not need to authenticate. HTTP basic authentication isn't too common for consumer sites/applications that have the concept of user accounts (like Facebook, LinkedIn, Twitter, etc.) but is simple to setup quickly and you often encounter it on with individual password protected pages/sites. 

Cookie based user login is what the majority of services use when you login with a browser (i.e. username and password). Once you sign in to a service like Facebook, the response stores a cookie in your browser to remember that you have logged in (HTTP is stateless). Each subsequent request to the same domain (i.e. any page on `facebook.com`) also sends the cookie that contains the authentication information to remind Facebook's servers that you have already logged in.

Many REST APIs however use OAuth (authentication using tokens) which can be thought of a programmatic way to "login" _another_ user. Using tokens, a user (or application) only needs to send the login credentials once in the initial authentication and as a response from the server gets a special signed token. This signed token is then sent in future requests to the server (in place of the user credentials).

A similar concept common used by many APIs is to assign API Keys to each client that needs access to server resources. The client must then pass the API Key along with _every_ request it makes to the API to authenticate. This is because the server is typically relatively stateless and does not maintain a session between subsequent calls from the same client. Most APIs (including Yelp) allow you to pass the API Key via a special HTTP Header: `Authorization: Bearer <API_KEY>`. Check out the [docs](https://www.yelp.com/developers/documentation/v3/authentication) for more information.

### Question 2.1: Basic HTTP Requests w/o authentication **(5%)**

First, let's do the "hello world" of making web requests with Python to get a sense for how to programmatically access web pages: an (unauthenticated) HTTP GET to download a web page.

Fill in the funtion to use `requests` to download and return the raw HTML content of the URL passed in as an argument. As an example try the following NYT article (on Youtube's algorithmic recommendation): [https://www.nytimes.com/2019/03/29/technology/youtube-online-extremism.html](https://www.nytimes.com/2019/03/29/technology/youtube-online-extremism.html)

Your function should return a tuple of: (`<status_code>`, `<text>`). (Hint: look at the **Library documentation** listed earlier to see how `requests` should work.) 

In [None]:
import requests
def retrieve_html(url):
    """
    Return the raw HTML at the specified URL.

    Args:
        url (string): 

    Returns:
        status_code (integer):
        raw_html (string): the raw HTML content of the response, properly encoded according to the HTTP headers.
    """

    # [YOUR CODE HERE]
    r = requests.get(url)
    # used the request library documentation & https://www.w3schools.com/python/ref_requests_get.asp

    return r.status_code, r.text

In [None]:
#  test your function here to receive credit
test = retrieve_html('https://martinheinz.dev/blog/31')
print(test)
#(200, '<!DOCTYPE html>\n<html  data-head-attrs="">\n\n<head >\n  <title>Scraping News and Articles From Public APIs with Python | Martin Heinz .....

(200, '<!DOCTYPE html>\n<html  data-head-attrs="">\n\n<head >\n  <title>Scraping News and Articles From Public APIs with Python | Martin Heinz | Personal Website & Blog</title><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><meta name="twitter:title" content="Scraping News and Articles From Public APIs with Python"><meta name="twitter:text:title" content="Scraping News and Articles From Public APIs with Python"><meta name="og:url" content="https://martinheinz.dev/blog/31"><meta name="og:type" content="article"><meta name="article:published_time" content="2020-08-20T17:30:00Z"><meta name="article:section" content="Technology"><script type="application/ld+json">{"@context":"http://schema.org","@type":"BlogPosting","headline":"Scraping News and Articles From Public APIs with Python","description":"Scraping News and Articles From Public APIs with Python","image":"/favicon.ico","url":"https://martinheinz.dev/blog/31","datePublished":"2020-08-20T17:3

Now while this example might have been fun, we haven't yet done anything more than we could with a web browser. To really see the power of programmatically making web requests we will need to interact with an API. For the rest of this lab we will be working with the [Yelp API](https://www.yelp.com/developers/documentation/v3/get_started) and Yelp data (for an extensive data dump see their [Academic Dataset Challenge](https://www.yelp.com/dataset_challenge)). 

### Yelp API Access

The reasons for using the Yelp API are 3 fold:

1. Incredibly rich dataset that combines:
    * entity data (users and businesses)
    * preferences (i.e. ratings)
    * geographic data (business location and check-ins)
    * temporal data
    * text in the form of reviews
    * and even images.
2. Well [documented API](https://www.yelp.com/developers/documentation/v3/get_started) with thorough examples.
3. Extensive data coverage so that you can find data that you know personally (from your home town/city or account). This will help with understanding and interpreting your results.

Yelp used to use OAuth tokens but has now switched to API Keys. **For the sake of backwards compatibility Yelp still provides a Client ID and Secret for OAuth, but you will not need those for this assignment.** 

To access the Yelp API, we will need to go through a few more steps than we did with the first NYT example. Most large web scale companies use a combination of authentication and rate limiting to control access to their data to ensure that everyone using it abides. The first step (even before we make any request) is to setup a Yelp account if you do not have one and get API credentials.

1. Create a [Yelp](https://www.yelp.com/login) account (if you do not have one already)
2. [Generate API keys](https://www.yelp.com/developers/v3/manage_app) (if you haven't already). You will only need the API Key (not the Client ID or Client Secret) -- more on that later.

Now that we have our accounts setup we can start making requests! 


### Question 2.2: Authenticated HTTP Request with the Yelp API **(15%)**



First, store your Yelp credentials in a local file (kept out of version control) which you can read in to authenticate with the API. This file can be any format/structure since you will fill in the function stub below.

For example, you may want to store your key in a file called `yelp_api_key.txt` (run in terminal):
```bash
!echo 'YOUR_YELP_API_KEY' > yelp_api_key.txt
```

**KEEP THE API KEY FILE PRIVATE AND OUT OF VERSION CONTROL (and definitely do not submit them to Gradescope!)**

You can then read from the file using:

In [None]:
!echo '1LWwBfG9GXEG7d_yXqNYeeaCnmSK3e9x9XpBRZM-C925bHYtoxNo2RsjR7FLw4sDjJukTfvUEd5L9529Zk_uukCBI6Pvq9ViO8wPpRoQTJixPCvGf7G4pExicLDiY3Yx' > yelp_api_key.txt

In [None]:
with open('yelp_api_key.txt', 'r') as f:
    api_key = f.read().replace('\n','')
    print(api_key)
    # verify your api_key is correct
# DO NOT FORGET TO CLEAR THE OUTPUT TO KEEP YOUR API KEY PRIVATE

1LWwBfG9GXEG7d_yXqNYeeaCnmSK3e9x9XpBRZM-C925bHYtoxNo2RsjR7FLw4sDjJukTfvUEd5L9529Zk_uukCBI6Pvq9ViO8wPpRoQTJixPCvGf7G4pExicLDiY3Yx


In [None]:
def read_api_key(filepath):
    """
    Read the Yelp API Key from file.
    
    Args:
        filepath (string): File containing API Key
    Returns:
        api_key (string): The API Key
    """
    
    # feel free to modify this function if you are storing the API Key differently
    with open(filepath, 'r') as f:
        return f.read().replace('\n','')

Using the Yelp API, fill in the following function stub to make an authenticated request to the [search](https://www.yelp.com/developers/documentation/v3/business_search) endpoint. Remember Yelp allows you to pass the API Key via a special HTTP Header: `Authorization: Bearer <API_KEY>`. Check out the [docs](https://www.yelp.com/developers/documentation/v3/authentication) for more information.

In [None]:
def location_search_params(api_key, location, **kwargs):
    """
    Construct url, headers and url_params. Reference API docs (link above) to use the arguments

    these are the actual params that we give to the API in order to get the results
    so o/p is only string
    """
    # [YOUR CODE HERE]
    # What is the url endpoint for search?
    url = 'https://api.yelp.com/v3/businesses/search'

    # How is Authentication performed? By verifying the api key using headers, and then url params help to isolate the search
    headers = {
        'Authorization': f'Bearer {api_key}'
    }

    #dictonary according to the result provided to work the params of location 
    #search https://docs.developer.yelp.com/docs/fusion-authentication

    # SPACES in url is problematic. How should you handle location containing spaces?'
    url_params  = {'location' : location.replace(' ', '+')}

    #spaces in yelps search website represented by +, https://www.yelp.com/search?find_desc=new+york&find_loc=Chicago%2C+IL+60608
    # made a dictonary like headers, with location as first param

    # print(type(kwargs)) kwargs are just another dictonary so we can simply union both dictonaries using update
    # Include keyword arguments in url_params

    url_params.update(**kwargs)
    #url_params Union Kwargs

    return url, headers, url_params

Hint: `**kwargs` represent keyword arguments that are passed to the function. For example, if you called the function `location_search_params(api_key, location, offset=0, limit=50)`. The arguments `api_key` and `location` are called *positional arguments* and key-value pair arguments are called **keyword arguments**. Your `kwargs` variable will be a python dictionary with those keyword arguments.

In [None]:
# Test your code here to receive credit.
api_key = "test_api_key_xyz"
location = "Chicago"
url, headers, url_params = location_search_params(api_key, location, offset=0, limit=50)
url, headers, url_params
# ('https://api.yelp.com/v3/businesses/search',
#  {'Authorization': 'Bearer test_api_key_xyz'},
#  {'location': 'Chicago', 'offset': 0, 'limit': 50})

('https://api.yelp.com/v3/businesses/search',
 {'Authorization': 'Bearer test_api_key_xyz'},
 {'location': 'Chicago', 'offset': 0, 'limit': 50})

Now use `location_search_params(api_key, location, **kwargs)` to actually search restaurants from Yelp API. Most of the code is provided to you. Complete the `api_get_request` function given below. 

In [None]:
def api_get_request(url, headers, url_params):
    """
    Send a HTTP GET request and return a json response

    Args:
        url (string): API endpoint url
        headers (dict): A python dictionary containing HTTP headers including Authentication to be sent
        url_params (dict): The parameters (required and optional) supported by endpoint

    Returns:
        results (json): response as json

    """

    http_method = 'GET'

    # https://www.w3schools.com/python/module_requests.asp https://www.w3schools.com/python/ref_requests_get.asp
    # in order get a response we can use the request library in python, we have header and params which are enough for YELP Api
    # both params are in the requests library
    response = requests.get(url=url, headers=headers, params= url_params)
    #https://realpython.com/python-requests/#the-get-request
    #response will give an automatic Json output using .json() function
    return response.json()

def yelp_search(api_key, location, offset=0):
    """
    Make an authenticated request to the Yelp API.

    Args:
        api_key (string): Your Yelp API Key for Authentication
        location (string): Business Location
        offset (int): param for pagination

    Returns:
        total (integer): total number of businesses on Yelp corresponding to the location
        businesses (list): list of dicts representing each business
    """
    url, headers, url_params = location_search_params(api_key, location, offset=0)
    response_json = api_get_request(url, headers, url_params)
    return response_json["total"], list(response_json["businesses"])

# test your code here to receive credit
# [YOUR CODE HERE]
api_key = '1LWwBfG9GXEG7d_yXqNYeeaCnmSK3e9x9XpBRZM-C925bHYtoxNo2RsjR7FLw4sDjJukTfvUEd5L9529Zk_uukCBI6Pvq9ViO8wPpRoQTJixPCvGf7G4pExicLDiY3Yx'
num_records, data = yelp_search(api_key, 'Chicago')
print(num_records)
#7700
print(len(data))
#20
print(list(map(lambda x: x['name'], data)))
#['Girl & The Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', 'Art Institute of Chicago', "Lou Malnati's Pizzeria" .....

7700
20
['Girl & The Goat', 'Wildberry Pancakes and Cafe', 'Au Cheval', 'The Purple Pig', "Lou Malnati's Pizzeria", 'Art Institute of Chicago', "Bavette's Bar & Boeuf", 'Cafe Ba-Ba-Reeba!', 'Quartino Ristorante', "Pequod's Pizzeria", "Joe's Seafood, Prime Steak & Stone Crab", 'Alinea', "Portillo's & Barnelli's Chicago", 'Xoco', 'The Gage', 'Sapori Trattoria', 'Millennium Park', 'Three Dots and A Dash', 'Avec - Chicago', 'RPM Italian']


### Parameterization and Pagination

Now that we have completed the "hello world" of working with the Yelp API, we are ready to really fly! The rest of the exercise will have a bit less direction since there are a variety of ways to retrieve the requested information but you should have all the component knowledge at this point to work with the API. Yelp being a fairly general platform actually has many more business than just restaurants, but by using the flexibility of the API we can ask it to only return the restaurants.



And before we can get any reviews on restaurants, we need to actually get the metadata on ALL of the restaurants in Chicago. Notice above that while Yelp told us that there are ~240, the response contained fewer actual `Business` objects. This is due to pagination and is a safeguard against returning __TOO__ much data in a single request (what would happen if there were 100,000 restaurants?) and can be used in conjuction with _rate limiting_ as well as a way to throttle and protect access to Yelp data.

> As a thought exercise, consider: If an API has 1,000,000 records, but only returns 10 records per page and limits you to 5 requests per second... how long will it take to acquire ALL of the records contained in the API?

One of the ways that APIs are an improvement over plain web scraping is the ability to make __parameterized__ requests. Just like the Python functions you have been writing have arguments (or parameters) that allow you to customize its behavior/actions (an output) without having to rewrite the function entirely, we can parameterize the queries we make to the Yelp API to filter the results it returns.

### Question 2.3: Acquire all of the restaurants in Chicago on Yelp **(10%)**



Again using the [API documentation](https://www.yelp.com/developers/documentation/v3/business_search) for the `search` endpoint, fill in the following function to retrieve all of the _Restuarants_ (using categories) for a given query. Again you should use your `read_api_key()` function outside of the `all_restaurants()` stub to read the API Key used for the requests. You will need to account for __pagination__ and __[rate limiting](https://www.yelp.com/developers/faq)__ to:

1. Retrieve all of the Business objects (# of business objects should equal `total` in the response). **Paginate by querying 10 restaurants each request.**
2. Pause slightly (at least 200 milliseconds) between subsequent requests so as to not overwhelm the API (and get blocked).  

As always with API access, make sure you follow all of the [API's policies](https://www.yelp.com/developers/api_terms) and use the API responsibly and respectfully.

**DO NOT MAKE TOO MANY REQUESTS TOO QUICKLY OR YOUR KEY MAY BE BLOCKED**

In [None]:
import math 

def paginated_restaurant_search_requests(api_key, location, total):
    """
    Returns a list of tuples (url, headers, url_params) for paginated search of all restaurants
    Args:
        api_key (string): Your Yelp API Key for Authentication
        location (string): Business Location
        total (int): Total number of items to be fetched
    Returns:
        results (list): list of tuple (url, headers, url_params)
    """
    # HINT: Use total, offset and limit for pagination
    # You can reuse function location_search_params(...)
    # [YOUR CODE HERE]
    # 
    limit = 10
    iterations = math.ceil(total / limit)
    # for every iteration we change offset by 10, so it 1 > ofset 0 then 10, then 20 so on..
    """

    so pagentation works by pagenting the number of requests, 
    we use the formula total/limit as thats how many requests we need to call the api 
    so for example, if there are total 100 results, to get every result by the limit of 10 
    # number of business objects should equal total in the response). Paginate by querying 10 restaurants each request.
    so 10 , 10 ,10 ,10 till.. total
    we do 100/10 = 10 so we iterate 10 times each time ofsetting by 10 total entries
    we as told above need to limit to 10 restaurants per request
    each iteration we therefore offset the result by 10
    we iterate through the requests so request 1 will get an output param
    that required for the 1st request
    and so on till the last request
    output is list of Tuples that give api calls using pagentation

    """
    results = []
    for i in range(iterations):
      offset = i*limit
      result = location_search_params(api_key,location,offset = offset,limit = 10,\
                                      categories = 'restaurants') # You can reuse function location_search_params(...), 
      results.extend([result])                            # offset and limitand catagories are *kwargs that'll be unioned,  
                                  #.append() takes a single element as argument while .extend() takes an iterable as argument (list, tuple, dictionaries, sets, strings).
                              #source: https://www.freecodecamp.org/news/python-list-append-vs-python-list-extend
                                                  
    return results #list of tuples ex, [(),(),()]

# Test your code here to receive credit.
api_key = "test_api_key_xyz"
location = "Chicago"
all_restaurants_requests = paginated_restaurant_search_requests(api_key, location, 15)
all_restaurants_requests

# [('https://api.yelp.com/v3/businesses/search',
#   {'Authorization': 'Bearer test_api_key_xyz'},
#   {'location': 'Chicago',
#    'offset': 0,
#    'limit': 10,
#    'categories': 'restaurants'}),
#  ('https://api.yelp.com/v3/businesses/search',
#   {'Authorization': 'Bearer test_api_key_xyz'},
#   {'location': 'Chicago',
#    'offset': 10,
#    'limit': 10,
#    'categories': 'restaurants'})]

[('https://api.yelp.com/v3/businesses/search',
  {'Authorization': 'Bearer test_api_key_xyz'},
  {'location': 'Chicago',
   'offset': 0,
   'limit': 10,
   'categories': 'restaurants'}),
 ('https://api.yelp.com/v3/businesses/search',
  {'Authorization': 'Bearer test_api_key_xyz'},
  {'location': 'Chicago',
   'offset': 10,
   'limit': 10,
   'categories': 'restaurants'})]

In [None]:
from time import sleep
def all_restaurants(api_key, location):
    """
    Construct the pagination requests for ALL the restaurants on Yelp for a given location.

    Args:
        api_key (string): Your Yelp API Key for Authentication
        location (string): Business Location

    Returns:
        results (list): list of dicts representing each restaurant
    """
    # What keyword arguments should you pass to get first page of restaurants in Yelp
    url, headers, url_params = location_search_params(api_key, location, offset=0, limit=10)
    # for the first page, api_key = APIKEY location= chicago, limit = 1, offset =0)

    response_json = api_get_request(url, headers, url_params)
    total_items = response_json["total"]

    all_restaurants_request = paginated_restaurant_search_requests(api_key, location, total_items)

    # Use returned list of (url, headers, url_params) and function api_get_request to retrive all restaurants
    # REMEMBER to pause slightly after each request.
    # [YOUR CODE HERE]

    results = [] #results (list): list of dicts representing each restaurant
    #in order to get requests we need url, headers and url_params
    # we have url headers and url parms as keys of the dictionary all_restaurants_request
    #print(all_restaurants_request) ([url, headers, param])
    for i in all_restaurants_request:
        response_json = api_get_request(i[0],i[1],i[2])
        #print(response_json)
        results.extend(response_json['businesses']) 
       
        sleep(0.5) #https://www.programiz.com/python-programming/time/sleep

    return results

The **all_restaurants** function is used to retrieve information about all the restaurants in a specific location using the Yelp API.

The function takes in two arguments:

api_key: The Yelp API key used for authentication.
location: The location of the business.
The function first constructs the request parameters needed to retrieve the first page of restaurant information from the Yelp API. This includes the URL, headers, and URL parameters.

The function then uses the paginated_restaurant_search_requests function to retrieve a list of all the requests that need to be made to retrieve all the restaurants.

Finally, the function uses a for loop to iterate through this list of requests, retrieve the response for each request, and add the response to a list called results. The function then pauses for 1 second after each iteration to avoid overwhelming the API server.

Once all the responses have been retrieved, the function returns the results list, which contains information about all the restaurants.

```
# This is formatted as code
```



You can test your function with an individual neighborhood in Chicago (for example, Greektown). Chicago itself has a lot of restaurants... meaning it will take a lot of time to download them all.

In [None]:
# test your function here to receive credit.
api_key = read_api_key('yelp_api_key.txt')
data = all_restaurants(api_key, 'Greektown, Chicago, IL')
print(len(data))
# 94
print(list(map(lambda x:x['name'], data)))
#['Greek Islands Restaurant', 'Artopolis', 'Meli Cafe & Juice Bar', 'Athena Greek Restaurant', 'WJ Noodles', .....]

[('https://api.yelp.com/v3/businesses/search', {'Authorization': 'Bearer 1LWwBfG9GXEG7d_yXqNYeeaCnmSK3e9x9XpBRZM-C925bHYtoxNo2RsjR7FLw4sDjJukTfvUEd5L9529Zk_uukCBI6Pvq9ViO8wPpRoQTJixPCvGf7G4pExicLDiY3Yx'}, {'location': 'Greektown,+Chicago,+IL', 'offset': 0, 'limit': 10, 'categories': 'restaurants'}), ('https://api.yelp.com/v3/businesses/search', {'Authorization': 'Bearer 1LWwBfG9GXEG7d_yXqNYeeaCnmSK3e9x9XpBRZM-C925bHYtoxNo2RsjR7FLw4sDjJukTfvUEd5L9529Zk_uukCBI6Pvq9ViO8wPpRoQTJixPCvGf7G4pExicLDiY3Yx'}, {'location': 'Greektown,+Chicago,+IL', 'offset': 10, 'limit': 10, 'categories': 'restaurants'}), ('https://api.yelp.com/v3/businesses/search', {'Authorization': 'Bearer 1LWwBfG9GXEG7d_yXqNYeeaCnmSK3e9x9XpBRZM-C925bHYtoxNo2RsjR7FLw4sDjJukTfvUEd5L9529Zk_uukCBI6Pvq9ViO8wPpRoQTJixPCvGf7G4pExicLDiY3Yx'}, {'location': 'Greektown,+Chicago,+IL', 'offset': 20, 'limit': 10, 'categories': 'restaurants'}), ('https://api.yelp.com/v3/businesses/search', {'Authorization': 'Bearer 1LWwBfG9GXEG7d_yXqNYeeaCn

Now that we have the metadata on all of the restaurants in Greektown (or at least the ones listed on Yelp), we can retrieve the reviews and ratings. The Yelp API gives us aggregate information on ratings but it doesn't give us the review text or individual users' ratings for a restaurant. For that we need to turn to web scraping, but to find out what pages to scrape we first need to parse our JSON from the API to extract the URLs of the restaurants.

In general, it is a best practice to separate the act of __downloading__ data and __parsing__ data. This ensures that your data processing pipeline is modular and extensible (and autogradable ;). This decoupling also solves the problem of expensive downloading but cheap parsing (in terms of computation and time).


### Question 2.4: Parse the API Responses and Extract the URLs **(5%)**




Because we want to separate the __downloading__ from the __parsing__, fill in the following function to parse the URLs pointing to the restaurants on `yelp.com`. As input your function should expect a string of [properly formatted JSON](http://www.json.org/) (which is similar to __BUT__ not the same as a Python dictionary) and as output should return a Python list of strings. Hint: print your `data` to see the JSON-formatted information you have. The input JSON will be structured as follows (same as the [sample](https://www.yelp.com/developers/documentation/v3/business_search) on the Yelp API page):

```json
{
  "total": 8228,
  "businesses": [
    {
      "rating": 4,
      "price": "$",
      "phone": "+14152520800",
      "id": "four-barrel-coffee-san-francisco",
      "is_closed": false,
      "categories": [
        {
          "alias": "coffee",
          "title": "Coffee & Tea"
        }
      ],
      "review_count": 1738,
      "name": "Four Barrel Coffee",
      "url": "https://www.yelp.com/biz/four-barrel-coffee-san-francisco",
      "coordinates": {
        "latitude": 37.7670169511878,
        "longitude": -122.42184275
      },
      "image_url": "http://s3-media2.fl.yelpcdn.com/bphoto/MmgtASP3l_t4tPCL1iAsCg/o.jpg",
      "location": {
        "city": "San Francisco",
        "country": "US",
        "address2": "",
        "address3": "",
        "state": "CA",
        "address1": "375 Valencia St",
        "zip_code": "94103"
      },
      "distance": 1604.23,
      "transactions": ["pickup", "delivery"]
    }
  ],
  "region": {
    "center": {
      "latitude": 37.767413217936834,
      "longitude": -122.42820739746094
    }
  }
}
```

In [None]:
def parse_api_response(data):
    """
    Parse Yelp API results to extract restaurant URLs.
    
     from what i understand,
     we get a json file after we put request, we extract urls from data.
     json is formated as dictonary in python
     in the key "businesses" as "url"


    Args:
        data (string): String of properly formatted JSON.

    Returns:
        (list): list of URLs as strings from the input JSON.
    """

    # [YOUR CODE HERE]
    results = []
    for i in range(len(data)):
        results.append(data['businesses'][i]['url'])
    return results

# test your code here to receive credit

url, headers, url_params = location_search_params(api_key, "Bridgeport, Chicago, IL", offset=0)
response_text = api_get_request(url, headers, url_params)

# print(response_text['businesses'][0]['url']) <- extract from the following URL ie iterate from 0 to len()

parse_api_response(response_text)

# ['https://www.yelp.com/biz/nana-chicago?adjust_creative=RYmY_QnZRP74oo4eNQbazg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=RYmY_QnZRP74oo4eNQbazg',
#  'https://www.yelp.com/biz/jackalope-coffee-and-tea-house-chicago?adjust_creative=RYmY_QnZRP74oo4eNQbazg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=RYmY_QnZRP74oo4eNQbazg',
#  .....
# ]


['https://www.yelp.com/biz/nana-chicago?adjust_creative=ak-32x0WbDy4kisNuC36pw&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=ak-32x0WbDy4kisNuC36pw',
 'https://www.yelp.com/biz/jackalope-coffee-and-tea-house-chicago?adjust_creative=ak-32x0WbDy4kisNuC36pw&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=ak-32x0WbDy4kisNuC36pw',
 'https://www.yelp.com/biz/the-duck-inn-chicago?adjust_creative=ak-32x0WbDy4kisNuC36pw&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=ak-32x0WbDy4kisNuC36pw']

As we can see, JSON is quite trivial to parse (which is not the case with HTML as we will see in a second) and work with programmatically. This is why it is one of the most ubiquitous data serialization formats (especially for ReSTful APIs) and a huge benefit of working with a well defined API if one exists. But APIs do not always exists or provide the data we might need, and as a last resort we can always scrape web pages...

### Working with Web Pages (and HTML)

Think of APIs as similar to accessing an application's database itself (something you can interactively query and receive structured data back). But the results are usually in a somewhat raw form with no formatting or visual representation (like the results from a database query). This is a benefit _AND_ a drawback depending on the end use case. For data science and _programatic_ analysis this raw form is quite ideal, but for an end user requesting information from a _graphical interface_ (like a web browser) this is very far from ideal since it takes some cognitive overhead to interpret the raw information. And vice versa, if we have HTML it is quite easy for a human to visually interpret it, but to try to perform some type of programmatic analysis we first need to parse the HTML into a more structured form.

> As a general rule of thumb, if the data you need can be accessed or retrieved in a structured form (either from a bulk download or API) prefer that first. But if the data you want (and need) is not as in our case we need to resort to alternative (messier) means.

Going back to the "hello world" example of question 2.1 with the NYT, we will do something similar to retrieve the HTML of the Yelp site itself (rather than going through the API programmatically) as text. 
> However, we will use saved HTML pages to reduce excessive traffic to the Yelp website.

### Question 2.5: Parse a Yelp restaurant Page **(10%)**

Using `BeautifulSoup`, parse the HTML of a single Yelp restaurant page to extract the reviews in a structured form as well as the URL to the next page of reviews (or `None` if it is the last page). Fill in following function stubs to parse a single page of reviews and return:
* the reviews as a structured Python dictionary
* the HTML element containing the link/url for the next page of reviews (or None).

For each review be sure to structure your Python dictionary as follows (to be graded correctly). The order of the keys doesn't matter, only the keys and the data type of the values:

```python
{
    'author': str
    'rating': float
    'date': str ('yyyy-mm-dd')
    'description': str
}

{
    'author': 'Topsy Kretts'
    'rating': 4.7
    'date': '2016-01-23'
    'description': "Wonderful!"
}
```

There can be issues with Beautiful Soup using various parsers, for maximum compatibility (and fewest errors) initialize the library with the default (and Python standard library parser): `BeautifulSoup(markup, "html.parser")`.

Most of the function has been provided to you:

In [None]:
from collections import defaultdict
url_lookup = {
"https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=225":"parse_page_test1.html",
"https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=245":"parse_page_test2.html"
}

def html_fetcher(url):
    """
    Return the raw HTML at the specified URL.
    Args:
        url (string): 

    Returns:
        is normal) https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
        raw_html (string): the raw status_code (integer): (200 HTML content of the response, properly encoded according to the HTTP headers.
    """
    html_file = url_lookup.get(url)
    with open(html_file, 'rb') as file:
        html_text = file.read()
        return 200, html_text


def parse_page(html):
    """
    Parse the reviews on a single page of a restaurant.
    
    I viewed ther html webpage first using chrome to see what I was working with
    we need to return a dictonary of reviews as shown above
    step 1, is the find where these reviews are located in the html file:
     i use visual studio to open the HTML file and search for the keyword "reviews"
     this gave a lot of results so i searched for "author", i found the reviews
     they are under the <div> </div> with the itemprop = review tag so we only search that using find_all()
     this acts a "key" in a sense that we isolate only the entries with this tags
    Args:
        html (string): String of HTML corresponding to a Yelp restaurant

    Returns:
        tuple(list, string): a tuple of two elements
            first element: list of dictionaries corresponding to the extracted review information
            second element: URL for the next page of reviews (or None if it is the last page)
    """
    soup = BeautifulSoup(html,'html.parser') 
    url_next = soup.find('link',rel='next')    
    if url_next:
        url_next = url_next.get('href')
    else:
        url_next = None

    reviews = soup.find_all('div', itemprop="review")#div itemprop="review" itemscope itemtype="http://schema.org/Review"
    #https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
    # i know this was already written but I tried to understand how it was being searched 
    # so I can use the find method further for isolation/search
    #print(reviews[0])
    """
    <div itemprop="review" itemscope="" itemtype="http://schema.org/Review">
    <meta content="Jason S." itemprop="author"/>
    <div itemprop="reviewRating" itemscope="" itemtype="http://schema.org/Rating">
    <meta content="5.0" itemprop="ratingValue"/>
    </div>
    <meta content="2016-05-02" itemprop="datePublished"/>
    <p itemprop="description">This was one of my favorite food trucks but as of last fall they've opened a brick and mortar restaurant in the Pilsen neighborhood...the perfect success story of how a person can start out with a food truck and grow their business into a restaurant. The food is always delicious and the service is great!<p>
    </p></p></div>

    
    example of 1 entry, now, we just need to struicture this into a dictonary
            print(type(review))
            bs4.element.tag

    'author': 'Topsy Kretts'  
    'rating': 4.7 
    'date': '2016-01-23'
    'description': "Wonderful!"


    """

        

    reviews_list = []
    #reviews is a list of html objects
    for i in reviews:
        #print(i)
        d = {}
        d['author'] = i.find('meta', itemprop='author')['content'] 
        #https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
        #    <meta content="Jason S." itemprop="author"/>
        #this helped me uderstand how to find via item prop https://stackoverflow.com/questions/42657894/beautifulsoup-scrape-itemprop-name-in-python
        # the ['content'] is the data value attribute  https://stackoverflow.com/questions/34301815/understand-the-find-function-in-beautiful-soup
        #this returns a string
        
        d['rating'] = float(i.find('meta', itemprop='ratingValue')['content'])
        #same as author
        d['date'] = i.find('meta', itemprop='datePublished')['content']
        #same as author
        d['description'] = i.find('p', itemprop='description').get_text()[:-2]
        #    <p itemprop="description">This wa..e is great!<p></p></p></div>
        #find by <p> with itemprop "description" 
        #https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element

        reviews_list.append(d)


    return reviews_list, url_next

# Test your implementation here to receive credit.
code, html = html_fetcher("https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=225")
# print(html) 
#we have a raw html as a text input to beautiful soup
reviews_list, url_next = parse_page(html)
print(len(reviews_list)) # 20
print(url_next) #https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=245

20
https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=245


### Question 2.6: Extract all Yelp reviews for a Single Restaurant **(5%)**



So now that we have parsed a single page, and figured out a method to go from one page to the next we are ready to combine these two techniques and actually crawl through web pages! 

Using the provided `html_fetcher` (for a real use-case you would use `requests`), programmatically retrieve __ALL__ of the reviews for a __single__ restaurant (provided as a parameter). Just like the API was paginated, the HTML paginates its reviews (it would be a very long web page to show 300 reviews on a single page) and to get all the reviews you will need to parse and traverse the HTML. As input your function will receive a URL corresponding to a Yelp restaurant. As output return a list of dictionaries (structured the same as question 2.5) containing the relevant information from the reviews. You can use `parse_page()` here.

In [None]:
def extract_reviews(url, html_fetcher):
    """
    Retrieve ALL of the reviews for a single restaurant on Yelp.

    Parameters:
        url (string): Yelp URL corresponding to the restaurant of interest.
        html_fetcher (function): A function that takes url and returns html status code and content
    
    Returns:
        reviews (list): list of dictionaries containing extracted review information
    """
    reviews = []
    # [YOUR CODE HERE]
    # HINT: Use function `parse_page(html)` multiple times until no next page exists
    # You MUST add comments explaining your thought process,
    # and the resources you used to solve this question (if any)
    
    """
    we need to call parse page untill no pages are left to be parsed? ie url_next != none
    parse page needs html so we use an html fetcher to fetch the html content of the page
    then loop it,append/extend it to list till we get all the content
    """
    while url:
      status_code, html_content = html_fetcher(url) # html fetched 
      if status_code != 200:
        return 'HTML ERROR'
      reviews_list, url = parse_page(html_content) #page 1 parsed
      reviews.extend(reviews_list) #page 1 reviews added. note that the formating is already done in parse_page
    return reviews

You can test your function with this code:

In [None]:
# test your function here to receive credit.
data = extract_reviews('https://www.yelp.com/biz/the-jibarito-stop-chicago-2?start=225', html_fetcher=html_fetcher)
print(len(data))
# 35
print(data[0])
# {'author': 'Jason S.', 'rating': 5.0, 'date': '2016-05-02', 'description': "This was one of my favorite food trucks ..."}


35
{'author': 'Jason S.', 'rating': '5.0', 'date': '2016-05-02', 'description': "This was one of my favorite food trucks but as of last fall they've opened a brick and mortar restaurant in the Pilsen neighborhood...the perfect success story of how a person can start out with a food truck and grow their business into a restaurant. The food is always delicious and the service is great"}


## Submission

You're almost done! 

After executing all commands and completing this notebook, download your Python Notebook (.IPYNB file) and Python file (.PY file) upload it to Gradescope under *Lab 2*. Make sure you check that your **ipynb** file includes all parts of your solution **(including the outputs)**.