In [38]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 6 – APIs and Web Scraping

## DSC 80, Spring 2024

### Due Date: Wednesday, May 15th at 11:59 PM


## Instructions

Welcome to the sixth DSC 80 lab this quarter!

Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook, and **you will only submit that `lab.py` file**, not this notebook!

Some additional guidelines:
- **Unlike in DSC 10, labs will have both public tests and hidden tests.** The bulk of your grade will come from your scores on hidden tests, which you will only see on Gradescope after the assignment deadline.
- **Do not change the function names in the `lab.py` file!** The functions in the `lab.py` file are how your assignment is graded, and they are graded by their name. If you changed something you weren't supposed to, you can find the original code in the [course GitHub repository](https://github.com/dsc-courses/dsc80-2024-sp).
- Notebooks are nice for testing and experimenting with different implementations before designing your function in your `lab.py` file. You can write code here, but make sure that all of your real work is in the `lab.py` file, since that's all you're submitting.
- You are encouraged to write your own additional helper functions to solve the lab, as long as they also end up in `lab.py`.

**To ensure that all of the work you want to submit is in `lab.py`, we've included a script named `lab-validation.py` in the lab folder. You shouldn't edit it, but instead, you should call it from the command line (e.g. the Terminal) to test your work.** More details on its usage are given at the bottom of this notebook.

**Importing code from `lab.py`**:

* Below, we import the `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [39]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [40]:
from lab import *

If the cell below returns a `ModuleNotFoundError`, please run `!pip install lxml` in a new cell. After `lxml` is succesfully installed, go to `kernel` then restart. Note that you will only need to run `!pip install lxml` once. 

In [41]:
import os
import pandas as pd
import numpy as np
import requests
import bs4
import lxml

In [5]:
#!pip install lxml

## Question 1 – Practice with HTML Tags 📎

In Question 2, you'll spend plenty of time parsing HTML source code. But before you get your hands dirty trying to extract information from HTML written by other people, it is a good idea to write basic HTML code yourself. This exercise will help you better understand how the code in a `.html` file is structured.

For this question, you'll create a very basic `.html` file, named `lab06_1.html`, that satisfies the following conditions:

- It must have `<title>` and `<head>` tags.
- It must also have `<body>` tags. Within the `<body>` tags, it must have:
    - At least two headers.
    * At least three images.
        - At least one image must be a local file.
        - At least one image must be linked to online source.
        - At least one image has to have default text when it cannot be displayed.
    * At least three references (hyperlinks) to different web pages.
    * At least one table with two rows and two columns.
    

Make sure to save your file as `lab06_1.html`, and save it in the same directory as `lab.py`. **When submitting this homework to Gradescope, make sure to also upload `lab06_1.html` along with the local image that you embedded in your site.** You can upload multiple files to Gradescope at a time.
   

***Notes***:
- You can write and view basic HTML with a Jupyter Notebook, using either a Markdown cell or by using the `IPython.display.HTML` function (which takes in a string of HTML and renders it).
- If you write your HTML code within a Jupyter Notebook, you should later copy your code into a text editor and save it with the `.html` extension. You could also write your HTML in a text editor directly.
- Be sure to open your final `.html` file in a browser and make sure it looks correct on its own.

<!DOCTYPE html>
<html>
<head>
    <title>Mia's Example HTML Page</title>
</head>
<body>
    <h1>Welcome to My Page!</h1>
    <h2>Guess what! My birthday is next week.</h2>

    <p>Below are some images:</p>
    <img src="local_image.png" alt="Flier for my Research Project">
    <img src="https://cdn-media.theathletic.com/cdn-cgi/image/width=1440%2cformat=auto%2cquality=75/https://cdn-media.theathletic.com/f0XraZOhyUM1_9YbgmKjDwQ2P_1440x960.jpg" alt="Online Image">
    <img src="non_existent_image.jpg" alt="Default Text" onerror="this.onerror=null; this.src='default_image.jpg';">

    <p>Here are some references:</p>
    <ul>
        <li><a href="https://www.nytimes.com/athletic/5485410/2024/05/14/caitlin-clark-indiana-fever-wnba-debate/">Link 1</a></li>
        <li><a href="https://www.macrumors.com/guide/apple-ring/">Link 2</a></li>
        <li><a href="https://www.iamchappellroan.com/tour/">Link 3</a></li>
    </ul>

    <p>Below is a table:</p>
    <table border="1">
        <tr>
            <td>Row 1, Column 1</td>
            <td>Row 1, Column 2</td>
        </tr>
        <tr>
            <td>Row 2, Column 1</td>
            <td>Row 2, Column 2</td>
        </tr>
    </table>
</body>
</html>

In [42]:
# Don't delete this cell!
question1()

In [43]:
grader.check("q1")

## Question 2 – Scraping an Online Bookstore 📚

Browse through the following fake online bookstore: http://books.toscrape.com/. This website is meant for toying with scraping.

Your job is to scrape the website, collecting data on all books that have:
- **_at least_ a four-star rating**, and
- **a price _strictly_ less than £50**, and 
- **belong to specific categories** (more details below). 

You will extract the information into a DataFrame that looks like the one below.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>UPC</th>
      <th>Product Type</th>
      <th>Price (excl. tax)</th>
      <th>Price (incl. tax)</th>
      <th>Tax</th>
      <th>Availability</th>
      <th>Number of reviews</th>
      <th>Category</th>
      <th>Rating</th>
      <th>Description</th>
      <th>Title</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>e10e1e165dc8be4a</td>
      <td>Books</td>
      <td>Â£22.60</td>
      <td>Â£22.60</td>
      <td>Â£0.00</td>
      <td>In stock (19 available)</td>
      <td>0</td>
      <td>Default</td>
      <td>Four</td>
      <td>For readers of Laura Hillenbrand's Seabiscuit...</td>
      <td>The Boys in the Boat: Nine Americans...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>c2e46a2ee3b4a322</td>
      <td>Books</td>
      <td>Â£25.27</td>
      <td>Â£25.27</td>
      <td>Â£0.00</td>
      <td>In stock (19 available)</td>
      <td>0</td>
      <td>Romance</td>
      <td>Five</td>
      <td>A Michelin two-star chef at twenty-eight, Violette...</td>
      <td>Chase Me (Paris Nights #2)</td>
    </tr>
    <tr>
      <th>2</th>
      <td>00bfed9e18bb36f3</td>
      <td>Books</td>
      <td>Â£34.53</td>
      <td>Â£34.53</td>
      <td>Â£0.00</td>
      <td>In stock (19 available)</td>
      <td>0</td>
      <td>Romance</td>
      <td>Five</td>
      <td>No matter how busy he keeps himself...</td>
      <td>Black Dust</td>
    </tr>
  </tbody>
</table>

To do so, implement the following functions.

<br>

#### `extract_book_links`

Complete the implementation of the function `extract_book_links`, which takes in the content of a page that contains book listings as a **string of HTML**, and returns a **list** of URLs of book-specific pages for all books with **_at least_ a four-star rating and a price _strictly_ less than £50**.

For this method, the URLs you return should not contain the protocol (i.e. `'https://'`). The protocols should be added back into the URLs when you actually make the requests.


<br>

#### `get_product_info`

Complete the implementation of the function `get_product_info`, which takes in the content of a book-specific page as a **string of HTML**, and a list `categories` of book categories. If the input book is in the list of `categories`, `get_product_info` should return a dictionary corresponding to a row in the DataFrame in the image above (where the keys are the column names and the values are the row values). If the input book is not in the list of `categories`, return `None`.


<br>

#### `scrape_books`

Finally, put everything together. Complete the implementation of the function `scrape_books`, which takes in an integer `k` and a list `categories` of book categories. `scrape_books` should use `requests` to scrape the first `k` pages of the bookstore and return a DataFrame of only the books that have 
- **_at least_ a four-star rating**, and
- **a price _strictly_ less than £50**, and
- **a category that is in the list `categories`**.

<br>

Some general guidance and tips:

- The first page of the bookstore is at http://books.toscrape.com/catalogue/page-1.html. Subsequent pages can be found by clicking the "Next" button at the bottom of the page. Look at how the URLs change each time you navigate to a new page; think about how to use f-strings (or some other string formatting technique) to generate these URLs.
- Use "inspect element" to view the source code of the pages you're trying to scrape. To find a book's category, look at the hyperlinks in the book-specific page for that book.
- **`scrape_books` should run in under 180 seconds on the entire bookstore (`k = 50`). `scrape_books` is also the only function that should make `GET` requests; the other two functions parse already-existing HTML.**
- When instantiating `bs4.BeautifulSoup` objects, use the optional argument `features='lxml'` to suppress any warnings.
- Don't worry about typecasting, i.e. it's fine if `'Number of reviews'` is not stored as type `int`. Also, don't worry if you run into encoding errors in your price columns (as the example DataFrame at the top of this cell contains).

In [166]:
from bs4 import BeautifulSoup

def extract_book_links(text):
    soup = BeautifulSoup(text, 'lxml')
    books = soup.find_all('article', class_='product_pod')
    urls = []
    for book in books:
        rating = book.find('p', class_='star-rating')['class'][1]
        price = book.find('p', class_='price_color').get_text()
        price = float(price[2:])

        if rating in ['Four','Five'] and price < 50:
            link = book.find('div', class_='image_container').find('a')['href']
            urls.append(link)
        
    return urls

In [171]:
def get_product_info(text, categories):
    keys = ['UPC', 'Product Type', 'Price (excl. tax)', 'Price (incl. tax)', 'Tax', 'Availability', 'Number of reviews', 'Category', 'Rating', 'Description', 'Title']
    soup = BeautifulSoup(text, features="lxml")
    categories = [x.lower() for x in categories]

    this_cat = soup.find('ul', class_='breadcrumb').find_all('a')[-1].text
    if this_cat.lower() not in categories: return None

    first_seven = soup.find_all('td') # Get the first seven values for the dict
    values = [x.get_text() for x in first_seven]
    
    values.append(this_cat) # Add category to dictionary
    
    rating = soup.find('p', class_='star-rating')['class'][1]
    values.append(rating)
    
    description = soup.find('meta', attrs={'name': 'description'})
    description = description['content'].strip()
    values.append(description) 
    
    title = soup.find('title').get_text().strip().split('|')[0].strip()
    values.append(title)
    
    result = dict(zip(keys, values))
    
    return result

In [172]:
def scrape_books(k, categories):
    cols = ['UPC', 'Product Type', 'Price (excl. tax)', 'Price (incl. tax)', 'Tax', 'Availability', 'Number of reviews', 'Category', 'Rating', 'Description', 'Title']
    final = pd.DataFrame(columns=cols)
    for i in range(1, k+1):
        page = requests.get(f'http://books.toscrape.com/catalogue/page-{i}.html')
        page_txt = page.text
        links = extract_book_links(page_txt)
        for book in links:
            info = requests.get(f'https://books.toscrape.com/catalogue/{book}')
            info_dict = get_product_info(info.text, categories)  
            if info_dict == None:
                continue
            info_df = pd.DataFrame([info_dict])
            final = pd.concat([final, info_df])
    return final

In [173]:
# don't delete this cell, but do run it -- it is needed for the autograder tests

# public test for extract_book_links 
extract_book_links_fp = os.path.join('data', 'products.html')
extract_book_out = extract_book_links(
    open(extract_book_links_fp, encoding='utf-8').read()
)
extract_book_url = 'scarlet-the-lunar-chronicles-2_218/index.html'

# doc tests for get product info
get_product_info_fp = os.path.join('data', 'Frankenstein.html')
get_product_info_out = get_product_info(
    open(get_product_info_fp, encoding='utf-8').read(), ['Default']
)

# public test for scrape books 
scrape_books_out = scrape_books(1, ['Mystery'])

In [174]:
grader.check("q2")

## Question 3 – API Requests 🤑

You trade stocks as a hobby. As an avid `pandas` coder, you decide to calculate statistics of your favorite stocks by pulling data from a public API. The API we will work with can be found at https://financialmodelingprep.com/developer/docs/#Stock-Historical-Price. Specifically, we will use the "**Daily Chart EOD**" endpoint (search for it at the linked page).

Some relevant definitions:
- Ticker: A short code that refers to a stock. For example, Apple's ticker is AAPL and Ford's ticker is F. 
- Open: The price of a stock at the beginning of a trading day.
- Close: The price of a stock at the end of a trading day.
- Volume: The total number of shares traded in a day.
- Percent change: The difference in price with respect to the original price, as a percentage.

To make requests to the aforementioned API, you will need an API key. In order to get one, you will need to make an account at the website. Once you've signed up, you can use the API key that comes with the free plan. It has a limit of 250 requests per day, which should be more than enough. You will have to encode your API key in the URL that you make requests to; see a complete example of such a request at the right side of the [documentation](https://site.financialmodelingprep.com/developer/docs#Stock-Historical-Price).

Implement the following two functions.

#### `stock_history`

Complete the implementation of the function `stock_history`, which takes in a string `ticker` and two integers, `year` and `month`, and returns a DataFrame containing the price history for that stock in that month. Keep all of the attributes that are returned by the API.

***Notes***:
- Read the API documentation if you get stuck!
- [`pd.date_range`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html) takes in two dates and returns a sequence of all dates between the two dates, excluding the right endpoint. How might this be helpful?
- The [`requests.get`](https://docs.python-requests.org/en/master/user/quickstart/) function returns a Response object, not the data itself. Use the `json` method on the Response object to extract the relevant JSON, as we did in [Lecture 9](https://dsc80.com/resources/lectures/lec09/lec09-filled.html#Example:-GET-requests-via-requests) (you don't need to `import json` to do this).
- You can instantiate a DataFrame using a sequence of dictionaries.

<br>

#### `stock_stats`

Create a function `stock_stats` that takes in a DataFrame outputted by `stock_history` and returns a **tuple** of two numbers:
1. The percent change of the stock throughout the month as a **percentage**.
2. An estimate of the total transaction volume **in billion of dollars** for that month.

Both values in the tuple should be **strings** that contain numbers rounded to two decimal places. Add a plus or minus sign in front of the percent change, and make sure that the total transaction volume string ends in a `'B'`.

**To compute the percent change**, use the opening price on the first day of the month as the starting price and the closing price on the last day of the month as the ending price.

**To compute the total transaction volume**, assume that on any given day, the average price of a share is the midpoint of the high and low price for that day.

$$ \text{Estimated Total Transaction Volume (in dollars)} = \text{Volume (number of shares traded)} \times \text{Average Price} $$

For example, suppose there are only three days in March – March 1st, March 2nd, and March 3rd.

If BYND (Beyond Meat) opens at \\$4 on March 1st and closes at \\$5 on March 3rd, its percent change for the month of March is $$\frac{\$5-\$4}{\$4} = +25.00\%$$

Suppose the high and low prices and volumes of BYND on each day are given below.
- March 1st: high \\$5, low \\$3, volume 500 million (0.5 billion)
- March 2nd: high \\$5.5, low \\$2.5, volume 1 billion
- March 3rd: high \\$5.25, low \\$4, volume 500 million (0.5 billion)

Then, the estimated total transaction volume is
$$\frac{\$5 + \$3}{2} \cdot 0.5 B + \frac{\$5.5 + \$2.5}{2} \cdot 1 B + \frac{\$5.25 + \$4}{2} \cdot 0.5 B = 8.3125B$$

In [29]:
def stock_history(ticker, year, month):
    start_date = f"{year}-{month:02d}-01"
    end_date = f"{year}-{month:02d}-{pd.Period(start_date).days_in_month}"

    url = f"https://financialmodelingprep.com/api/v3/historical-price-full/{ticker}?from={start_date}&to={end_date}&apikey=ATgVbtI43Q4WNnBNUqiuy0VaNjyyqu7o"
    response = requests.get(url)
    data = response.json()

    historical_data = data['historical']
    df = pd.DataFrame(historical_data)

    return df

In [34]:
def stock_stats(history):
    open_pr = history.iloc[0]['open']
    close_pr = history.iloc[-1]['close']
    percent_change = ((close_pr - open_pr) / open_pr) * 100

    total_vol = 0
    for index, row in history.iterrows():
        avg_pr = (row['high'] + row['low']) / 2
        daily_vol = row['volume'] * avg_pr
        total_vol += daily_vol
    total_vol_B = total_vol / 1e9

    percent_change = f"{percent_change:+.2f}%"
    total_vol_B = f"{total_vol_B:.2f}B"

    return (percent_change, total_vol_B)

In [35]:
# don't delete this cell, but do run it -- it is needed for the autograder tests

# public test for stock_history
history = stock_history('BYND', 2019, 6)

# public test for stock_stats
stats = stock_stats(history)

In [36]:
grader.check("q3")

## Question 4 – Comment Threads 🧵

You regularly browse [Hacker News](https://news.ycombinator.com/) to keep up with the latest news in tech. One link to a Hacker News article is https://news.ycombinator.com/item?id=18344932. Note that this article has 18 comments and has a `storyid` of 18344932.

The problem now is that you don't have internet access on your phone during your morning commute to work, so you want to save the interesting stories' comment threads beforehand in a CSV. You find their [API documentation](https://github.com/HackerNews/API) and decide to get to work.

Complete the implementation of the function `get_comments`, which takes in a `storyid` and returns a DataFrame of all the comments below the news story. You can ignore "dead" comments(you will know them when you see them), as well as "dead" comments’ children. **Make sure the order of the comments in your DataFrame is from top to bottom just as you see on the website**. 

The DataFrame that `get_comments` returns should have 5 columns:
1. `'id'`: The unique ID of the comment.
2. `'by'`: The author of the comment.
3. `'text'`: The actual comment.
4. `'parent'`: The unique ID of the comment this comment is replying to.
5. `'time'`: When the comment was created (in `pd.Timestamp` format).

Some guidance:
1. The URL to make requests to is `'https://hacker-news.firebaseio.com/v0/item/{}.json'`, however, the `{}` should be replaced with the ID of the article or page you are trying to access. 
2. Again, do not `import json` – instead, use the `json` method on the Response object you get back.
3. Use depth-first search when traversing the comments tree. You will have to do this manually, since you cannot use Beautiful Soup (which is only for HTML documents, not JSON objects).
4. Make sure the length of your returned DataFrame is the same as value for the `'descendants'` key in the response JSON (both of which correspond to the number of comments for the story).
5. You are allowed to use loops in this function. You may also want to create at least one helper function.

<div class="alert alert-block alert-success">
    You may find <a href="https://www.youtube.com/watch?v=uOfwW-onmpc"><b>this hint video 🎥</b></a> helpful!
</div>

In [46]:
from datetime import datetime

def format_url(code):
    url = f'https://hacker-news.firebaseio.com/v0/item/{code}.json'
    return url

def get_comment(code):
    comment = requests.get(format_url(code)).json()
    return comment

# class for stack
class Stack:
    def __init__(self):
        self.stack = []

    def push(self, item):
        self.stack.append(item)

    def pop(self):
        if not self.is_empty():
            return self.stack.pop()
        raise IndexError("pop from an empty stack")

    def is_empty(self):
        return len(self.stack) == 0

    def size(self):
        return len(self.stack)
    
def make_df(visited):
    keys = ['id', 'by', 'text', 'parent', 'time']
    result = pd.DataFrame(columns=keys) 
    dead_alive = []
    for i in visited:
        attributes = []
        com = get_comment(i)
        try: 
            dead_alive.append(com['dead'])
        except KeyError:
            dead_alive.append(False)
        for i in keys:
            try:
                attributes.append(com[i])
            except KeyError:
                attributes.append(np.NaN)

        result.loc[len(result)] = attributes
    dead_series = pd.Series(dead_alive)
    filtered = result[~dead_series].reset_index(drop=True)
    
    filtered['time'] = filtered['time'].apply(lambda x: datetime.fromtimestamp(x))
    return filtered
    


def get_comments(storyid):
    next_up = Stack()
    
    site = requests.get(format_url(18344932)).json()
    reverse_ids = site['kids'][::-1]
    # put all the main comments into the next_up stack
    for i in reverse_ids:
        next_up.push(i)
    
    # initialize list for visited comments
    visited = []

    # loop through next_up and continue adding child comments
    while not next_up.is_empty():
        # move the top comment to the visited list
        top = next_up.pop()
        visited.append(top)
        # get the children of the comment we just popped 
        try:
            for kid in get_comment(top)['kids']:
                next_up.push(kid)
        # if the comment doesn't have any kids, just continue
        except KeyError:
            continue
            
    final_df = make_df(visited)
    return final_df

In [47]:
# don't delete this cell, but do run it -- it is needed for the autograder tests
comments = get_comments(18344932)

In [48]:
grader.check("q4")

## Congratulations! You're done Lab 6! 🏁

As a reminder, all of the work you want to submit needs to be in `lab.py`.

To ensure that all of the work you want to submit is in `lab.py`, we've included a script named `lab-validation.py` in the lab folder. You shouldn't edit it, but instead, you should call it from the command line (e.g. the Terminal) to test your work.

Once you've finished the lab, you should open the command line and run, in the directory for this lab:

```
python lab-validation.py
```

**This will run all of the `grader.check` cells that you see in this notebook, but only using the code in `lab.py` – that is, it doesn't look at any of the code in this notebook. If all of your `grader.check` cells pass in this notebook but not all of them pass in your command line with the above command, then you likely have code in your notebook that isn't in your `lab.py`!**

You can also use `lab-validation.py` to test individual questions. For instance,

```
python lab-validation.py q1 q2 q4
```

will run the `grader.check` cells for Questions 1, 2, and 4 – again, only using the code in `lab.py`. [This video](https://www.loom.com/share/0ea254b85b2745e59322b5e5a8692e91?sid=5acc92e6-0dfe-4555-9b6a-8115b6a52f99) how to use the script as well.

Once `python lab-validation.py` shows that you're passing all test cases, you're ready to submit your `lab.py` (and only your `lab.py`) to Gradescope. Once submitting to Gradescope, make sure to stick around until all test cases pass.

There is also a call to `grader.check_all()` below in _this_ notebook, but make sure to also follow the steps above.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()