# Lab 8 Web Scraping [Total: 3 points]

The purpose of this assignment is for you to engage with a concrete web scraping task. This will be accomplished through a coding assignment. You will carry out this task in the present notebook and use the notebook to document the various steps of the exercise and to answer all questions.


## Required skills

This lab will let you practice the following skills:
- Download HTML
- Parse HTML

A few additional resources can be found here:
- https://docs.python-requests.org/
- https://beautiful-soup-4.readthedocs.io

## Important
* Please ensure that you run the following two cells below before running any others. This will download all required files, as well as install the necessary packages to ensure the code runs successfully. If you restart the kernel or your runtime session (in Colab), be sure to rerun this cell before running any others.
* This assignment recommends using Google Colab. If you are using Anaconda Jupyter notebook/lab, please ensure that **this notebook is kept in a new folder**. This is because the following commands will **delete all files with the extensions .csv and .py** before downloading the required files.

In [1]:
# Installing Otter-Grader and downloading required files
required_files = "https://github.com/mainuddin-rony/inst447-fall2024/raw/main/assignment/lab/lab8/required_files.zip"
! rm -rf tests
! rm -f required_files.zip *.csv *.py ._*.csv *.html *.txt
! wget $required_files && unzip -j required_files.zip
! mkdir tests && mv *.py tests
! pip install otter-grader==5.7.1

--2024-11-08 15:16:51--  https://github.com/mainuddin-rony/inst447-fall2024/raw/main/assignment/lab/lab8/required_files.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/mainuddin-rony/inst447-fall2024/main/assignment/lab/lab8/required_files.zip [following]
--2024-11-08 15:16:51--  https://raw.githubusercontent.com/mainuddin-rony/inst447-fall2024/main/assignment/lab/lab8/required_files.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5958 (5.8K) [application/zip]
Saving to: ‘required_files.zip’


2024-11-08 15:16:52 (50.4 MB/s) - ‘required_files.zip’ saved [5958/5958]

Archive:  required_fi

In [2]:
# Initialize Otter
import otter
grader = otter.Notebook()

## Q1

**Points**: 1

Write a function `gettextbytag` that extracts all the text of a given HTML tag within an HTML file. Your function should accept 2 arguments -- the name of an HTML file as a string, and the name of an HTML tag as a string.

It should return a list of string, where for each occurrence of the tag in the file, it should include the text of the tag.

For example, if the file `q1file.html` is:

```html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>This is a Heading</h1>
<p class="firstpara">This is first paragraph.</p>
<p>This is second paragraph.</p>
<p id="third">This is third paragraph.</p>

</body>
</html>
```

and the tag is `"h1"`, then your function should return
```
['This is a Heading']
```

In [3]:
from bs4 import BeautifulSoup

def gettextbytag(html_file, tag_name):
  text_by_tag = []

  with open(html_file, 'r') as f:
        data_in_file = f.read()

  beautiful_soup = BeautifulSoup(data_in_file, 'html.parser')

  find_tags = beautiful_soup.find_all(tag_name)
  for tag in find_tags:
    text_by_tag.append(tag.get_text())

  return text_by_tag

Use the cell below to run your function and see what it returns. You may want to test different dates to see if your code returns the correct answer.

In [4]:
gettextbytag("q1file.html", "a")

[]

When you're ready, run the cell below to get feedback on your answer.

In [5]:
grader.check("q1")

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Q2

__Points__: 1

Write a function called `getbooksprice` that scrapes prices of books from the front page of the website [books.toscrape.com](//books.toscrape.com).

Your function should take no parameter. It should fetch the front page of the website, and it should return a Python list of float values with the price (in pounds) of the books in it.

**Hint**: if you see the symbol `Â` being printed in the price, then make sure to set the encoding from the response, like this:
```
    response = requests.get("https://books.toscrape.com/")
    response.encoding = 'utf-8'
```
before extracting the text from the response object.

In [10]:
from bs4 import BeautifulSoup
import requests

def getbooksprice():
  book_prices = []

  res = requests.get("https://books.toscrape.com/")
  res.encoding = 'utf-8'

  beautiful_soup = BeautifulSoup(res.text, 'html.parser')

  find_prices = beautiful_soup.find_all('p', {'class': 'price_color'})

  for price in find_prices:
        book_prices.append(float(price.get_text().strip('Â£')))

  return book_prices

Use the cell below to run your function and see what it returns.

In [11]:
getbooksprice()

[51.77,
 53.74,
 50.1,
 47.82,
 54.23,
 22.65,
 33.34,
 17.93,
 22.6,
 52.15,
 13.99,
 20.66,
 17.46,
 52.29,
 35.02,
 57.25,
 23.88,
 37.59,
 51.33,
 45.17]

When you're ready, run the cell below to get feedback on your answer.

In [7]:
grader.check("q2")

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Q3

__Points__: 1

Write a function called `getbookspricebonus` that scrapes the prices of all the books on the website catalog in addition to the front page. Your function should fetch one page at a time, scrape the book prices, and append them to the final return value -- a NumPy array. Besides from the front page there are 50 additional pages of book catalog, each with 20 books, so the the array should contain exactly 1020 books in total. *Since the first catalog page contains the same books as the frontpage, this means that the first 20 books should appear twice in your array*.



**Hint 1**: if you see the symbol `Â` being printed in the price, then make sure to set the encoding from the response, like this:
```
    response = requests.get("https://books.toscrape.com/")
    response.encoding = 'utf-8'
```
before extracting the text from the response object.

**Hint 2**: Each additional catalog page is located at the following URL:

    https://books.toscrape.com/catalogue/page-XX.html
    
Where `XX` ranges from 1 to 50.

In [22]:
from bs4 import BeautifulSoup
import requests
import numpy as np

def getbookspricebonus():
    book_price_bonus = []


    first_page = "https://books.toscrape.com/"
    res = requests.get(first_page)
    res.encoding = 'utf-8'
    beautiful_soup = BeautifulSoup(res.text, 'html.parser')


    first_page = beautiful_soup.find_all('p', {'class': 'price_color'})
    for price in first_page:
        book_price_bonus.append(float(price.get_text().strip('Â£')))


    for i in range(1, 51):
        other_pages = f"https://books.toscrape.com/catalogue/page-{i}.html"
        res = requests.get(other_pages)
        res.encoding = 'utf-8'
        beautiful_soup = BeautifulSoup(res.text, 'html.parser')


        other_page_prices = beautiful_soup.find_all('p', {'class': 'price_color'})
        for price in other_page_prices:
            book_price_bonus.append(float(price.get_text().strip('Â£')))

    return np.array(book_price_bonus)

Use the cell below to run your function and see what it returns.

In [23]:
getbookspricebonus()

array([51.77, 53.74, 50.1 , ..., 16.97, 53.98, 26.08])

When you're ready, run the cell below to get feedback on your answer.

In [24]:
grader.check("q3")

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Submission

Don't forget to run all cells in your notebook and then save it. To save, click on *File*, then select *Save/Save Notebook*. After that, download the notebook by going to *File --> Download* (for Anaconda Notebook) or *File --> Download .ipynb* (for Colab). Finally, submit the notebook on Gradescope using the link found on ELMS.