# Homework 8: Web Scraping and Database Management

This is an **pair-optional** assignment. Total: 20 points. Due:**<span style="color:red">  Sunday, November 5, 10:00 pm </span>**.

**Overview**
In this assignment, you will develop a Python script to scrape book information from the website https://books.toscrape.com. The website contains a total of 1000 book entries spread across multiple pages. Your task is to extract information about each book, including its `title`,`category`, `price`, `availability`, and `description`, and then store this information in a MySQL database. After populating the database, you are required to perform **at least five** meaningful MySQL queries to demonstrate your data manipulation skills.

**Objectives**
1. **Web Scraping**: Write a Python script using libraries like `requests` and `BeautifulSoup` to scrape book data from the website. Ensure you navigate through all the pages to get details of all 1000 books.
2. **Database Creation**: Set up a MySQL database and design an appropriate schema to store the scraped book data.
3. **Data Insertion**: Populate the MySQL database with the scraped data, ensuring data integrity and proper structuring.
4. **MySQL Queries**: Execute **at least five** meaningful queries on your database. These might include:
    * Aggregations (e.g., average price of books, count of books under specific categories).
    * Search queries (e.g., finding all books with a particular word in their title/description).
    * Data updates (e.g., updating prices or availability status).
    * Ordering and grouping of data based on certain criteria.

**Deliverables**
* Python script for Web Scraping: A script that systematically navigates through https://books.toscrape.com and scrapes the relevant data.
* The Database and the Python script used to populate it: Exporting a database will be demonstrated on Friday.
* MySQL script: MySQL queries executed on the database, along with brief explanations of the purpose and outcome of each query.
* Report (Optional) - A concise report including:
    * Challenges encountered and how they were resolved.
    * Insights or interesting findings from your MySQL queries.
    * Any suggestions or advice for this assignment.
    

**Submission**
* Include the Python scripts and MySQL queries in the Jupyter Notebook. Submit the actual database as a .zip or .sql file on Gradescope, along with the Jupyter Notebook.

**Notes**
* Your code should handle exceptions and potential data inconsistencies gracefully.
* Pay attention to the pagination on the website when designing your scraper.
* You may need to visit the specific page of the book to obtain its category and description. It's possible that certain books may lack a description.
* Based on your approach, the code snippet `find_all('p', recursive=False)` could be helpful for extracting the description.

# For pagination use a for loop:

for page in range(1, 51):
 page_url = f'https://books.toscrape.com/catalogue/page-{page}.html'

In [3]:
!pip3 install mysql-connector-python
# Had to reinstall cuz my computer broke have way through and got a new one

Collecting mysql-connector-python
  Obtaining dependency information for mysql-connector-python from https://files.pythonhosted.org/packages/a7/84/b63f11124f808b6f1e3389072bc36cc907929d7574e85f94bf8f18117fe4/mysql_connector_python-8.2.0-cp311-cp311-win_amd64.whl.metadata
  Downloading mysql_connector_python-8.2.0-cp311-cp311-win_amd64.whl.metadata (2.1 kB)
Collecting protobuf<=4.21.12,>=4.21.1 (from mysql-connector-python)
  Downloading protobuf-4.21.12-cp310-abi3-win_amd64.whl (527 kB)
     ---------------------------------------- 0.0/527.0 kB ? eta -:--:--
     ---------------------------------------- 0.0/527.0 kB ? eta -:--:--
      --------------------------------------- 10.2/527.0 kB ? eta -:--:--
     -- ---------------------------------- 41.0/527.0 kB 393.8 kB/s eta 0:00:02
     ---------------- --------------------- 235.5/527.0 kB 1.8 MB/s eta 0:00:01
     -------------------------------------- 527.0/527.0 kB 3.3 MB/s eta 0:00:00
Downloading mysql_connector_python-8.2.0-cp311-c

In [6]:
# Trying to mix the two scripts because I cant figure out how to transfer data
import mysql.connector
import random
from mysql.connector import Error
from datetime import datetime
from bs4 import BeautifulSoup
import requests

url = "https://books.toscrape.com/catalogue/"
# Sending a request to the website
response = requests.get(url)

#func
def web_scrape(booki_url):
    response = requests.get(booki_url)
    data = response.text
# Parsing the HTML content of the webpage
    soup = BeautifulSoup(data, 'html.parser')
# Selecting all elements that contain the book information
    book_containers = soup.find_all("article", class_="product_pod")
    book_title = soup.find('h1').get_text(strip=True)
#select nested text son! (second element of list)
    book_genre = soup.select("ul.breadcrumb li")[2].get_text(strip=True)
#select_one first instance of element
    book_price = soup.select_one('p.price_color').get_text(strip=True)
    book_availability = soup.find('p', class_='availability').get_text(strip=True)
#meta tag cuz I couldn't figure out recursive statement
    book_description = soup.find("meta", attrs={"name": "description"})["content"]    
    
    return book_title, book_genre, book_price, book_availability, book_description

try:
# Establish a database connection
    connection = mysql.connector.connect(user='root', password='rootpassword', host='localhost', database='HW8')
    cursor = connection.cursor()

# SQL query for inserting data
    add_transaction = ("INSERT INTO scrapey "
                       "(title, genre, price, availability, description) "
                       "VALUES (%s, %s, %s, %s, %s)")

# Pagination dawg
    for page in range(1, 51): 
        print(page)
        page_url = f'https://books.toscrape.com/catalogue/page-{page}.html'    
        response = requests.get(page_url)
        data = response.text
        soup = BeautifulSoup(data, 'html.parser')
        book_containers = soup.find_all('h3')
# Iterate through each book container and extract info from book's individual url  
        for book in book_containers:
            close_url = book.find('a')['href']
            booki_url = url + close_url
            book_title, book_genre, book_price, book_availability, book_description = web_scrape(booki_url)

            transaction_data = (book_title, book_genre, book_price, book_availability, book_description)
# Insert new book
            cursor.execute(add_transaction, transaction_data)
# Commit the transaction
    connection.commit()

except mysql.connector.Error as error:
    print(f"Failed to insert record into MySQL table {error}")

finally:
# Close communication with the database
    if connection.is_connected():
        cursor.close()
        connection.close()
        print("MySQL connection is closed")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
MySQL connection is closed


### REPORT ###
## Challenges encountered and how they were resolved.
The biggest challenge that I faced was getting the web scraping to work. I tried so many variations in my code when it came to finding the genre, price and description. I ended up finding 'select' and used it instead of find because I couldn't get find to work for all variables.

## Insights or interesting findings from your MySQL queries.
Only 3 books had descriptions that were over 5,000 characters long, most fell into the range of 1,000-2,000 characters. 
No book cost more than 60 euros, so Im glad that they took time to give realistic prices instead of just randomly assigning prices without a limit.

## Any suggestions or advice for this assignment.
The assignment was extremely vague, in some ways I enoyed it and others I hated it. With the other assignments we had the modules to help with our answers, but this one I really only had youtube and reddit to help me (with the exception of the template and deep diving into the links you gave us). Maybe give a little bit more info on web scraping and how it works over a second class in the future because I was lost so many times.