# Web Scraping Script Documentation

## Introduction

Web scraping is the process of extracting content and data from a website using scripts or bots. Web scraping allows you to obtain the underlying HTML code and, with it, data from a database. The scraper can then copy the complete website content (or acquired information) elsewhere in a more user-friendly style. Be it a spreadsheet or an API.

## Purpose


This web scraper script in Python is created as part of the first Task in the internship program, following the requirements below:

1. **Website Selection:**
   - This script is designed to scrape data from the Junkbooks website (https://www.junkybooks.com/), a publicly accessible source of many different books in PDF format. 


2. **Web Scraping:**
   - Utilizing the Beautiful Soup and Requests libraries, the script extracts the names of all the books available on the website with their corresponding downloading links.


3. **Data Processing:**
   - The script extracts and displays the names of all the books available on the website including their corresponding downloading links.


4. **Automation:**
   - The script is automated by scheduling it to run every 24 hours. This helps keep the dataset up-to-date.


5. **Documentation:**
   - Documentation that explains the purpose of the script, how to execute it, dependencies used, and additional notes is added to the script.

### 1. Import the necessary livraries:

In [18]:
import requests
from bs4 import BeautifulSoup
import schedule
import time

## Imported Libraries

1. **requests**: is an HTTP library that allows to send HTTP/1.1 requests extremely easily. It is commonly employed to fetch data from websites.

2. **BeautifulSoup**: used to make it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

3. **schedule**: A simple to use API for scheduling tasks. Enables the automation of recurring tasks at specified intervals.

4. **time**: Is used for introducing delays in the script. It can be employed to control the timing of various operations.

### 2. Define url of the website to be scraped

In [4]:
_baseurl_ = "https://www.junkybooks.com/"

### 3. Define empty lists

In [16]:
book_titles = []
download_pages = []
download_links = []

These lists are designed to hold the following data respectively.

1. **book_titles:** Holds the titles of all the books listed on the website.
2. **download_pages:** holds the web addresses where each book can be found.
3. **download_links:** Holds the web address from which each book is downloaded.

### 4. Define the first webscrap Function

In [28]:
def scrape_web():
    index = 1
    while True:
        link = "https://www.junkybooks.com/books?page="+str(index)
        page = requests.get(link)
        soup = BeautifulSoup(page.content, "html.parser")
        results = soup.find("div", class_="tab-content")
        if results is not None:
            books = results.find_all("div", class_="product-details text-center")
        else:
            print(f"Parsing stops here since END OF WEB PAGES is reached")
            scrape_webpage(download_pages)
            break
        for book in books:
            title = book.find("h4")
            title_text = title.get_text()
            book_titles.append(title_text)
            download_path = title.find('a')['href'].strip()
            download_link = "https://www.junkybooks.com/" + download_path
            download_pages.append(download_link)
        index+=1

This (scrape_web) function is defined to facilitate web scraping. It sets the initial URL, performs scraping each webpage (till end of the page), then finds the title and url of the download page for each book on the website.

### 5. Define the second webscrap Function

In [20]:
def scrape_webpage(download_pages):
    for download_page in download_pages:
        download_url = requests.get(download_page)
        soup = BeautifulSoup(download_url.content, "html.parser")

        results = soup.find("div", class_="product-info-main")
        books = results.find_all("div", class_="product-add-form")
        for book in books:
            download_button = book.find("form")
            download_subDir = download_button.find('a')['href'].strip()
            download_link = "https://www.junkybooks.com/" + download_subDir.strip("..?")
            download_links.append(download_link)

This second webscraper function (scrape_webpage) is defined to facilitate webscraping based on the download pages recieved from the first scraper function (scrape_webp).

In [None]:
scrape_web()

### 6. Print the scraped data (title of each book with their corresponding download link)

In [21]:
m=0
if download_links is not None:
    for n, download_link in enumerate(download_links):
        print(f"===========================================================================================================================================||")
        print(f"||#{n} BOOK TITLE: ||  {book_titles[n]}")
        print(f"-------------------------------------------------------------------------------------------------------------------------------------------||")
        print(f"||DOWNLOAD LINK:  || {download_link}")
        m+=1

In [22]:
print(f"============================================================================================================================================||")
print(f"Total of {m} Books Scraped from the website")
print(f"============================================================================================================================================||")        

Total of 0 Books Scraped from the website


### 7. Schedule the task to run every 24 hours

In [None]:
# Schedule the task to run every 24 hours
schedule.every(2).seconds.do(scrape_web)
while True:
    schedule.run_pending()
    time.sleep(1)