# Web Scraping in Goodreads

This notebook is designed to obtain the **Image_url** of books from the Goodreads website. The books for which we need to collect this data are those listed in the datasets available [here](https://github.com/zygmuntz/goodbooks-10k). Some of these books have this feature missing.

### Importing Libraries

In [1]:
import pandas as pd               # pandas is used for data manipulation and analysis, providing data structures like DataFrames.
import numpy as np                # numpy is used for numerical operations on large, multi-dimensional arrays and matrices.
import requests                   # Library used for making HTTP requests.
from bs4 import BeautifulSoup     # Library for parsing HTML and XML documents.
from tqdm import tqdm             # To include a progress bar in the loop.
import concurrent.futures         # To make multiple http requests simultaneously.
import time                       # time is used for time-related functions
import re                         # re provides regular expression matching operations in strings.
import json                       # json is used for parsing and generating JSON (JavaScript Object Notation) data.
from datetime import datetime     # datetime is used for manipulating dates and times.
from IPython.display import Image # IPython's display module to display images within Jupyter Notebooks.

### Function Definition

Here we define functions that will be used in a loop to make HTTP requests to Goodreads. This function searches for the cover image url of a book with a given GoodreadsID.

In [101]:
# The header is for the simultaneous requests to work.
# It seems that Goodreads blocks the requests made through a script.
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}

def get_book_editions(book_workid):
    url = f"https://www.goodreads.com/work/editions/{book_workid}"
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        #print(f"Error fetching the page: {response.status_code}")
        return []
    
    soup = BeautifulSoup(response.content, 'html.parser')

    editions = soup.find_all('div', class_='editionData')

    book_editions = []
    for edition in editions:
        edition_info = {}
        title_tag = edition.find('a', class_='bookTitle')
        if title_tag:
            edition_info['title'] = title_tag.text.strip()
            edition_info['link'] = "https://www.goodreads.com" + title_tag['href']
    
        if edition_info:
            book_editions.append(edition_info)

    return book_editions


def is_image_url_accessible(image_url):
    try:
        response = requests.get(image_url)
        if response.status_code == 200:
            return True
        else:
            return False
    except requests.RequestException:
        return False


def get_missing_data(edition_url):
    response = requests.get(edition_url)

    if response.status_code != 200:
        #print(f"Error fetching the page: {response.status_code}")
        return None

    soup = BeautifulSoup(response.content, 'html.parser')
    script_tag = soup.find('script', type='application/ld+json') # The image's url is here

    if not script_tag:
        print("No JSON-LD script found")
        return None

    # Convert the JSON fragment into a dictionary
    try:
        book_details = json.loads(script_tag.string)
    except json.JSONDecodeError:
        #print("Error decoding JSON")
        return None

    image_url = book_details.get('image')

    if is_image_url_accessible(image_url) == False:
        return None

    return image_url


def get_data(workid):
    editions = get_book_editions(workid)
    for edition in editions:
        image_url = get_missing_data(edition['link'])
        if image_url:
            return image_url

### Load the dataset

Here we load the books dataset from which we can get the Goodreads ID of the books for which we want to find the genres.

In [2]:
books = pd.read_csv("../data_preprocessed/books.csv")

In [3]:
columns = ['image_url']
for column in columns:
    num_duplicates = books[books[column].duplicated()].shape[0]
    print(f'Number of duplicates in books[{column}]: {num_duplicates}')

Number of duplicates in books[image_url]: 3331


In [4]:
duplicated_images = books[books['image_url'].duplicated()]['image_url'].values
print(set(duplicated_images), '\n')
# The duplicated images correspond all of them to the same picture and it is actually a missing image:
url = duplicated_images[0]
Image(url=url, width=100)

{'https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png'} 



In [5]:
books_missing_data = books[books['image_url'] == url].reset_index(drop=True)

### Web Scraping

We divide the array of GoodreadsIDs into a number of intervals determined by `divs`. This approach allows us to iterate over these intervals and further within the GoodreadsIDs inside each interval, specifying how many intervals we compile each time. This method gives us control over the web scraping process, as simply iterating over the entire array of GoodreadsIDs, i.e., `books['goodreads_book_id']`, could potentially lead to various issues. Additionally, every time an interval is completed, we store the results in a CSV file to prevent any potential loss of information. 

In [122]:
total = len(books_missing_data['work_id']) # total number of ISBNs
divs = 28 # number of intervals in which we divide the isbns
step = int(total / divs) # length of each interval
ranges = [range(step*i - step, step*i) for i in range(1,divs+1)] # an array with the intervals

key_i = 0 # number of the interval at which we start the for loop
key_f = 28 # number of the interval at which we stop the for loop

#progress_bar = tqdm(total=(key_f-key_i), bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')

for i in range(key_i, key_f): 
    
    workids = books_missing_data['work_id'][ranges[i]] # WorkIDs of the interval i
    
    progress_bar = tqdm(total=step, bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')
    
    data_dict = {
        'WorkID':[],
        'Image_url':[]
    }

    with concurrent.futures.ThreadPoolExecutor() as executor: 
        future_to_workid = {executor.submit(get_data, workid): workid for workid in workids}
        for future in concurrent.futures.as_completed(future_to_workid):
            workid = future_to_workid[future]
            try:
                image_workid = future.result()
                data_dict['WorkID'].append(workid)
                data_dict['Image_url'].append(image_workid)
            except Exception as e:
                data_dict['WorkID'].append(workid)
                data_dict['Image_url'].append(np.nan)
            progress_bar.update(1)
           
    progress_bar.close()

    new_df = pd.DataFrame(data_dict)

    try:
        existing_df = pd.read_csv("books_image_missing.txt", sep="\t")
    except FileNotFoundError:
        existing_df = pd.DataFrame()

    combined_df = pd.concat([existing_df, new_df], ignore_index=True)

    combined_df.to_csv("books_image_missing.txt", sep="\t", index=False)

    time.sleep(2)

    #progress_bar.update(1) 
    
#progress_bar.close()

100.00%|██████████████████████████████████████| 119/119 [00:00<01:47,  1.11it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:06,  1.78it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:28,  1.35it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:02,  1.91it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:02,  1.92it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:20,  1.49it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<02:48,  1.42s/it]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:07,  1.77it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:38,  1.21it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:49,  1.09it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:01,  1.95it/s]
100.00%|██████████████████████████████████████| 119/119 [00:00<01:26,  1.37it/s]
100.00%|████████████████████

In [131]:
combined_df[combined_df.isnull().any(axis=1)]

Unnamed: 0,WorkID,Image_url
117,1180927,
118,1766737,
142,4551869,
170,2247074,
237,3125926,
356,1244564,
427,25704,
713,968512,
754,3898716,
832,1075398,


The urls that are still missing, are obtained manually.