# Web Scraping in Goodreads

This notebook is designed to obtain the **ISBN** and **Year** of books from the Goodreads website. The books for which we need to collect this data are those listed in the datasets available [here](https://github.com/zygmuntz/goodbooks-10k). Some of these books have these features missing.

### Importing Libraries

In [1]:
import pandas as pd               # pandas is used for data manipulation and analysis, providing data structures like DataFrames.
import numpy as np                # numpy is used for numerical operations on large, multi-dimensional arrays and matrices.
import requests                   # Library used for making HTTP requests.
from bs4 import BeautifulSoup     # Library for parsing HTML and XML documents.
from tqdm import tqdm             # To include a progress bar in the loop.
import concurrent.futures         # To make multiple http requests simultaneously.
import time                       # time is used for time-related functions
import re                         # re provides regular expression matching operations in strings.
import json                       # json is used for parsing and generating JSON (JavaScript Object Notation) data.
from datetime import datetime     # datetime is used for manipulating dates and times.

### Function Definition

Here we define functions that will be used in a loop to make HTTP requests to Goodreads. These functions search for the publication year and ISBN of a book with a given GoodreadsID.

In [2]:
# The header is for the simultaneous requests to work.
# It seems that Goodreads blocks the requests made through a script.
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}

def get_book_editions(book_workid):
    url = f"https://www.goodreads.com/work/editions/{book_workid}"
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        #print(f"Error fetching the page: {response.status_code}")
        return []
    
    soup = BeautifulSoup(response.content, 'html.parser')

    editions = soup.find_all('div', class_='editionData')

    book_editions = []
    for edition in editions:
        edition_info = {}
        title_tag = edition.find('a', class_='bookTitle')
        if title_tag:
            edition_info['title'] = title_tag.text.strip()
            edition_info['link'] = "https://www.goodreads.com" + title_tag['href']
    
        if edition_info:
            book_editions.append(edition_info)

    return book_editions


def get_missing_data(edition_url):
    response = requests.get(edition_url)

    if response.status_code != 200:
        #print(f"Error fetching the page: {response.status_code}")
        return None, None

    #match = re.search(r'{"__typename":"BookDetails","[^}]*"isbn":"[^"]*"', response.text)
    match = re.search(r'{"__typename":"BookDetails","[^}]*"language":{"__typename":"Language","name":"[^"]*"}', response.text)
    
    if not match:
        #print("No JSON data found")
        return None, None

    json_data = match.group(0) + '}'  # We close the JSON with '"}'

    # Convert the JSON fragment into a dictionary
    try:
        book_details = json.loads(json_data)
    except json.JSONDecodeError:
        #print("Error decoding JSON")
        return None, None

    isbn = book_details.get('isbn')
    publication_time = book_details.get('publicationTime')
    language = book_details.get('language').get('name')

    # Convert publication time into a friendly format
    if publication_time:
        year = datetime.utcfromtimestamp(publication_time / 1000).strftime('%Y-%m-%d')[0:4]
    else:
        year = None

    if language != 'English':
        return None, None

    #print(book_details)
    return isbn, year 


def get_data(workid):
    editions = get_book_editions(workid)
    for edition in editions:
        isbn, year = get_missing_data(edition['link'])
        if isbn:
            return isbn, year

### Load the dataset

Here we load the books dataset from which we can get the Goodreads ID of the books for which we want to find the ISBNs and Years.

In [3]:
books = pd.read_csv("../data_preprocessed/books.csv")

In [4]:
books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   book_id                    10000 non-null  int64  
 1   goodreads_book_id          10000 non-null  int64  
 2   best_book_id               10000 non-null  int64  
 3   work_id                    10000 non-null  int64  
 4   books_count                10000 non-null  int64  
 5   isbn                       9300 non-null   object 
 6   isbn13                     9415 non-null   float64
 7   authors                    10000 non-null  object 
 8   original_publication_year  9979 non-null   float64
 9   original_title             9415 non-null   object 
 10  title                      10000 non-null  object 
 11  language_code              8916 non-null   object 
 12  average_rating             10000 non-null  float64
 13  ratings_count              10000 non-null  int6

In [5]:
missing_data_books = books[books['isbn'].isnull() | books['original_publication_year'].isnull()].index
books_missing_data = books.loc[missing_data_books].reset_index(drop=True)

### Web Scraping

We divide the array of GoodreadsIDs into a number of intervals determined by `divs`. This approach allows us to iterate over these intervals and further within the GoodreadsIDs inside each interval, specifying how many intervals we compile each time. This method gives us control over the web scraping process, as simply iterating over the entire array of GoodreadsIDs, i.e., `books['goodreads_book_id']`, could potentially lead to various issues. Additionally, every time an interval is completed, we store the results in a CSV file to prevent any potential loss of information. 

In [44]:
total = len(books_missing_data['work_id']) # total number of ISBNs
divs = 2 # number of intervals in which we divide the isbns
step = int(total / divs) # length of each interval
ranges = [range(step*i - step, step*i) for i in range(1,divs+1)] # an array with the intervals

key_i = 0 # number of the interval at which we start the for loop
key_f = 2 # number of the interval at which we stop the for loop

#progress_bar = tqdm(total=(key_f-key_i), bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')

for i in range(key_i, key_f): 
    
    workids = books_missing_data['work_id'][ranges[i]] # ISBNs of the interval i
    
    progress_bar = tqdm(total=step, bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')
    
    data_dict = {
        'WorkID':[],
        'ISBN':[],
        'Year':[]
    }

    with concurrent.futures.ThreadPoolExecutor() as executor: 
        future_to_workid = {executor.submit(get_data, workid): workid for workid in workids}
        for future in concurrent.futures.as_completed(future_to_workid):
            workid = future_to_workid[future]
            try:
                isbn_workid, year_workid = future.result()
                data_dict['WorkID'].append(workid)
                data_dict['ISBN'].append(isbn_workid)
                data_dict['Year'].append(year_workid)
            except Exception as e:
                data_dict['WorkID'].append(workid)
                data_dict['ISBN'].append(np.nan)
                data_dict['Year'].append(np.nan)
            progress_bar.update(1)
           
    progress_bar.close()

    new_df = pd.DataFrame(data_dict)

    try:
        existing_df = pd.read_csv("books_data_missing.txt", sep="\t")
    except FileNotFoundError:
        existing_df = pd.DataFrame()

    combined_df = pd.concat([existing_df, new_df], ignore_index=True)

    combined_df.to_csv("books_data_missing.txt", sep="\t", index=False)

    time.sleep(2)

    #progress_bar.update(1) 
    
#progress_bar.close()

100.00%|██████████████████████████████████████| 359/359 [00:00<06:28,  1.08s/it]
100.00%|██████████████████████████████████████| 359/359 [00:00<07:19,  1.23s/it]
