# Web Scraping in Goodreads

This notebook is just to get the genres of each book from the webpage Goodreads, since the dataset used from [here](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset/data) does not include the genres of the books included.

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import requests # Library used for making HTTP requests
from bs4 import BeautifulSoup # Library for parsing HTML and XML documents
from tqdm import tqdm # To include a progress bar in the loop
import concurrent.futures # To make multiple http requests simultaneously
import time # For time-related functions

### Useful Function

Here we define a function that will be used in the loop to make an HTTP request to Goodreads. Then, the genres of the book with a given GoodreadsID are searched for.

In [2]:
# This is in order for the request to work.
# It seems that Goodreads blocks the requests made through a script.
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}

def get_genres(goodreadsID):
    url = f"https://www.goodreads.com/book/show/{goodreadsID}"
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        genres_section = soup.find("div", class_="BookPageMetadataSection__genres")
        genres_list = genres_section.find_all("span", class_="Button__labelItem")
        genres = []
        for i in range(len(genres_list)):
            item = genres_list[i].text
            if item != '...more':
                genres.append(item)
        return genres
    return np.nan

### Load the dataset

Here we load the books dataset from which we can get the ISBN of the books for which we want to find the genres.

In [3]:
books = pd.read_csv("archive/books.csv")

In [4]:
print("Books Shape: ", books.shape)
books.head(3)

Books Shape:  (10000, 23)


Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...


### Web Scraping

What we do is divide the array of ISBNs into a number of intervals determined by `divs`. This approach allows us to iterate over these intervals, and further within the ISBNs inside each interval, specifying how many intervals we compile each time. This method gives us control over the web scraping, as simply iterating over the entire array of ISBNS, i.e., `books['ISBN']`, could potentially lead to various issues. Additionally, every time an interval is completed, we store the results in a CSV file to prevent any potential loss of information. 

In [None]:
total = len(books['goodreads_book_id']) # total number of GoodreadsIDs
divs = 25 # number of intervals in which we divide the GoodreadsIDs
step = int(total / divs) # length of each interval
ranges = [range(step*i - step, step*i) for i in range(1,divs+1)] # an array with the intervals

key_i = 0 # number of the interval at which we start the for loop
key_f = 25 # number of the interval at which we stop the for loop

#progress_bar = tqdm(total=(key_f-key_i), bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')

for i in range(key_i, key_f): 
    
    goodreadsIDs = books['goodreads_book_id'][ranges[i]] # goodreadsIDs of the interval i
    
    progress_bar = tqdm(total=step, bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')
    
    genres_dict = {
        'Goodreads_BookID':[],
        'Genres':[]
    }

    with concurrent.futures.ThreadPoolExecutor() as executor: 
        future_to_goodreadsIDs = {executor.submit(get_genres, goodreadsID): goodreadsID for goodreadsID in goodreadsIDs}
        for future in concurrent.futures.as_completed(future_to_goodreadsIDs):
            goodreadsID = future_to_goodreadsIDs[future]
            try:
                genres_goodreadsID = future.result()
                genres_dict['Goodreads_BookID'].append(goodreadsID)
                genres_dict['Genres'].append(genres_goodreadsID)
            except Exception as e:
                #print(f"Error al procesar ISBN {isbn}: {e}")
                genres_dict['Goodreads_BookID'].append(goodreadsID)
                genres_dict['Genres'].append(np.nan)
            progress_bar.update(1)
           
    progress_bar.close()

    new_df = pd.DataFrame(genres_dict)

    try:
        existing_df = pd.read_csv("books_genres.txt", sep="\t")
    except FileNotFoundError:
        existing_df = pd.DataFrame()

    combined_df = pd.concat([existing_df, new_df], ignore_index=True)

    combined_df.to_csv("books_genres.txt", sep="\t", index=False)

    time.sleep(2)

    #progress_bar.update(1) 
    
#progress_bar.close()

100.00%|██████████████████████████████████████| 400/400 [00:00<02:04,  3.23it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<02:10,  3.06it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<02:08,  3.10it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<01:56,  3.43it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<02:13,  3.01it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<02:05,  3.20it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<02:02,  3.26it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<02:04,  3.22it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<02:10,  3.08it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<02:42,  2.46it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<02:04,  3.21it/s]
100.00%|██████████████████████████████████████| 400/400 [00:00<03:08,  2.12it/s]
100.00%|████████████████████

1. ThreadPoolExecutor: concurrent.futures.ThreadPoolExecutor() creates a pool of threads that can be used to execute multiple calls to the process_isbn function simultaneously. Each thread in the pool will execute a specific task.

2. executor.submit(): executor.submit(process_isbn, isbn) submits a task to the thread pool for execution. The submit function takes two arguments: the function to be executed (process_isbn) and the arguments to be passed to that function (isbn). It returns a Future object that represents the future result of the function call.

3. {executor.submit(process_isbn, isbn): isbn for isbn in isbns}: This is a dictionary comprehension that creates a dictionary where the keys are Future objects returned by executor.submit() and the values are the corresponding ISBNs. This is used to keep track of which ISBN corresponds to which Future.

4. concurrent.futures.as_completed(): This function takes an iterable of Future objects and returns an iterator that yields Future as they are completed. It waits until each Future is completed and then returns the completed Future. In this case, we are passing the future_to_isbn dictionary that contains all the Future objects we have created earlier.

5. for future in concurrent.futures.as_completed(future_to_isbn):: We iterate over the iterator returned by as_completed(). As the Future objects are completed, the loop iterates over them in the order they are completed.

6. isbn = future_to_isbn[future]: Since we are keeping track of which ISBN corresponds to which Future in our future_to_isbn dictionary, we can use the current Future to find its corresponding ISBN.

7. future.result(): future.result() returns the result of the function call associated with the Future. If the function call has not yet finished, result() will block until it is completed. In this case, we are getting the result (i.e., the genre of the book) and storing it in the genres_isbn variable.

In [5]:
combined_df = pd.read_csv("books_genres.txt", sep="\t")

In [6]:
combined_df.isnull().sum()

Goodreads_BookID     0
Genres              42
dtype: int64

In [7]:
books_missing_genres = combined_df[combined_df['Genres'].isnull()]
books_missing_genres.reset_index(drop=True, inplace=True)

I perform the web scrapping again with those books for which I still do not have the genres. Probably the problem is due to a fail during the request.

In [12]:
total = len(books_missing_genres['Goodreads_BookID']) # total number of GoodreadsIDs
divs = 1 # number of intervals in which we divide the GoodreadsIDs
step = int(total / divs) # length of each interval
ranges = [range(step*i - step, step*i) for i in range(1,divs+1)] # an array with the intervals

key_i = 0 # number of the interval at which we start the for loop
key_f = 1 # number of the interval at which we stop the for loop

#progress_bar = tqdm(total=(key_f-key_i), bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')

for i in range(key_i, key_f): 
    
    goodreadsIDs = books_missing_genres['Goodreads_BookID'][ranges[i]] # goodreadsIDs of the interval i
    
    progress_bar = tqdm(total=step, bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')
    
    genres_dict = {
        'Goodreads_BookID':[],
        'Genres':[]
    }

    with concurrent.futures.ThreadPoolExecutor() as executor: 
        future_to_goodreadsIDs = {executor.submit(get_genres, goodreadsID): goodreadsID for goodreadsID in goodreadsIDs}
        for future in concurrent.futures.as_completed(future_to_goodreadsIDs):
            goodreadsID = future_to_goodreadsIDs[future]
            try:
                genres_goodreadsID = future.result()
                genres_dict['Goodreads_BookID'].append(goodreadsID)
                genres_dict['Genres'].append(genres_goodreadsID)
            except Exception as e:
                #print(f"Error al procesar ISBN {isbn}: {e}")
                genres_dict['Goodreads_BookID'].append(goodreadsID)
                genres_dict['Genres'].append(np.nan)
            progress_bar.update(1)
           
    progress_bar.close()

    new_df = pd.DataFrame(genres_dict)

    try:
        existing_df = pd.read_csv("books_genres_v2.txt", sep="\t")
    except FileNotFoundError:
        existing_df = pd.DataFrame()

    combined_df_v2 = pd.concat([existing_df, new_df], ignore_index=True)

    combined_df_v2.to_csv("books_genres_v2.txt", sep="\t", index=False)

    time.sleep(2)

    #progress_bar.update(1) 
    
#progress_bar.close()

100.00%|████████████████████████████████████████| 42/42 [00:00<00:12,  3.40it/s]


In [10]:
combined_df = combined_df[~combined_df['Genres'].isnull()].copy()
combined_df

Unnamed: 0,Goodreads_BookID,Genres
0,2657,"['Classics', 'Fiction', 'Historical Fiction', ..."
1,11870085,"['Young Adult', 'Fiction', 'Contemporary', 'Re..."
2,3,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',..."
3,2767052,"['Young Adult', 'Fiction', 'Fantasy', 'Science..."
4,960,"['Fiction', 'Mystery', 'Thriller', 'Mystery Th..."
...,...,...
9995,13616278,"['Fantasy', 'Epic Fantasy', 'Fiction', 'High F..."
9996,4769651,"['Fantasy', 'Middle Grade', 'Fairy Tales', 'My..."
9997,7130616,"['Urban Fantasy', 'Fantasy', 'Romance', 'Paran..."
9998,208324,"['Biography', 'History', 'Politics', 'Nonficti..."


In [13]:
combined_df_v2.isnull().sum()

Goodreads_BookID    0
Genres              7
dtype: int64

In [14]:
books_missing_genres_v2 = combined_df_v2[combined_df_v2['Genres'].isnull()]
books_missing_genres_v2.reset_index(drop=True, inplace=True)
books_missing_genres_v2

Unnamed: 0,Goodreads_BookID,Genres
0,31426,
1,852460,
2,2855034,
3,6120349,
4,89959,
5,61942,
6,18906484,


The GoodreadsID are wrong in the cases above, so I correct them in the next cell.

In [15]:
books_missing_genres_v2_indices = books_missing_genres_v2.index

# Goodreads_BookID = 31426
books_missing_genres_v2.loc[0,'Goodreads_BookID'] = 439286

# Goodreads_BookID = 852460
books_missing_genres_v2.loc[1,'Goodreads_BookID'] = 20742529

# Goodreads_BookID = 2855034
books_missing_genres_v2.loc[2,'Goodreads_BookID'] = 2424593

# Goodreads_BookID = 89959
books_missing_genres_v2.loc[3,'Goodreads_BookID'] = 355316

# Goodreads_BookID = 6120349
books_missing_genres_v2.loc[4,'Goodreads_BookID'] = 18652490

# Goodreads_BookID = 61942
books_missing_genres_v2.loc[5,'Goodreads_BookID'] = 8356426

# Goodreads_BookID = 18906484
books_missing_genres_v2.loc[6,'Goodreads_BookID'] = 18906484

In [16]:
books_missing_genres_v2

Unnamed: 0,Goodreads_BookID,Genres
0,439286,
1,20742529,
2,2424593,
3,355316,
4,18652490,
5,8356426,
6,18906484,


In [47]:
total = len(books_missing_genres_v2['Goodreads_BookID']) # total number of GoodreadsIDs
divs = 1 # number of intervals in which we divide the GoodreadsIDs
step = int(total / divs) # length of each interval
ranges = [range(step*i - step, step*i) for i in range(1,divs+1)] # an array with the intervals

key_i = 0 # number of the interval at which we start the for loop
key_f = 1 # number of the interval at which we stop the for loop

#progress_bar = tqdm(total=(key_f-key_i), bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')

for i in range(key_i, key_f): 
    
    goodreadsIDs = books_missing_genres_v2['Goodreads_BookID'][ranges[i]] # goodreadsIDs of the interval i
    
    progress_bar = tqdm(total=step, bar_format='{percentage:.2f}%|{bar}| {n_fmt}/{total_fmt} [{remaining}<{elapsed}, {rate_fmt}]')
    
    genres_dict = {
        'Goodreads_BookID':[],
        'Genres':[]
    }

    with concurrent.futures.ThreadPoolExecutor() as executor: 
        future_to_goodreadsIDs = {executor.submit(get_genres, goodreadsID): goodreadsID for goodreadsID in goodreadsIDs}
        for future in concurrent.futures.as_completed(future_to_goodreadsIDs):
            goodreadsID = future_to_goodreadsIDs[future]
            try:
                genres_goodreadsID = future.result()
                genres_dict['Goodreads_BookID'].append(goodreadsID)
                genres_dict['Genres'].append(genres_goodreadsID)
            except Exception as e:
                #print(f"Error al procesar ISBN {isbn}: {e}")
                genres_dict['Goodreads_BookID'].append(goodreadsID)
                genres_dict['Genres'].append([None])
            progress_bar.update(1)
           
    progress_bar.close()

    new_df = pd.DataFrame(genres_dict)

    try:
        existing_df = pd.read_csv("books_genres_v3.txt", sep="\t")
    except FileNotFoundError:
        existing_df = pd.DataFrame()

    combined_df_v3 = pd.concat([existing_df, new_df], ignore_index=True)

    combined_df_v3.to_csv("books_genres_v3.txt", sep="\t", index=False)

    time.sleep(2)

    #progress_bar.update(1) 
    
#progress_bar.close()

100.00%|██████████████████████████████████████████| 7/7 [00:00<00:04,  1.64it/s]


In [48]:
combined_df_v3

Unnamed: 0,Goodreads_BookID,Genres
0,18652490,"[Graphic Novels, Comics, Horror, Fantasy, Fict..."
1,18906484,[None]
2,8356426,"[Fantasy, Science Fiction, Fiction, Dragons, S..."
3,20742529,"[Cookbooks, Cooking, Nonfiction, Food, Referen..."
4,439286,"[Poetry, Classics, Fiction, Feminism, Literatu..."
5,355316,"[History, Nonfiction, Politics, Classics, Phil..."
6,2424593,"[Fantasy, Fiction, Young Adult, Adventure, His..."


In [56]:
combined_df_v3['Genres'].values[1] = ['Fiction', 'Mystery', 'Chick Lit', 'Contemporary', 'Audiobook', 'Adult', 'Thriller']

In [57]:
combined_df_v3

Unnamed: 0,Goodreads_BookID,Genres
0,18652490,"[Graphic Novels, Comics, Horror, Fantasy, Fict..."
1,18906484,"[Fiction, Mystery, Chick Lit, Contemporary, Au..."
2,8356426,"[Fantasy, Science Fiction, Fiction, Dragons, S..."
3,20742529,"[Cookbooks, Cooking, Nonfiction, Food, Referen..."
4,439286,"[Poetry, Classics, Fiction, Feminism, Literatu..."
5,355316,"[History, Nonfiction, Politics, Classics, Phil..."
6,2424593,"[Fantasy, Fiction, Young Adult, Adventure, His..."


combined_df_v1

In [60]:
combined_df = combined_df[~combined_df['Genres'].isnull()].copy()
combined_df_v2 = combined_df_v2[~combined_df_v2['Genres'].isnull()].copy()

print(combined_df.shape)
print(combined_df_v2.shape)
print(combined_df_v3.shape)

(9958, 2)
(35, 2)
(7, 2)


In [61]:
combined_df = pd.concat([combined_df, combined_df_v2], ignore_index=True)
combined_df = pd.concat([combined_df, combined_df_v3], ignore_index=True)
combined_df

Unnamed: 0,Goodreads_BookID,Genres
0,2657,"['Classics', 'Fiction', 'Historical Fiction', ..."
1,11870085,"['Young Adult', 'Fiction', 'Contemporary', 'Re..."
2,3,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',..."
3,2767052,"['Young Adult', 'Fiction', 'Fantasy', 'Science..."
4,960,"['Fiction', 'Mystery', 'Thriller', 'Mystery Th..."
...,...,...
9995,8356426,"[Fantasy, Science Fiction, Fiction, Dragons, S..."
9996,20742529,"[Cookbooks, Cooking, Nonfiction, Food, Referen..."
9997,439286,"[Poetry, Classics, Fiction, Feminism, Literatu..."
9998,355316,"[History, Nonfiction, Politics, Classics, Phil..."


In [62]:
combined_df.to_csv("books_genres_all.txt", sep="\t", index=False)