# Add missing information to books

In [1]:
import pandas as pd
import numpy as np

In [5]:
books_df = pd.read_csv("../data/processed/books.csv")
ratings_df = pd.read_csv("../data/processed/ratings.csv")
user_book_ratings_df = pd.read_csv("../data/processed/user_book_ratings.csv")

## Add missing columsn to user_book_ratings

Extracting rows with missing values only

In [6]:
user_book_ratings_df_missing = user_book_ratings_df[user_book_ratings_df.isnull().any(axis=1)]

In [None]:
import requests

Used the Open Library API to send requrests with ISBN to get genre, title, authors etc.

The API allowed multiple ISBNs to be requested, but 300 seems to be the upper limit. So 300 ISBNs are requested each, their information extraced and then added to our books dataframe.

NOTE: The script below is way more complex than it should be and on a normal day I wouldn't touch it with a 10 foot pole. But as this processing is a one-time thing I didn't invest much into making it more pythonic.

In [None]:
chunk_size = 300

for i in range(0, len(user_book_ratings_df_missing), chunk_size):
    isbns = user_book_ratings_df_missing['isbn'].iloc[i:i+chunk_size]
    isbns_formatted = '|'.join(isbns)
    url = f"http://openlibrary.org/api/volumes/brief/json/{isbns_formatted}"
    response = requests.get(url)
    
    for book_info in response.json().values():
        for record in book_info['records'].values():
            data = record['data']
            genre_dicts = data.get('subjects', None)

            new_data = {'book_title': data.get('title', None),
                        'book_author': data.get('authors', None),
                        'publication_year': data.get('publish_date', None),
                        'publisher': data.get('publishers', None)}

            row_index = user_book_ratings_df_missing[user_book_ratings_df_missing['isbn'] == record['isbns'][0]].index[0]
            user_book_ratings_df_missing.loc[row_index, new_data.keys()] = new_data.values()

## Add genre to books

Adding a "genre" column with empty lists to populate later

In [None]:
books_df['genre'] = np.empty((len(books_df), 0)).tolist()

Same as above, this time only adding genres to the books_df.

In [None]:
chunk_size = 300

for i in range(0, len(books_df), chunk_size):
    isbns = books_df['isbn'].iloc[i:i+chunk_size]
    isbns_formatted = '|'.join(isbns)
    url = f"http://openlibrary.org/api/volumes/brief/json/{isbns_formatted}"
    response = requests.get(url)
    
    for book_info in response.json().values():
        for record in book_info['records'].values():
            data = record['data']
            genre_dicts = data.get('subjects', None)

            if genre_dicts is not None:
                genre_list = [genre['name'] for genre in genre_dicts]
                row_index = books_df[books_df['isbn'] == record['isbns'][0]].index[0]
                books_df.loc[row_index, 'genre'] = genre_list

## Verdict

It's...taking ages.

Connection to the API drops all the time without even reaching the 300 chunk size. The *user_book_rating_df_missing* has 109076 rows and *books_df* has 270947 rows. It would take literally ages to finish this whole processing.

So I shifted my approach to this project to leverage my expertise with LLMs.

The goal now is to utilize LLMs to later take the output of the recommender systems and expand information like genre on it with its own knowledge (or a potential dyanmic web scraping).