# Books – Data Collection and Integration Plan

## Information Integration and Analytic Data Processing – Project Phase I

<div style="text-align: left; margin: 0;">
  <table style="margin-left: 0;">
    <tr>
      <th>Clara Saldanha</th>
      <th>Daniel João</th>
      <th>Diogo Antunes</th>
      <th>Mariana Tomás</th>
    </tr>
    <tr>
      <td><code>fc64501@alunos.fc.ul.pt</code></td>
      <td><code>fc56455@alunos.fc.ul.pt</code></td>
      <td><code>fc64337@alunos.fc.ul.pt</code></td>
      <td><code>fc60421@alunos.fc.ul.pt</code></td>
    </tr>
  </table>
</div>

**Group:** 12

**Professor:** Assistant Professor André Rodrigues from the Informatics Department




### More interesting questions:

- Is there any correlation between an author's rating and the number of units sold?
- Are books more likely to be bought if they are published by certain publishers?
- Which countries have produced the most influential authors?
- How have our cultural taste in literature changed over this decade?

In [1]:
import pandas as pd

## 1.  Datasets

### 1.1. Amazon Reviews’23 - Books
https://amazon-reviews-2023.github.io/

The **Amazon Reviews’23** dataset is a large-scale collection of Amazon product reviews assembled by the McAuley Lab in 2023. It comprises over 571 million **reviews**, with user feedback that includes ratings, review texts, and helpfulness votes. 

In addition, it provides detailed **item metadata** such as product descriptions, prices, and raw images, alongside relational data like user-item interaction graphs and bought-together links. 

Covering interactions (reviews) from May 1996 to September 2023, the dataset features fine-grained timestamps, cleaner preprocessed review datasets, and standardized data splits. Its mostly used for benchmarking recommendation systems.

We selected the a pre-made subset specific to items categorized has "Books", emcompasing both physical books and *ebooks*. We took two files from it, both came zipped to accomodate better data transfer.

- **Books.jsonl.gz** - Preprocessed with the rewiews data
- **meta_books.jsonl.gz** - File with the metadata of items (descriptions, price, rating, etc.) catagorized as Books in amazon.

### 1.2. Goodreads - 2017
https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html#datasets

This dataset was collected from goodreads.com in late 2017, focusing on user‑submitted public shelves that do not require login to view. All user and review identifiers have been anonymized. It covers approximately 2.36 million books (including works, book series, and authors), 876,000 users, and over 228 million user‑book interactions (ratings, reads, and other shelf statuses).

It also includes **detailed metadata about books**, authors, works, and series, as well as comprehensive review data. Subsets organized by genre (e.g., Children’s, Fantasy, Romance) are provided for more manageable exploration.

(1) meta-data of the books, (2) user-book interactions (users' public shelves) and (3) users' detailed book reviews. These datasets can be merged together by joining on book/user/review ids. 

### 1.3. Goodreads - 2019
https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks/data

This kaggle dataset is a curated, clean collection of book information compiled using the **Goodreads API**.

Its creator built it to overcome common issues in other book datasets—such as **missing key columns** and **unclean data** and **focused on including reliable numerical data** (like ratings and counts) along with important details such as **publisher information, publication dates, and properly formatted author names (with multiple authors delimited by '/')**. Unlike the prior goodreads dataset (2017), this one has only has one table with the book metadata information.

It was **initiated in May 2019** and saw **updates until December 2020**, when <ins>changes to the Goodreads API led to its discontinuation</ins>.

### 1.4. Book‑Crossing Community
https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset

A kaggle dataset that was collected from the Book‑Crossing community by Cai‑Nicolas Ziegler over a four‑week crawl in August/September 2004. It has been preprocessed and cleaned to remove invalid entries, with user IDs anonymized. 

The Book‑Crossing community is a global network of readers who share and track books through the website bookcrossing.com. Members register each book online, then “release” it—either in public places for anyone to find (wild releases) or directly to other participants (controlled releases)—and record its journey on the site. By encouraging people to pass books along rather than keep them, Book‑Crossing aspires to transform the whole world into a library. Over time, this community has grown to include forums, meetups, and conventions, with more than 1.9 million members worldwide.

In total, it covers:

- **278,858 users** (with possible demographic information such as location and age),  
- **271,379 unique books** (identified by valid ISBNs and accompanied by metadata like title, author, publication year, publisher, and Amazon cover image links),  
- **1,149,780 ratings**, which may be explicit (on a 1–10 scale) or implicit (denoted by 0).

The data is organized into three main files: **Users**, **Books**, and **Ratings**. This structure **allows us to link user demographics to specific book records and the corresponding rating behavior**. 

--> The dataset is particularly useful for exploring or benchmarking recommendation systems and other data‑intensive analyses within the realm of reading preferences and user behavior.

### 1.5. Books Sales and Ratings
https://www.kaggle.com/datasets/thedevastator/books-sales-and-ratings/data

This is dataset featuring various attributes about books from nine different publishers, with **publishing years ranging from 1600s to 2016**. Included in the data is attributes reagarding sales, ratings and book identities. 

The data was sourced on the linked kaggle dataset but <ins>it was orginally published by Josh Murrey on data.world under the name Books</ins> (https://data.world/josh-nbu). 



### 1.6. Amazon sales rank data for print and kindle books
https://www.kaggle.com/datasets/ucffool/amazon-sales-rank-data-for-print-and-kindle-books

This kaggle dataset was compiled by NovelRank.com using sales data sourced from Amazon and encompasses over 61,000 unique ASINs with approximately 200 million sales rank data points available in both JSON and CSV formats. 

It covers a period **from January 1, 2017** to **June 29, 2018** and includes rankings for **both Kindle and Print editions**, reflecting the dynamic and category-specific nature of Amazon's sales ranking system. 

Data collection frequency varies from: 
- **hourly updates for certain consistently tracked titles** 
- to as **infrequent as once every 24 hours when sales ranks remain unchanged**

<ins>Thereby capturing the fluctuations and inherent delays in rank updates</ins>.

### 1.7. Amazon Best Sellers of 2010-2020 (Top 100 Books)
https://www.kaggle.com/datasets/jiyoungkimpf/amazon-best-sellers-of-20102020-top-100-books/data


The kaggle dataset was scraped from Amazon's best sellers webpage and covers an 11-year period, **capturing the top 100 best selling books for each year from 2010 to 2020**. 

### 1.8. Amazon Kindle Books Dataset 2023 (130K Books) 
https://www.kaggle.com/datasets/jiyoungkimpf/amazon-best-sellers-of-20102020-top-100-books/data

This kaggle dataset comprises data on **130,000 Kindle e-books**, scraped from **publicly available information on Amazon’s Kindle Books webpage in October 2023**.

The data were systematically collected by navigating through the Kindle book category pages on amazon.com/kindle-books, capturing a range of book details and sales information.

### 1.9. Wonderbook 
https://www.kaggle.com/datasets/elvinrustam/books-dataset


This kaggle dataset is derived from **wonderbk.com** (an amazon competitor), a popular online bookstore, using a **Python-based web scraping approach**. The data acquisition process employed libraries such as requests, Beautiful Soup (bs4), and Selenium, with two primary functions defined: one to gather URLs for individual books, and another to extract pertinent details including title, authors, description, category, publisher, starting price, and publication dates.

## 2. Data Profiling

In [6]:
import os

# Get and print the current working directory
cwd = os.getcwd()
print("Current working directory:", cwd)


Current working directory: /Users/dan/Desktop/2Semester/IPAI/Project


### 2.1 Amazon Reviews’23 - Books

### 2.2. Goodreads - 2017

Two gzipped json files (*json.gz*) files were downloaded from this dataset's website: 
- a detailed book graph emcompasing the metadata of about 2.3 million books (**goodreads_books.json.gz *~2.1 GB***),
- and a exclusive english review subset parsed emcompasing around 1.3 million book reviews, 25 thousand books and 19 thousand users, parsed at sentence level, meaning each of the reviews were decomposed in sentenses with a list (**goodreads_reviews_spoiler.json.gz *~591MB***)

#### 2.2.1. Goodreads Metadata - *goodreads_books.json.gz*

Due to the file's large size - which made working with it, especially with packages like **pandas**, a daunting task — the files were opened and subsetted using ***Google Cloud*** and its ***BigQuery*** API.

The number of entries/ rows were reduced via query by:

- removing entries with no *publicaiton_year* or older than 2010 (<2010);
- removing entries with *ratings_count* lower than 3000 - ratings are not the same as reviews being in much higher counts across the dataset.



In [30]:
file_path = './Datasets/goodreads/goodreads_3000RCount.json'

df = pd.read_json(file_path, lines=True)


In [9]:
df.head()

Unnamed: 0,title_without_series,title,work_id,book_id,publication_year,num_pages,ratings_count,kindle_asin,publisher,authors,...,language_code,description,link,url,asin,popular_shelves,edition_information,isbn,publication_day,publication_month
0,Heaven is for Real: A Little Boy's Astounding ...,Heaven is for Real: A Little Boy's Astounding ...,11283577,7933292,2010,162.0,229153,B004A90BXS,,"[{'role': '', 'author_id': '3446736'}, {'role'...",...,eng,When Colton Burpo made it through an emergency...,https://www.goodreads.com/book/show/7933292-he...,https://www.goodreads.com/book/show/7933292-he...,,"[{'name': 'to-read', 'count': '751'}, {'name':...",,849946158.0,,
1,All the Little Children,All the Little Children,55111067,34093937,2017,320.0,6911,B01MR8641A,,"[{'role': '', 'author_id': '16370494'}]",...,eng,"When a family camping trip takes a dark turn, ...",https://www.goodreads.com/book/show/34093937-a...,https://www.goodreads.com/book/show/34093937-a...,B01MR8641A,"[{'name': 'to-read', 'count': '6135'}, {'name'...",,,1.0,9.0
2,"Hold Me Closer, Necromancer (Necromancer, #1)","Hold Me Closer, Necromancer (Necromancer, #1)",12671757,8041873,2010,343.0,12065,B003P8Q5L2,Henry Holt and Company,"[{'role': '', 'author_id': '3484883'}]",...,en-US,Sam leads a pretty normal life. He may not hav...,https://www.goodreads.com/book/show/8041873-ho...,https://www.goodreads.com/book/show/8041873-ho...,,"[{'name': 'to-read', 'count': '26606'}, {'name...",,805090983.0,12.0,10.0
3,The Violets of March,The Violets of March,14613617,9724798,2011,296.0,17601,B004IYJEYM,Plume,"[{'role': '', 'author_id': '4467375'}]",...,en-US,A LIBRARY JOURNAL BEST BOOK OF 2011\nA heartbr...,https://www.goodreads.com/book/show/9724798-th...,https://www.goodreads.com/book/show/9724798-th...,,"[{'name': 'to-read', 'count': '18278'}, {'name...",,452297036.0,26.0,4.0
4,"Forced to Kill (Nathan McBride, #2)","Forced to Kill (Nathan McBride, #2)",16417192,11953868,2012,323.0,6051,B008MMQBEC,Thomas & Mercer,"[{'role': '', 'author_id': '1544549'}]",...,en,Trained Marine sniper Nathan McBride is no str...,https://www.goodreads.com/book/show/11953868-f...,https://www.goodreads.com/book/show/11953868-f...,B008MMQBEC,"[{'name': 'to-read', 'count': '599'}, {'name':...",,,21.0,8.0


**Printing some examples of this data with the nested features more visible**

In [10]:

with open(file_path, 'r') as f:
    # Read and print the first 2 JSON objects (one per line)
    for i in range(2):
        line = f.readline().strip()
        if not line:
            break
        obj = json.loads(line)
        print(json.dumps(obj, indent=2))

{
  "title_without_series": "Heaven is for Real: A Little Boy's Astounding Story of His Trip to Heaven and Back",
  "title": "Heaven is for Real: A Little Boy's Astounding Story of His Trip to Heaven and Back",
  "work_id": "11283577",
  "book_id": "7933292",
  "publication_year": "2010",
  "num_pages": "162",
  "ratings_count": "229153",
  "kindle_asin": "B004A90BXS",
  "publisher": "",
  "authors": [
    {
      "role": "",
      "author_id": "3446736"
    },
    {
      "role": "",
      "author_id": "266797"
    }
  ],
  "format": "",
  "country_code": "US",
  "series": [],
  "average_rating": 4.01,
  "similar_books": [
    "8100288",
    "8765461",
    "89375",
    "13158130",
    "6836258",
    "299795",
    "9640038",
    "97862",
    "104189",
    "6436732",
    "232631",
    "6817610",
    "13137883",
    "7570892",
    "11880626",
    "824844",
    "89376",
    "8142508"
  ],
  "image_url": "https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.

| Column Name           | Data Type         | Description                                                                                                         | Example Value                                                                                                                                                          |
|-----------------------|-------------------|---------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| title_without_series  | STRING            | Book title without series info                                                                                      | The Way of Kings (The Stormlight Archive, #1)                                                                                                                          |
| title                 | STRING            | Full book title                                                                                                     | The Way of Kings (The Stormlight Archive, #1)                                                                                                                          |
| work_id               | INT               | Unique identifier for the work (across editions)                                                                    | 8134945                                                                                                                                                                |
| book_id               | INT               | Unique identifier for this specific edition                                                                         | 9647295                                                                                                                                                                |
| publication_year      | INT              | Year the book was published                                                                                          | 2011                                                                                                                                                                   |
| num_pages             | INT               | Number of pages in the book                                                                                          | 1258                                                                                                                                                                   |
| ratings_count         | INT               | Total number of ratings the book has received                                                                        | 3114                                                                                                                                                                   |
| kindle_asin           | STRING            | ASIN for the Kindle version (if applicable)                                                                          | B003P2WO5E                                                                                                                                                             |
| publisher             | STRING            | Name of the publisher                                                                                                | Tom Doherty                                                                                                                                                            |
| authors               | ARRAY<STRUCT>     | Array of author objects; each contains fields like "role" and "author_id"                                             | `[{'role': '', 'author_id': '38550'}]`                                                                                                                                  |
| format                | STRING            | Book format (e.g., Paperback, Hardcover)                                                                             | Mass Market Paperback                                                                                                                                                  |
| country_code          | STRING            | Country code of publication                                                                                          | US                                                                                                                                                                     |
| series                | ARRAY             | List of series IDs (empty array if none)                                                                             | `['178728', '675258']`                                                                                                                                                 |
| average_rating        | FLOAT             | Average rating of the book                                                                                           | 4.64                                                                                                                                                                   |
| similar_books         | ARRAY<INT>        | List of similar book IDs                                                                                             | `[6736971, 10790277, 55398, 12499290, 1166599, 2315892, 15790883, 133664, 8752885, 2890090]`                                                                           |
| image_url             | STRING            | URL for the book's cover image                                                                                       | `https://images.gr-assets.com/books/1436456720m/9647295.jpg`                                                                                                           |
| isbn13                | STRING            | ISBN-13 identifier                                                                                                   | 9780765365279                                                                                                                                                          |
| is_ebook              | BOOLEAN           | Indicates whether the book is available as an ebook                                                                  | False                                                                                                                                                                  |
| text_reviews_count    | INT               | Number of text reviews submitted for the book                                                                        | 561                                                                                                                                                                    |
| language_code         | STRING            | ISO language code for the book                                                                                       | eng                                                                                                                                                                    |
| description           | STRING            | Detailed description of the book                                                                                     | I long for the days before the Last Desolation. Before the Heralds abandoned us and the Knights Radiant turned against us. (truncated for brevity)                   |
| link                  | STRING            | Goodreads URL for the book                                                                                           | `https://www.goodreads.com/book/show/9647295-the-way-of-kings`                                                                                                         |
| url                   | STRING            | URL for the book (often the same as "link")                                                                          | `https://www.goodreads.com/book/show/9647295-the-way-of-kings`                                                                                                         |
| asin                  | STRING            | Amazon Standard Identification Number (if available)                                                                 | (empty)                                                                                                                                                                |
| popular_shelves       | ARRAY<STRUCT>     | Array of shelf objects; each with a shelf "name" and a "count" indicating how many users added the book to that shelf  | `[{'name': 'to-read', 'count': '122552'}, {'name': 'currently-reading', 'count': '10145'}, ...]`                                                                    |
| edition_information   | STRING            | Additional edition information (often empty)                                                                         | (empty)                                                                                                                                                                |
| isbn                  | STRING            | Standard ISBN (may differ from ISBN-13)                                                                                | 0765365278                                                                                                                                                             |
| publication_day       | FLOAT             | Day of publication                                                                                                   | 24.0                                                                                                                                                                   |
| publication_month     | FLOAT             | Month of publication                                                                                                 | 5.0                                                                                                                                                                    |


#### 2.2.2. Goodreads Reviews - *goodreads_reviews_spoiler.json.gz*

This file also had a large size, therefore we also subsetted it using ***Google Cloud*** and its ***BigQuery*** API.

The number of entries/ rows were reduced via query by:

- only keeping entries with the *book_id*'s present in the prior splitted Goodread dataset, to cleave out books that would not be present in the metadata.
- stochastically removing 79% of the original reviews while preserving the distribution of reviews across each book.


In [23]:
file_path = './Datasets/goodreads/reviews_matched_subset.json'

df = pd.read_json(file_path, lines=True)

In [26]:
filtered_df = df[df['book_id'] == 9647295]
print(filtered_df.head(1).to_string())


                              review_id  book_id  rating                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            review_sentences  has_spoiler  timestamp                           user_id  rand_val
52110  e02075d42b5d7096c3c4fa86acccf8f3  9647295       4  [{'text': 'Satisfying, and quite the page turner.', 'flag': '0'}, {'text': 'However, a book of this length must inevi

| Column Name      | Data Type         | Description                                                                               | Example Value                                                                                                                                                                                                                                                       |
|------------------|-------------------|-------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| review_id        | STRING            | Unique identifier for the review                                                          | e02075d42b5d7096c3c4fa86acccf8f3                                                                                                                                                                                                                                     |
| book_id          | INT               | Identifier of the book being reviewed                                                     | 9647295                                                                                                                                                                                                                                                              |
| rating           | INT               | The rating given by the reviewer (typically on a scale of 1 to 5)                           | 4                                                                                                                                                                                                                                                                    |
| review_sentences | ARRAY<STRUCT>     | List of review sentences; each element is a structure with keys like "text" and "flag"      | `[{'text': 'Satisfying, and quite the page turner.', 'flag': '0'}, {'text': 'However, a book of this length must inevitably be guilty of meandering, and meander it most certainly did.', 'flag': '0'}, …]`                                                     |
| has_spoiler      | BOOLEAN           | Indicates whether the review contains spoilers                                              | False                                                                                                                                                                                                                                                                |
| timestamp        | DATE              | Date the review was posted                                                                  | 2016-01-20                                                                                                                                                                                                                                                           |
| user_id          | STRING            | Unique identifier for the user who submitted the review                                     | 0b0eb7f583962f6f8c5fd9e08cf27042                                                                                                                                                                                                                                     |


### 2.3. Goodreads - 2019-2020

This file was sourced from a newer dataset from the prior (obtained independently), primarly containing detailed information about the books. Detailed description for each column can be found alongside.

- **goodReads_2019_2020.csv (~1.5MB)** 


There is an issue with the file's formating. Instead of only 12 fields some entries have 13. We have to look more closely to fix this issue.

In [2]:
import pandas as pd

file_path = "./Datasets/goodreads/goodReads_2019_2020.csv"


df = pd.read_csv(file_path)

ParserError: Error tokenizing data. C error: Expected 12 fields in line 3350, saw 13


In [34]:
# Get the line with more or less than 12 fields

expected_num_columns = 12
bad_entries = []


with open(file_path, 'r', encoding='utf-8') as f:
    for line_num, line in enumerate(f, start=1):
        # Split the line on commas
        fields = line.strip().split(',')
        if len(fields) != expected_num_columns:
            bad_entries.append((line_num, len(fields), fields))

# Print out the bad entries
for entry in bad_entries:
    line_num, field_count, fields = entry
    print(f"Line {line_num} has {field_count} fields: {fields}")


Line 3350 has 13 fields: ['12224', 'Streetcar Suburbs: The Process of Growth in Boston  1870-1900', 'Sam Bass Warner', ' Jr./Sam B. Warner', '3.58', '0674842111', '9780674842113', 'en-US', '236', '61', '6', '4/20/2004', 'Harvard University Press']
Line 4704 has 13 fields: ['16914', "The Tolkien Fan's Medieval Reader", 'David E. Smith (Turgon of TheOneRing.net', ' one of the founding members of this Tolkien website)/Verlyn Flieger/Turgon (=David E. Smith)', '3.58', '1593600119', '9781593600112', 'eng', '400', '26', '4', '4/6/2004', 'Cold Spring Press']
Line 5879 has 13 fields: ['22128', 'Patriots (The Coming Collapse)', 'James Wesley', ' Rawles', '3.63', '156384155X', '9781563841552', 'eng', '342', '38', '4', '1/15/1999', 'Huntington House Publishers']
Line 8981 has 13 fields: ['34889', "Brown's Star Atlas: Showing All The Bright Stars With Full Instructions How To Find And Use Them For Navigational Purposes And Department Of Trade Examinations.", 'Brown', ' Son & Ferguson', '0.00', '08

In [35]:
# Open the file and check a specific problematic line
file_path = "./Datasets/goodreads/goodReads_2019_2020.csv"

with open(file_path, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f, start=1):
        # Check a few rows for demonstration
        if i == 3350:  # Based on the error message (line 3350)
            fields = line.strip().split(',')
            print(f"Line {i} has {len(fields)} fields:")
            for idx, field in enumerate(fields, start=1):
                print(f"Field {idx}: {field}")
            break


Line 3350 has 13 fields:
Field 1: 12224
Field 2: Streetcar Suburbs: The Process of Growth in Boston  1870-1900
Field 3: Sam Bass Warner
Field 4:  Jr./Sam B. Warner
Field 5: 3.58
Field 6: 0674842111
Field 7: 9780674842113
Field 8: en-US
Field 9: 236
Field 10: 61
Field 11: 6
Field 12: 4/20/2004
Field 13: Harvard University Press


**The issue seems to arises because the author's name contains a comma** — "Sam Bass Warner, Jr./Sam B. Warner" — **but it isn’t enclosed in quotes**, <ins>so the CSV parser incorrectly splits it into two separate fields</ins>.

This results in an extra column being detected (13 fields instead of the expected 12) because the comma in the name is interpreted as a field delimiter rather than as part of the data.

#### Fixing the Issue

In [36]:
import csv

input_file = "./Datasets/goodreads/goodReads_2019_2020.csv"
output_file = "./Datasets/goodreads/goodReads_2019_2020_fixed.csv"

expected_num_columns = 12  # the expected number of fields per row
fixed_count = 0

with open(input_file, 'r', encoding='utf-8', newline='') as infile, \
     open(output_file, 'w', encoding='utf-8', newline='') as outfile:
    
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    
    for row in reader:
        if len(row) == expected_num_columns + 1:
            # Increment the counter for bad rows
            fixed_count += 1
            # Merge the third and fourth fields (index 2 and 3)
            merged_field = row[2].strip() + ", " + row[3].strip()
            fixed_row = row[:2] + [merged_field] + row[4:]
            writer.writerow(fixed_row)
        else:
            writer.writerow(row)

print("Number of fixed rows:", fixed_count)


Number of fixed rows: 4


#### Open fixed File - "goodReads_2019_2020_fixed.csv"

In [39]:

# Path to the fixed CSV file
fixed_file = "./Datasets/goodreads/goodReads_2019_2020_fixed.csv"

# Read the fixed CSV file into a DataFrame
df = pd.read_csv(fixed_file)
df.head(2)

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.


In [41]:

# Examine a sample of the 'date' column to see its raw values
print("\nSample of the 'date' column values:")
print(df['publication_date'].sample(10, random_state=42))


Sample of the 'date' column values:
8663      2/1/2006
483       5/1/1996
8403     1/30/2007
6382     3/21/2006
1844    11/11/2003
398     11/19/1998
1118      9/6/2007
9951      8/1/2006
8758      9/1/2005
3445    12/31/1994
Name: publication_date, dtype: object


In [43]:


# Attempt to convert the 'publication_date' column to datetime; unparseable dates become NaT
df['publication_date_conv'] = pd.to_datetime(df['publication_date'], errors='coerce')

# Count how many dates failed to convert (NaT)
num_bad_dates = df['publication_date_conv'].isna().sum()
total_dates = len(df)
print(f"\nTotal rows: {total_dates}")
print(f"Number of rows with unparseable/invalid dates: {num_bad_dates}")

# Show the problematic date entries
bad_dates = df.loc[df['publication_date_conv'].isna(), 'publication_date'].unique()
print("\nExamples of problematic 'publication_date' entries:")
for entry in bad_dates:
    print(entry)



Total rows: 11127
Number of rows with unparseable/invalid dates: 2

Examples of problematic 'publication_date' entries:
11/31/2000
6/31/1982


The issue here is that the two problematic entries contain invalid dates: "11/31/2000" and "6/31/1982". November and June each have only 30 days, so specifying the 31st day for these months makes the dates unparseable.

**We decided to fix them manually by searching for their publication date via their isbn code.**

In [47]:

# Show a summary of the date conversion
print("\nDate conversion summary:")
df['publication_date_conv'].describe()


Date conversion summary:


count                            11125
mean     2000-08-28 13:08:24.485393152
min                1900-01-01 00:00:00
25%                1998-07-17 00:00:00
50%                2003-03-01 00:00:00
75%                2005-09-30 00:00:00
max                2020-03-31 00:00:00
Name: publication_date_conv, dtype: object

In [49]:
df.dtypes

bookID                            int64
title                            object
authors                          object
average_rating                  float64
isbn                             object
isbn13                            int64
language_code                    object
  num_pages                       int64
ratings_count                     int64
text_reviews_count                int64
publication_date                 object
publisher                        object
publication_date_conv    datetime64[ns]
dtype: object

| Column Name         | Data Type | Description                                                                                          | Example Value                                                       |
|---------------------|-----------|------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|
| bookID              | INT       | A unique identification number for each book (in this data set they were by incrementation)                                                     | 1                                                                    |
| title               | STRING    | The name under which the book was published.                                                       | Harry Potter and the Half-Blood Prince (Harry Potter  #6)            |
| authors             | STRING    | Names of the authors. Multiple authors are delimited by a hyphen (-) or slash (/).                   | J.K. Rowling/Mary GrandPré                                           |
| average_rating      | FLOAT     | The average rating the book received in total.                                                     | 4.57                                                                 |
| isbn                | STRING    | The International Standard Book Number (unique identifier for the book).                           | 0439785960                                                           |
| isbn13              | STRING    | A 13-digit ISBN used to identify the book.                                                         | 9780439785969                                                        |
| language_code       | STRING    | The primary language code of the book (e.g., "eng" for English).                                   | eng                                                                  |
| num_pages           | INT       | Number of pages contained in the book.                                                             | 652                                                                  |
| ratings_count       | INT       | Total number of ratings the book has received.                                                     | 2095690                                                              |
| text_reviews_count  | INT       | Total number of written text reviews the book has received.                                        | 27591                                                                |
| publication_date    | DATE      | The date when the book was first published. (Can be stored as a DATE or string in a consistent format.) | 2006-09-16  (or 9/16/2006)                                            |
| publisher           | STRING    | The name of the publisher.                                                                         | Scholastic Inc.                                                      |


### 2.4. Book‑Crossing Community

This dataset is divided into 3 tables:


#### 2.4.1. Books - *book_crossing_Books.csv*

Books are identified by their respective ISBN and invalid ISBNs were already removed from this the dataset. ***~69.9MB***

In [106]:
file_path = "./Datasets/bookcrossing/book_crossing_Books.csv"

df = pd.read_csv(file_path)

  df = pd.read_csv(file_path)


  df = pd.read_csv(file_path)


**Check what is really causing this issue!**

Seems that some date are being taken as INT and others as STR. Later well convert them to unix time though. We played around with the acceptable range of valid years to see the issue.

- 4621 entries/ rows have "0" has their 'Year-Of-Publication'.
- 12 entries/ rows have with years ahead of 2024 has their 'Year-Of-Publication', which is implausible since this dataset was published in 2024.
- 3 entries/ rows have an issue with their structuring it - The titles contain extra escaped quotes (\") and semicolons (;) that appear to be artifacts from the export process. This induced the titles to not be properly parsed during data extraction with pandas, the supposed 'Book-Author' values to be fused with "Book-Title" values, and consequently the other values to be moved to the incorrect column.

In [114]:

df = pd.read_csv(file_path, low_memory=False)

df['Year-Of-Publication_clean'] = df['Year-Of-Publication'].astype(str).str.strip()
df['Year_numeric'] = pd.to_numeric(df['Year-Of-Publication_clean'], errors='coerce')

# Define a plausible range for valid years. Adjust min_year as needed. (some of this dates feel wrong)
min_year = 1
max_year = 3000

invalid_years = df[
    (df['Year_numeric'].isna()) |
    (df['Year_numeric'] < min_year) |
    (df['Year_numeric'] > max_year)
]

print("Entries with invalid Year-Of-Publication:")
print(invalid_years[['Year-Of-Publication', 'Year-Of-Publication_clean', 'Year_numeric']])



Entries with invalid Year-Of-Publication:
       Year-Of-Publication Year-Of-Publication_clean  Year_numeric
176                      0                         0           0.0
188                      0                         0           0.0
288                      0                         0           0.0
351                      0                         0           0.0
542                      0                         0           0.0
...                    ...                       ...           ...
270794                   0                         0           0.0
270913                   0                         0           0.0
271094                   0                         0           0.0
271182                   0                         0           0.0
271196                   0                         0           0.0

[4621 rows x 3 columns]


In [108]:
df['Year-Of-Publication_clean'] = df['Year-Of-Publication'].astype(str).str.strip()
df['Year_numeric'] = pd.to_numeric(df['Year-Of-Publication_clean'], errors='coerce')

# Define a plausible range for valid years. Adjust min_year as needed. (some of this dates feel wrong)
min_year = 0
max_year = 2024 # this dataset was published in 2024

invalid_years = df[
    (df['Year_numeric'].isna()) |
    (df['Year_numeric'] < min_year) |
    (df['Year_numeric'] > max_year)
]

print("Entries with invalid Year-Of-Publication:")
print(invalid_years[['Year-Of-Publication', 'Year-Of-Publication_clean', 'Year_numeric']])

Entries with invalid Year-Of-Publication:
       Year-Of-Publication Year-Of-Publication_clean  Year_numeric
37487                 2030                      2030        2030.0
55676                 2030                      2030        2030.0
78168                 2030                      2030        2030.0
80264                 2050                      2050        2050.0
97826                 2050                      2050        2050.0
116053                2038                      2038        2038.0
118294                2026                      2026        2026.0
192993                2030                      2030        2030.0
209538   DK Publishing Inc         DK Publishing Inc           NaN
220731           Gallimard                 Gallimard           NaN
221678   DK Publishing Inc         DK Publishing Inc           NaN
228173                2030                      2030        2030.0
240169                2030                      2030        2030.0
255409              

In [109]:
selected_rows = df.loc[[209538, 220731, 221678]]
selected_rows


Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,Year-Of-Publication_clean,Year_numeric
209538,078946697X,"DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)\"";Michael Teitelbaum""",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/078946697X.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/078946697X.01.LZZZZZZZ.jpg,,DK Publishing Inc,
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-Marie Gustave Le ClÃ?Â©zio""",2003,Gallimard,http://images.amazon.com/images/P/2070426769.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/2070426769.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/2070426769.01.LZZZZZZZ.jpg,,Gallimard,
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)\"";James Buckley""",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0789466953.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0789466953.01.LZZZZZZZ.jpg,,DK Publishing Inc,


In [110]:
selected_titles = df.loc[[209538, 220731, 221678], 'Book-Title']

for idx, title in selected_titles.items():
    print(f"Index {idx}:")
    print(title)
    print()

Index 209538:
DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)\";Michael Teitelbaum"

Index 220731:
Peuple du ciel, suivi de 'Les Bergers\";Jean-Marie Gustave Le ClÃ?Â©zio"

Index 221678:
DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)\";James Buckley"



In [115]:
# Set display option to avoid truncation
pd.set_option('display.max_colwidth', None)

# Define a function to fix the fused Book-Title field.
# It expects the problematic pattern '";' (an escaped quote followed by a semicolon)
def fix_title_and_author(fused_value):
    if '";' in fused_value:
        # Split on the problematic pattern
        parts = fused_value.split('";')
        # Remove extra quotes and whitespace from both parts
        title = parts[0].replace('"', '').strip()
        author = parts[1].replace('"', '').strip() if len(parts) > 1 else ''
        return pd.Series([title, author])
    else:
        return pd.Series([fused_value, None])

# Define the ISBNs of the rows to fix
problematic_isbns = ["078946697X", "2070426769", "0789466953"]

# Create a boolean mask for rows with these ISBNs
mask = df['ISBN'].isin(problematic_isbns)

# Apply the fix to the "Book-Title" column for the problematic rows
# The function returns a Series with [corrected title, extracted author]
fixed = df.loc[mask, 'Book-Title'].apply(fix_title_and_author)
fixed.columns = ['Book-Title', 'Book-Author']  # Name the new columns

# Update the original DataFrame with the fixed values
df.loc[mask, ['Book-Title', 'Book-Author']] = fixed

# Verify the results by printing the problematic rows (showing ISBN, Book-Title, and Book-Author)
df.loc[mask]

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L,Year-Of-Publication_clean,Year_numeric
209538,078946697X,"DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)\",Michael Teitelbaum,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/078946697X.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/078946697X.01.LZZZZZZZ.jpg,,DK Publishing Inc,
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\",Jean-Marie Gustave Le ClÃ?Â©zio,Gallimard,http://images.amazon.com/images/P/2070426769.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/2070426769.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/2070426769.01.LZZZZZZZ.jpg,,Gallimard,
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)\",James Buckley,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0789466953.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0789466953.01.LZZZZZZZ.jpg,,DK Publishing Inc,


The rest will be left to the data cleaning section.

**The other issues will be left to the data cleaning section.**

In [149]:
example = df[df['Book-Author'].str.contains("Sanderson", case=False, na=False)]
print(example.head(1).to_string())


             ISBN                                                                          Book-Title          Book-Author Year-Of-Publication   Publisher                                                   Image-URL-S                                                   Image-URL-M                                                   Image-URL-L Year-Of-Publication_clean  Year_numeric
53897  0590485733  Dog to the Rescue II: Seventeen More True Tales of Dog Heroism (Dog to the Rescue)  Jeannette Sanderson                1995  Scholastic  http://images.amazon.com/images/P/0590485733.01.THUMBZZZ.jpg  http://images.amazon.com/images/P/0590485733.01.MZZZZZZZ.jpg  http://images.amazon.com/images/P/0590485733.01.LZZZZZZZ.jpg                      1995        1995.0


| Column Name               | Data Type        | Description                                                                                         | Example Value                                                                                                                          |
|---------------------------|------------------|-----------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| ISBN                      | **STRING confirm**           | The International Standard Book Number that uniquely identifies the book.                         | 0590485733                                                                                                                             |
| Book-Title                | STRING           | The title under which the book was published.                                                     | Dog to the Rescue II: Seventeen More True Tales of Dog Heroism (Dog to the Rescue)                                                       |
| Book-Author               | STRING           | The name(s) of the author(s) of the book. If multiple, they are delimited (e.g., by a slash or hyphen). | Jeannette Sanderson                                                                                                                    |
| Year-Of-Publication       | INT   | The year the book was first published.                                                            | 1995                                                                                                                                   |
| Publisher                 | STRING           | The name of the publisher.                                                                          | Scholastic                                                                                                                             |
| Image-URL-S               | STRING           | URL for the small version of the book's cover image.                                                | http://images.amazon.com/images/P/0590485733.01.THUMBZZZ.jpg                                                                             |
| Image-URL-M               | STRING           | URL for the medium version of the book's cover image.                                               | http://images.amazon.com/images/P/0590485733.01.MZZZZZZZ.jpg                                                                             |
| Image-URL-L               | STRING           | URL for the large version of the book's cover image.                                                | http://images.amazon.com/images/P/0590485733.01.LZZZZZZZ.jpg                                                                             |
                                                                                                                    

#### 2.4.2. Ratings - *book_crossing_Ratings.csv*

Ratings (Book-Rating) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation). Ratings can also be expressed implicitly by 0, meaning user has interest (clicked on the book's link ) and there is no rating data yet rather than a user intentionally giving a bad score.

AKA - "0" shows user didn't vote the book, but interacted with it in some way.

***~21.6MB***

In [153]:
file_path = "./Datasets/bookcrossing/book_crossing_Ratings.csv"

df_ratings = pd.read_csv(file_path)

In [154]:
df_ratings.dtypes

User-ID         int64
ISBN           object
Book-Rating     int64
dtype: object

In [155]:
df_ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
...,...,...,...
1149775,276704,1563526298,9
1149776,276706,0679447156,0
1149777,276709,0515107662,10
1149778,276721,0590442449,10


In [156]:
df_ratings.describe()

Unnamed: 0,User-ID,Book-Rating
count,1149780.0,1149780.0
mean,140386.4,2.86695
std,80562.28,3.854184
min,2.0,0.0
25%,70345.0,0.0
50%,141010.0,0.0
75%,211028.0,7.0
max,278854.0,10.0


| Column Name  | Data Type | Description                                                           | Example Value |
|--------------|-----------|-----------------------------------------------------------------------|---------------|
| User-ID      | INT       | A unique identifier for the user who submitted the rating.           | 276725        |
| ISBN         | STRING    | The International Standard Book Number for the book being rated.     | 034545104X    |
| Book-Rating  | INT       | The rating given by the user. | 0             |


#### 2.4.3. Users - *book_crossing_Users.csv*

Contains the users. Note that user IDs (User-ID) have been anonymized and map to integers (still correspond to their conterparts the other tables). 

Demographic data is provided (Location, Age) if available. Otherwise, these fields contain NULL-values.


***~10.5MB***

In [158]:
file_path = "./Datasets/bookcrossing/book_crossing_Users.csv"

df_users = pd.read_csv(file_path)

In [159]:
df_users.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


| Column Name | Data Type | Description                                                                 | Example Value                         |
|-------------|-----------|-----------------------------------------------------------------------------|--------------------------------------|
| User-ID     | INT       | A unique identifier for each user.                                          | 1                                    |
| Location    | STRING    | The user’s location, typically in the format "city, state/region, country". | nyc, new york, usa                   |
| Age         | FLOAT     | The age of the user. Can be missing (NaN) or imprecise; often needs cleaning. | 18.0             |


### 2.5. Books Sales and Ratings

This dataset has one file - ***Books_Data_Clean.csv ~158.3KB*** 

ignored index column, it's probably just a byproduct of the data produces exporting it with pandas.

In [166]:
file_path = "./Datasets/sales_N_ratings/Books_Data_Clean.csv"

df_BSR = pd.read_csv(file_path)

In [168]:
df_BSR.head(1)

Unnamed: 0,index,Publishing Year,Book Name,Author,language_code,Author_Rating,Book_average_rating,Book_ratings_count,genre,gross sales,publisher revenue,sale price,sales rank,Publisher,units sold
0,0,1975.0,Beowulf,"Unknown, Seamus Heaney",en-US,Novice,3.42,155903,genre fiction,34160.0,20496.0,4.88,1,HarperCollins Publishers,7000


| Column Name           | Data Type      | Description                                                                                                  | Example Value                     |
|-----------------------|----------------|--------------------------------------------------------------------------------------------------------------|-----------------------------------|
| Publishing Year       | NUMERIC (FLOAT)| The year in which the book was published.                                                                  | 1975.0                            |
| Book Name             | STRING         | The title of the book.                                                                                       | Beowulf                           |
| Author                | STRING         | The name(s) of the author(s) of the book.                                                                    | Unknown, Seamus Heaney            |
| language_code         | STRING         | The code representing the language in which the book is written.                                             | en-US                             |
| Author_Rating         | STRING/NUMERIC | The rating of the author based on previous works (may be a category such as "Novice" or a numeric score).      | Novice                            |
| Book_average_rating   | FLOAT          | The average rating given to the book by readers.                                                             | 3.42                              |
| Book_ratings_count    | INT            | The total number of ratings the book received.                                                               | 155903                            |
| genre                 | STRING         | The genre or category of the book.                                                                           | genre fiction                     |
| gross sales           | FLOAT          | The total sales revenue generated by the book.                                                               | 34160.0                           |
| publisher revenue     | FLOAT          | The revenue earned by the publisher from selling the book.                                                   | 20496.0                           |
| sale price            | FLOAT          | The price at which the book was sold.                                                                        | 4.88                              |
| sales rank            | INT            | The rank of the book based on its sales performance.                                                         | 1                                 |
| Publisher             | STRING         | The name of the publisher.                                                                                     | HarperCollins Publishers          |
| units sold            | INT            | The number of units sold for the book.                                                                       | 7000                              |
