
# Data Cleaning

## Objectives

The purpose of this notebook is to **clean, standardize, and prepare the collected datasets** for subsequent exploratory analysis and modeling tasks.

The goal is to transform raw inputs from multiple book datasets into a **reliable, consistent, and mergeable analytical base**, ensuring data integrity and comparability across platforms.

---

## Inputs

| Dataset                    | Source                     | Description                                                               | Format |
| -------------------------- | -------------------------- | ------------------------------------------------------------------------- | ------ |
| `bbe_books.csv`            | Zenodo – *Best Books Ever* | Book metadata including title, author, rating, genres, and description.   | CSV    |
| `books.csv`, `ratings.csv` | GitHub – *Goodbooks-10k*   | Book metadata and user–book interaction data for recommendation modeling. | CSV    |

---

## Tasks in This Notebook

This notebook will execute the following cleaning and preparation steps:

1. **Standardize column formats:**
   Ensure consistent data types and naming conventions across datasets (e.g., convert `isbn` to string, align `author`, `rating`, and `title` formats).

2. **Clean and normalize missing values:**
   Replace placeholder NaNs (`9999999999999`, empty lists, or `"None"`) with `np.nan`, then impute or drop based on analytical importance.

3. **Detect and resolve duplicates:**
   Identify duplicate records using key identifiers (`bookId`, `isbn`, `title + author`) and retain the most complete or relevant entries.

4. **Validate and align categorical values:**
   Standardize genre labels, language codes, and rating scales to ensure comparability between datasets.

5. **Merge compatible datasets:**
   Integrate *BestBooksEver* and *Goodbooks-10k_books* into a unified schema while maintaining referential integrity with the ratings dataset.

6. **Outlier and consistency checks:**
   Review numerical and date fields (e.g., `pages`, `price`, `publishDate`) for unrealistic or extreme values and adjust as needed.

7. **Feature enrichment (optional):**
   Derive or enhance fields such as `popularity_score`, `recency`, or missing genre information using external APIs where beneficial.

---

## Outputs

* **Cleaned, schema-aligned datasets** ready for exploratory data analysis and modeling.
* **Summary statistics** on completeness, duplicates, and outliers.
* **Processed CSV files** saved for reproducibility in `data/processed/`.

> **Note:** This notebook focuses on the *Data Cleaning and Preparation*. Further feature engineering and model-specific transformations will follow in later notebooks.

---


## Navigate to the Parent Directory

Before combining and saving datasets, it’s often helpful to move to a parent directory so that file operations (like loading or saving data) are easier and more organized. 

Before using the Python’s built-in os module to move one level up from the current working directory, it is advisable to inspect the current directory.

In [1]:
import os

# Get the current working directory
current_dir = os.getcwd()
print(f'Current directory: {current_dir}')

Current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics\notebooks


To change to parent directory (root folder), run the code below. If you are already in the root folder, you can skip this step.

In [2]:
# Change the working directory to its parent
os.chdir(os.path.dirname(current_dir))
print('Changed directory to parent.')

# Get the new current working directory (the parent directory)
current_dir = os.getcwd()
print(f'New current directory: {current_dir}')

Changed directory to parent.
New current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics


## Load and Inspect Books Datasets

In this step, we load the previously collected datasets: **Goodbooks-10k** (books) and **Best Books Ever**. We will inspect their structure one more time before starting any merging or cleaning operations.

In [3]:
import pandas as pd 

# load datasets
books_raw = pd.read_csv('data/raw/books.csv')
bbe_raw = pd.read_csv('data/raw/bbe_books.csv')

# create copies for cleaning
books_clean = books_raw.copy()
bbe_clean = bbe_raw.copy()

In [None]:
from pathlib import Path

# Create data folder if not exists
interim_bbe_path = Path("data/interim/bbe")
interim_bbe_path.mkdir(parents=True, exist_ok=True)

interim_gb_path = Path("data/interim/goodbooks")
interim_gb_path.mkdir(parents=True, exist_ok=True)

version = 0

bbe_clean.to_csv(interim_bbe_path / f"bbe_clean_v{version}.csv", index=False)
books_clean.to_csv(interim_gb_path / f"books_clean_v{version}.csv", index=False)

print("Interim datasets saved successfully in data/interim/ directory.")

Interim datasets saved successfully in data/interim/ directory.


In [6]:
# Preview data
display(bbe_clean.head(3))
display(books_clean.head(3))

# Check shape and missing values
for name, df in {'BBE': bbe_clean, 'Books': books_clean,}.items():
    print(f"\n{name} — Shape: {df.shape}")
    print(df.info())
    print(df.isna().sum().sort_values(ascending=False).head())


Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price,bookId_num
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",...,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,2993816,30516,5.09,2767052.0
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",...,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...,2632233,26923,7.38,2.0
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",...,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,2269402,23328,,2657.0


Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052.0,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3.0,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865.0,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...



BBE — Shape: (52478, 26)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52478 entries, 0 to 52477
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bookId            52478 non-null  object 
 1   title             52478 non-null  object 
 2   series            23470 non-null  object 
 3   author            52478 non-null  object 
 4   rating            52478 non-null  float64
 5   description       51140 non-null  object 
 6   language          48672 non-null  object 
 7   isbn              52478 non-null  object 
 8   genres            52478 non-null  object 
 9   characters        52478 non-null  object 
 10  bookFormat        51005 non-null  object 
 11  edition           4955 non-null   object 
 12  pages             50131 non-null  object 
 13  publisher         48782 non-null  object 
 14  publishDate       51598 non-null  object 
 15  firstPublishDate  31152 non-null  object 
 16  awards        

We will check if the datasets share common identifiers and compatible data types.

In [7]:
bbe_only_columns = set(bbe_clean.columns) - set(books_clean.columns)
print(f'Columns only in BBE: {bbe_only_columns}')

goodbooks_only_columns = set(books_clean.columns) - set(bbe_clean.columns)
print(f'Columns only in Goodbooks: {goodbooks_only_columns}')

Columns only in BBE: {'awards', 'bbeVotes', 'price', 'description', 'genres', 'bookFormat', 'numRatings', 'publishDate', 'likedPercent', 'author', 'rating', 'bbeScore', 'pages', 'setting', 'language', 'coverImg', 'publisher', 'ratingsByStars', 'bookId_num', 'series', 'bookId', 'edition', 'firstPublishDate', 'characters'}
Columns only in Goodbooks: {'goodreads_book_id', 'isbn13', 'work_text_reviews_count', 'ratings_3', 'ratings_5', 'books_count', 'average_rating', 'book_id', 'ratings_count', 'ratings_2', 'original_publication_year', 'language_code', 'authors', 'image_url', 'work_ratings_count', 'best_book_id', 'original_title', 'ratings_1', 'work_id', 'small_image_url', 'ratings_4'}


Based on the initial inspection, we can create a mapping table to align columns from both datasets for merging and analysis.

| **BestBooksEver (BBE)** | **Goodbooks10k_books (GB10k)** | **Notes / Alignment Rationale** |
| --------------------------------- | ------------------------------------------------ | -------------------------------------------------------------------------- |
| `bookId` | `book_id` | Main identifier; ensure both are numeric. |
| `bookId_num` | `goodreads_book_id` | Goodreads identifier; ensure both are numeric for joining. |
| `title` | `title` | Direct match. Used as secondary join key. |
| `series` | — | Only in BBE; could enrich GB10k if available via API. |
| `author` | `authors` | Same meaning. Normalize format. |
| `rating` | `average_rating` | Equivalent — rename to unified `average_rating`. |
| `numRatings` | `ratings_count` | Same measure of total user ratings. |
| `ratingsByStars` | `ratings_1` … `ratings_5` | BBE has dict, GB10k has explicit columns. Expand or aggregate accordingly. |
| `likedPercent` | — | BBE-only; optional metric of user sentiment. |
| `isbn` | `isbn` / `isbn13` | Common linking key; keep both (string). Use for merges when present. |
| `language` | `language_code` | Standardize to ISO 639-1 (lowercase). |
| `description` | — | BBE-only; valuable for NLP features. |
| `genres` | — | BBE-only; can enrich GB10k tags later. |
| `characters` | — | bbe_clean-only; low modeling priority, but could add narrative metadata. |
| `bookFormat` | — | BBE-only; possible categorical feature. |
| `edition` | — | BBE-only. |
| `pages` | — | BBE-only; numeric, may enrich GB10k metadata. |
| `publisher` | — | bbe_clean_clean-only; possible future feature. |
| `publishDate` | — | bbe_clean_clean-only; can approximate from GB10k’s `original_publication_year`. |
| `firstPublishDate` | `original_publication_year` | Equivalent (date vs year). |
| `coverImg` | `image_url` / `small_image_url` | Same function (cover link). |
| `bbeScore` | — | BBE-only; internal popularity score. |
| `bbeVotes` | `work_ratings_count` | Comparable as popularity proxy. |
| `price` | — | BBE-only; likely non-essential for satisfaction prediction. |
| `setting` | — | BBE-only; can support content enrichment. |
| `awards` | — | BBE-only; categorical enrichment. |
| — | `goodreads_book_id` / `best_book_id` / `work_id` | GB10k-only identifiers; may be used for deeper Goodreads linking. |
| — | `books_count` | GB10k-only; number of editions per work. |
| — | `work_text_reviews_count` | GB10k-only; can complement `numRatings` as engagement metric. |



## Data Cleaning Steps

### Best Books Ever

- Handle identifier columns
- Standardize key columns: `author`, `language`
- Missing data handling strategies
- Normalize genre and format
- Validate for no nulls or duplicates

#### 1. Handle identifier columns
On the previous notebook, we created a new field `bookId_num` in the BBE dataset to align with `goodreads_book_id` in the Goodbooks10k dataset. We have also ensured that they were both converted to numeric types and that all `bookId` values generated a valid `bookId_num`. So we can skip the handle identifier columns, as it was already done. 

#### 2. Standardize key columns

**Author**

We will proceed with the standardization of key columns, starting with the `author` column. The author column in the BBE dataset often contains a qualifier such as "(Goodreads Author)". We will remove such qualifiers to standardize the format. We will also create an additional list column to store multiple authors as a list rather than a single string. This way, its is ready to use for feature engineering later on if needed.

In [8]:
import re
import pandas as pd

def clean_and_split_authors(name):
    """
    Cleans author names and returns a list of authors.
    """
    if pd.isna(name):
        return None

    # Remove role descriptors
    cleaned = re.sub(r"\s*\([^)]*\)", "", name)
    
    # Split into list if multiple authors exist
    authors_list = [a.strip() for a in cleaned.split(",") if a.strip()]
    
    return authors_list

In [9]:
# Apply to BestBooksEver dataset
bbe_clean["authors_list"] = bbe_clean["author"].apply(clean_and_split_authors)
bbe_clean["author_clean"] = bbe_clean["authors_list"].apply(lambda x: ", ".join(x) if isinstance(x, list) else None)

# Quick check
bbe_clean[["author", "author_clean", "authors_list"]].head(5)

Unnamed: 0,author,author_clean,authors_list
0,Suzanne Collins,Suzanne Collins,[Suzanne Collins]
1,"J.K. Rowling, Mary GrandPré (Illustrator)","J.K. Rowling, Mary GrandPré","[J.K. Rowling, Mary GrandPré]"
2,Harper Lee,Harper Lee,[Harper Lee]
3,"Jane Austen, Anna Quindlen (Introduction)","Jane Austen, Anna Quindlen","[Jane Austen, Anna Quindlen]"
4,Stephenie Meyer,Stephenie Meyer,[Stephenie Meyer]


In [10]:
from pathlib import Path

version = 1

interim_bbe_path = Path("data/interim/bbe")

bbe_clean.to_csv(interim_bbe_path / f"bbe_clean_v{version}.csv", index=False)

print("Interim author datasets saved successfully in data/interim/ directory.")

Interim author datasets saved successfully in data/interim/ directory.


**Language**

The `language` column in the Best Books Ever dataset used full names such as “English”, “German”, and “Arabic”.  Before transforming the values, we will check for all unique values to identify any unexpected entries.

In [11]:
# Inspect unique language values
print("Unique language values in BBE dataset:")
bbe_clean['language'] = bbe_clean['language'].astype(str).str.strip()
unique_languages = bbe_clean['language'].unique()

print(f"\nTotal unique values: {len(unique_languages)}\n")
print(unique_languages)

Unique language values in BBE dataset:

Total unique values: 82

['English' 'French' 'German' 'Persian' 'Arabic' 'nan' 'Spanish'
 'Multiple languages' 'Portuguese' 'Indonesian' 'Turkish' 'Polish'
 'Bulgarian' 'Tamil' 'Japanese' 'Romanian' 'Italian'
 'French, Middle (ca.1400-1600)' 'Norwegian' 'Urdu' 'Dutch' 'Finnish'
 'Marathi' 'Chinese' 'Swedish' 'Icelandic' 'Malayalam' 'Croatian'
 'Estonian' 'Greek, Modern (1453-)' 'Russian' 'Kurdish' 'Danish' 'Hindi'
 'Filipino; Pilipino' 'Serbian' 'Bengali' 'Malay' 'Catalan; Valencian'
 'Czech' 'Vietnamese' 'Armenian' 'Georgian' 'Kannada' 'Korean' 'Nepali'
 'Slovak' 'Telugu' 'Hungarian' 'English, Middle (1100-1500)' 'Azerbaijani'
 'Farsi' 'Lithuanian' 'Ukrainian' 'Bokmål, Norwegian; Norwegian Bokmål'
 'Iranian (Other)' 'Faroese' 'Basque' 'Macedonian' 'Maltese' 'Gujarati'
 'Amharic' 'Aromanian; Arumanian; Macedo-Romanian' 'Assamese'
 'Panjabi; Punjabi' 'Albanian' 'Latvian' 'Bosnian' 'Afrikaans' 'Thai'
 'Dutch, Middle (ca.1050-1350)' 'Mongolian' 'Tag

We can see that there are some unexpected values such as:
- _historical forms_ (“English, Middle (1100-1500)”, “French, Middle (ca.1400-1600)”)
- _combined or semicolon-separated entries_ (“Filipino; Pilipino”, “Catalan; Valencian”)
- _multi-language / uncertain cases_ (“Multiple languages”, “Undetermined”)
- _rare or dialects_ (“Bokmål, Norwegian; Norwegian Bokmål”, “Aromanian; Arumanian; Macedo-Romanian”)

We will clean the unusual entries by mapping them to the closest language present in the ISO 639-1 standard. Unrecognized values will be flagged and replaced with `"unknown"`. It was decided to distinguish the `"unknown"` from the `NaN` values to retain information about missingness versus unrecognized entries. 

In [12]:
import numpy as np

# Standardize capitalization & spacing
bbe_clean['language'] = bbe_clean['language'].astype(str).str.strip().str.title()

# Handle NaNs that became strings
bbe_clean['language'] = bbe_clean['language'].replace({'Nan': np.nan})

# Simplify and unify multi-language / dialect forms
replace_map = {
    'Multiple Languages': 'Multilingual',
    'Undetermined': 'Unknown',
    'Iranian (Other)': 'Persian',
    'Farsi': 'Persian',
    'Filipino; Pilipino': 'Filipino',
    'Catalan; Valencian': 'Catalan',
    'Panjabi; Punjabi': 'Punjabi',
    'Bokmål, Norwegian; Norwegian Bokmål': 'Norwegian',
    'Norwegian Nynorsk; Nynorsk, Norwegian': 'Norwegian',
    'Greek, Modern (1453-)': 'Greek',
    'Greek, Ancient (To 1453)': 'Greek',
    'French, Middle (Ca.1400-1600)': 'French',
    'English, Middle (1100-1500)': 'English',
    'Dutch, Middle (Ca.1050-1350)': 'Dutch',
    'Aromanian; Arumanian; Macedo-Romanian': 'Romanian',
    'Mayan Languages': 'Mayan',
    'Australian Languages': 'English'
}

bbe_clean['language'] = bbe_clean['language'].replace(replace_map)



After transforming the values, we apply a mapping to standardize the `language` column using **ISO 639-1 two-letter codes**.
The mapping dictionaries are stored in the `src/cleaning/mappings/` folder to keep the notebooks cleaner and improve readability.

In [13]:
import json

with open("src/cleaning/mappings/languages_dict.json", "r", encoding="utf-8") as f:
    languages_dict = json.load(f)

# Apply dictionary
bbe_clean['language_clean'] = bbe_clean['language'].str.lower().map(languages_dict)

# Fill remaining NaNs
bbe_clean['language_clean'] = bbe_clean['language_clean'].fillna('unknown')

In [14]:
# check again for unique language values
print("Unique language values in BBE dataset:")
unique_languages = bbe_clean['language_clean'].unique()

print(f"\nTotal unique values: {len(unique_languages)}\n")
print(unique_languages)

Unique language values in BBE dataset:

Total unique values: 62

['en' 'fr' 'de' 'fa' 'ar' 'unknown' 'es' 'multi' 'pt' 'id' 'tr' 'pl' 'bg'
 'ta' 'ja' 'ro' 'it' 'no' 'ur' 'nl' 'fi' 'mr' 'zh' 'sv' 'is' 'ml' 'hr'
 'et' 'el' 'ru' 'da' 'hi' 'fil' 'cs' 'vi' 'hy' 'ka' 'kn' 'ko' 'ne' 'sk'
 'te' 'hu' 'az' 'lt' 'uk' 'fo' 'eu' 'mk' 'mt' 'gu' 'am' 'pa' 'sq' 'lv'
 'bs' 'th' 'mn' 'tl' 'gl' 'sl' 'myn']


In [15]:
language_breakdown = (
    bbe_clean['language']
    .value_counts()
    .to_frame('count')
)

language_breakdown['percentage'] = (
    language_breakdown['count'] / len(bbe_clean) * 100
).round(2)

print(language_breakdown)


           count  percentage
language                    
English    42663       81.30
Arabic      1038        1.98
Spanish      687        1.31
French       580        1.11
German       528        1.01
...          ...         ...
Mongolian      1        0.00
Aleut          1        0.00
Unknown        1        0.00
Mayan          1        0.00
Duala          1        0.00

[71 rows x 2 columns]


In [18]:
from pathlib import Path

version = 2

interim_bbe_path = Path("data/interim/bbe")

bbe_clean.to_csv(interim_bbe_path / f"bbe_clean_v{version}.csv", index=False)

print("Interim language datasets saved successfully in data/interim/ directory.")

Interim language datasets saved successfully in data/interim/ directory.


**Dates**

BBE dataset has two publication fields: `publishDate` and `firstPublishDate`. The `firstPublishDate` represents the original publication date, while `publishDate` refers to a more recent edition or reprint date. Publishing experts assumption is that the recency of the `firstPublishDate` is more relevant for modeling book satisfaction, as it reflects when the book was first introduced to readers. Therefore, we will focus on cleaning and standardizing the `firstPublishDate` column and use `publishDate` only if `firstPublishDate` is missing.

While majority of the dates follow the 'MM/DD/YY' format, after a first attemp at cleaning, we noticed some dates do not conform to this format. Therefore, we will implement a more robust date parsing strategy, focusing first on transforming textual formats into 'MM/DD/YYYY' format before attempting to parse them into datetime objects.

In [19]:
from dateutil import parser

def clean_date_string(date_str):
    """Remove ordinal suffixes and unwanted characters from a date string."""
    if pd.isna(date_str):
        return np.nan
    # remove st, nd, rd, th (like 'April 27th 2010' → 'April 27 2010')
    cleaned = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', str(date_str))
    return cleaned.strip()

def parse_mixed_date(date_str):
    """Try to parse a variety of date formats safely."""
    if pd.isna(date_str) or date_str == '':
        return np.nan
    try:
        # Use dateutil to parse most human-readable formats
        return parser.parse(date_str, fuzzy=True)
    except Exception:
        # Try year-only fallback (e.g. '2003')
        match = re.match(r'^\d{4}$', str(date_str))
        if match:
            return pd.to_datetime(f"{date_str}-01-01")
        return np.nan

In [20]:
# Apply cleaning to both columns
for col in ['firstPublishDate', 'publishDate']:
       bbe_clean[f'{col}_clean'] = (
        bbe_clean[col]
        .astype(str)
        .replace({'nan': np.nan, '': np.nan})
        .apply(clean_date_string)
        .apply(parse_mixed_date)
    )

In [21]:
# Combine using your logic: prefer firstPublishDate, else publishDate
bbe_clean['publication_date_clean'] = (
    bbe_clean['firstPublishDate_clean'].combine_first(bbe_clean['publishDate_clean'])
)
# Reconvert to datetime safely before using .dt
bbe_clean['publication_date_clean'] = pd.to_datetime(bbe_clean['publication_date_clean'], errors='coerce')

# Format as ISO standard
bbe_clean['publication_date_clean'] = bbe_clean['publication_date_clean'].dt.strftime("%Y-%m-%d")

# Check a sample of remaining nulls
bbe_clean[bbe_clean['publication_date_clean'].isna()][['title', 'firstPublishDate', 'publishDate', 'publication_date_clean']].head(10)

Unnamed: 0,title,firstPublishDate,publishDate,publication_date_clean
2271,To Dream the Blackbane,,Published,
2989,Betrayal In Black,,Best Books to Read When the Snow Is Falling\n\...,
3138,Stepping Beyond Intention,,,
3160,The Fyfield Plantation,,,
4359,Angles - Part I,,,
4508,لوحات ناجي العلي,,في أحضان الكتب - الجزء الثاني\n\n111 books — 2...,
7869,Mayfair Witches Collection,,"Best Horror Novels\n\n1,773 books — 5,396 vote...",
8409,Night That Jimi Died,,50 Books That Changed Me\n\n319 books — 259 vo...,
9054,الشيخ زعرب وآخرون,,"أفضل مجموعة قصصية عربية\n\n1,069 books — 401 v...",
11231,World Peace: The Voice of a Mountain Bird,,September 9th 214,


In [22]:
# Filter rows where the unified publication date is missing
total = len(bbe_clean)
bbe_missing_dates = bbe_clean.loc[bbe_clean['publication_date_clean'].isna()]
missing_count = len(bbe_missing_dates)

print(f"Missing publication dates: {missing_count} of {total} ({missing_count/total:.2%})")

# Preview key columns
bbe_missing_dates[['title', 'author', 'firstPublishDate', 'publishDate', 'publication_date_clean']].head(10)

Missing publication dates: 588 of 52478 (1.12%)


Unnamed: 0,title,author,firstPublishDate,publishDate,publication_date_clean
2271,To Dream the Blackbane,Richard J. O'Brien,,Published,
2989,Betrayal In Black,Mark M. Bello (Goodreads Author),,Best Books to Read When the Snow Is Falling\n\...,
3138,Stepping Beyond Intention,Daniel Mangena (Goodreads Author),,,
3160,The Fyfield Plantation,Andrew R. Williams (Goodreads Author),,,
4359,Angles - Part I,Erin Lockwood (Goodreads Author),,,
4508,لوحات ناجي العلي,ناجي العلي,,في أحضان الكتب - الجزء الثاني\n\n111 books — 2...,
7869,Mayfair Witches Collection,Anne Rice,,"Best Horror Novels\n\n1,773 books — 5,396 vote...",
8409,Night That Jimi Died,Darragh J Brady,,50 Books That Changed Me\n\n319 books — 259 vo...,
9054,الشيخ زعرب وآخرون,يوسف السباعي,,"أفضل مجموعة قصصية عربية\n\n1,069 books — 401 v...",
11231,World Peace: The Voice of a Mountain Bird,"Amit Ray, Banani Ray (Goodreads Author)",,September 9th 214,


In [23]:
from pathlib import Path

version = 3

interim_bbe_path = Path("data/interim/bbe")

bbe_clean.to_csv(interim_bbe_path / f"bbe_clean_v{version}.csv", index=False)

print("Interim dates datasets saved successfully in data/interim/ directory.")

Interim dates datasets saved successfully in data/interim/ directory.


**Publisher**

Publisher names can vary significantly in formatting, including differences in capitalization, punctuation, and spacing. To standardize the `publisher` column, we will convert all entries to lowercase and strip any leading or trailing whitespace. This will help reduce variability and improve consistency across the dataset.

In [24]:
print("Sample publishers:")
print(bbe_clean['publisher'].drop_duplicates().sample(30, random_state=42).values)

Sample publishers:
['Igniter' 'Grace Press' 'A New Reality Publishing' 'Geração Editorial'
 'Linear B Editora' 'Addison-Wesley' 'Penguin Group(CA)'
 'Northern Lights ATP' 'Random House Vintage' 'World Castle Publishing'
 'Monica Schaumann' 'Weber Books' 'Editori Riuniti' 'Jugoslavija'
 'Laurann Dohner' 'Pendo' 'Gardners Books' '47North' 'منشورات القاسمي'
 'Elle Casey' 'Dial Press Trade Paperback' 'Atheneum Books'
 'Quinta Essência' 'Chatter Creek Publishing' 'Sparkplug Books'
 'Wahlström & Widstrand' 'HarperCollins Canada' 'Backwoods'
 'DMP / Dark Horse' 'La línea del horizonte']


In [25]:

# Strip, lowercase, remove extra spaces and punctuation
bbe_clean['publisher'] = (
    bbe_clean['publisher']
    .astype(str)
    .str.strip()
    .str.lower()
    .str.replace('"', '', regex=False)
    .str.replace("'", '', regex=False)
    .str.replace(r'[.,]', '', regex=True)
    .str.replace(r'\s+', ' ', regex=True)
)


In [26]:
# Inspect unique publisher values 
bbe_clean['publisher'] = bbe_clean['publisher'].astype(str).str.strip() 
unique_publisher = bbe_clean['publisher'].unique() 

print(f"\nTotal unique publisher values: {len(unique_publisher)}\n") 


Total unique publisher values: 10741



In [27]:
# normalize numeric publishers names:
def clean_numeric_publishers(x):
    if re.match(r'^\d+$', x.strip()):
        return 'unknown'
    return x

bbe_clean['publisher'] = bbe_clean['publisher'].apply(clean_numeric_publishers)


This cleaning step reduced the number of unique publisher names from **11,111 to 10,764**.
Since **English-language books represent 81% of the catalogue**, the analysis will focus on this segment.
We will **standardize major English-language publishing groups**, consolidating their **imprints and subsidiaries**, and apply **fuzzy matching** to unify names with **minor variations**.

In [28]:
# load publishers dictionary
with open("src/cleaning/mappings/publishers_dict.json", "r", encoding="utf-8") as f:
    publishers_dict = json.load(f)

In [29]:
from rapidfuzz import process, fuzz

# Get top 10000 most common publishers
top_n = 10000
publisher_counts = bbe_clean['publisher'].value_counts()
top_publishers = publisher_counts.head(top_n).index.tolist()

# Create a mapping for top publishers only
standardization_map = {}
processed = set()

for pub in top_publishers:
    if pub in processed:
        continue
    
    # Find similar publishers in the top list
    matches = process.extract(pub, top_publishers, scorer=fuzz.ratio, limit=5)
    
    # Group similar ones (score > 90)
    similar = [m[0] for m in matches if m[1] > 90]
    canonical = similar[0]  # Use first as canonical
    
    for similar_pub in similar:
        standardization_map[similar_pub] = canonical
        processed.add(similar_pub)

# Apply the mapping
bbe_clean['publisher_standardized'] = bbe_clean['publisher'].replace(standardization_map)

# Then apply manual mapping
bbe_clean['publisher_standardized'] = bbe_clean['publisher_standardized'].replace(publishers_dict)

In [30]:
standardized_unique_publisher = bbe_clean['publisher_standardized'].unique() 

print(f"\nTotal unique publisher values: {len(standardized_unique_publisher)}\n") 

print("Sample publishers:")
print(bbe_clean['publisher_standardized'].drop_duplicates().sample(30, random_state=42).values)


Total unique publisher values: 9993

Sample publishers:
['polirom' 'central avenue publishing' 'siglo xxi ediciones'
 'oxford university press' 'harvard business review press'
 'دار البشائر الاسلامية' 'pedro sajini publishing'
 'nantier beall minoustchine publishing' 'flux' 'dark blade publishing'
 'm evans and company' 'granta uk' 'tutku yayınevi' 'دار الفكر المعاصر'
 'replica books' 'mayandree michel' 'obuolys' 'dva' 'audiogo ltd'
 'willow lane publishing' 'zeppelin publishing company' 'lulu press'
 'cavalier press' 'النور للإنتاج الإعلامى والتوزيع' 'endeavour compass'
 'feiwel & friends' 'charming gal publications' 'diamond pocket books'
 'j s sanders and company' 'el leon literary arts']


The cleaning process reduced the number of unique publisher names from **11,111 to 9993**, representing a **10% decrease**.
Given that the dataset includes books in multiple languages and many small or independent publishers, this reduction is a **satisfactory outcome**.

To further evaluate the effectiveness of the cleaning, we will analyze the **proportion of titles associated with the most common publishers**.
This will help us assess how well the standardization process **consolidated the publisher catalog** and captured the main publishing groups.

In [31]:
# Define your core publisher groups
major_publishers = [
    'penguin random house', 'harpercollins', 'macmillan',
    'simon & schuster', 'hachette', 'bloomsbury',
    'amazon publishing', 'scholastic'
]

# Create a flag
bbe_clean['is_major_publisher'] = bbe_clean['publisher_standardized'].isin(major_publishers)

# Count results
total_books = len(bbe_clean)
major_books = bbe_clean['is_major_publisher'].sum()
share_major = major_books / total_books * 100

print(f"Books from mapped major publishers: {major_books} of {total_books} ({share_major:.2f}%)")


Books from mapped major publishers: 9301 of 52478 (17.72%)


About 17% of all titles now belong to one of the standardized major publisher groups.
The remaining publishers represent independent, regional, or self-published works.
Further improvements (e.g., mapping academic and international publishers) could expand this coverage to 25–30%. But we'll leave it as is for now.

In [32]:
from pathlib import Path

version = 4

interim_bbe_path = Path("data/interim/bbe")

bbe_clean.to_csv(interim_bbe_path / f"bbe_clean_v{version}.csv", index=False)

print("Interim publisher datasets saved successfully in data/interim/ directory.")

Interim publisher datasets saved successfully in data/interim/ directory.


**Book Format**

This step standardizes the `bookFormat` field across multiple languages and inconsistent label variations found in the dataset.  
The goal is to translate all format names into English and consolidate equivalent values (e.g., *“Capa dura”*, *“Gebundene Ausgabe”*, *“Hard back”*) under unified categories such as **Hardcover**, **Paperback**, **Ebook**, and **Audiobook**.

This cleaning ensures that:
- Format values are consistent for analysis and visualization.  
- Non-English or rare variants are translated and grouped appropriately.  
- Missing or unrecognized entries are handled under a neutral category: **Other / Unknown**.  

By applying a mapping dictionary, we make the variable suitable for aggregation, comparison, and predictive modeling. After transformation, we verify the result by inspecting the number of unique standardized values.


In [33]:
# Inspect unique format values 
bbe_clean['bookFormat'] = bbe_clean['bookFormat'].astype(str).str.strip() 
unique_format = bbe_clean['bookFormat'].unique() 

print(f"\nTotal unique book format values: {len(unique_format)}\n") 


Total unique book format values: 135



In [34]:
print(unique_format)

['Hardcover' 'Paperback' 'Mass Market Paperback' 'Kindle Edition'
 'Audiobook' 'ebook' 'nan' 'Board book' 'Boxed Set' 'Leather Bound'
 'Capa dura' 'Trade Paperback' 'Box Set' 'Board Book' 'Nook'
 'Library Binding' 'Capa comum' 'Pasta blanda' 'Audio Cassette'
 'Unknown Binding' 'Audio CD' 'Slipcased Hardcover' 'Broschiert'
 'Brochura' 'MP3 CD' 'Audible Audio' 'hardcover' 'cloth' 'Pasta dura'
 'Paperback/Kindle' 'paper' 'Hard Cover' 'Perfect Paperback' 'Poche'
 'Comics' 'Hardcover Slipcased' 'Unbound' 'Taschenbuch' 'Paper back'
 'Paperback, Kindle, Ebook, Audio' 'CD-ROM' 'Paperback and Kindle'
 'Hardcover im Schuber' 'paperback' 'Graphic Novels' 'Broché'
 'Science Fiction Book Club Omnibus' 'Newsprint' 'Spiral-bound'
 'Mass Market' 'Hardcover Boxed Set' 'Hardback' 'Audio' 'Novel'
 'Gebundene Ausgabe' 'softcover' 'گالینگور-وزیری' 'hardbound'
 'Hard cover, Soft cover, e-book' 'Kindle' 'Paperback/Ebook'
 'Online Fiction' 'Interactive ebook' 'Paperback mit Klappen'
 'eBook Kindle' 'ebook and

In [35]:
# load format dictionary
with open("src/cleaning/mappings/format_dict.json", "r", encoding="utf-8") as f:
    format_dict = json.load(f)

In [36]:
bbe_clean['bookFormat_clean'] = (
    bbe_clean['bookFormat']
    .astype(str)
    .str.strip()
    .str.lower()
    .replace(format_dict)
)

# Replace remaining unknowns or NaN with a unified label
bbe_clean['bookFormat_clean'] = bbe_clean['bookFormat_clean'].replace(['nan', 'none', ''], np.nan)
bbe_clean['bookFormat_clean'] = bbe_clean['bookFormat_clean'].fillna('Other / Unknown')

In [37]:
unique_format_clean = bbe_clean['bookFormat_clean'].unique() 

print(f"\nTotal unique book format values: {len(unique_format_clean)}\n") 


Total unique book format values: 11



In [39]:
unique_format_clean

array(['Hardcover', 'Paperback', 'Ebook', 'Audiobook', 'Other / Unknown',
       'Board Book', 'Boxed Set', 'Leather Bound', 'Library Binding',
       'Comics / Graphic Novel', "author's website"], dtype=object)

In [40]:
from pathlib import Path

version = 5

interim_bbe_path = Path("data/interim/bbe")

bbe_clean.to_csv(interim_bbe_path / f"bbe_clean_v{version}.csv", index=False)

print("Interim format datasets saved successfully in data/interim/ directory.")

Interim format datasets saved successfully in data/interim/ directory.


After applying the standardization mapping, the number of unique book format values was reduced from **135** to **10**.  This represents a substantial improvement in data consistency and interpretability.  

**ISBN and ASIN Cleaning**

The BBE dataset includes a single `isbn` column, which initially contained numerous missing or invalid entries (e.g. placeholder values such as `9999999999999`).

Our initial cleaning flow focused solely on standardizing **ISBN** values, but upon further inspection, we identified additional patterns such as **Amazon ASINs** (10-character alphanumeric codes) and prefixed identifiers like `10:` or `13:`.

These findings led to an adjustment to the cleaning logic and the order of operations in the pipeline.

The final cleaning process:

- Removes punctuation and non-digit characters to standardize ISBN formatting.
- Detects and separates ASINs (`asin` column) to preserve them for potential cross-dataset enrichment.
- Handles prefixed identifiers (e.g., `13:9780615700`) by removing prefixes before validation.
- Filters out placeholder or invalid entries (`999…`, `000…`) and ensures consistent string representation.
- Creates a new `isbn_clean` column containing only valid ISBN-10 or ISBN-13 values.

In [41]:
# Inspect ISBN column
bbe_clean[['title','isbn']].head(10)

Unnamed: 0,title,isbn
0,The Hunger Games,9780439023481
1,Harry Potter and the Order of the Phoenix,9780439358071
2,To Kill a Mockingbird,9999999999999
3,Pride and Prejudice,9999999999999
4,Twilight,9780316015844
5,The Book Thief,9780375831003
6,Animal Farm,9780451526342
7,The Chronicles of Narnia,9999999999999
8,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,9780345538376
9,Gone with the Wind,9780446675536


In [42]:
# Check missing and invalid patterns
n_missing_isbn = bbe_clean['isbn'].isna().sum()
print(f'Number of missing ISBN entries: {n_missing_isbn}')

Number of missing ISBN entries: 0


In [43]:
# Identify invalid placeholders (like 9999999999999)
n_invalid_isbn = bbe_clean[bbe_clean['isbn'].astype(str).str.contains('9999999999')].shape[0]
print(f'Number of placeholder ISBN entries: {n_invalid_isbn}')

Number of placeholder ISBN entries: 4354


In [44]:
def detect_asin(x):
    if pd.isna(x):
        return np.nan
    x = str(x).strip()
    if re.fullmatch(r'[A-Z0-9]{10}', x) and not x.isdigit():  # must have at least one letter
        return x
    return np.nan

In [45]:
bbe_clean['asin'] = bbe_clean['isbn'].apply(detect_asin)
has_asin = bbe_clean[bbe_clean['asin'].notna()] 
print(f'Books with ASINs: {len(has_asin)}')
has_asin[['title','isbn', 'asin']].head(10)

Books with ASINs: 4692


Unnamed: 0,title,isbn,asin
20,Fahrenheit 451,B0064CPN7I,B0064CPN7I
56,Lolita,B00IIAQY3Q,B00IIAQY3Q
80,1984,B003JTHWKU,B003JTHWKU
95,The Notebook,B000Q67J66,B000Q67J66
174,The Mists of Avalon,B000FC1JCQ,B000FC1JCQ
183,Bridge to Terabithia,B001UFP6JY,B001UFP6JY
193,The Clan of the Cave Bear,B00466HQ2Y,B00466HQ2Y
209,Jurassic Park,B007UH4D3G,B007UH4D3G
231,The Screwtape Letters,B002BD2V2Y,B002BD2V2Y
247,Blindness,B003T0GBOM,B003T0GBOM


In [46]:
def clean_isbn(x):
    # handle missing
    if pd.isna(x):
        return np.nan

    # detect ASIN first
    asin_val = detect_asin(x)
    if pd.notna(asin_val):
        # return NaN for ISBN cleaning, because it's an ASIN
        return np.nan  

    # clean numeric ISBNs
    s = str(x).strip()
    s = re.sub(r'^(10:|13:)', '', s)       # remove leading prefixes
    s = re.sub(r'\D', '', s)               # keep only digits

    # handle placeholders
    if re.fullmatch(r'(9{10}|9{13}|0{10}|0{13})', s):
        return np.nan

    # keep valid ISBN-10 or ISBN-13
    if len(s) in [10, 13]:
        return s

    return np.nan


In [47]:
bbe_clean['isbn_clean'] = bbe_clean['isbn'].apply(clean_isbn)

In [48]:
# Inspect ISBN columns after cleaning
bbe_clean[['title','isbn', 'isbn_clean']].head(10)

Unnamed: 0,title,isbn,isbn_clean
0,The Hunger Games,9780439023481,9780439023481.0
1,Harry Potter and the Order of the Phoenix,9780439358071,9780439358071.0
2,To Kill a Mockingbird,9999999999999,
3,Pride and Prejudice,9999999999999,
4,Twilight,9780316015844,9780316015844.0
5,The Book Thief,9780375831003,9780375831003.0
6,Animal Farm,9780451526342,9780451526342.0
7,The Chronicles of Narnia,9999999999999,
8,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,9780345538376,9780345538376.0
9,Gone with the Wind,9780446675536,9780446675536.0


In [49]:
placeholder_remaining = bbe_clean[bbe_clean['isbn_clean'].astype(str).str.fullmatch(r'(9{10}|9{13}|0{10}|0{13})', na=False)]
print(f"Remaining placeholder ISBNs: {len(placeholder_remaining)}")

Remaining placeholder ISBNs: 0


In [50]:
# Filter the rows where isbn_clean is NaN
missing_isbn_clean = bbe_clean[bbe_clean['isbn_clean'].isna()]

# Print the number of missing and show the first few examples
print(f"Missing isbn_clean: {missing_isbn_clean.shape[0]}")
missing_isbn_clean[['title', 'bookFormat', 'isbn', 'asin','isbn_clean']].head(10)

Missing isbn_clean: 9078


Unnamed: 0,title,bookFormat,isbn,asin,isbn_clean
2,To Kill a Mockingbird,Paperback,9999999999999,,
3,Pride and Prejudice,Paperback,9999999999999,,
7,The Chronicles of Narnia,Paperback,9999999999999,,
10,The Fault in Our Stars,Hardcover,9999999999999,,
11,The Hitchhiker's Guide to the Galaxy,Paperback,9999999999999,,
14,The Da Vinci Code,Paperback,9999999999999,,
16,The Picture of Dorian Gray,Paperback,9999999999999,,
19,Les Misérables,Mass Market Paperback,9999999999999,,
20,Fahrenheit 451,Kindle Edition,B0064CPN7I,B0064CPN7I,
26,The Perks of Being a Wallflower,Paperback,9999999999999,,


To inspect if there are other cases of invalid ISBNs, we will filter the rows where the `isbn_type` is either `'wrong_length'` or `'missing'`. This will help us identify any additional issues with the ISBN data that may need to be addressed. For that a custom function `isbn_type` was created to classify the reason for invalidity.

In [51]:
def isbn_type(x):
    if pd.isna(x):
        return 'missing'

    s = str(x).strip()

    # Detect ASIN (10-char alphanumeric, must have at least one letter)
    if re.fullmatch(r'[A-Z0-9]{10}', s.upper()) and not s.isdigit():
        return 'asin'

    # Remove non-digits for numeric checks
    x = re.sub(r'\D', '', s)

    # Placeholder patterns
    if re.fullmatch(r'9{10}|9{13}', x):
        return 'placeholder_9'
    if re.fullmatch(r'0{10}|0{13}', x):
        return 'placeholder_0'

    # Length checks
    if len(x) in [10, 13]:
        return 'valid'
    if len(x) > 0:
        return 'wrong_length'

    return 'missing'


In [52]:
bbe_clean['isbn_type'] = bbe_clean['isbn'].apply(isbn_type)
bbe_clean['isbn_type'].value_counts()

isbn_type
valid            43397
asin              4692
placeholder_9     4354
wrong_length        34
missing              1
Name: count, dtype: int64

In [53]:
# Filter rows with type either 'wrong_length' or 'missing'
invalid_isbn = bbe_clean[bbe_clean['isbn_type'].isin(['wrong_length', 'missing'])]

# Show total count
print(f"Total invalid (wrong_length + missing): {invalid_isbn.shape[0]}")

# Preview relevant columns
invalid_isbn[['title', 'author_clean', 'bookFormat', 'isbn', 'asin', 'isbn_type']].head(10)

Total invalid (wrong_length + missing): 35


Unnamed: 0,title,author_clean,bookFormat,isbn,asin,isbn_type
3670,Ice in My Veins,Kelli Sullivan,Paperback,978145208533,,wrong_length
14276,Changing the Game,Jaci Burton,Trade Paperback,978042524240,,wrong_length
16178,The Reluctant Vampire,Lynsay Sands,Mass Market Paperback,978006189459,,wrong_length
16368,Awful Auntie,David Walliams,Hardcover,978000794445,,wrong_length
18081,Clipped,Samantha Potts,Paperback,978145633431,,wrong_length
18506,The Gaze,Javier A. Robayo,Kindle Edition,978147505066,,wrong_length
18697,The Vine Keeper . . . messages in Poetry and P...,William S. Peters Sr.,Paperback,13:9780615700,,wrong_length
19101,نهضة أمة,هشام مصطفى عبد العزيز,Paperback,978977623807,,wrong_length
19212,Venom,Fiona Paul,Hardcover,978039925725,,wrong_length
19304,Scarback: There Is So Much More to Fishing Tha...,Roger Corea,Paperback,10:1496102266,,wrong_length


Out of all records, **9,081 entries (≈18%)** were identified as invalid ISBNs, leaving roughly **82%** valid.
Only **34** cases were tagged as `'wrong_length'` and **1** as `'missing'`.
These mostly represent truncated or prefixed identifiers, while the `isbn_type` function accurately distinguished valid ISBNs, ASINs, and placeholders.

In [54]:
from pathlib import Path

version = 6

interim_bbe_path = Path("data/interim/bbe")

bbe_clean.to_csv(interim_bbe_path / f"bbe_clean_v{version}.csv", index=False)

print("Interim ISBN/ASIN datasets saved successfully in data/interim/ directory.")

Interim ISBN/ASIN datasets saved successfully in data/interim/ directory.


**Ratings**

In this step, we evaluate the quality and consistency of the `rating` field.
We first check for missing or invalid values and calculate the percentage of available ratings to assess data completeness. Then, we use the `describe()` method to verify whether the ratings follow the expected 1–5 scale.

In [55]:
# Filter the rows where rating is not NaN
total_books = len(bbe_clean)
has_ratings = bbe_clean[bbe_clean['rating'].notna()]
has_ratings_num = has_ratings.shape[0]
share_ratings = has_ratings_num / total_books * 100

# Print the number of titles with ratings and show the first few examples
print(f"Books with ratings: {has_ratings_num} of {total_books} ({share_ratings:.2f}%)")
has_ratings[['title', 'rating', 'numRatings','ratingsByStars']].head(10)

Books with ratings: 52478 of 52478 (100.00%)


Unnamed: 0,title,rating,numRatings,ratingsByStars
0,The Hunger Games,4.33,6376780,"['3444695', '1921313', '745221', '171994', '93..."
1,Harry Potter and the Order of the Phoenix,4.5,2507623,"['1593642', '637516', '222366', '39573', '14526']"
2,To Kill a Mockingbird,4.28,4501075,"['2363896', '1333153', '573280', '149952', '80..."
3,Pride and Prejudice,4.26,2998241,"['1617567', '816659', '373311', '113934', '767..."
4,Twilight,3.6,4964519,"['1751460', '1113682', '1008686', '542017', '5..."
5,The Book Thief,4.37,1834276,"['1048230', '524674', '186297', '48864', '26211']"
6,Animal Farm,3.95,2740713,"['986764', '958699', '545475', '165093', '84682']"
7,The Chronicles of Narnia,4.26,517740,"['254964', '167572', '74362', '15423', '5419']"
8,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,4.6,110146,"['78217', '22857', '6628', '1477', '967']"
9,Gone with the Wind,4.3,1074620,"['602138', '275517', '133535', '39008', '24422']"


In [56]:
bbe_clean['rating'].describe()

count    52478.000000
mean         4.021878
std          0.367146
min          0.000000
25%          3.820000
50%          4.030000
75%          4.230000
max          5.000000
Name: rating, dtype: float64

In [57]:
bbe_clean['rating'].unique()[:20]

array([4.33, 4.5 , 4.28, 4.26, 3.6 , 4.37, 3.95, 4.6 , 4.3 , 4.21, 4.22,
       3.86, 4.12, 4.08, 4.06, 4.13, 4.18, 3.99, 4.19, 3.69])

The inspection confirms that the dataset is generally clean; however, a small number of entries have a value of `0`, which represents missing evaluations. These will be replaced with `NaN` to ensure the ratings remain within the valid range (1–5). Since all valid values already follow the standard Goodreads scale, no normalization is required.

In [58]:
mask = (bbe_clean['rating'] == 0)
print(f'Items with value equal 0: {bbe_clean[mask].shape[0]}')
bbe_clean[mask][['title', 'author_clean','rating','numRatings','ratingsByStars']].head()

Items with value equal 0: 71


Unnamed: 0,title,author_clean,rating,numRatings,ratingsByStars
8321,Her Beauty,M.R. Desmond,0.0,0,[]
17834,Mach Deine Träume Wahrverwirkliche Deine Ziele...,Roeland Suylen,0.0,0,[]
17907,Mindtronics! And Inquiry Alive!,William C. Bruce,0.0,0,[]
18197,Moon Secrets,J.J. Gregory,0.0,0,[]
18618,Aphrodisiac Concupiscence Deluxe,Yolanda Williams,0.0,0,[]


In [59]:
bbe_clean['rating_clean'] = bbe_clean['rating'].replace(0, np.nan)

In [60]:
bbe_clean['rating_clean'].describe()

count    52407.000000
mean         4.027327
std          0.336206
min          1.000000
25%          3.820000
50%          4.030000
75%          4.230000
max          5.000000
Name: rating_clean, dtype: float64

In [61]:
from pathlib import Path

version = 7

interim_bbe_path = Path("data/interim/bbe")

bbe_clean.to_csv(interim_bbe_path / f"bbe_clean_v{version}.csv", index=False)

print("Interim ratings datasets saved successfully in data/interim/ directory.")

Interim ratings datasets saved successfully in data/interim/ directory.
