
# Data Cleaning

## Objectives

The purpose of this notebook is to **clean, standardize, and prepare the collected datasets** for subsequent exploratory analysis and modeling tasks.

The goal is to transform raw inputs from multiple book datasets into a **reliable, consistent, and mergeable analytical base**, ensuring data integrity and comparability across platforms.

---

## Inputs

| Dataset                    | Source                     | Description                                                               | Format |
| -------------------------- | -------------------------- | ------------------------------------------------------------------------- | ------ |
| `bbe_books.csv`            | Zenodo – *Best Books Ever* | Book metadata including title, author, rating, genres, and description.   | CSV    |
| `books.csv`, `ratings.csv` | GitHub – *Goodbooks-10k*   | Book metadata and user–book interaction data for recommendation modeling. | CSV    |

---

## Tasks in This Notebook

This notebook will execute the following cleaning and preparation steps:

1. **Standardize column formats:**
   Ensure consistent data types and naming conventions across datasets (e.g., convert `isbn` to string, align `author`, `rating`, and `title` formats).

2. **Clean and normalize missing values:**
   Replace placeholder NaNs (`9999999999999`, empty lists, or `"None"`) with `np.nan`, then impute or drop based on analytical importance.

3. **Detect and resolve duplicates:**
   Identify duplicate records using key identifiers (`bookId`, `isbn`, `title + author`) and retain the most complete or relevant entries.

4. **Validate and align categorical values:**
   Standardize genre labels, language codes, and rating scales to ensure comparability between datasets.

5. **Merge compatible datasets:**
   Integrate *BestBooksEver* and *Goodbooks-10k_books* into a unified schema while maintaining referential integrity with the ratings dataset.

6. **Outlier and consistency checks:**
   Review numerical and date fields (e.g., `pages`, `price`, `publishDate`) for unrealistic or extreme values and adjust as needed.

7. **Feature enrichment (optional):**
   Derive or enhance fields such as `popularity_score`, `recency`, or missing genre information using external APIs where beneficial.

---

## Outputs

* **Cleaned, schema-aligned datasets** ready for exploratory data analysis and modeling.
* **Summary statistics** on completeness, duplicates, and outliers.
* **Processed CSV files** saved for reproducibility in `data/processed/`.

> **Note:** This notebook focuses on the *Data Cleaning and Preparation*. Further feature engineering and model-specific transformations will follow in later notebooks.

---


## Navigate to the Parent Directory

Before combining and saving datasets, it’s often helpful to move to a parent directory so that file operations (like loading or saving data) are easier and more organized. 

Before using the Python’s built-in os module to move one level up from the current working directory, it is advisable to inspect the current directory.

In [1]:
import os

# Get the current working directory
current_dir = os.getcwd()
print(f'Current directory: {current_dir}')

Current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics\notebooks


To change to parent directory (root folder), run the code below. If you are already in the root folder, you can skip this step.

In [2]:
# Change the working directory to its parent
os.chdir(os.path.dirname(current_dir))
print('Changed directory to parent.')

# Get the new current working directory (the parent directory)
current_dir = os.getcwd()
print(f'New current directory: {current_dir}')

Changed directory to parent.
New current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics


## Load and Inspect Books Datasets

In this step, we load the previously collected datasets: **Goodbooks-10k** (books) and **Best Books Ever**. We will inspect their structure one more time before starting any merging or cleaning operations.

In [3]:
import pandas as pd 

# load datasets
books = pd.read_csv('data/raw/books.csv')
bbe = pd.read_csv('data/raw/bbe_books.csv')

# create copies for cleaning
books_clean = books.copy()
bbe_clean = bbe.copy()


In [4]:
# Preview data
display(bbe_clean.head(3))
display(books_clean.head(3))

# Check shape and missing values
for name, df in {'BBE': bbe_clean, 'Books': books_clean,}.items():
    print(f"\n{name} — Shape: {df.shape}")
    print(df.info())
    print(df.isna().sum().sort_values(ascending=False).head(10))


Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price,bookId_num
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",...,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,2993816,30516,5.09,2767052.0
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",...,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...,2632233,26923,7.38,2.0
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",...,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,2269402,23328,,2657.0


Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052.0,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3.0,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865.0,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...



BBE — Shape: (52478, 26)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52478 entries, 0 to 52477
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bookId            52478 non-null  object 
 1   title             52478 non-null  object 
 2   series            23470 non-null  object 
 3   author            52478 non-null  object 
 4   rating            52478 non-null  float64
 5   description       51140 non-null  object 
 6   language          48672 non-null  object 
 7   isbn              52478 non-null  object 
 8   genres            52478 non-null  object 
 9   characters        52478 non-null  object 
 10  bookFormat        51005 non-null  object 
 11  edition           4955 non-null   object 
 12  pages             50131 non-null  object 
 13  publisher         48782 non-null  object 
 14  publishDate       51598 non-null  object 
 15  firstPublishDate  31152 non-null  object 
 16  awards        

We will check if the datasets share common identifiers and compatible data types.

In [5]:
bbe_only_columns = set(bbe_clean.columns) - set(books_clean.columns)
print(f'Columns only in BBE: {bbe_only_columns}')

goodbooks_only_columns = set(books_clean.columns) - set(bbe_clean.columns)
print(f'Columns only in Goodbooks: {goodbooks_only_columns}')


Columns only in BBE: {'numRatings', 'bookFormat', 'bbeScore', 'bbeVotes', 'firstPublishDate', 'author', 'genres', 'rating', 'ratingsByStars', 'bookId', 'characters', 'bookId_num', 'coverImg', 'pages', 'price', 'series', 'publisher', 'setting', 'awards', 'publishDate', 'language', 'likedPercent', 'description', 'edition'}
Columns only in Goodbooks: {'ratings_2', 'books_count', 'book_id', 'ratings_3', 'authors', 'original_title', 'ratings_5', 'ratings_1', 'ratings_4', 'best_book_id', 'average_rating', 'original_publication_year', 'isbn13', 'work_ratings_count', 'small_image_url', 'ratings_count', 'image_url', 'work_text_reviews_count', 'work_id', 'goodreads_book_id', 'language_code'}


Based on the initial inspection, we can create a mapping table to align columns from both datasets for merging and analysis.

| **BestBooksEver (BBE)**           | **Goodbooks10k_books (GB10k)**                   | **Notes / Alignment Rationale**                                            |
| --------------------------------- | ------------------------------------------------ | -------------------------------------------------------------------------- |
| `bookId`           | `book_id`                                                      | Main identifier; ensure both are numeric.                      |
| `bookId_num`           | `goodreads_book_id`                                        | Goodreads identifier; ensure both are numeric for joining.                      |
| `title`                           | `title`                                          | Direct match. Used as secondary join key.                                  |
| `series`                          | —                                                | Only in BBE; could enrich GB10k if available via API.                      |
| `author`                          | `authors`                                        | Same meaning. Normalize format.      |
| `rating`                          | `average_rating`                                 | Equivalent — rename to unified `average_rating`.                           |
| `numRatings`                      | `ratings_count`                                  | Same measure of total user ratings.                                        |
| `ratingsByStars`                  | `ratings_1` … `ratings_5`                        | BBE has dict, GB10k has explicit columns. Expand or aggregate accordingly. |
| `likePercent` (or `likedPercent`) | —                                                | BBE-only; optional metric of user sentiment.                               |
| `isbn`                            | `isbn` / `isbn13`                                | Common linking key; keep both (string). Use for merges when present.       |
| `language`                        | `language_code`                                  | Standardize to ISO 639-1 (lowercase).                                      |
| `description`                     | —                                                | BBE-only; valuable for NLP features.                                       |
| `genres`                          | —                                                | BBE-only; can enrich GB10k tags later.                                     |
| `characters`                      | —                                                | BBE-only; low modeling priority, but could add narrative metadata.         |
| `bookFormat`                      | —                                                | BBE-only; possible categorical feature.                                    |
| `edition`                         | —                                                | BBE-only.                                                                  |
| `pages`                           | —                                                | BBE-only; numeric, may enrich GB10k metadata.                              |
| `publisher`                       | —                                                | BBE-only; possible future feature.                                         |
| `publishDate`                     | —                                                | BBE-only; can approximate from GB10k’s `original_publication_year`.        |
| `firstPublishDate`                | `original_publication_year`                      | Equivalent (date vs year).                                                 |
| `coverImg`                        | `image_url` / `small_image_url`                  | Same function (cover link).                                                |
| `bbeScore`                        | —                                                | BBE-only; internal popularity score.                                       |
| `bbeVotes`                        | `work_ratings_count`                             | Comparable as popularity proxy.                                            |
| `price`                           | —                                                | BBE-only; likely non-essential for satisfaction prediction.                |
| `setting`                         | —                                                | BBE-only; can support content enrichment.                                  |
| `awards`                          | —                                                | BBE-only; categorical enrichment.                                          |
| —                                 | `goodreads_book_id` / `best_book_id` / `work_id` | GB10k-only identifiers; may be used for deeper Goodreads linking.          |
| —                                 | `books_count`                                    | GB10k-only; number of editions per work.                                   |
| —                                 | `work_text_reviews_count`                        | GB10k-only; can complement `numRatings` as engagement metric.              |


## Data Cleaning Steps

### Best Books Ever

- Handle identifier columns
- Standardize key columns: `author`, `language`
- Missing data handling strategies
- Normalize genre and format
- Validate for no nulls or duplicates

#### 1. Handle identifier columns
On the previous notebook, we created a new field `bookId_num` in the BBE dataset to align with `goodreads_book_id` in the Goodbooks10k dataset. We have also ensured that they were both converted to numeric types and that all `bookId` values generated a valid `bookId_num`. So we can skip the handle identifier columns, as it was already done. 

#### 2. Standardize key columns

**Author Column**

We will proceed with the standardization of key columns, starting with the `author` column. The author column in the BBE dataset often contains a qualifier such as "(Goodreads Author)". We will remove such qualifiers to standardize the format. We will also create an additional list column to store multiple authors as a list rather than a single string. This way, its is ready to use for feature engineering later on if needed.

In [6]:
import re
import pandas as pd

def clean_and_split_authors(name):
    """
    Cleans author names and returns a list of authors.
    """
    if pd.isna(name):
        return None

    # Remove role descriptors
    cleaned = re.sub(r"\s*\([^)]*\)", "", name)
    
    # Split into list if multiple authors exist
    authors_list = [a.strip() for a in cleaned.split(",") if a.strip()]
    
    return authors_list

In [7]:
# Apply to BestBooksEver dataset
bbe_clean["authors_list"] = bbe_clean["author"].apply(clean_and_split_authors)
bbe_clean["author_clean"] = bbe_clean["authors_list"].apply(lambda x: ", ".join(x) if isinstance(x, list) else None)

# Quick check
bbe_clean[["author", "author_clean", "authors_list"]].head(5)

Unnamed: 0,author,author_clean,authors_list
0,Suzanne Collins,Suzanne Collins,[Suzanne Collins]
1,"J.K. Rowling, Mary GrandPré (Illustrator)","J.K. Rowling, Mary GrandPré","[J.K. Rowling, Mary GrandPré]"
2,Harper Lee,Harper Lee,[Harper Lee]
3,"Jane Austen, Anna Quindlen (Introduction)","Jane Austen, Anna Quindlen","[Jane Austen, Anna Quindlen]"
4,Stephenie Meyer,Stephenie Meyer,[Stephenie Meyer]


**Language Column**

The `language` column in the Best Books Ever dataset used full names such as “English”, “German”, and “Arabic”.  Before transforming the values, we will check for all unique values to identify any unexpected entries.

In [8]:
# Inspect unique language values
print("Unique language values in BBE dataset:")
bbe_clean['language'] = bbe_clean['language'].astype(str).str.strip()
unique_languages = bbe_clean['language'].unique()

print(f"\nTotal unique values: {len(unique_languages)}\n")
print(unique_languages)


Unique language values in BBE dataset:

Total unique values: 82

['English' 'French' 'German' 'Persian' 'Arabic' 'nan' 'Spanish'
 'Multiple languages' 'Portuguese' 'Indonesian' 'Turkish' 'Polish'
 'Bulgarian' 'Tamil' 'Japanese' 'Romanian' 'Italian'
 'French, Middle (ca.1400-1600)' 'Norwegian' 'Urdu' 'Dutch' 'Finnish'
 'Marathi' 'Chinese' 'Swedish' 'Icelandic' 'Malayalam' 'Croatian'
 'Estonian' 'Greek, Modern (1453-)' 'Russian' 'Kurdish' 'Danish' 'Hindi'
 'Filipino; Pilipino' 'Serbian' 'Bengali' 'Malay' 'Catalan; Valencian'
 'Czech' 'Vietnamese' 'Armenian' 'Georgian' 'Kannada' 'Korean' 'Nepali'
 'Slovak' 'Telugu' 'Hungarian' 'English, Middle (1100-1500)' 'Azerbaijani'
 'Farsi' 'Lithuanian' 'Ukrainian' 'Bokmål, Norwegian; Norwegian Bokmål'
 'Iranian (Other)' 'Faroese' 'Basque' 'Macedonian' 'Maltese' 'Gujarati'
 'Amharic' 'Aromanian; Arumanian; Macedo-Romanian' 'Assamese'
 'Panjabi; Punjabi' 'Albanian' 'Latvian' 'Bosnian' 'Afrikaans' 'Thai'
 'Dutch, Middle (ca.1050-1350)' 'Mongolian' 'Tag

We can see that there are some unexpected values such as:
- _historical forms_ (“English, Middle (1100-1500)”, “French, Middle (ca.1400-1600)”)
- _combined or semicolon-separated entries_ (“Filipino; Pilipino”, “Catalan; Valencian”)
- _multi-language / uncertain cases_ (“Multiple languages”, “Undetermined”)
- _rare or dialects_ (“Bokmål, Norwegian; Norwegian Bokmål”, “Aromanian; Arumanian; Macedo-Romanian”)

We will clean the unusual entries by mapping them to the closest language present in the ISO 639-1 standard. Unrecognized values will be flagged and replaced with `"unknown"`. It was decided to distinguish the `"unknown"` from the `NaN` values to retain information about missingness versus unrecognized entries. 

In [9]:
import numpy as np

# Standardize capitalization & spacing
bbe_clean['language'] = bbe_clean['language'].astype(str).str.strip().str.title()

# Handle NaNs that became strings
bbe_clean['language'] = bbe_clean['language'].replace({'Nan': np.nan})

# Simplify and unify multi-language / dialect forms
replace_map = {
    'Multiple Languages': 'Multilingual',
    'Undetermined': 'Unknown',
    'Iranian (Other)': 'Persian',
    'Farsi': 'Persian',
    'Filipino; Pilipino': 'Filipino',
    'Catalan; Valencian': 'Catalan',
    'Panjabi; Punjabi': 'Punjabi',
    'Bokmål, Norwegian; Norwegian Bokmål': 'Norwegian',
    'Norwegian Nynorsk; Nynorsk, Norwegian': 'Norwegian',
    'Greek, Modern (1453-)': 'Greek',
    'Greek, Ancient (To 1453)': 'Greek',
    'French, Middle (Ca.1400-1600)': 'French',
    'English, Middle (1100-1500)': 'English',
    'Dutch, Middle (Ca.1050-1350)': 'Dutch',
    'Aromanian; Arumanian; Macedo-Romanian': 'Romanian',
    'Mayan Languages': 'Mayan',
    'Australian Languages': 'English'
}

bbe_clean['language'] = bbe_clean['language'].replace(replace_map)



After transforming the values, we will apply the mapping to standardize the `language` column to ISO 639-1 two-letter codes. 

In [10]:
language_dict = {
    "english": "en", "german": "de", "french": "fr", "arabic": "ar",
    "spanish": "es", "italian": "it", "portuguese": "pt", "russian": "ru",
    "chinese": "zh", "japanese": "ja", "hindi": "hi", "dutch": "nl",
    "swedish": "sv", "norwegian": "no", "polish": "pl", "turkish": "tr",
    "korean": "ko", "danish": "da", "finnish": "fi", "hebrew": "he",
    "greek": "el", "czech": "cs", "romanian": "ro", "indonesian": "id",
    "thai": "th", "hungarian": "hu", "vietnamese": "vi", "persian": "fa",
    "icelandic": "is", "latin": "la", "swahili": "sw", "bulgarian": "bg",
    "croatian": "hr", "estonian": "et", "tamil": "ta", "urdu": "ur",
    "malayalam": "ml", "slovak": "sk", "telugu": "te", "azerbaijani": "az",
    "lithuanian": "lt", "ukrainian": "uk", "faroese": "fo", "basque": "eu",
    "macedonian": "mk", "maltese": "mt", "gujarati": "gu", "amharic": "am",
    "albanian": "sq", "latvian": "lv", "bosnian": "bs", "afrikaan": "af",
    "mongolian": "mn", "tagalog": "tl", "galician": "gl", "slovenian": "sl",
    "armenian": "hy", "georgian": "ka", "kannada": "kn", "marathi": "mr",
    "nepali": "ne", "punjabi": "pa", "filipino": "fil", "mayan": "myn",
    "unknown": "unknown", "multilingual": "multi"
}

# Apply dictionary
bbe_clean['language'] = bbe_clean['language'].str.lower().map(language_dict)

# Fill remaining NaNs
bbe['language'] = bbe['language'].fillna('unknown')

In [11]:
# check again for unique language values
print("Unique language values in BBE dataset:")
unique_languages = bbe_clean['language'].unique()

print(f"\nTotal unique values: {len(unique_languages)}\n")
print(unique_languages)


Unique language values in BBE dataset:

Total unique values: 63

['en' 'fr' 'de' 'fa' 'ar' nan 'es' 'multi' 'pt' 'id' 'tr' 'pl' 'bg' 'ta'
 'ja' 'ro' 'it' 'no' 'ur' 'nl' 'fi' 'mr' 'zh' 'sv' 'is' 'ml' 'hr' 'et'
 'el' 'ru' 'da' 'hi' 'fil' 'cs' 'vi' 'hy' 'ka' 'kn' 'ko' 'ne' 'sk' 'te'
 'hu' 'az' 'lt' 'uk' 'fo' 'eu' 'mk' 'mt' 'gu' 'am' 'pa' 'sq' 'lv' 'bs'
 'th' 'mn' 'tl' 'gl' 'sl' 'unknown' 'myn']


**Date Columns**

BBE dataset has two publication fields: `publishDate` and `firstPublishDate`. The `firstPublishDate` represents the original publication date, while `publishDate` refers to a more recent edition or reprint date. Publishing experts assumption is that the recency of the `firstPublishDate` is more relevant for modeling book satisfaction, as it reflects when the book was first introduced to readers. Therefore, we will focus on cleaning and standardizing the `firstPublishDate` column and use `publishDate` only if `firstPublishDate` is missing.

While majority of the dates follow the 'MM/DD/YY' format, after a first attemp at cleaning, we noticed some dates do not conform to this format. Therefore, we will implement a more robust date parsing strategy, focusing first on transforming textual formats into 'MM/DD/YYYY' format before attempting to parse them into datetime objects.

In [12]:
from dateutil import parser

def clean_date_string(date_str):
    """Remove ordinal suffixes and unwanted characters from a date string."""
    if pd.isna(date_str):
        return np.nan
    # remove st, nd, rd, th (like 'April 27th 2010' → 'April 27 2010')
    cleaned = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', str(date_str))
    return cleaned.strip()

def parse_mixed_date(date_str):
    """Try to parse a variety of date formats safely."""
    if pd.isna(date_str) or date_str == '':
        return np.nan
    try:
        # Use dateutil to parse most human-readable formats
        return parser.parse(date_str, fuzzy=True)
    except Exception:
        # Try year-only fallback (e.g. '2003')
        match = re.match(r'^\d{4}$', str(date_str))
        if match:
            return pd.to_datetime(f"{date_str}-01-01")
        return np.nan

In [13]:
# Apply cleaning to both columns
for col in ['firstPublishDate', 'publishDate']:
    bbe[f'{col}_clean'] = (
        bbe[col]
        .astype(str)
        .replace({'nan': np.nan, '': np.nan})
        .apply(clean_date_string)
        .apply(parse_mixed_date)
    )

In [None]:
# Combine using your logic: prefer firstPublishDate, else publishDate
bbe['publication_date_clean'] = (
    bbe['firstPublishDate_clean'].combine_first(bbe['publishDate_clean'])
)
# Reconvert to datetime safely before using .dt
bbe['publication_date_clean'] = pd.to_datetime(bbe['publication_date_clean'], errors='coerce')

# Format as ISO standard
bbe['publication_date_clean'] = bbe['publication_date_clean'].dt.strftime("%Y-%m-%d")

# Check a sample of remaining nulls
bbe[bbe['publication_date_clean'].isna()][['title', 'firstPublishDate', 'publishDate', 'publication_date_clean']].head(10)

Unnamed: 0,title,firstPublishDate,publishDate,publication_date_clean
2271,To Dream the Blackbane,,Published,
2989,Betrayal In Black,,Best Books to Read When the Snow Is Falling\n\...,
3138,Stepping Beyond Intention,,,
3160,The Fyfield Plantation,,,
4359,Angles - Part I,,,
4508,لوحات ناجي العلي,,في أحضان الكتب - الجزء الثاني\n\n111 books — 2...,
7869,Mayfair Witches Collection,,"Best Horror Novels\n\n1,773 books — 5,396 vote...",
8409,Night That Jimi Died,,50 Books That Changed Me\n\n319 books — 259 vo...,
9054,الشيخ زعرب وآخرون,,"أفضل مجموعة قصصية عربية\n\n1,069 books — 401 v...",
11231,World Peace: The Voice of a Mountain Bird,,September 9th 214,


In [29]:
# Filter rows where the unified publication date is missing
total = len(bbe)
bbe_missing_dates = bbe.loc[bbe['publication_date_clean'].isna()]
missing_count = len(bbe_missing_dates)

print(f"Missing publication dates: {missing_count} of {total} ({missing_count/total:.2%})")

# Preview key columns
bbe_missing_dates[['title', 'author', 'firstPublishDate', 'publishDate', 'publication_date_clean']].head(10)

Missing publication dates: 588 of 52478 (1.12%)


Unnamed: 0,title,author,firstPublishDate,publishDate,publication_date_clean
2271,To Dream the Blackbane,Richard J. O'Brien,,Published,
2989,Betrayal In Black,Mark M. Bello (Goodreads Author),,Best Books to Read When the Snow Is Falling\n\...,
3138,Stepping Beyond Intention,Daniel Mangena (Goodreads Author),,,
3160,The Fyfield Plantation,Andrew R. Williams (Goodreads Author),,,
4359,Angles - Part I,Erin Lockwood (Goodreads Author),,,
4508,لوحات ناجي العلي,ناجي العلي,,في أحضان الكتب - الجزء الثاني\n\n111 books — 2...,
7869,Mayfair Witches Collection,Anne Rice,,"Best Horror Novels\n\n1,773 books — 5,396 vote...",
8409,Night That Jimi Died,Darragh J Brady,,50 Books That Changed Me\n\n319 books — 259 vo...,
9054,الشيخ زعرب وآخرون,يوسف السباعي,,"أفضل مجموعة قصصية عربية\n\n1,069 books — 401 v...",
11231,World Peace: The Voice of a Mountain Bird,"Amit Ray, Banani Ray (Goodreads Author)",,September 9th 214,
