# Data Collection
## Objectives
The goal of this notebook is to **collect, consolidate, and prepare datasets** that will be used to build data-driven insights for the *Book Subscription Optimization* project.  
This aligns with the **Data Collection and Understanding** stages of the CRISP-DM process, ensuring that the data foundation supports later stages of modeling and evaluation.

Specifically, this notebook aims to:
- Retrieve and load multiple book-related datasets from open sources.  
- Perform initial validation to assess structure, completeness, and consistency. 
- Note any data quality issues for future cleaning steps.

## Inputs
- **Goodbooks-10k Dataset:** User-book interactions and ratings, used to simulate subscription platform interactions.  
- **Best Books Ever Dataset:** Global book metadata and ratings, used for cross-platform popularity validation and content catalogue proxy.  

Each dataset contributes unique dimensions: reader behavior, book features, and market data. This structure allows for a holistic analysis of book engagement and satisfaction.

## Outputs
- Metadata summary of dataset structure, completeness, and variable distributions.  
- Preliminary insights into data coverage for later CRISP-DM stages (Data Understanding & Preparation).
- Original datasets saved in CSV format for reproducibility and future use.

> **Note:** This notebook focuses on data collection and initial assessment. Detailed cleaning and transformation will be addressed in subsequent notebooks.

---

## Navigate to the Parent Directory

Before combining and saving datasets, it’s often helpful to move to a parent directory so that file operations (like loading or saving data) are easier and more organized. 

Before using the Python’s built-in os module to move one level up from the current working directory, it is advisable to inspect the current directory.

In [2]:
import os

# Get the current working directory
current_dir = os.getcwd()
print(f'Current directory: {current_dir}')

Current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics\notebooks


To change to parent directory (root folder), run the code below. If you are already in the root folder, you can skip this step.

In [3]:
# Change the working directory to its parent
os.chdir(os.path.dirname(current_dir))
print('Changed directory to parent.')

# Get the new current working directory (the parent directory)
current_dir = os.getcwd()
print(f'New current directory: {current_dir}')

Changed directory to parent.
New current directory: c:\Users\reisl\OneDrive\Documents\GitHub\bookwise-analytics


## Fetch data from various sources

In this section, we will fetch data from multiple open datasets and inspect their basic properties to understand their structure and content. Since the datasets are all hosted in GitHub repositories, we will use pandas to read them directly from their raw URLs.
To streamline the process, we will define a function that loads a dataset from a given URL and prints its shape, columns, and missing values.

In [4]:
import pandas as pd 

def load_and_inspect(path, name, show_head=False):
    """
    Load a dataset and perform initial structure validation.
    - Structure and type overview
    - Missing value summary
    - Duplicate count

    Parameters
    ----------
    path : str
        File path to the dataset (CSV format expected).
    name : str
        Readable name for reporting purposes.
    show_head : bool, optional
        If True, displays the first five rows for preview. Default is False.

    Returns
    -------
    pd.DataFrame
        Loaded dataset for further exploration.
    """

    try:
        df = pd.read_csv(path)
        print(f"\n{name} loaded successfully.")
        print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
        
        # Structural Overview
        print("\nData Overview:")
        print("-" * 60)
        print(df.info())
        print("-" * 60)
        
        # Missing value summary
        missing_cols = df.columns[df.isnull().any()].tolist()
        if missing_cols:
            print("\nColumns with Missing Data:", ", ".join(missing_cols))
        else:
            print("\nNo missing data detected")

        # Duplicate check
        duplicate_count = df.duplicated().sum()
        print(f"\nDuplicate Rows: {duplicate_count}")

        if show_head:
            display(df.head(3))

        return df

    except FileNotFoundError:
        print(f"File not found: {path}")
    except Exception as e:
        print(f"Error loading {name}: {e}")


### Goodbooks-10k Dataset
In this step, we load two datasets from the **Goodbooks-10k** project, hosted on GitHub.
This dataset will simulate user-book interactions on a book subscription platform.

These datasets contain:

- **Books**: metadata about 10,000 books (e.g. title, authors, publication year, ratings).
- **Ratings**: over 5 million individual ratings given by readers.

In [5]:
# URLs of the Goodbooks-10k datasets (hosted on GitHub)
books_url   = "https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv"

# Load the ratings and books data into Pandas DataFrames
books = load_and_inspect(books_url, 'Books')      # Contains detailed book metadata


Books loaded successfully.
Shape: 10000 rows × 23 columns

Data Overview:
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   book_id                    10000 non-null  int64  
 1   goodreads_book_id          10000 non-null  int64  
 2   best_book_id               10000 non-null  int64  
 3   work_id                    10000 non-null  int64  
 4   books_count                10000 non-null  int64  
 5   isbn                       9300 non-null   object 
 6   isbn13                     9415 non-null   float64
 7   authors                    10000 non-null  object 
 8   original_publication_year  9979 non-null   float64
 9   original_title             9415 non-null   object 
 10  title                      10000 non-null  object 
 11  language_code          

In [6]:
# Preview the first 5 rows of the books dataset
print("Books Data:")
display(books.head())


Books Data:


Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


The dataset includes 10,000 records and 23 attributes describing book metadata, publication details, and aggregated rating statistics. Core fields such as `title`, `author`, and `average_rating` are complete and well-defined, supporting reliability for downstream recommendation modeling.
A few metadata fields (`ISBN`, `ISBN13`, `original_title`, `language_code`, and `original_publication_year`) contain missing values, mainly related to identification or translation details. No duplicate records were detected. Overall, the dataset is clean, compact, and suitable for integrating with user rating data to form the analytical base of the project.

In [7]:
# URLs of the Goodbooks-10k datasets (hosted on GitHub)
ratings_url = "https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv"

# Load the ratings and books data into Pandas DataFrames
ratings = load_and_inspect(ratings_url, 'Ratings')  # Contains user_id, book_id, and rating columns


Ratings loaded successfully.
Shape: 5976479 rows × 3 columns

Data Overview:
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5976479 entries, 0 to 5976478
Data columns (total 3 columns):
 #   Column   Dtype
---  ------   -----
 0   user_id  int64
 1   book_id  int64
 2   rating   int64
dtypes: int64(3)
memory usage: 136.8 MB
None
------------------------------------------------------------

No missing data detected

Duplicate Rows: 0


In [8]:
# Preview the first 5 rows of the ratings dataset
print("Ratings Data:")
display(ratings.head())

Ratings Data:


Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


This dataset comprises 5.97 million user–book interaction records across three fields: `user_id`, `book_id`, and `rating`. There are no missing values or duplicate entries, indicating excellent data integrity.
The structure and completeness make it highly suitable for collaborative filtering and user engagement modeling. Combined with the book metadata, it provides a robust foundation for analyzing reading preferences and predicting personalized selections.

### Best Books Ever Dataset
In this step, we load the **Best Books Ever** dataset, which contains metadata and ratings for a wider range of books. We can reuse the same function defined earlier to load and inspect this dataset.

In [9]:

bbe_url = 'https://raw.githubusercontent.com/scostap/goodreads_bbe_dataset/refs/heads/main/Best_Books_Ever_dataset/books_1.Best_Books_Ever.csv'
bbe = load_and_inspect(bbe_url, 'Best Books Ever', show_head=True)
# We'll preview the first few rows of the dataset to get an initial understanding of their structure and content.


Best Books Ever loaded successfully.
Shape: 52478 rows × 25 columns

Data Overview:
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52478 entries, 0 to 52477
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bookId            52478 non-null  object 
 1   title             52478 non-null  object 
 2   series            23470 non-null  object 
 3   author            52478 non-null  object 
 4   rating            52478 non-null  float64
 5   description       51140 non-null  object 
 6   language          48672 non-null  object 
 7   isbn              52478 non-null  object 
 8   genres            52478 non-null  object 
 9   characters        52478 non-null  object 
 10  bookFormat        51005 non-null  object 
 11  edition           4955 non-null   object 
 12  pages             50131 non-null  object 
 13  publisher         48782 non-null  o

Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,characters,...,firstPublishDate,awards,numRatings,ratingsByStars,likedPercent,setting,coverImg,bbeScore,bbeVotes,price
0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,English,9780439023481,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...","['Katniss Everdeen', 'Peeta Mellark', 'Cato (H...",...,,['Locus Award Nominee for Best Young Adult Boo...,6376780,"['3444695', '1921313', '745221', '171994', '93...",96.0,"['District 12, Panem', 'Capitol, Panem', 'Pane...",https://i.gr-assets.com/images/S/compressed.ph...,2993816,30516,5.09
1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.5,There is a door at the end of a silent corrido...,English,9780439358071,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...","['Sirius Black', 'Draco Malfoy', 'Ron Weasley'...",...,06/21/03,['Bram Stoker Award for Works for Young Reader...,2507623,"['1593642', '637516', '222366', '39573', '14526']",98.0,['Hogwarts School of Witchcraft and Wizardry (...,https://i.gr-assets.com/images/S/compressed.ph...,2632233,26923,7.38
2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,The unforgettable novel of a childhood in a sl...,English,9999999999999,"['Classics', 'Fiction', 'Historical Fiction', ...","['Scout Finch', 'Atticus Finch', 'Jem Finch', ...",...,07/11/60,"['Pulitzer Prize for Fiction (1961)', 'Audie A...",4501075,"['2363896', '1333153', '573280', '149952', '80...",95.0,"['Maycomb, Alabama (United States)']",https://i.gr-assets.com/images/S/compressed.ph...,2269402,23328,


The dataset contains 52,478 records and 25 features, providing extensive coverage of book metadata, author details, and reader engagement metrics. Most core fields (title, author, rating, ISBN, genres) are complete and well-structured, indicating strong potential for content-based analysis.

However, 12 columns exhibit missing data, mainly in auxiliary attributes such as series, book format, publisher, publication dates, and price, which may require selective imputation or exclusion in later cleaning. The presence of 50 duplicate rows suggests minor redundancy that should be addressed during the Data Preparation phase. Overall, the dataset is rich and comprehensive, with data quality issues confined to non-critical metadata fields.

## Overlap Inspection
After loading the datasets, it is important to inspect the overlap between them to ensure consistency and identify potential integration points for analysis. We will check for common `book_id` values between the Goodbooks-10k and Best Books Ever datasets.

This will ensure that we are able to simulate user interactions on a subscription platform while validating book popularity across a broader market context.

### Overlap Analysis

In [10]:
# Extract numeric identifiers from the 'bookId' string column in the Best Books Ever dataset
bbe['bookId_num'] = (
    bbe['bookId']                   # Access the bookId column
    .astype(str)                    # Ensure the data is treated as strings
    .str.extract(r'^(\d+)[\.-]')    # Extract leading digits before '.' or '-' using regex
    .astype(float)                  # Convert the extracted strings to float 
)


In [11]:
# Ensure consistent numeric data types for key identifier columns in both datasets
bbe['bookId_num'] = bbe['bookId_num'].astype(float)
books['goodreads_book_id'] = books['goodreads_book_id'].astype(float)


In [12]:
# create mask to identify overlapping book entries between the two datasets
overlap_mask = bbe['bookId_num'].isin(books['goodreads_book_id'])

In [13]:
# Calculate the number and percentage of overlapping book IDs between the two datasets
overlap_count = overlap_mask.sum()

# Percentage of overlap relative to Best Books Ever (bbe)
overlap_pct_bbe = overlap_count / len(bbe) * 100

# Percentage of overlap relative to Goodbooks-10k (books)
overlap_pct_books = overlap_count / len(books) * 100

# Display the results
print(f"Number of overlapping IDs: {overlap_count}")
print(f"Percentage overlap (relative to Best Books Ever): {overlap_pct_bbe:.2f}%")
print(f"Percentage overlap (relative to Goodbooks-10k): {overlap_pct_books:.2f}%")


Number of overlapping IDs: 8039
Percentage overlap (relative to Best Books Ever): 15.32%
Percentage overlap (relative to Goodbooks-10k): 80.39%


The overlap analysis reveals 8,039 shared book IDs between the **Best Books Ever** and **Goodbooks-10k datasets**, representing approximately 80% coverage of the **Goodbooks-10k** catalog and 15% coverage of the larger **Best Books Ever** collection. 
This represents a strong and realistic overlap for modeling purposes: the high proportion on the **Goodbooks** side ensures reliable alignment for simulating member-book interactions, while the lower proportion on the **BBE** side indicates a broader untapped catalog. 
This balance mirrors real-world dynamics in subscription platforms, where temporary licensing deals or curated partnerships cover a subset of available titles, leaving a substantial pool of additional books for potential recommendation expansion.

## Summary of Findings

Data collection and initial inspection were successfully completed. The datasets together provide a strong foundation for analyzing book popularity, reader engagement, and cross-platform catalog alignment.

- **Data Quality:** Most core fields (IDs, titles, authors, ratings) are complete and consistent.  
  Missing data is mainly confined to secondary metadata such as publication details, ISBNs, and languages.  
  Duplicate records were minimal and limited to the Best Books Ever dataset (50 entries).

- **Schema Harmonization:** Book identifiers were standardized into numeric format to ensure compatibility across sources.  
  This enables future merging of metadata and rating data into a unified analytical dataset.

- **Cross-Dataset Overlap:** The overlap analysis revealed **8,039 shared titles**, covering **80% of Goodbooks-10k** and **15% of Best Books Ever**.  This reflects a realistic representation of shared catalog licensing, sufficient for simulation while leaving room for novel recommendations.

- **Next Steps:**  
  The upcoming **02_Data_Cleaning** notebook will focus on cleaning and transforming the data:
  - Handling missing values and duplicates  
  - Normalizing column types and naming conventions  
  - (If needed) merging datasets based on books IDs to create a consolidated dataset for analysis and modeling.

## Save Collected Data
In this final section, we will save the collected datasets to CSV files for reproducibility and traceability in future steps including analysis and modeling stages of the project.

In [15]:
books.to_csv('data/books.csv', index=False)
ratings.to_csv('data/ratings.csv', index=False)
bbe.to_csv('data/bbe_books.csv', index=False)