# Data Collection
## Objectives
The goal of this notebook is to **collect, consolidate, and prepare datasets** that will be used to build data-driven insights for the *Book Subscription Optimization* project.  
This aligns with the **Data Collection and Understanding** stages of the CRISP-DM process, ensuring that the data foundation supports later stages of modeling and evaluation.

Specifically, this notebook aims to:
- Retrieve and load multiple book-related datasets from open sources.  
- Perform initial validation to assess structure, completeness, and consistency. 
- Note any data quality issues for future cleaning steps.

## Inputs
- **Goodbooks-10k Dataset:** User-book interactions and ratings, used to simulate subscription platform interactions.  
- **Best Books Ever Dataset:** Global book metadata and ratings, used for cross-platform popularity validation and content catalogue proxy.  

Each dataset contributes unique dimensions: reader behavior, book features, and market data. This structure allows for a holistic analysis of book engagement and satisfaction.

## Outputs
- Metadata summary of dataset structure, completeness, and variable distributions.  
- Preliminary insights into data coverage for later CRISP-DM stages (Data Understanding & Preparation).
- Original datasets saved in CSV format for reproducibility and future use.

> **Note:** This notebook focuses on data collection and initial assessment. Detailed cleaning and transformation will be addressed in subsequent notebooks.

---

## Navigate to the Parent Directory

Before combining and saving datasets, it’s often helpful to move to a parent directory so that file operations (like loading or saving data) are easier and more organized. 

Before using the Python’s built-in os module to move one level up from the current working directory, it is advisable to inspect the current directory.

In [None]:
import os

# Get the current working directory
current_dir = os.getcwd()
print(f'Current directory: {current_dir}')

To change to parent directory (root folder), run the code below. If you are already in the root folder, you can skip this step.

In [None]:
# Change the working directory to its parent
os.chdir(os.path.dirname(current_dir))
print('Changed directory to parent.')

# Get the new current working directory (the parent directory)
current_dir = os.getcwd()
print(f'New current directory: {current_dir}')

## Fetch data from various sources

In this section, we will fetch data from multiple open datasets and inspect their basic properties to understand their structure and content. Since the datasets are all hosted in GitHub repositories, we will use pandas to read them directly from their raw URLs.
To streamline the process, we will define a function that loads a dataset from a given URL and prints its shape, columns, and missing values.

In [None]:
import pandas as pd 

def load_and_inspect(path, name, show_head=False):
    """
    Load a dataset and perform initial structure validation.
    - Structure and type overview
    - Missing value summary
    - Duplicate count

    Parameters
    ----------
    path : str
        File path to the dataset (CSV format expected).
    name : str
        Readable name for reporting purposes.
    show_head : bool, optional
        If True, displays the first five rows for preview. Default is False.

    Returns
    -------
    pd.DataFrame
        Loaded dataset for further exploration.
    """

    try:
        df = pd.read_csv(path)
        print(f"\n{name} loaded successfully.")
        print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
        
        # Structural Overview
        print("\nData Overview:")
        print("-" * 60)
        print(df.info())
        print("-" * 60)
        
        # Missing value summary
        missing_cols = df.columns[df.isnull().any()].tolist()
        if missing_cols:
            print("\nColumns with Missing Data:", ", ".join(missing_cols))
        else:
            print("\nNo missing data detected")

        # Duplicate check
        duplicate_count = df.duplicated().sum()
        print(f"\nDuplicate Rows: {duplicate_count}")

        if show_head:
            display(df.head(3))

        return df

    except FileNotFoundError:
        print(f"File not found: {path}")
    except Exception as e:
        print(f"Error loading {name}: {e}")


### Goodbooks-10k Dataset
In this step, we load two datasets from the **Goodbooks-10k** project, hosted on GitHub.
This dataset will simulate user-book interactions on a book subscription platform.

These datasets contain:

- **Books**: metadata about 10,000 books (e.g. title, authors, publication year, ratings).
- **Ratings**: over 5 million individual ratings given by readers.

In [None]:
# URLs of the Goodbooks-10k datasets (hosted on GitHub)
books_url   = "https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv"

# Load the ratings and books data into Pandas DataFrames
books = load_and_inspect(books_url, 'Books')      # Contains detailed book metadata

In [None]:
# Preview the first 5 rows of the books dataset
print("Books Data:")
display(books.head())


The dataset includes 10,000 records and 23 attributes describing book metadata, publication details, and aggregated rating statistics. Core fields such as `title`, `author`, and `average_rating` are complete and well-defined, supporting reliability for downstream recommendation modeling.
A few metadata fields (`ISBN`, `ISBN13`, `original_title`, `language_code`, and `original_publication_year`) contain missing values, mainly related to identification or translation details. No duplicate records were detected. Overall, the dataset is clean, compact, and suitable for integrating with user rating data to form the analytical base of the project.

In [None]:
# URLs of the Goodbooks-10k datasets (hosted on GitHub)
ratings_url = "https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv"

# Load the ratings and books data into Pandas DataFrames
ratings = load_and_inspect(ratings_url, 'Ratings')  # Contains user_id, book_id, and rating columns

In [None]:
# Preview the first 5 rows of the ratings dataset
print("Ratings Data:")
display(ratings.head())

This dataset comprises 5.97 million user–book interaction records across three fields: `user_id`, `book_id`, and `rating`. There are no missing values or duplicate entries, indicating excellent data integrity.
The structure and completeness make it highly suitable for collaborative filtering and user engagement modeling. Combined with the book metadata, it provides a robust foundation for analyzing reading preferences and predicting personalized selections.

### Best Books Ever Dataset
In this step, we load the **Best Books Ever** dataset, which contains metadata and ratings for a wider range of books. We can reuse the same function defined earlier to load and inspect this dataset.

In [None]:

bbe_url = 'https://raw.githubusercontent.com/scostap/goodreads_bbe_dataset/refs/heads/main/Best_Books_Ever_dataset/books_1.Best_Books_Ever.csv'
bbe = load_and_inspect(bbe_url, 'Best Books Ever', show_head=True)
# We'll preview the first few rows of the dataset to get an initial understanding of their structure and content.

The dataset contains 52,478 records and 25 features, providing extensive coverage of book metadata, author details, and reader engagement metrics. Most core fields (title, author, rating, ISBN, genres) are complete and well-structured, indicating strong potential for content-based analysis.

However, 12 columns exhibit missing data, mainly in auxiliary attributes such as series, book format, publisher, publication dates, and price, which may require selective imputation or exclusion in later cleaning. The presence of 50 duplicate rows suggests minor redundancy that should be addressed during the Data Preparation phase. Overall, the dataset is rich and comprehensive, with data quality issues confined to non-critical metadata fields.

## Overlap Inspection
After loading the datasets, it is important to inspect the overlap between them to ensure consistency and identify potential integration points for analysis. We will check for common `book_id` values between the Goodbooks-10k and Best Books Ever datasets.

This will ensure that we are able to simulate user interactions on a subscription platform while validating book popularity across a broader market context.

### Overlap Analysis

In [None]:
# Extract numeric identifiers from the 'bookId' string column in the Best Books Ever dataset
bbe['bookId_num'] = (
    bbe['bookId']
    .astype(str)
    .str.extract(r'^(\d+)')    # Extract leading digits whether or not followed by '.' or '-'
    .astype(float)
)


We will run a check to identify any missing or inconsistency after splitting the numerical book Ids from the Best Books Ever dataset.

In [None]:
# Count missing values
missing_count = bbe['bookId_num'].isnull().sum()
print(f"Missing bookId_num entries: {missing_count}")

# Display all rows where bookId_num is null
bbe_missing_bookId_num = bbe[bbe['bookId_num'].isnull()]

# Inspect them
bbe_missing_bookId_num.head()

In [None]:
# Ensure consistent numeric data types for key identifier columns in both datasets
bbe['bookId_num'] = bbe['bookId_num'].astype(float)
books['goodreads_book_id'] = books['goodreads_book_id'].astype(float)


In [None]:
# create mask to identify overlapping book entries between the two datasets
overlap_mask = bbe['bookId_num'].isin(books['goodreads_book_id'])

In [None]:
# Calculate the number and percentage of overlapping book IDs between the two datasets
overlap_count = overlap_mask.sum()

# Percentage of overlap relative to Best Books Ever (bbe)
overlap_pct_bbe = overlap_count / len(bbe) * 100

# Percentage of overlap relative to Goodbooks-10k (books)
overlap_pct_books = overlap_count / len(books) * 100

# Display the results
print(f"Number of overlapping IDs: {overlap_count}")
print(f"Percentage overlap (relative to Best Books Ever): {overlap_pct_bbe:.2f}%")
print(f"Percentage overlap (relative to Goodbooks-10k): {overlap_pct_books:.2f}%")


The overlap analysis reveals 8,039 shared book IDs between the **Best Books Ever** and **Goodbooks-10k datasets**, representing approximately 80% coverage of the **Goodbooks-10k** catalog and 15% coverage of the larger **Best Books Ever** collection. 
This represents a strong and realistic overlap for modeling purposes: the high proportion on the **Goodbooks** side ensures reliable alignment for simulating member-book interactions, while the lower proportion on the **BBE** side indicates a broader untapped catalog. 
This balance mirrors real-world dynamics in subscription platforms, where temporary licensing deals or curated partnerships cover a subset of available titles, leaving a substantial pool of additional books for potential recommendation expansion.

## Summary of Findings

Data collection and initial inspection were successfully completed. The datasets together provide a strong foundation for analyzing book popularity, reader engagement, and cross-platform catalog alignment.

- **Data Quality:** Most core fields (IDs, titles, authors, ratings) are complete and consistent.  
  Missing data is mainly confined to secondary metadata such as publication details, ISBNs, and languages.  
  Duplicate records were minimal and limited to the Best Books Ever dataset (50 entries).

- **Schema Harmonization:** Book identifiers were standardized into numeric format to ensure compatibility across sources.  
  This enables future merging of metadata and rating data into a unified analytical dataset.

- **Cross-Dataset Overlap:** The overlap analysis revealed **8,039 shared titles**, covering **80% of Goodbooks-10k** and **15% of Best Books Ever**.  This reflects a realistic representation of shared catalog licensing, sufficient for simulation while leaving room for novel recommendations.

- **Next Steps:**  
  The upcoming **02_Data_Cleaning** notebook will focus on cleaning and transforming the data:
  - Handling missing values and duplicates  
  - Normalizing column types and naming conventions  
  - (If needed) merging datasets based on books IDs to create a consolidated dataset for analysis and modeling.

## Save Collected Data
In this final section, we will save the collected datasets to CSV files for reproducibility and traceability in future steps including analysis and modeling stages of the project.

In [None]:
from pathlib import Path

# Create data folder if not exists
raw_path = Path("data/raw")
raw_path.mkdir(parents=True, exist_ok=True)

books.to_csv(raw_path / "books.csv", index=False)
ratings.to_csv(raw_path / "ratings.csv", index=False)
bbe.to_csv(raw_path / "bbe_books.csv", index=False)

print("Datasets saved successfully in data/raw/ directory.")

**Data Sample**  
> This step creates small representative samples of the full datasets (books, ratings, and Best Books Ever) to document the data structure.  
> The full raw files are stored locally (not committed to the repository) to comply with size limits, while the sample files in `data/sample/` allow evaluators to inspect dataset content and metadata consistency.


In [None]:
# Create data folder if not exists
sample_path = Path("data/sample")
sample_path.mkdir(parents=True, exist_ok=True)

# Use random sampling
books_sample = books.sample(n=1000, random_state=42)
ratings_sample = ratings.sample(n=5000, random_state=42)
bbe_sample = bbe.sample(n=1000, random_state=42)

# Save sampled datasets
books_sample.to_csv(sample_path / "books_sample.csv", index=False)
ratings_sample.to_csv(sample_path / "ratings_sample.csv", index=False)
bbe_sample.to_csv(sample_path / "bbe_books_sample.csv", index=False)

print("Sample datasets saved successfully in data/sample/")