# Reading Topography - MVP Notebook

This notebook implements the first milestone: CSV ingestion, normalization, and deduplication.

## Overview
1. **Setup**: Import libraries and configure settings
2. **Load Goodreads CSV**: Read and validate the CSV export
3. **Normalize & Deduplicate**: Clean data and remove duplicates
4. **Smoke Test**: Verify the pipeline works with sample data
5. **Summary**: Review results and next steps

## 1. Setup

In [None]:
# Standard imports
import sys
import os
import pandas as pd
import numpy as np

# Setup environment for Colab (must happen before other imports)
if 'google.colab' in sys.modules:
    import subprocess
    from pathlib import Path
    
    # Check if repository is already cloned
    repo_path = Path('/content/ReadingTopography')
    if not repo_path.exists():
        print('Cloning repository...')
        subprocess.run(['git', 'clone', 'https://github.com/olivialynn/ReadingTopography.git'], 
                      cwd='/content', check=True)
        print('✓ Repository cloned')
    else:
        print('✓ Repository already available')
    
    # Add to Python path
    sys.path.insert(0, '/content/ReadingTopography')
else:
    # Local Jupyter - add parent directory to path
    sys.path.insert(0, os.path.abspath('..'))

# Now we can import from src (repo is cloned and in path)
from src.notebook_utils import setup_notebook_environment, get_data_path

# Complete environment initialization
env = setup_notebook_environment()

# Import our data ingestion functions
from src.data_ingestion import (
    load_goodreads_csv,
    normalize_title,
    normalize_author,
    deduplicate_books,
    process_goodreads_csv
)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.width', None)

print('✓ Setup complete')

## 2. Load Goodreads CSV

Load the Goodreads export CSV and filter to "to-read" shelf.

In [None]:
# Path to your Goodreads CSV export
# This works in both Colab and local environments
CSV_PATH = get_data_path('sample_goodreads.csv')

# Load CSV with validation
df_raw = load_goodreads_csv(str(CSV_PATH), filter_shelf='to-read')

print(f"\nLoaded {len(df_raw)} books from {CSV_PATH}")
print(f"\nColumns available: {', '.join(df_raw.columns)}")
print(f"\nFirst few rows:")
df_raw.head()

## 3. Normalize & Deduplicate

Apply normalization to title and author fields, then remove duplicates.

In [None]:
# Apply normalization
df_raw['title_norm'] = df_raw['Title'].apply(normalize_title)
df_raw['author_norm'] = df_raw['Author'].apply(normalize_author)

# Handle ISBN13
if 'ISBN13' in df_raw.columns:
    df_raw['isbn13'] = df_raw['ISBN13'].astype(str).replace('nan', '')
elif 'ISBN' in df_raw.columns:
    df_raw['isbn13'] = df_raw['ISBN'].astype(str).replace('nan', '')
else:
    df_raw['isbn13'] = ''

print("✓ Normalization applied")
print("\nExample normalizations:")
print(df_raw[['Title', 'title_norm', 'Author', 'author_norm']].head())

In [None]:
# Deduplicate
df_clean = deduplicate_books(df_raw)

print(f"\n✓ Deduplication complete")
print(f"Original count: {len(df_raw)}")
print(f"After deduplication: {len(df_clean)}")
print(f"Duplicates removed: {len(df_raw) - len(df_clean)}")

# Run complete pipeline in one step
df_processed = process_goodreads_csv(str(CSV_PATH), filter_shelf='to-read')

print(f"\n✓ Pipeline complete")
print(f"\nFinal dataset: {len(df_processed)} unique books")
print(f"\nColumns in output:")
print(df_processed.columns.tolist())

In [None]:
# Run complete pipeline in one step
df_processed = process_goodreads_csv(CSV_PATH, filter_shelf='to-read')

print(f"\n✓ Pipeline complete")
print(f"\nFinal dataset: {len(df_processed)} unique books")
print(f"\nColumns in output:")
print(df_processed.columns.tolist())

In [None]:
# Display processed data
print("Processed books:")
display_cols = ['Title', 'Author', 'title_norm', 'author_norm', 'isbn13']
df_processed[display_cols].head(10)

## 5. Summary Statistics

Review the data quality and coverage.

In [None]:
# Summary statistics
print("=" * 60)
print("DATA QUALITY SUMMARY")
print("=" * 60)

total_books = len(df_processed)
has_isbn = (df_processed['isbn13'] != '').sum()
missing_isbn = total_books - has_isbn

print(f"\nTotal unique books: {total_books}")
print(f"\nISBN Coverage:")
print(f"  - With ISBN13: {has_isbn} ({100*has_isbn/total_books:.1f}%)")
print(f"  - Without ISBN13: {missing_isbn} ({100*missing_isbn/total_books:.1f}%)")

print(f"\nNormalization:")
print(f"  - All titles normalized: {(df_processed['title_norm'] != '').all()}")
print(f"  - All authors normalized: {(df_processed['author_norm'] != '').all()}")

print(f"\nSample normalized identifiers:")
for idx in range(min(3, len(df_processed))):
    row = df_processed.iloc[idx]
    print(f"  {idx+1}. '{row['Title']}' by {row['Author']}")
    print(f"     → title_norm: '{row['title_norm']}'")
    print(f"     → author_norm: '{row['author_norm']}'")
    if row['isbn13']:
        print(f"     → isbn13: {row['isbn13']}")

print("\n" + "=" * 60)

## Next Steps

With the foundation in place, the next milestones are:

1. **Enrichment**: Fetch metadata from Google Books and Open Library APIs
2. **Difficulty Estimation**: Compute readability metrics from text samples
3. **Visualization**: Create interactive Plotly scatter plot
4. **Caching**: Store enrichment results to avoid repeated API calls

See [TDD Section 15](../docs/tdd.md#15-milestones-implementation-plan) for the full implementation plan.

---

## Appendix: Testing Individual Functions

You can test individual normalization functions:

In [None]:
# Test normalization functions
test_cases = [
    ("The Great Gatsby", "F. Scott Fitzgerald"),
    ("Harry Potter and the Sorcerer's Stone", "J.K. Rowling"),
    ("The Lord of the Rings: The Fellowship of the Ring", "J.R.R. Tolkien")
]

print("Normalization examples:")
print("-" * 80)
for title, author in test_cases:
    print(f"Original: '{title}' by {author}")
    print(f"  → title_norm:  '{normalize_title(title)}'")
    print(f"  → author_norm: '{normalize_author(author)}'")
    print()