
# Exploratory Data Analysis

## Objectives

This notebook performs **exploratory data analysis (EDA)** on the enriched datasets to understand book performance, catalog diversity, and user behavior.

The goal is to extract insights that guide:

* Feature engineering
* Predictive modeling
* Dashboard visualisation
* Business interpretation (ratings, satisfaction, catalog suitability)

---

## Inputs

* `en_supply_catalog.csv` — enriched BBE (supply) catalog
* `en_internal_catalog.csv` — enriched Goodbooks catalog
* `ratings_clean.csv` — cleaned user–book interactions
* `model_dataset_warm_start.csv` — unified metadata + external signals
* `model_dataset_cold_start.csv` — unified metadata (no external signals)

---

## Tasks in This Notebook

1. **Load Data**
    Read in enriched catalogs and interaction data.

2. **Book Performance Analysis**
   Ratings, popularity, publication trends, outliers.

3. **Correlation & Feature Signal Study**
   Identify which variables show predictive potential.

4. **Internal vs Supply Catalog Diversity Comparison**
   Genre mix, popularity spread, publication-year coverage.

5. **User Behavior Exploration**
   Rating patterns, user activity, interaction sparsity.

6. **Feature Engineering Recommendations**
   Identify variables needing transformation, encoding, or imputation.

7. **Generate Dashboard-Ready Visualisations**
   Plots to be reused in the Streamlit “Data Explorer” page.

---

## Outputs

* Summary tables and plots for book performance, diversity, and behavior
* Correlation & feature signal diagnostics
* Dashboard-ready visualisations (saved to `outputs/eda_plots/`)
* Notes on recommended feature transformations for Notebook 05

>**Note:**
>This notebook focuses on **analysis only**.
>Feature engineering and modeling are completed in the following notebooks.

# Set up

## Navigate to the Parent Directory

Before combining and saving datasets, it’s often helpful to move to a parent directory so that file operations (like loading or saving data) are easier and more organized. 

Before using the Python’s built-in os module to move one level up from the current working directory, it is advisable to inspect the current directory.

In [None]:
import os

# Get the current working directory
current_dir = os.getcwd()
print(f'Current directory: {current_dir}')

To change to parent directory (root folder), run the code below. If you are already in the root folder, you can skip this step.

In [None]:
# Change the working directory to its parent
os.chdir(os.path.dirname(current_dir))
print('Changed directory to parent.')

# Get the new current working directory (the parent directory)
current_dir = os.getcwd()
print(f'New current directory: {current_dir}')

## Load Datasets

In this step, we load the previously cleaned datasets for analysis.

In [None]:
from pathlib import Path
import pandas as pd

internal_catalog = pd.read_csv(
    'outputs/datasets/cleaned/en_internal_catalog.csv',
    dtype={"isbn_clean": "string", "goodreads_id_clean": "string"}
    )
ratings_clean = pd.read_csv('outputs/datasets/cleaned/ratings_clean_v1.csv')
supply_catalog = pd.read_csv(
    "outputs/datasets/cleaned/en_supply_catalog.csv",
    dtype={"isbn_clean": "string", "goodreads_id_clean": "string"}
)

# Book Performance Analysis

In this section, we analyze book performance metrics such as ratings distribution, popularity trends, and publication year analysis.

## Correlation & Feature Signal Study

In this section, we will analyze the correlations between different features in the dataset and their predictive power regarding the target variable, which is `rating_clean`. This analysis will help us identify which features are most relevant for our predictive modeling tasks.

 To be able to do this, we will use the enriched datasets loaded in the previous step. Majority of our data is in textual form, so we need to convert them into numerical format for analysis. 
 
 We created a feature engineering module in `src/modeling/feature_engineering.py` to help with this task. Some values will need preparation steps before feature engineering, those are `genres_clean` and `publication_date_clean`.

In [None]:
from src.cleaning.utils.helpers import safe_literal_eval

# apply to both catalogs before feature engineering
internal_catalog['genres_clean'] = internal_catalog['genres_clean'].apply(safe_literal_eval)
supply_catalog['genres_clean'] = supply_catalog['genres_clean'].apply(safe_literal_eval)

print("Genres converted to lists successfully.")
print("\nSample after conversion:")
print(internal_catalog['genres_clean'].head(5))

In [None]:
from src.cleaning.utils.dates import extract_year

# transform both catalogs' publication_date_clean to extract year
internal_catalog['publication_year_clean'] = internal_catalog['publication_date_clean'].apply(extract_year)
supply_catalog['publication_year_clean'] = supply_catalog['publication_date_clean'].apply(extract_year)

print("Year extraction complete.")
print(f"\nInternal catalog - Valid years: {internal_catalog['publication_year_clean'].notna().sum()}")
print(f"Supply catalog - Valid years: {supply_catalog['publication_year_clean'].notna().sum()}")
print(f"\nYear range (internal): {internal_catalog['publication_year_clean'].min()} - {internal_catalog['publication_year_clean'].max()}")
print(f"Year range (supply): {supply_catalog['publication_year_clean'].min()} - {supply_catalog['publication_year_clean'].max()}")

In [None]:
from src.modeling.feature_engineering import fe_engineering

print("Feature engineering module imported.")

# apply feature engineering to internal catalog
internal_catalog_fe = fe_engineering(
    df=internal_catalog,
    encode_text_embeddings=False,
    top_n_authors=50,
    top_n_genres=30,
    bool_cols=['has_award', 'is_major_publisher'], 
    text_col='description_clean',
    genres_col='genres_clean',
    author_col='author_clean',
    publisher_col='publisher_clean',
    series_col='series_clean'
)

print(f"Internal catalog shape: {internal_catalog_fe.shape}")
print(f"New features added: {set(internal_catalog_fe.columns) - set(internal_catalog.columns)}")
display(internal_catalog_fe.head())

Next, we will remove any columns that may lead to data leakage. These columns contain information that would not be available at the time of prediction and could artificially inflate the performance of our models. And apply correlation and predictive power analysis on the cleaned dataset.

In [None]:
leakage_cols = [
    # Raw rating distribution components (leakage)
    'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
    'ratings_1_share', 'ratings_2_share', 'ratings_3_share',
    'ratings_4_share', 'ratings_5_share',
]

# Drop leakage columns if they exist
internal_clean = internal_catalog_fe.drop(
    columns=[c for c in leakage_cols if c in internal_catalog_fe.columns]
)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Define the features for correlation analysis
num_cols = internal_clean.select_dtypes(include=[np.number]).columns.tolist()

# Exclude ratings share columns from correlation
# Ensure rating_clean is the first column (target)
num_cols = ['rating_clean'] + [c for c in num_cols if c != 'rating_clean']

print(f"Numeric features found ({len(num_cols)}):")
print(num_cols)

correlation_features = [
    'rating_clean',
    'numratings_clean',
    'pages_clean',
    'publication_year_clean',
    'has_award_encoded',
    'is_major_publisher_encoded',
    'genre_count',
    'has_genres',
    'is_top_genre',
    'author_book_count',
    'is_top_author',
    'publisher_book_count',
    'in_series',
    'description_length',
    'description_word_count'
    'work_text_reviews_log',
]

correlation_features = [c for c in correlation_features if c in internal_clean.columns]

df_corr = internal_clean[correlation_features]

df_num = df_corr.dropna(how='all')

# Create correlation matrix
corr_matrix = df_num.corr()

print("\nCORRELATION SUMMARY")

if 'rating_clean' in df_corr.columns:
    rating_correlations = corr_matrix['rating_clean'].drop('rating_clean').sort_values(ascending=False)

    print("\nTop Features Positively Correlated with Rating:")
    print(rating_correlations[rating_correlations > 0].head(10))

    print("\nTop Features Negatively Correlated with Rating:")
    print(rating_correlations[rating_correlations < 0].head(10))

In [None]:
import ppscore as pps
import pandas as pd

# PPS of all features predicting rating_clean
pps_rating = pps.predictors(
    internal_clean,
    y="rating_clean",
    output="df"   # returns a tidy DataFrame
).sort_values("ppscore", ascending=False)

pps_rating.head(20)

In [None]:
import matplotlib.pyplot as plt

top = pps_rating.head(15)

plt.figure(figsize=(8, 6))
plt.barh(top["x"], top["ppscore"])
plt.gca().invert_yaxis()
plt.xlabel("Predictive Power Score")
plt.title("Top PPS predictors of rating_clean")
plt.show()


Both correlation analysis and Predictive Power Score (PPS) showed low correlation and predictive power for most features. We will validate if this is due to the nature of the data by performing similar analysis on the supply catalog dataset.

In [None]:

# apply feature engineering to supply catalog
supply_catalog_fe = fe_engineering(
    df=supply_catalog,
    encode_text_embeddings=False,
    top_n_authors=50,
    top_n_genres=30,
    bool_cols=['has_award', 'is_major_publisher'],
    text_col='description_clean',
    genres_col='genres_clean',
    author_col='author_clean',
    publisher_col='publisher_clean',
    series_col='series_clean'
)

print(f"Supply catalog shape: {supply_catalog_fe.shape}")
print(f"New features added: {set(supply_catalog_fe.columns) - set(supply_catalog.columns)}")
display(supply_catalog_fe.head())

# Drop only if the columns exist
supply_clean = supply_catalog_fe.drop(
    columns=[c for c in leakage_cols if c in supply_catalog_fe.columns],
    errors='ignore'
)

print(f"Supply catalog shape after leakage removal: {supply_clean.shape}")


correlation_features = [
    'rating_clean',
    'numRatings_clean',
    'pages_clean',
    'likedPercent_clean',
    'bookFormat_cleaned',
    'bbeVotes_clean',
    'bbeScore_clean',
    'price_clean',
    'publication_year_clean',
    'has_award_encoded',
    'is_major_publisher_encoded',
    'genre_count',
    'has_genres',
    'is_top_genre',
    'author_book_count',
    'is_top_author',
    'publisher_book_count',
    'in_series',
    'description_length',
    'description_word_count'
]

correlation_features = [
    c for c in correlation_features if c in supply_clean.columns
]

df_corr_sup = supply_clean[correlation_features]

corr_matrix_sup = df_corr_sup.corr()

print("\n=== SUPPLY CATALOG — CORRELATION SUMMARY ===")

if 'rating_clean' in correlation_features:
    rating_corr_sup = corr_matrix_sup['rating_clean'] \
                        .drop('rating_clean') \
                        .sort_values(ascending=False)

    print("\nTop Positive Correlations:")
    print(rating_corr_sup[rating_corr_sup > 0].head(10))

    print("\nTop Negative Correlations:")
    print(rating_corr_sup[rating_corr_sup < 0].head(10))

In [None]:
import ppscore as pps
import pandas as pd

pps_supply = (
    pps.predictors(
        supply_clean,
        y="rating_clean",
        output="df"
    )
    .sort_values("ppscore", ascending=False)
)

print("\n=== SUPPLY CATALOG — TOP PPS FEATURES ===")
display(pps_supply.head(15))


In [None]:

import matplotlib.pyplot as plt

top_sup = pps_supply.head(15)

plt.figure(figsize=(9,6))
plt.barh(top_sup["x"], top_sup["ppscore"])
plt.gca().invert_yaxis()
plt.xlabel("Predictive Power Score")
plt.title("Top PPS Predictors of rating_clean (Supply Catalog)")
plt.show()


The correlation and PPS results from the supply catalog closely match those of the internal catalog, confirming that the overall low predictive signal is inherent to the nature of book metadata rather than a limitation of a specific dataset. 

Across both catalogs, the only feature showing meaningful predictive strength is `likedPercent_clean`, which is intuitively expected since it reflects direct user satisfaction. Its high correlation (≈0.80) and strong PPS (≈0.50) reinforce this interpretation. 

This demonstrates that without behavioral indicators, both correlation and predictive power will naturally remain low for this type of data. With this understanding, the next step is to use the **warm-start dataset** to evaluate whether external behavioral signals also correlate with internal user behavior, thereby validating our modeling approach and confirming whether cross-platform signals can reliably enhance prediction.


In [None]:
import pandas as pd
from pathlib import Path

warm_start_path = Path('outputs/datasets/modeling/model_dataset_warm_start.csv')
pd_warm_start = pd.read_csv(warm_start_path,
    dtype={"isbn_clean": "string", "goodreads_id_clean": "string"}
)
print(f"Warm-start dataset shape: {pd_warm_start.shape}")
display(pd_warm_start.head())

In [None]:
import numpy as np

# copy warm-start dataset
warm_clean = pd_warm_start.copy() 

# select only numeric columns
num_cols_warm = warm_clean.select_dtypes(include=[np.number]).columns.tolist()

# ensure gb_rating_clean is in the matrix
if 'gb_rating_clean' not in num_cols_warm:
    raise ValueError("gb_rating_clean not found in warm start dataset.")

# compute correlation matrix
corr_matrix_warm = warm_clean[num_cols_warm].corr()

# extract correlation with internal rating
warm_rating_corr = (
    corr_matrix_warm['gb_rating_clean']
    .drop('gb_rating_clean')
    .sort_values(ascending=False)
)

print("\n=== WARM START — TOP POSITIVE CORRELATIONS ===")
print(warm_rating_corr[warm_rating_corr > 0].head(10))

print("\n=== WARM START — TOP NEGATIVE CORRELATIONS ===")
print(warm_rating_corr[warm_rating_corr < 0].head(10))


In [None]:
import ppscore as pps
import pandas as pd

pps_warm = (
    pps.predictors(
        warm_clean,
        y="gb_rating_clean",
        output="df"
    )
    .sort_values("ppscore", ascending=False)
)

print("\n=== WARM START — TOP PPS FEATURES ===")
display(pps_warm.head(15))


The warm-start analysis shows that external behavioral metrics, such as `external_rating`, `external_bbe_ratings_5_share`, `external_bbe_ratings_3_share`, and `external_likedpct`, are the strongest predictors of our internal rating (`gb_rating_clean`), with correlations as high as 0.99 and PPS scores up to 0.89. 

This confirms that the low signal observed earlier is not a dataset issue but an inherent limitation of metadata for predicting user satisfaction, and that meaningful predictive signal emerges only when behavioral indicators are available. With this evidence, we can now confidently proceed with a warm-start modeling strategy focused on cross-platform validation, using external behavioral data as a reliable proxy for internal preferences.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from src.modeling.feature_engineering import fe_engineering
from sklearn.metrics.pairwise import cosine_similarity

# Create sample data
sample_books = pd.DataFrame({
    'book_id': [1, 2, 3, 4],
    'description_clean': [
        'A dystopian novel about totalitarian surveillance and thought control.',
        'An epic fantasy adventure featuring hobbits, wizards, and dragons.',
        'A romantic story of love and social class in 19th century England.',
        None  # Test missing description
    ],
    'genres_clean': [
        ['science fiction', 'dystopia'],
        ['fantasy', 'adventure'],
        ['romance', 'classic'],
        ['fiction']
    ],
    'author_clean': ['George Orwell', 'J.R.R. Tolkien', 'Jane Austen', 'Unknown'],
    'publisher_clean': ['Penguin', 'HarperCollins', 'Penguin', 'Small Press'],
    'series_clean': [None, 'The Lord of the Rings', None, 'Series X'],
    'has_award': [True, True, False, False],
    'is_major_publisher': [True, True, True, False]
})

print("=== TESTING TEXT EMBEDDINGS ===\n")

# Apply feature engineering with embeddings enabled
result = fe_engineering(
    df=sample_books,
    encode_text_embeddings=True,
    top_n_authors=2,
    top_n_genres=2,
    bool_cols=['has_award', 'is_major_publisher'],
    text_col='description_clean',
    genres_col='genres_clean',
    author_col='author_clean',
    publisher_col='publisher_clean',
    series_col='series_clean'
)

# Display results
print(f"Total books processed: {len(result)}")
print(f"Books with embeddings: {result['text_embedding'].notna().sum()}")

# Check first embedding
first_embedding = result.loc[0, 'text_embedding']
print(f"\nFirst embedding shape: {first_embedding.shape}")
print(f"Embedding dimension: {len(first_embedding)}")
print(f"\nFirst 10 values: {first_embedding[:10]}")

# Print statistics for each book
print("\n=== EMBEDDING STATISTICS ===\n")
for idx, row in result.iterrows():
    desc = row['description_clean']
    emb = row['text_embedding']
    desc_preview = desc[:60] if isinstance(desc, str) else "None"
    print(f"Book {idx + 1}: '{desc_preview}...'")
    print(f"  Mean: {emb.mean():.4f}, Std: {emb.std():.4f}")
    print(f"  Min: {emb.min():.4f}, Max: {emb.max():.4f}")
    print(f"  L2 Norm: {np.linalg.norm(emb):.4f}\n")

# Similarity analysis
print("=== COSINE SIMILARITY ANALYSIS ===\n")
embeddings_matrix = np.vstack(result['text_embedding'].values)
similarity_matrix = cosine_similarity(embeddings_matrix)

similarity_df = pd.DataFrame(
    similarity_matrix,
    index=[f"Book {i+1}" for i in range(len(result))],
    columns=[f"Book {i+1}" for i in range(len(result))]
)
print(similarity_df.round(4))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Text Embedding Analysis', fontsize=16, fontweight='bold')

# Plot 1: Heatmap of first 50 dimensions
embeddings_50 = embeddings_matrix[:, :50]
im1 = axes[0, 0].imshow(embeddings_50, aspect='auto', cmap='coolwarm', vmin=-0.5, vmax=0.5)
axes[0, 0].set_title('First 50 Embedding Dimensions')
axes[0, 0].set_xlabel('Dimension')
axes[0, 0].set_ylabel('Book')
axes[0, 0].set_yticks(range(len(result)))
axes[0, 0].set_yticklabels([f"Book {i+1}" for i in range(len(result))])
plt.colorbar(im1, ax=axes[0, 0], label='Value')

# Plot 2: Distribution of values for Book 1
axes[0, 1].hist(first_embedding, bins=40, edgecolor='black', alpha=0.7, color='steelblue')
axes[0, 1].set_title('Embedding Value Distribution (Book 1)')
axes[0, 1].set_xlabel('Value')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].axvline(0, color='red', linestyle='--', linewidth=2, label='Zero')
axes[0, 1].legend()

# Plot 3: L2 Norm comparison
magnitudes = [np.linalg.norm(emb) for emb in result['text_embedding']]
colors = ['steelblue' if mag > 0 else 'red' for mag in magnitudes]
axes[1, 0].bar(range(len(magnitudes)), magnitudes, color=colors, alpha=0.7)
axes[1, 0].set_title('Embedding Magnitude (L2 Norm)')
axes[1, 0].set_xlabel('Book')
axes[1, 0].set_ylabel('Magnitude')
axes[1, 0].set_xticks(range(len(result)))
axes[1, 0].set_xticklabels([f"Book {i+1}" for i in range(len(result))])
axes[1, 0].grid(axis='y', alpha=0.3)

# Plot 4: Cosine similarity heatmap
im4 = axes[1, 1].imshow(similarity_matrix, cmap='RdYlGn', vmin=0, vmax=1)
axes[1, 1].set_title('Cosine Similarity Matrix')
axes[1, 1].set_xlabel('Book')
axes[1, 1].set_ylabel('Book')
axes[1, 1].set_xticks(range(len(result)))
axes[1, 1].set_yticks(range(len(result)))
axes[1, 1].set_xticklabels([f"B{i+1}" for i in range(len(result))])
axes[1, 1].set_yticklabels([f"B{i+1}" for i in range(len(result))])

# Add text annotations
for i in range(len(result)):
    for j in range(len(result)):
        text = axes[1, 1].text(j, i, f'{similarity_matrix[i, j]:.2f}',
                              ha="center", va="center", color="black", fontsize=10)

plt.colorbar(im4, ax=axes[1, 1], label='Similarity')

plt.tight_layout()
plt.savefig('outputs/eda_plots/text_embeddings_analysis.png', dpi=150, bbox_inches='tight')
print("\n✓ Visualization saved to: outputs/eda_plots/text_embeddings_analysis.png")
plt.show()