# Feature Engineering

## Objectives

This notebook implements **feature engineering** actions recommended by the exploratory analysis in Notebook 04. The goal is to transform raw and enriched datasets into modeling-ready features that address rating inflation, correct skewed distributions, and aggregate user-level signals for clustering and personalization.

Key priorities:

* Bayesian/weighted ratings to correct sample-size bias
* Log transformations for skewed count features
* User profile aggregation for segmentation and clustering
* Save feature-engineered datasets for modeling

---

## Inputs

* `internal_catalog_analysis.csv` — feature-engineered internal catalog
* `supply_catalog_analysis.csv` — feature-engineered supply catalog
* `ratings_clean_v1.csv` — cleaned user–book interactions
* `user_activity.csv`, `user_diversity.csv`, `user_genre_prefs.csv` — user-level analysis datasets
* `model_dataset_warm_start.csv` — unified metadata + external signals (for validation)

---

## Tasks in This Notebook

1. **Load Data**
    Import feature-engineered catalogs and user interaction datasets.

2. **Implement Bayesian/Weighted Ratings**
   Apply Bayesian average formula to correct rating inflation in low-volume books, authors, genres, and publishers.

3. **Apply Log Transformations**
   Transform skewed count features (ratings, author/publisher book counts) for regression compatibility.

4. **Aggregate User Profile Features**
   Merge rating, diversity, and preference metrics into user-level features for clustering and segmentation.

5. **Save Feature-Engineered Datasets**
   Export modeling-ready datasets for downstream tasks.

---

## Outputs

* Feature-engineered catalogs with weighted ratings and log-transformed features
* Aggregated user profile dataset for clustering and personalization
* Modeling-ready datasets saved to `outputs/datasets/modeling/`
* Documentation of feature engineering logic and business rationale

>**Note:**
>This notebook focuses on **feature engineering only**.
>Model training and evaluation are completed in the following notebooks.

# Set up

## Navigate to the Parent Directory

Before combining and saving datasets, it’s often helpful to move to a parent directory so that file operations (like loading or saving data) are easier and more organized. 

Before using the Python’s built-in os module to move one level up from the current working directory, it is advisable to inspect the current directory.

In [None]:
import os

# Get the current working directory
current_dir = os.getcwd()
print(f'Current directory: {current_dir}')

To change to parent directory (root folder), run the code below. If you are already in the root folder, you can skip this step.

In [None]:
# Change the working directory to its parent
os.chdir(os.path.dirname(current_dir))
print('Changed directory to parent.')

# Get the new current working directory (the parent directory)
current_dir = os.getcwd()
print(f'New current directory: {current_dir}')

## Load Datasets

In this step, we load the previously cleaned datasets for analysis.

In [None]:
import pandas as pd

# catalogs
internal_catalog_fe = pd.read_csv('outputs/datasets/analysis/internal_catalog_analysis.csv')

# ratings and user-level analysis
ratings = pd.read_csv('outputs/datasets/cleaned/ratings_clean.csv')
user_activity = pd.read_csv('outputs/datasets/analysis/user_activity.csv')
user_diversity = pd.read_csv('outputs/datasets/analysis/user_diversity.csv')
book_genre_mapping = pd.read_csv('outputs/datasets/analysis/internal_book_genre_mapping.csv')

# unified metadata for validation
warm_start = pd.read_csv('outputs/datasets/cleaned/model_dataset_warm_start.csv')

print(f"Internal catalog shape: {internal_catalog_fe.shape}")
print(f"Ratings shape: {ratings.shape}")
print(f"User activity shape: {user_activity.shape}")
print(f"User diversity shape: {user_diversity.shape}")
print(f"Warm start dataset shape: {warm_start.shape}")
print(f"Internal Catalog book-genre mapping: {book_genre_mapping.shape}")

# Log Transformations

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew

def check_skew_and_range(df, features, plot=True):
    results = []
    for col in features:
        if col not in df.columns:
            continue
        vals = df[col].dropna()
        if not np.issubdtype(vals.dtype, np.number):
            continue
        rng = vals.max() - vals.min()
        skw = skew(vals)
        pct_zeros = (vals == 0).mean()
        pct_ones = (vals == 1).mean()
        results.append({
            'feature': col,
            'min': vals.min(),
            'max': vals.max(),
            'range': rng,
            'skew': skw,
            'pct_zeros': pct_zeros,
            'pct_ones': pct_ones,
            'mean': vals.mean(),
            'std': vals.std(),
        })
        if plot:
            plt.figure(figsize=(6,2))
            sns.histplot(vals, bins=50, kde=True)
            plt.title(f"{col} (skew={skw:.2f}, range={rng:.2e})")
            plt.show()
    return pd.DataFrame(results).sort_values('skew', ascending=False)

# internal catalog features to check
features_to_check = [
    'numRatings_clean', 'author_book_count', 'publisher_book_count',
    'work_text_reviews_count', 'pages_clean', 'description_length', 'description_word_count',
]
check_skew_and_range(internal_catalog_fe, features_to_check)

| Feature                   | Range   | Skew    | Power Law                  | Distribution Shape         | Log Transform? |
|---------------------------|---------|---------|----------------------------|---------------------------|:--------------:|
| numRatings_clean          | High    | High    | Yes, extreme long tail     | Extreme right skew, power law |      Yes       |
| work_text_reviews_count   | High    | High    | Yes, long tail             | Extreme right skew        |      Yes       |
| pages_clean               | High    | High    | Weak, some long tail       | Right skew                |      Yes       |
| author_book_count         | Medium  | High    | Yes, long tail             | Right skew                |      Yes       |
| description_length        | High    | Low     | No, mild tail              | Mild right skew           |   Optional     |
| description_word_count    | Medium  | Low     | No, mild tail              | Mild right skew           |   Optional     |
| publisher_book_count      | High    | Low     | Bimodal, some long tail    | Bimodal, moderate skew    |   Consider     |


**Summary:**  
- **Apply log transform** to features with high skew (>2) and/or very large ranges:  
  `numRatings_clean` (already done), `work_text_reviews_count`(already done), `pages_clean`, `author_book_count`
- **Optional** for features with mild skew but large range:  
  `description_length`, `description_word_count`
- **Consider** for `publisher_book_count` if model performance improves.



> note: A power law is a type of statistical distribution where a small number of items have very large values, while most items have small values. This results in a long tail on the right side of the distribution curve.

In [None]:
# apply log1 to selected features
internal_catalog_fe['author_book_count_log'] = np.log1p(internal_catalog_fe['author_book_count'])
internal_catalog_fe['pages_log'] = np.log1p(internal_catalog_fe['pages_clean'])
internal_catalog_fe['description_length_log'] = np.log1p(internal_catalog_fe['description_length'])
internal_catalog_fe['description_word_count_log'] = np.log1p(internal_catalog_fe['description_word_count'])
internal_catalog_fe['publisher_book_count_log'] = np.log1p(internal_catalog_fe['publisher_book_count'])

In [None]:
# external features to check
external_features_to_check = [
    'external_likedpct',
    'external_numratings',
    'external_votes',
    'external_score',
    'external_price'
]
check_skew_and_range(warm_start, external_features_to_check)

| Feature             | Range   | Skew    | Power Law                  | Distribution Shape         | Log Transform? |
|---------------------|---------|---------|----------------------------|---------------------------|:--------------:|
| external_score      | High    | High    | Yes, extreme long tail     | Extreme right skew        |      Yes       |
| external_votes      | High    | High    | Yes, long tail             | Extreme right skew        |      Yes       |
| external_price      | High    | High    | Yes, long tail             | Extreme right skew        |      Yes       |
| external_numratings | High    | High    | Yes, extreme long tail     | Extreme right skew        |      Yes       |
| external_likedpct   | Low     | Low     | No, concentrated           | Mild left skew            |      No        |

**Summary:**  
- **Apply log transform** to all external features with high skew and/or very large ranges and power-law tails:  
  `external_score`, `external_votes`, `external_price`, `external_numratings`
- **No log transform** needed for `external_likedpct` due to low skew and narrow range.

In [None]:
# Log-transform external features with high skew and range
warm_start['external_score_log'] = np.log1p(warm_start['external_score'])
warm_start['external_votes_log'] = np.log1p(warm_start['external_votes'])
warm_start['external_price_log'] = np.log1p(warm_start['external_price'])
warm_start['external_numratings_log'] = np.log1p(warm_start['external_numratings'])
# No log transform for external_likedpct

# Popularity Score

In [None]:
# Standardize features (z-score)
for col in ['work_text_reviews_log', 'numRatings_log', 'rating_clean']:
    internal_catalog_fe[f'{col}_z'] = (internal_catalog_fe[col] - internal_catalog_fe[col].mean()) / internal_catalog_fe[col].std()

# Create popularity score (equal weights)
internal_catalog_fe['popularity_score'] = (
    internal_catalog_fe['work_text_reviews_log_z'] +
    internal_catalog_fe['numRatings_log_z'] +
    internal_catalog_fe['rating_clean_z']
)

# Preview top popular books
internal_catalog_fe[['title_clean', 'popularity_score', 'work_text_reviews_log', 'numRatings_log', 'rating_clean']].sort_values('popularity_score', ascending=False).head(10)

In [None]:
import numpy as np

# Standardize features (z-score)
for col in ['external_rating', 'external_numratings_log', 'external_votes_log', 'external_score_log', 'external_likedpct']:
    warm_start[f'{col}_z'] = (warm_start[col] - warm_start[col].mean()) / warm_start[col].std()

# Create external popularity score (equal weights)
warm_start['external_popularity_score'] = (
    warm_start['external_rating_z'] +
    warm_start['external_numratings_log_z'] +
    warm_start['external_votes_log_z'] +
    warm_start['external_score_log_z'] +
    warm_start['external_likedpct_z']
)

# Preview top popular items
warm_start[['title_final', 'external_popularity_score', 'external_rating', 'external_numratings', 'external_votes', 'external_score', 'external_likedpct']].sort_values('external_popularity_score', ascending=False).head(10)

# User Profile Aggregation

In [None]:
print(user_activity.columns)

In [None]:
# join ratings with internal catalog features on book_id
user_book = ratings.merge(
    internal_catalog_fe[['book_id', 'publication_year_clean', 'pages_clean', 'has_award_encoded', 'is_major_publisher_encoded', 'genre_count', 'popularity_score']],
    on='book_id', how='left'
)

# aggregate to user level
user_profile = user_book.groupby('user_id').agg(
    rating_count=('rating', 'count'),
    rating_mean=('rating', 'mean'),
    rating_std=('rating', 'std'),
    pub_year_mean=('publication_year_clean', 'mean'),
    pages_mean=('pages_clean', 'mean'),
    award_pref=('has_award_encoded', 'mean'),
    major_pub_pref=('is_major_publisher_encoded', 'mean'),
    genre_count_mean=('genre_count', 'mean'),
    popularity_mean=('popularity_score', 'mean')
).reset_index()

# merge with user activity and diversity
user_profile = user_profile.merge(
    user_diversity, on='user_id', how='left'
)
user_profile = user_profile.merge(
    user_activity, on='user_id', how='left'
)

print(user_profile.head())

# Save Datasets

In [None]:
user_profile.to_csv('outputs/datasets/modeling/user_profile_features.csv', index=False)