# Data Preprocessing Decisions
- Transaction dates were converted to a standard datetime format to enable time-based analysis and feature creation.
- Missing numerical values in transaction amounts and customer satisfaction scores were handled using median imputation to minimize the influence of extreme values and preserve overall data distribution.
- Extreme transaction amounts were capped at the 99th percentile to reduce the impact of outliers on aggregated customer spending metrics.
- Customer satisfaction scores were clipped to a maximum value of 10 to ensure consistency with the expected rating scale.
- Duplicate records were removed from all datasets to maintain data integrity, using unique identifiers such as Transaction_ID and Product_ID.
- Non-essential or redundant fields (e.g., target age group in the product dataset) were removed to simplify the dataset and focus on analytically relevant attributes.
- Datasets were aggregated and merged at the customer level to support customer-centric analysis aligned with FinMark’s business objectives.

# Feature Engineering Decisions
- Transaction data was aggregated per customer to generate key behavioral metrics, including total spend, average transaction value, transaction frequency, and number of unique transaction types.
- A recency feature was created by calculating the number of days since a customer’s most recent transaction, enabling analysis of customer engagement over time.
- The most frequent transaction type per customer was identified to capture primary customer behavior patterns.
- Customer feedback data was aggregated to compute average satisfaction and likelihood-to-recommend scores.
- For customers without feedback records, missing feedback metrics were filled using median values to retain these customers in downstream analysis.
- A High Spender indicator was created based on the top 25% of total customer spending to support segmentation and targeting use cases.
- A Loyalty Index was engineered by combining customer satisfaction and transaction frequency, reflecting both engagement and sentiment in a single metric.

# Data Export
- Cleaned datasets and the engineered customer feature dataset were exported as CSV files to support reproducibility and downstream modeling or reporting tasks.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the datasets
transactions = pd.read_csv('Transaction_Data.csv')
products = pd.read_csv('Product_Offering_Data.csv')
feedback = pd.read_csv('Customer_Feedback_Data.csv')

In [3]:
# --- PREPROCESSING ---

# Clean Transaction Data
transactions['Transaction_Date'] = pd.to_datetime(transactions['Transaction_Date'])
transactions['Transaction_Amount'] = transactions['Transaction_Amount'].fillna(transactions['Transaction_Amount'].median())
cap_val = transactions['Transaction_Amount'].quantile(0.99)
transactions['Transaction_Amount'] = transactions['Transaction_Amount'].clip(upper=cap_val)
transactions = transactions.drop_duplicates(subset=['Transaction_ID'])

In [4]:
# Clean Product Offering Data
products = products.drop_duplicates(subset=['Product_ID'])
if 'Target_Age_Group' in products.columns:
    products = products.drop(columns=['Target_Age_Group'])

In [5]:
# Clean Feedback Data
feedback['Satisfaction_Score'] = feedback['Satisfaction_Score'].fillna(feedback['Satisfaction_Score'].median())
feedback['Satisfaction_Score'] = feedback['Satisfaction_Score'].clip(upper=10.0)
feedback = feedback.drop_duplicates()

In [6]:
# --- FEATURE ENGINEERING ---

latest_date = transactions['Transaction_Date'].max()

# Aggregate Transaction Data per Customer
cust_agg = transactions.groupby('Customer_ID').agg(
    Total_Spend=('Transaction_Amount', 'sum'),
    Avg_Transaction_Value=('Transaction_Amount', 'mean'),
    Transaction_Count=('Transaction_ID', 'count'),
    Last_Transaction_Date=('Transaction_Date', 'max'),
    Unique_Transaction_Types=('Transaction_Type', 'nunique')
).reset_index()

cust_agg['Recency'] = (latest_date - cust_agg['Last_Transaction_Date']).dt.days


In [7]:
# Most Frequent Transaction Type
mode_type = transactions.groupby('Customer_ID')['Transaction_Type'].agg(lambda x: x.mode()[0]).reset_index()
mode_type.rename(columns={'Transaction_Type': 'Primary_Transaction_Type'}, inplace=True)

# Aggregate Feedback Data
feed_agg = feedback.groupby('Customer_ID').agg(
    Avg_Satisfaction_Score=('Satisfaction_Score', 'mean'),
    Avg_Likelihood_to_Recommend=('Likelihood_to_Recommend', 'mean')
).reset_index()

In [8]:
# Merge into Master Dataset
master_df = pd.merge(cust_agg, mode_type, on='Customer_ID', how='left')
master_df = pd.merge(master_df, feed_agg, on='Customer_ID', how='left')


In [9]:
# Fill missing scores for customers with no feedback using medians
master_df['Avg_Satisfaction_Score'] = master_df['Avg_Satisfaction_Score'].fillna(master_df['Avg_Satisfaction_Score'].median())
master_df['Avg_Likelihood_to_Recommend'] = master_df['Avg_Likelihood_to_Recommend'].fillna(master_df['Avg_Likelihood_to_Recommend'].median())

In [10]:
# New Metrics
master_df['Is_High_Spender'] = (master_df['Total_Spend'] > master_df['Total_Spend'].quantile(0.75)).astype(int)
master_df['Loyalty_Index'] = master_df['Avg_Satisfaction_Score'] * master_df['Transaction_Count']

In [11]:
# Export
transactions.to_csv('Cleaned_Transactions.csv', index=False)
products.to_csv('Cleaned_Products.csv', index=False)
feedback.to_csv('Cleaned_Feedback.csv', index=False)
master_df.to_csv('Engineered_Customer_Features.csv', index=False)