# Santander Product Recommendation

**Dataset description:**

This dataset contains 1.5 years of customer behavior data from Santander Bank, designed to predict the likelihood of customers purchasing new products. The data begins on 2015-01-28 and includes monthly records of products each customer holds, such as "credit card," "savings account," and others.

> See the following link for further details: [SPR dataset](https://www.kaggle.com/competitions/santander-product-recommendation/overview)

## Overview

This notebook explores the Santander dataset to identify additional useful features beyond the common set specified in `knn-banking-product-recommender/common-banking-dataset-fields.txt.`

**Summary of Steps:**
- Access Datasets
    - Retrieve the train and test datasets from HuggingFace.
    - Explore the Santander Dataset
    - Investigate the dataset to understand its structure and content.

- Pre-process the Data
    - Convert column names from Spanish to English.
    - Check for and address missing and null values.
    - Apply imputation based on the findings.
    - Drop rows with excessive NaN values for specific columns.
    - Update specific column values and formatting.
    - Drop unnecessary columns.
    - Handle categorical data types.
    - Normalize the data.

- Apply KNN Model
    - Apply Principal Component Analysis (PCA).
    - Fit the KNN model.
    - Generate product recommendations.

> **Note:** Kernel - Python 3.11

## Access HuggingFace for Dataset access

### Sign in to your Hugging Face account

This will enable you to access the dataset and upload/share the model.

### Steps to get the `Access Token` from Hugging Face:

 - **Sign In or Sign Up:** If you don't have a Hugging Face account yet, you'll need to sign up. If you already have an account, sign in.

 - **Access Your Profile:** Once you're signed in, navigate to your profile settings. You can do this by clicking on your profile icon or username, usually located in the top-right corner of the Hugging Face website.
 
- **Navigate to Access Token Settings:** Within your profile settings, look for an option related to Access tokens. This is where you can manage and generate tokens.

- **Generate a New Token:** If you haven't generated a token before, you'll see a button (`New token`) to generate a new token. Click on this button. Please ensure you give the token `write` access

- **Name Your Token (Optional):** You may be prompted to give your token a name or description. This step is optional but can be helpful if you plan to generate multiple tokens for different purposes.

- **Copy Your Token:** Once your token is generated, you'll typically see it displayed on the screen. Make sure to copy the token and replace it in the `login` code below. 

In [15]:
# Log into Hugging Face
# Replace <access_token> with your access token

HUGGINGFACE_TOKEN = "<access_token>"
!huggingface-cli login --token $HUGGINGFACE_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/verosha/.cache/huggingface/token
Login successful


In [78]:
# imports required

import os
from pathlib import Path
from typing import Dict, List, Optional

import joblib
import numpy as np
import pandas as pd
import sklearn
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.model_selection import KFold
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

from huggingface_hub import hf_hub_download
from skops import hub_utils


### Config

In [17]:
REPO_ID = "MelioAI/santander-product-recommendation"
PUSH_TO_HF = True # Whether to push to Hugging Face Hub or not
HF_TRAIN_DATASET_NAME = "train_ver2.csv"
HF_TEST_DATASET_NAME = "test_ver2.csv"
ARTIFACT_SAVE_DIR = "../saved_model/"


### Helper Functions

The following cells define helper functions used throughout this notebook.

In [94]:
def update_gender_columns(df: pd.DataFrame, column_name: str = 'gender') -> pd.DataFrame:
    """
    Update gender columns in the DataFrame.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the gender column.
    column_name (str): The name of the gender column. Default is 'gender'.

    Returns:
    pd.DataFrame: The DataFrame with updated gender values.
    """
    df[column_name].replace({'H': 'M', 'V': 'F'}, inplace=True)
    return df


def update_date_id(df: pd.DataFrame, column_name: str = 'date_id') -> pd.DataFrame:
    """
    Convert the date_id column in the DataFrame to YYYYMMDD format.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the date_id column.
    column_name (str): The name of the date_id column. Default is 'date_id'.

    Returns:
    pd.DataFrame: The DataFrame with date_id values in YYYYMMDD format.
    """
    # Convert column to string
    df[column_name] = df[column_name].astype(str)

    # Convert ISO 8601 (YYYY-MM-DD) to YYYYMMDD
    df[column_name] = pd.to_datetime(df[column_name], format='%Y-%m-%d', errors='coerce').dt.strftime('%Y%m%d')

    return df

def map_products_with_numbers(owned_products: List[str], product_mapping: Dict[str, int]) -> List[int]:
    """
    Replace product names with corresponding numbers from the product_mapping dictionary.
    Return a list with -1 if the list is empty.

    Parameters:
    owned_products (List[str]): List of product names to be replaced.
    product_mapping (Dict[str, int]): Dictionary mapping product names to numbers.

    Returns:
    List[int]: List of corresponding product numbers, or [-1] if input list is empty.
    """
    if not owned_products:
        return [-1]
    return [product_mapping.get(product, -1) for product in owned_products]


def get_customer_details(df: pd.DataFrame, customer_id: int) -> Optional[pd.DataFrame]:
    """
    Retrieve the details of a customer given their customer ID.

    Parameters:
    df (pd.DataFrame): DataFrame containing customer data with columns 'cust_id', 'date_id', and 'owned_products'.
    customer_id (int): The ID of the customer to retrieve details for.

    Returns:
    Optional[pd.DataFrame]: A DataFrame containing the customer's details or an empty DataFrame if the customer ID is not found.
    """
    # Filter the DataFrame to get the row for the given customer_id
    customer_details = df[df['cust_id'] == customer_id]

    if not customer_details.empty:
        # Return the relevant columns as a DataFrame
        return customer_details[['cust_id', 'date_id', 'owned_products']]
    else:
        # Return an empty DataFrame if the customer ID is not found
        return pd.DataFrame(columns=['cust_id', 'date_id', 'owned_products'])


def check_columns(train_columns, test_columns):
    """
    Check if the columns in two sets are the same and print the common and different columns.

    Parameters:
    - train_columns: Set of columns from the training set
    - test_columns: Set of columns from the test set
    """
    if set(train_columns) == set(test_columns):
        print("Train and test sets have the same columns.")
        #common_columns = set(train_columns)
        #print("Common columns:")
        #for col in common_columns:
        #    print(f"- {col}")
    else:
        different_columns_train = set(train_columns) - set(test_columns)
        different_columns_test = set(test_columns) - set(train_columns)

        print("Train and test sets have different columns.")

        if different_columns_train:
            print("Columns present in train but not in test:")
            for col in different_columns_train:
                print(f"- {col}")

        if different_columns_test:
            print("Columns present in test but not in train:")
            for col in different_columns_test:
                print(f"- {col}")


def pre_process_categorical_data(data: pd.DataFrame, columns_to_encode: List[str]) -> pd.DataFrame:
    """
    Convert categorical data to numerical data using one-hot encoding.

    Parameters:
    - data: DataFrame
    - columns_to_encode: List of columns to one-hot encode

    Returns:
    - DataFrame with one-hot encoded columns
    """
    return pd.get_dummies(data, columns=columns_to_encode)



def get_binary_item_ownership_columns(df_: pd.DataFrame, owned_items_col: str = "owned_products", nb_items: int = 25) -> pd.DataFrame:
    """
    Takes a dataframe df_, looks at the owned_product col which contains a list of item IDs,
    and converts that into a set of nb_items binary ownership columns called `owns_product_<ID>`
    which. These ownership indicator columns are returned.
    """

    # Insert dummy user that owns everything
    dummy_user_id = -999
    dummy_user = pd.DataFrame(data={owned_items_col: {dummy_user_id: list(range(nb_items))}})
    df_ = pd.concat([df_, dummy_user], ignore_index=False)

    # Explode the owned items column
    df_exploded = df_.explode(owned_items_col)

    # Get dummy columns for ownership
    ownership_cols = pd.get_dummies(df_exploded[owned_items_col].astype(int), prefix='owns_product')

    # Group by the original index and sum the dummy columns
    ownership_cols = ownership_cols.groupby(level=0).sum()

    # Drop -1 (none-item) if exists
    if 'owns_product_-1' in ownership_cols.columns:
        ownership_cols = ownership_cols.drop(columns=['owns_product_-1'])

    # Remove dummy user
    ownership_cols = ownership_cols.drop(index=dummy_user_id)

    return ownership_cols


def find_best_n_neighbors(X: np.ndarray, min_neighbors: int = 1, max_neighbors: int = 10) -> int:
    """
    Find the best number of neighbors for the K-Nearest Neighbors algorithm by evaluating different values
    using silhouette scores.

    Parameters:
    X (np.ndarray): The input data as a NumPy array.
    min_neighbors (int): The minimum number of neighbors to consider. Default is 1.
    max_neighbors (int): The maximum number of neighbors to consider. Default is 10.

    Returns:
    int: The best number of neighbors based on silhouette scores.
    """
    best_n_neighbors = min_neighbors
    best_score = -1

    for n_neighbors in range(min_neighbors, max_neighbors + 1):
        knn = NearestNeighbors(n_neighbors=n_neighbors)
        kf = KFold(n_splits=5, shuffle=True, random_state=42)
        scores = []

        for train_index, test_index in kf.split(X):
            X_train, X_test = X[train_index], X[test_index]
            knn.fit(X_train)
            distances, indices = knn.kneighbors(X_test)
            if len(X_test) > 1:
                score = silhouette_score(X_test, indices[:, 0])
                scores.append(score)

        mean_score = np.mean(scores)
        # print(f"n_neighbors = {n_neighbors}, Mean Silhouette Score = {mean_score:.4f}")

        if mean_score > best_score:
            best_score = mean_score
            best_n_neighbors = n_neighbors

    return best_n_neighbors


def get_product_recommendations(user_index, train_df, test_df, indices, product_columns, n_recommendations=2):
    """
    Generate product recommendations for a given user based on nearest neighbors.

    Parameters:
    user_index (int): The index of the user in the test set.
    train_df (pd.DataFrame): The DataFrame with training data.
    test_df (pd.DataFrame): The DataFrame with test data.
    indices (np.ndarray): Indices of the nearest neighbors.
    product_columns (List[str]): List of product columns in the DataFrame.
    n_recommendations (int): Number of products to recommend.

    Returns:
    List[str]: List of recommended product names.
    """
    # Get the indices of nearest neighbors
    neighbors_indices = indices[user_index]

    # Get the products owned by the nearest neighbors
    neighbors_products = train_df.iloc[neighbors_indices]

    # Sum the number of times each product is owned by the neighbors
    product_counts = neighbors_products[product_columns].sum().sort_values(ascending=False)

    # Get the products that the user already owns
    user_products = set(test_df.loc[user_index, product_columns].index[test_df.loc[user_index, product_columns] > 0])

    # Ensure `owns_product_24` is not in recommendations
    if 'owns_product_24' in product_counts.index:
        product_counts = product_counts.drop('owns_product_24')

    # Exclude products the user already owns
    recommended_products = [product for product in product_counts.index if product not in user_products]

    return recommended_products[:n_recommendations]

def map_product_codes_to_names(product_codes: List[str], mapping: Dict[str, str]) -> List[str]:
    """
    Map product codes to product names.

    Parameters:
    product_codes (List[str]): List of product codes to be mapped.
    mapping (Dict[str, str]): Dictionary mapping product codes to product names.

    Returns:
    List[str]: List of product names.
    """
    return [mapping.get(code, 'None') for code in product_codes]

def get_product_recommendations_for_user_id(user_id, train_df, test_df, test_df_ids, indices, product_columns, n_recommendations=2):
    """
    Generate product recommendations for a specific user ID.

    Parameters:
    user_id (int): The user ID for which recommendations are to be generated.
    train_df (pd.DataFrame): The DataFrame with training data.
    test_df (pd.DataFrame): The DataFrame with test data.
    test_df_ids (pd.DataFrame): The DataFrame containing user IDs.
    indices (np.ndarray): Indices of the nearest neighbors.
    product_columns (List[str]): List of product columns in the DataFrame.
    n_recommendations (int): Number of products to recommend.

    Returns:
    List[str]: List of recommended product names.
    """
    # Find the index of the user_id in the test DataFrame
    user_index = test_df_ids[test_df_ids['cust_id'] == user_id].index
    if user_index.empty:
        raise ValueError(f"User ID {user_id} not found in test DataFrame.")

    user_index = user_index[0]  # Get the first index (should be a single index)

    # Get the indices of nearest neighbors
    neighbors_indices = indices[user_index]

    # Get the products owned by the nearest neighbors
    neighbors_products = train_df.iloc[neighbors_indices]

    # Sum the number of times each product is owned by the neighbors
    product_counts = neighbors_products[product_columns].sum().sort_values(ascending=False)

    # Get the products that the user already owns
    user_products = set(test_df.loc[user_index, product_columns].index[test_df.loc[user_index, product_columns] > 0])

    # Ensure `owns_product_24` is not in recommendations
    if 'owns_product_24' in product_counts.index:
        product_counts = product_counts.drop('owns_product_24')

    # Exclude products the user already owns
    recommended_products = [product for product in product_counts.index if product not in user_products]

    return recommended_products[:n_recommendations]



## Access Hugging Face

The cells below will use the Hugging Face Client Library to get train and test from the Santander Product Recommendation dataset

In [19]:
# NB: This may take a few seconds to a few minutes to run

ds_train = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=HF_TRAIN_DATASET_NAME, repo_type="dataset")
)

  ds_train = pd.read_csv(


In [20]:
ds_test = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=HF_TEST_DATASET_NAME, repo_type="dataset")
)

  ds_test = pd.read_csv(


In [21]:
# Data Profiling
# Get an overview of the test and train dataframe’s structure

ds_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13647309 entries, 0 to 13647308
Data columns (total 48 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   fecha_dato             object 
 1   ncodpers               int64  
 2   ind_empleado           object 
 3   pais_residencia        object 
 4   sexo                   object 
 5   age                    object 
 6   fecha_alta             object 
 7   ind_nuevo              float64
 8   antiguedad             object 
 9   indrel                 float64
 10  ult_fec_cli_1t         object 
 11  indrel_1mes            object 
 12  tiprel_1mes            object 
 13  indresi                object 
 14  indext                 object 
 15  conyuemp               object 
 16  canal_entrada          object 
 17  indfall                object 
 18  tipodom                float64
 19  cod_prov               float64
 20  nomprov                object 
 21  ind_actividad_cliente  float64
 22  renta           

In [22]:
ds_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 929615 entries, 0 to 929614
Data columns (total 24 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   fecha_dato             929615 non-null  object 
 1   ncodpers               929615 non-null  int64  
 2   ind_empleado           929615 non-null  object 
 3   pais_residencia        929615 non-null  object 
 4   sexo                   929610 non-null  object 
 5   age                    929615 non-null  int64  
 6   fecha_alta             929615 non-null  object 
 7   ind_nuevo              929615 non-null  int64  
 8   antiguedad             929615 non-null  int64  
 9   indrel                 929615 non-null  int64  
 10  ult_fec_cli_1t         1683 non-null    object 
 11  indrel_1mes            929592 non-null  float64
 12  tiprel_1mes            929592 non-null  object 
 13  indresi                929615 non-null  object 
 14  indext                 929615 non-nu

In [23]:
# Check for nulls

ds_train.isnull().sum()

fecha_dato                      0
ncodpers                        0
ind_empleado                27734
pais_residencia             27734
sexo                        27804
age                             0
fecha_alta                  27734
ind_nuevo                   27734
antiguedad                      0
indrel                      27734
ult_fec_cli_1t           13622516
indrel_1mes                149781
tiprel_1mes                149781
indresi                     27734
indext                      27734
conyuemp                 13645501
canal_entrada              186126
indfall                     27734
tipodom                     27735
cod_prov                    93591
nomprov                     93591
ind_actividad_cliente       27734
renta                     2794375
segmento                   189368
ind_ahor_fin_ult1               0
ind_aval_fin_ult1               0
ind_cco_fin_ult1                0
ind_cder_fin_ult1               0
ind_cno_fin_ult1                0
ind_ctju_fin_u

In [24]:
# Check for nulls

ds_test.isnull().sum()

fecha_dato                    0
ncodpers                      0
ind_empleado                  0
pais_residencia               0
sexo                          5
age                           0
fecha_alta                    0
ind_nuevo                     0
antiguedad                    0
indrel                        0
ult_fec_cli_1t           927932
indrel_1mes                  23
tiprel_1mes                  23
indresi                       0
indext                        0
conyuemp                 929511
canal_entrada              2081
indfall                       0
tipodom                       0
cod_prov                   3996
nomprov                    3996
ind_actividad_cliente         0
renta                         0
segmento                   2248
dtype: int64

In [25]:
# Generate a detailed statistical summary of the train DataFrame, including both numeric and categorical columns
# Transpose for easier readability and analysis
ds_train.describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
fecha_dato,13647309.0,17.0,2016-05-28,931453.0,,,,,,,
ncodpers,13647309.0,,,,834904.211501,431565.025784,15889.0,452813.0,931893.0,1199286.0,1553689.0
ind_empleado,13619575.0,5.0,N,13610977.0,,,,,,,
pais_residencia,13619575.0,118.0,ES,13553710.0,,,,,,,
sexo,13619505.0,2.0,V,7424252.0,,,,,,,
age,13647309.0,235.0,23.0,542682.0,,,,,,,
fecha_alta,13619575.0,6756.0,2014-07-28,57389.0,,,,,,,
ind_nuevo,13619575.0,,,,0.059562,0.236673,0.0,0.0,0.0,0.0,1.0
antiguedad,13647309.0,507.0,0.0,134335.0,,,,,,,
indrel,13619575.0,,,,1.178399,4.177469,1.0,1.0,1.0,1.0,99.0


## Data Pre-processing

- Convert column names from Spanish to English
- Check for and address missing and null values.
- Apply imputation based on the findings.
- Drop rows with excessive NaN values for specific columns.
- Update specific column values and formatting.
- Drop unnecessary columns.
- Map product names to numbers
- Handle categorical data types.
- Normalize the data.

In [26]:
# Convert column names from Spanish to more readable English names

col_names = {"fecha_dato": "date_id","ncodpers":"cust_id", "ind_empleado":"emp_index","pais_residencia":"cust_country_res",
            "sexo":"gender","fecha_alta":"cust_start_date_first_holder_contract","ind_nuevo":"new_cust_index","antiguedad":"cust_seniority",
            "indrel":"cust_primary_type","ult_fec_cli_1t":"cust_last_primary_date","indrel_1mes":"cust_type_at_start_month",
            "tiprel_1mes":"cust_rel_type_at_start_month","indresi":"residence_index","indext":"foreigner_index",
            "conyuemp":"spouse_index","canal_entrada":"channel_joined", "indfall":"deceased_index", "tipodom":"address_type",
            "cod_prov":"province","nomprov":"province_name", "ind_actividad_cliente":"activity_index","renta":"gross_income",
            "segmento":"cust_category", "ind_ahor_fin_ult1":"savings_acc", "ind_aval_fin_ult1":"guarantees",
            "ind_cco_fin_ult1":"current_acc", "ind_cder_fin_ult1":"derivada_acc", "ind_cno_fin_ult1":"payroll_acc",
            "ind_ctju_fin_ult1":"jnr_acc", "ind_ctma_fin_ult1":"más_particular_acc", "ind_ctop_fin_ult1":"particular_account",
            "ind_ctpp_fin_ult1":"particular_plus_account", "ind_deco_fin_ult1":"short_term_deposits",
            "ind_deme_fin_ult1":"medium_term_deposits", "ind_dela_fin_ult1":"long_term_deposits", "ind_ecue_fin_ult1":"e_acc",
            "ind_fond_fin_ult1":"funds","ind_hip_fin_ult1":"mortgage", "ind_plan_fin_ult1":"pensions_plan", "ind_pres_fin_ult1":"loans",
            "ind_reca_fin_ult1":"taxes", "ind_tjcr_fin_ult1":"credit_card", "ind_valo_fin_ult1":"securities", "ind_viv_fin_ult1":"home_acc",
            "ind_nomina_ult1":"payroll", "ind_nom_pens_ult1": "pensions", "ind_recibo_ult1":"direct_debit"}

In [27]:
# Rename columns in ds_train and ds_test according to the col_names dictionary,
ds_train_renamed = ds_train.rename(col_names, axis = 1, inplace = False)
ds_test_renamed = ds_test.rename(col_names, axis = 1, inplace = False)

# sample output of the renamed columns for ds_train
ds_train_renamed.sample(3)

Unnamed: 0,date_id,cust_id,emp_index,cust_country_res,gender,age,cust_start_date_first_holder_contract,new_cust_index,cust_seniority,cust_primary_type,...,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
11926885,2016-04-28,1542763,N,ES,V,58,2016-04-01,1.0,0,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
8308171,2015-12-28,1258914,N,ES,V,56,2014-05-20,0.0,19,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
12412790,2016-04-28,935661,N,ES,H,26,2011-08-22,0.0,56,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0


In [28]:
# check the updated column names
ds_train_renamed.columns

Index(['date_id', 'cust_id', 'emp_index', 'cust_country_res', 'gender', 'age',
       'cust_start_date_first_holder_contract', 'new_cust_index',
       'cust_seniority', 'cust_primary_type', 'cust_last_primary_date',
       'cust_type_at_start_month', 'cust_rel_type_at_start_month',
       'residence_index', 'foreigner_index', 'spouse_index', 'channel_joined',
       'deceased_index', 'address_type', 'province', 'province_name',
       'activity_index', 'gross_income', 'cust_category', 'savings_acc',
       'guarantees', 'current_acc', 'derivada_acc', 'payroll_acc', 'jnr_acc',
       'más_particular_acc', 'particular_account', 'particular_plus_account',
       'short_term_deposits', 'medium_term_deposits', 'long_term_deposits',
       'e_acc', 'funds', 'mortgage', 'pensions_plan', 'loans', 'taxes',
       'credit_card', 'securities', 'home_acc', 'payroll', 'pensions',
       'direct_debit'],
      dtype='object')

In [29]:
ds_train_renamed.dtypes

date_id                                   object
cust_id                                    int64
emp_index                                 object
cust_country_res                          object
gender                                    object
age                                       object
cust_start_date_first_holder_contract     object
new_cust_index                           float64
cust_seniority                            object
cust_primary_type                        float64
cust_last_primary_date                    object
cust_type_at_start_month                  object
cust_rel_type_at_start_month              object
residence_index                           object
foreigner_index                           object
spouse_index                              object
channel_joined                            object
deceased_index                            object
address_type                             float64
province                                 float64
province_name       

In [30]:
# Perform data type conversions on columns

ds_train_renamed.age = pd.to_numeric(ds_train_renamed.age, errors='coerce')
ds_train_renamed.gross_income = pd.to_numeric(ds_train_renamed.gross_income, errors='coerce')
ds_train_renamed.cust_seniority = pd.to_numeric(ds_train_renamed.cust_seniority, errors='coerce')
ds_train_renamed.cust_start_date_first_holder_contract = pd.to_datetime(ds_train_renamed.cust_start_date_first_holder_contract, errors = 'coerce')
ds_train_renamed['date_id'] = pd.to_datetime(ds_train_renamed['date_id'])

ds_test_renamed.age = pd.to_numeric(ds_test_renamed.age, errors='coerce')
ds_test_renamed.gross_income = pd.to_numeric(ds_test_renamed.gross_income, errors='coerce')
ds_test_renamed.cust_seniority = pd.to_numeric(ds_test_renamed.cust_seniority, errors='coerce')
ds_test_renamed.cust_start_date_first_holder_contract = pd.to_datetime(ds_test_renamed.cust_start_date_first_holder_contract, errors = 'coerce')
ds_test_renamed['date_id'] = pd.to_datetime(ds_test_renamed['date_id'])


In [31]:
# Get the percentage of missing values in each column
ds_train_renamed.isnull().sum()/ds_train_renamed.shape[0] * 100
# ds_test_renamed.isnull().sum()/ds_train_renamed.shape[0] * 100

date_id                                   0.000000
cust_id                                   0.000000
emp_index                                 0.203220
cust_country_res                          0.203220
gender                                    0.203732
age                                       0.203220
cust_start_date_first_holder_contract     0.203220
new_cust_index                            0.203220
cust_seniority                            0.203220
cust_primary_type                         0.203220
cust_last_primary_date                   99.818330
cust_type_at_start_month                  1.097513
cust_rel_type_at_start_month              1.097513
residence_index                           0.203220
foreigner_index                           0.203220
spouse_index                             99.986752
channel_joined                            1.363829
deceased_index                            0.203220
address_type                              0.203227
province                       

In [32]:
# According to the percentage of missing values, we will perform appropriate imputation for each column

# Imputation 1: Drop columns which have > 99% missing values
columns_to_drop = ['cust_last_primary_date', 'spouse_index']

# Drop columns if they exist in the DataFrame
ds_train_renamed.drop(columns=[col for col in columns_to_drop if col in ds_train_renamed.columns], axis=1, inplace=True)
ds_test_renamed.drop(columns=[col for col in columns_to_drop if col in ds_test_renamed.columns], axis=1, inplace=True)

In [33]:
# Imputation 2: Missing values that have <10% are imputed with the most common value (mode) in each column

# Columns <10% missing values
cols = ['emp_index','cust_country_res','cust_start_date_first_holder_contract','new_cust_index',
        'cust_primary_type',"cust_type_at_start_month", "cust_rel_type_at_start_month", "province","province_name",
        "activity_index","channel_joined","cust_category"]

# Impute missing values with the most common value (mode) in each column
for i in cols:
    if i in ds_train_renamed.columns and i in ds_test_renamed.columns:
        # Impute with mode for train DataFrame
        mode_value_train = ds_train_renamed[i].mode()[0]  # Use mode()[0] for the first mode
        ds_train_renamed[i].fillna(mode_value_train, inplace=True)

        # Impute with mode for test DataFrame
        mode_value_test = ds_test_renamed[i].mode()[0]  # Use mode()[0] for the first mode
        ds_test_renamed[i].fillna(mode_value_test, inplace=True)

# For features with missing value accounts for over 10%, impute their missing values based on the mean
# For gross_income, impute the missing values using the mean value
ds_train_renamed['gross_income'].fillna(ds_train_renamed['gross_income'].mean(), inplace=True)
ds_test_renamed['gross_income'].fillna(ds_test_renamed['gross_income'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ds_train_renamed[i].fillna(mode_value_train, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ds_test_renamed[i].fillna(mode_value_test, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which

In [34]:
# Get the percentage of missing values in each column
ds_train_renamed.isnull().sum()/ds_train_renamed.shape[0] * 100

date_id                                  0.000000
cust_id                                  0.000000
emp_index                                0.000000
cust_country_res                         0.000000
gender                                   0.203732
age                                      0.203220
cust_start_date_first_holder_contract    0.000000
new_cust_index                           0.000000
cust_seniority                           0.203220
cust_primary_type                        0.000000
cust_type_at_start_month                 0.000000
cust_rel_type_at_start_month             0.000000
residence_index                          0.203220
foreigner_index                          0.203220
channel_joined                           0.000000
deceased_index                           0.203220
address_type                             0.203227
province                                 0.000000
province_name                            0.000000
activity_index                           0.000000


In [35]:
# Get the percentage of missing values in each column for test
ds_test_renamed.isnull().sum()/ds_test_renamed.shape[0] * 100

date_id                                  0.000000
cust_id                                  0.000000
emp_index                                0.000000
cust_country_res                         0.000000
gender                                   0.000538
age                                      0.000000
cust_start_date_first_holder_contract    0.000000
new_cust_index                           0.000000
cust_seniority                           0.000000
cust_primary_type                        0.000000
cust_type_at_start_month                 0.000000
cust_rel_type_at_start_month             0.000000
residence_index                          0.000000
foreigner_index                          0.000000
channel_joined                           0.000000
deceased_index                           0.000000
address_type                             0.000000
province                                 0.000000
province_name                            0.000000
activity_index                           0.000000


In [36]:
# Drop rows where 'payroll' or 'pensions' is NaN in the training data
df_train_cleaned = ds_train_renamed.dropna(subset=['payroll', 'pensions'])


In [37]:
unique_values = df_train_cleaned['deceased_index'].unique()
print(unique_values)

['N' nan 'S']


In [38]:
#df_train_cleaned['cust_id','emp_index'].head(10)
df_train_cleaned[['cust_id', 'payroll_acc']].sample(5)

Unnamed: 0,cust_id,payroll_acc
4363534,979231,0
6788897,336612,0
12099308,257947,0
13393012,1477911,0
7214264,463443,1


### Features to consider

- emp_index: Employment status might affect financial needs.
 - A active, B ex employed, F filial, N not employee, P pasive

- channel_joined: The channel through which the customer joined might influence their behavior.

- cust_primary_type: This will indicate the main relationship type the customer has with the bank.
    - 1 (First/Primary), 99 (Primary customer during the month but not at the end of the month)

- cust_rel_type_at_start_month: The type of relationship at the start of the month could influence product needs.
    - Customer relation type at the beginning of the month, A (active), I (inactive), P (former customer),R (Potential)

- activity_index: Indicates whether the customer is active, which is crucial for determining product needs.
    - activity_index: Indicates whether the customer is active, which is crucial for determining product needs.

- gross_income: Income level can influence the types of products a customer might need.
    - gross income of the household

- cust_category: Customer segmentation can indicate different needs and preferences.
    - segmentation: 01 - VIP, 02 - Individuals 03 - college graduated

- All banking products

### Create Dataset with the key features
- Pre-process features selected 
- Check for nulls 
- Check that train and test have the same columns

In [39]:
df_train_cleaned.columns

Index(['date_id', 'cust_id', 'emp_index', 'cust_country_res', 'gender', 'age',
       'cust_start_date_first_holder_contract', 'new_cust_index',
       'cust_seniority', 'cust_primary_type', 'cust_type_at_start_month',
       'cust_rel_type_at_start_month', 'residence_index', 'foreigner_index',
       'channel_joined', 'deceased_index', 'address_type', 'province',
       'province_name', 'activity_index', 'gross_income', 'cust_category',
       'savings_acc', 'guarantees', 'current_acc', 'derivada_acc',
       'payroll_acc', 'jnr_acc', 'más_particular_acc', 'particular_account',
       'particular_plus_account', 'short_term_deposits',
       'medium_term_deposits', 'long_term_deposits', 'e_acc', 'funds',
       'mortgage', 'pensions_plan', 'loans', 'taxes', 'credit_card',
       'securities', 'home_acc', 'payroll', 'pensions', 'direct_debit'],
      dtype='object')

In [40]:
# Define the columns to retain in the new DataFrame
retained_columns = [
    'cust_id', 'date_id', 'age', 'gender', 'emp_index', 'channel_joined',
    'cust_primary_type', 'cust_rel_type_at_start_month', 'activity_index',
    'gross_income', 'cust_category', 'deceased_index'
]

# Define the banking product columns
product_columns = [
    'savings_acc', 'guarantees', 'current_acc', 'derivada_acc', 'payroll_acc', 'jnr_acc',
    'más_particular_acc', 'particular_account', 'particular_plus_account', 'short_term_deposits',
    'medium_term_deposits', 'long_term_deposits', 'e_acc', 'funds', 'mortgage', 'pensions_plan',
    'loans', 'taxes', 'credit_card', 'securities', 'home_acc', 'payroll', 'pensions', 'direct_debit'
]

# Combine all the columns to include in the new DataFrame
all_columns = retained_columns + product_columns

# Create the new DataFrame with the selected columns
df_train_subset_cols = df_train_cleaned[all_columns]

# Display the first few rows of the new DataFrame
df_train_subset_cols.head(5)

Unnamed: 0,cust_id,date_id,age,gender,emp_index,channel_joined,cust_primary_type,cust_rel_type_at_start_month,activity_index,gross_income,...,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
0,1375586,2015-01-28,35.0,H,N,KHL,1.0,A,1.0,87218.1,...,0,0,0,0,0,0,0,0.0,0.0,0
1,1050611,2015-01-28,23.0,V,N,KHE,1.0,I,0.0,35548.74,...,0,0,0,0,0,0,0,0.0,0.0,0
2,1050612,2015-01-28,23.0,V,N,KHE,1.0,I,0.0,122179.11,...,0,0,0,0,0,0,0,0.0,0.0,0
3,1050613,2015-01-28,22.0,H,N,KHD,1.0,I,0.0,119775.54,...,0,0,0,0,0,0,0,0.0,0.0,0
4,1050614,2015-01-28,23.0,V,N,KHE,1.0,A,1.0,134254.318238,...,0,0,0,0,0,0,0,0.0,0.0,0


In [41]:
# Get the percentage of missing values in each column
df_train_subset_cols.isnull().sum()/df_train_subset_cols.shape[0] * 100

cust_id                         0.000000
date_id                         0.000000
age                             0.087211
gender                          0.087725
emp_index                       0.000000
channel_joined                  0.000000
cust_primary_type               0.000000
cust_rel_type_at_start_month    0.000000
activity_index                  0.000000
gross_income                    0.000000
cust_category                   0.000000
deceased_index                  0.087211
savings_acc                     0.000000
guarantees                      0.000000
current_acc                     0.000000
derivada_acc                    0.000000
payroll_acc                     0.000000
jnr_acc                         0.000000
más_particular_acc              0.000000
particular_account              0.000000
particular_plus_account         0.000000
short_term_deposits             0.000000
medium_term_deposits            0.000000
long_term_deposits              0.000000
e_acc           

In [42]:
# Since age, gender and deceased_index have very low percentages of missing values (0.087211%)
# Remove rows with missing values in 'age', 'gender' 'deceased_index'

df_train_subset_cols_cleaned = df_train_subset_cols.dropna(subset=['age', 'deceased_index', 'gender'])
df_test_subset_cols_cleaned = ds_test_renamed.dropna(subset=['gender'])


In [43]:
#  Check the unique values of each column in the train DataFrame
for column in df_train_subset_cols_cleaned.columns:
    unique_values = df_train_subset_cols_cleaned[column].unique()
    print(f"Unique values in {column}: {unique_values}\n")


Unique values in cust_id: [1375586 1050611 1050612 ... 1173729 1164094 1550586]

Unique values in date_id: <DatetimeArray>
['2015-01-28 00:00:00', '2015-02-28 00:00:00', '2015-03-28 00:00:00',
 '2015-04-28 00:00:00', '2015-05-28 00:00:00', '2015-06-28 00:00:00',
 '2015-07-28 00:00:00', '2015-08-28 00:00:00', '2015-09-28 00:00:00',
 '2015-10-28 00:00:00', '2015-11-28 00:00:00', '2015-12-28 00:00:00',
 '2016-01-28 00:00:00', '2016-02-28 00:00:00', '2016-03-28 00:00:00',
 '2016-04-28 00:00:00', '2016-05-28 00:00:00']
Length: 17, dtype: datetime64[ns]

Unique values in age: [ 35.  23.  22.  24.  65.  28.  25.  26.  53.  27.  32.  37.  31.  39.
  63.  33.  55.  42.  58.  38.  50.  30.  45.  44.  36.  29.  60.  57.
  67.  47.  34.  48.  46.  54.  84.  15.  12.   8.   6.  83.  40.  77.
  69.  52.  59.  43.  10.   9.  49.  41.  51.  78.  16.  11.  73.  62.
  66.  17.  68.  82.  95.  96.  56.  61.  79.  72.  14.  19.  13.  86.
  64.  20.  89.  71.   7.  70.  74.  21.  18.  75.   4.  80.  81.   

In [44]:
#  Check the unique values of each column in the test DataFrame
for column in df_test_subset_cols_cleaned.columns:
    unique_values = df_test_subset_cols_cleaned[column].unique()
    print(f"Unique values in {column}: {unique_values}\n")

Unique values in date_id: <DatetimeArray>
['2016-06-28 00:00:00']
Length: 1, dtype: datetime64[ns]

Unique values in cust_id: [  15889 1170544 1170545 ...  660240  660243  660248]

Unique values in emp_index: ['F' 'N' 'A' 'B' 'S']

Unique values in cust_country_res: ['ES' 'CH' 'DE' 'GB' 'BE' 'DJ' 'IE' 'QA' 'US' 'VE' 'DO' 'SE' 'AR' 'CA'
 'PL' 'CN' 'CM' 'FR' 'AT' 'RO' 'LU' 'PT' 'CL' 'IT' 'MR' 'MX' 'SN' 'BR'
 'CO' 'PE' 'RU' 'LT' 'EE' 'MA' 'HN' 'BG' 'NO' 'GT' 'UA' 'NL' 'GA' 'IL'
 'JP' 'EC' 'IN' 'DZ' 'ET' 'SA' 'HU' 'JM' 'CI' 'CU' 'BO' 'TG' 'TN' 'NG'
 'AU' 'GR' 'DK' 'LB' 'UY' 'TH' 'SG' 'MD' 'SK' 'AD' 'BY' 'HK' 'HR' 'EG'
 'GQ' 'PR' 'ZA' 'PA' 'KE' 'TR' 'FI' 'BA' 'SV' 'PY' 'PK' 'KR' 'AO' 'GN'
 'IS' 'TW' 'MK' 'VN' 'CZ' 'CR' 'MZ' 'MT' 'LY' 'GH' 'KH' 'AE' 'RS' 'OM'
 'GE' 'NI' 'GI' 'NZ' 'MM' 'PH' 'KW' 'BM' 'CG' 'ML' 'AL' 'ZW' 'CF' 'GM'
 'CD' 'BZ' 'KZ' 'GW' 'SL' 'LV']

Unique values in gender: ['V' 'H']

Unique values in age: [ 56  36  22  51  41  33  23  43  63  62  32  58  71  31  30  59  45  37
 

In [45]:
# Update gender values in the data as it's currently reflected as "H" and "V" instead of "F" and "M"
df_train_subset_cols_cleaned = update_gender_columns(df_train_subset_cols_cleaned)
df_test_subset_cols_cleaned = update_gender_columns(df_test_subset_cols_cleaned)

# sample output of the gender update for ds_train
# df_train_subset_cols_cleaned.sample(5)
df_test_subset_cols_cleaned.sample(5)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column_name].replace({'H': 'M', 'V': 'F'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_name].replace({'H': 'M', 'V': 'F'}, inplace=True)


Unnamed: 0,date_id,cust_id,emp_index,cust_country_res,gender,age,cust_start_date_first_holder_contract,new_cust_index,cust_seniority,cust_primary_type,...,residence_index,foreigner_index,channel_joined,deceased_index,address_type,province,province_name,activity_index,gross_income,cust_category
431796,2016-06-28,1402755,N,ES,M,48,2015-06-22,0,12,1,...,S,N,KHM,N,1,20.0,GIPUZKOA,0,134087.870595,02 - PARTICULARES
904384,2016-06-28,648353,N,ES,F,36,2006-10-14,0,63,1,...,S,N,KFC,N,1,41.0,SEVILLA,0,119187.24,02 - PARTICULARES
182598,2016-06-28,1083081,N,ES,M,33,2012-10-17,0,44,1,...,S,N,KHE,N,1,30.0,MURCIA,1,84323.25,03 - UNIVERSITARIO
421240,2016-06-28,1407277,N,ES,M,24,2015-07-16,0,24,1,...,S,S,RED,N,1,9.0,BURGOS,0,91655.13,02 - PARTICULARES
220430,2016-06-28,1075232,N,ES,F,25,2012-10-04,0,44,1,...,S,N,KHE,N,1,10.0,CACERES,0,60780.75,03 - UNIVERSITARIO


In [46]:
# Update date_id format to YYYYMMDD (20150328)
df_train_subset_cols_cleaned = update_date_id(df_train_subset_cols_cleaned)
df_test_subset_cols_cleaned = update_date_id(df_test_subset_cols_cleaned)

# sample output of the date format update for ds_train
# df_train_subset_cols_cleaned.sample(5)
df_test_subset_cols_cleaned.sample(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_name] = df[column_name].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_name] = pd.to_datetime(df[column_name], format='%Y-%m-%d', errors='coerce').dt.strftime('%Y%m%d')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_name] = df[column_name].astype(str)
A val

Unnamed: 0,date_id,cust_id,emp_index,cust_country_res,gender,age,cust_start_date_first_holder_contract,new_cust_index,cust_seniority,cust_primary_type,...,residence_index,foreigner_index,channel_joined,deceased_index,address_type,province,province_name,activity_index,gross_income,cust_category
261834,20160628,1474184,N,ES,F,23,2015-10-09,0,8,1,...,S,N,KHQ,N,1,36.0,PONTEVEDRA,0,134087.870595,03 - UNIVERSITARIO
733049,20160628,756251,N,ES,F,90,2008-05-23,0,97,1,...,S,N,KAP,S,1,26.0,"RIOJA, LA",0,64388.73,02 - PARTICULARES
875502,20160628,711475,N,ES,M,45,2007-08-14,0,106,1,...,S,N,KFC,N,1,23.0,JAEN,0,49358.4,02 - PARTICULARES
462366,20160628,1362262,N,ES,F,21,2014-11-27,0,19,1,...,S,N,KHE,N,1,38.0,SANTA CRUZ DE TENERIFE,0,150591.99,03 - UNIVERSITARIO
687028,20160628,189339,N,ES,F,67,2000-08-28,0,190,1,...,S,N,KFC,N,1,8.0,BARCELONA,0,148151.79,02 - PARTICULARES


In [47]:
# Update the values in the cust_category column from ['02 - PARTICULARES', '03 - UNIVERSITARIO', '01 - TOP'] to [2, 3, 1]

# Define the mapping from the original values to the new values
category_mapping = {
    '02 - PARTICULARES': 2,
    '03 - UNIVERSITARIO': 3,
    '01 - TOP': 1
}

# Replace the values in the 'cust_category' column
df_train_subset_cols_cleaned['cust_category'] = df_train_subset_cols_cleaned['cust_category'].replace(category_mapping)
df_test_subset_cols_cleaned['cust_category'] = df_test_subset_cols_cleaned['cust_category'].replace(category_mapping)

# Verify the changes
df_train_subset_cols_cleaned['cust_category'].unique()

  df_train_subset_cols_cleaned['cust_category'] = df_train_subset_cols_cleaned['cust_category'].replace(category_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train_subset_cols_cleaned['cust_category'] = df_train_subset_cols_cleaned['cust_category'].replace(category_mapping)
  df_test_subset_cols_cleaned['cust_category'] = df_test_subset_cols_cleaned['cust_category'].replace(category_mapping)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test_subset_cols_cleaned['cust_category'] = df_test_subset_cols_cleaned['cust_category'].replace(category

array([2, 3, 1])

In [48]:
# Drop deceased_index
df_train_subset_cols_cleaned = df_train_subset_cols_cleaned.drop(columns=['deceased_index'])
df_test_subset_cols_cleaned = df_test_subset_cols_cleaned.drop(columns=['deceased_index'])

In [49]:
# Check that the train and test set have the same columns present
check_columns(df_train_subset_cols_cleaned.columns, df_test_subset_cols_cleaned.columns)

Train and test sets have different columns.
Columns present in train but not in test:
- pensions
- payroll_acc
- derivada_acc
- payroll
- loans
- current_acc
- mortgage
- taxes
- guarantees
- medium_term_deposits
- long_term_deposits
- e_acc
- jnr_acc
- savings_acc
- direct_debit
- securities
- short_term_deposits
- más_particular_acc
- pensions_plan
- particular_account
- home_acc
- funds
- particular_plus_account
- credit_card
Columns present in test but not in train:
- new_cust_index
- province
- cust_country_res
- residence_index
- cust_type_at_start_month
- address_type
- cust_seniority
- cust_start_date_first_holder_contract
- province_name
- foreigner_index


In [50]:
# Banking products not present in test set
# Calculate the ownership probability for each product
ownership_probabilities = {col: df_train_subset_cols_cleaned[col].mean() for col in product_columns}

print("Ownership Probabilities Based on Training Data:")
print(ownership_probabilities)

Ownership Probabilities Based on Training Data:
{'savings_acc': 0.00010250168731287568, 'guarantees': 2.320238767254206e-05, 'current_acc': 0.6562846016619959, 'derivada_acc': 0.000394440590433215, 'payroll_acc': 0.08098925582600207, 'jnr_acc': 0.00947538520369053, 'más_particular_acc': 0.00971078664317841, 'particular_account': 0.12918318490658248, 'particular_plus_account': 0.04336900724913079, 'short_term_deposits': 0.0017499446373408067, 'medium_term_deposits': 0.0016640370627304451, 'long_term_deposits': 0.04304307244255353, 'e_acc': 0.08282503461267579, 'funds': 0.018514624259359225, 'mortgage': 0.005878280861672064, 'pensions_plan': 0.009185355357783755, 'loans': 0.00253486085322522, 'taxes': 0.051892140029640314, 'credit_card': 0.044463484434722284, 'securities': 0.02564921161811102, 'home_acc': 0.003853652261410435, 'payroll': 0.05475734120608948, 'pensions': 0.05946419519140795, 'direct_debit': 0.12811146955699887}


In [51]:
# Approach for Imputing Product Ownership (test set)
# Ownership probabilities based on training data
product_probabilities = {
    'savings_acc': 0.00010250168731287568,
    'guarantees': 2.320238767254206e-05,
    'current_acc': 0.6562846016619959,
    'derivada_acc': 0.000394440590433215,
    'payroll_acc': 0.08098925582600207,
    'jnr_acc': 0.00947538520369053,
    'más_particular_acc': 0.00971078664317841,
    'particular_account': 0.12918318490658248,
    'particular_plus_account': 0.04336900724913079,
    'short_term_deposits': 0.0017499446373408067,
    'medium_term_deposits': 0.0016640370627304451,
    'long_term_deposits': 0.04304307244255353,
    'e_acc': 0.08282503461267579,
    'funds': 0.018514624259359225,
    'mortgage': 0.005878280861672064,
    'pensions_plan': 0.009185355357783755,
    'loans': 0.00253486085322522,
    'taxes': 0.051892140029640314,
    'credit_card': 0.044463484434722284,
    'securities': 0.02564921161811102,
    'home_acc': 0.003853652261410435,
    'payroll': 0.05475734120608948,
    'pensions': 0.05946419519140795,
    'direct_debit': 0.12811146955699887
}

# Add missing product columns to the test DataFrame with default NaN values
for product in product_probabilities:
    if product not in df_test_subset_cols_cleaned.columns:
        df_test_subset_cols_cleaned[product] = np.nan

# Ensure the size parameter matches the number of rows in the DataFrame
num_rows = len(df_test_subset_cols_cleaned)

# Fill product columns based on probabilities
for product, prob in product_probabilities.items():
    if product in df_test_subset_cols_cleaned.columns:
        # Ensure size matches the number of rows
        df_test_subset_cols_cleaned[product] = np.random.binomial(1, prob, size=num_rows)

# Ensure all columns are present and in the correct order
required_columns = [
    'cust_id', 'date_id', 'age', 'gender', 'emp_index', 'channel_joined',
    'cust_primary_type', 'cust_rel_type_at_start_month', 'activity_index',
    'gross_income', 'cust_category'
] + list(product_probabilities.keys())

# Ensure the DataFrame has the required columns
for col in required_columns:
    if col not in df_test_subset_cols_cleaned.columns:
        df_test_subset_cols_cleaned[col] = np.nan

# Reorder columns to match the required list
df_test_subset_cols_cleaned = df_test_subset_cols_cleaned[required_columns]


In [52]:
df_test_subset_cols_cleaned.sample(3)

Unnamed: 0,cust_id,date_id,age,gender,emp_index,channel_joined,cust_primary_type,cust_rel_type_at_start_month,activity_index,gross_income,...,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
905636,648564,20160628,50,M,N,KFC,1,A,1,251187.39,...,0,0,0,0,0,0,0,0,0,0
152026,981929,20160628,26,F,N,KHE,1,A,1,101509.38,...,0,0,0,0,0,0,0,0,0,1
212511,1047838,20160628,24,M,N,KHE,1,A,1,359744.16,...,0,0,0,0,1,1,0,0,0,0


### Add a new column
 - owned_products
    - list of products owned by the customer for the latest date_id
    - format: list e.g: ['current_acc', 'mortgage']
    - map product names to numbers eg: savings_acc = 0 and no products owned will be -1

In [53]:
# Create the 'owned_products' column
# Check if the value is not equal to 0 for the products.
# If the value is not 0, it includes the column name (product) in the list.
# NB: This may take a few minutes to finish running
df_train_subset_cols_cleaned['owned_products'] = df_train_subset_cols_cleaned[product_columns].apply(lambda row: [col for col in product_columns if row[col] != 0], axis=1)
df_test_subset_cols_cleaned['owned_products'] = df_test_subset_cols_cleaned[product_columns].apply(lambda row: [col for col in product_columns if row[col] != 0], axis=1)

In [54]:
# Check that the train and test set now have the same columns present
check_columns(df_train_subset_cols_cleaned.columns, df_test_subset_cols_cleaned.columns)

Train and test sets have the same columns.


In [55]:
df_train_subset_cols_cleaned.sample(4)

Unnamed: 0,cust_id,date_id,age,gender,emp_index,channel_joined,cust_primary_type,cust_rel_type_at_start_month,activity_index,gross_income,...,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit,owned_products
1252244,1057509,20150228,31.0,M,N,KHE,1.0,I,0.0,112128.12,...,0,0,0,0,0,0,0.0,0.0,0,[current_acc]
12101316,239586,20160428,71.0,F,N,KAT,1.0,A,1.0,59485.11,...,0,0,0,0,0,0,0.0,0.0,0,"[current_acc, particular_account]"
9749458,1086181,20160128,23.0,M,N,KHE,1.0,I,0.0,227124.39,...,0,0,0,0,0,0,0.0,0.0,0,[current_acc]
11305989,368726,20160328,102.0,M,N,KFC,1.0,A,1.0,260731.17,...,0,0,0,0,0,0,0.0,0.0,0,"[current_acc, particular_account]"


In [56]:
# Map product names to numbers eg: savings_acc = 0 and no products owned will be -1

# Create a dictionary to map product names to numbers
product_mapping = {product: idx for idx, product in enumerate(product_columns)}

In [57]:
# Replace product names with numbers and ensure non-empty lists
df_train_subset_cols_cleaned['owned_products'] = df_train_subset_cols_cleaned['owned_products'].apply(lambda x: map_products_with_numbers(x, product_mapping))
df_test_subset_cols_cleaned['owned_products'] = df_test_subset_cols_cleaned['owned_products'].apply(lambda x: map_products_with_numbers(x, product_mapping))

In [58]:
# View a small sample of the selected columns with the mapping update

df_train_subset_cols_cleaned.sample(10)

Unnamed: 0,cust_id,date_id,age,gender,emp_index,channel_joined,cust_primary_type,cust_rel_type_at_start_month,activity_index,gross_income,...,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit,owned_products
13406013,1452137,20160528,22.0,M,N,KHQ,1.0,I,0.0,134254.318238,...,0,0,0,0,0,0,0.0,0.0,0,[2]
1711286,1325752,20150328,24.0,M,N,KHE,1.0,I,0.0,60600.42,...,0,0,0,0,0,0,0.0,0.0,0,[2]
13441921,1056563,20160528,51.0,F,N,KFC,1.0,A,1.0,89244.48,...,0,0,0,0,0,0,0.0,0.0,0,[2]
4817585,443220,20150828,71.0,M,N,KFC,1.0,I,0.0,77236.44,...,0,0,0,0,0,0,0.0,0.0,0,[-1]
1208944,1103156,20150228,25.0,M,N,KHE,1.0,I,0.0,164838.54,...,0,0,0,0,0,0,0.0,0.0,0,[2]
6439212,980100,20151028,27.0,M,N,KHE,1.0,I,0.0,47676.93,...,0,0,0,0,0,0,0.0,0.0,0,[2]
5570747,930664,20150928,24.0,M,N,KHE,1.0,A,0.0,73219.32,...,0,0,0,0,0,0,0.0,0.0,0,[-1]
8819809,426328,20151228,45.0,F,N,KAT,1.0,I,0.0,239618.04,...,0,0,0,0,0,0,0.0,0.0,0,[2]
12875000,962153,20160528,24.0,M,N,KHE,1.0,I,0.0,67817.97,...,0,0,0,0,0,0,0.0,0.0,0,[2]
10195949,1420266,20160228,24.0,M,N,KHQ,1.0,A,1.0,134254.318238,...,0,0,0,0,0,0,0.0,0.0,0,[2]


In [59]:
# Get the latest record for each customer

df_train_result = df_train_subset_cols_cleaned.loc[df_train_subset_cols_cleaned.groupby('cust_id')['date_id'].idxmax()]
df_test_result = df_test_subset_cols_cleaned.loc[df_test_subset_cols_cleaned.groupby('cust_id')['date_id'].idxmax()]

In [60]:
df_test_result.sample(5)

Unnamed: 0,cust_id,date_id,age,gender,emp_index,channel_joined,cust_primary_type,cust_rel_type_at_start_month,activity_index,gross_income,...,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit,owned_products
861165,542360,20160628,43,F,N,KAT,1,I,0,259866.36,...,0,0,0,0,0,0,0,0,0,[2]
502592,253711,20160628,45,M,N,KFC,1,I,1,155851.47,...,0,0,0,0,0,0,1,0,1,"[2, 21, 23]"
796505,894474,20160628,35,F,N,KFC,1,I,0,333863.49,...,0,0,0,0,0,0,0,0,0,[2]
902180,641295,20160628,59,M,N,KAH,1,I,0,114126.78,...,0,0,0,0,0,0,0,0,0,[2]
906867,645929,20160628,43,F,N,KFC,1,A,1,127036.26,...,0,0,0,0,0,0,1,0,0,"[2, 21]"


In [61]:
# Retrieve details for customers (latest record - not for all dates)
cust_num = 1339667
customer_info = get_customer_details(df_test_result, cust_num)
customer_info

Unnamed: 0,cust_id,date_id,owned_products
365363,1339667,20160628,"[2, 23]"


In [62]:
# Check column types for encoding purposes
# Set the train and test dtypes to be the same

df_train_dtypes = df_train_result.dtypes
df_test_dtypes = df_test_result.dtypes


# Find the common columns between train and test sets
common_columns = set(df_train_dtypes.index) & set(df_test_dtypes.index)

# Check for data type mismatches and correct them
for column in common_columns:
    if df_train_dtypes[column] != df_test_dtypes[column]:
        print(f"Data type mismatch in column '{column}':")
        print(f"  Train data type: {df_train_dtypes[column]}")
        print(f"  Test data type: {df_test_dtypes[column]}")
        # Convert the test column to the train column's data type
        df_test_result[column] = df_test_result[column].astype(df_train_dtypes[column])
        print(f"  Column '{column}' in test set converted to {df_train_dtypes[column]}")

# Optionally, you can check columns only present in train or test
train_only_columns = set(df_train_dtypes.index) - set(df_test_dtypes.index)
test_only_columns = set(df_test_dtypes.index) - set(df_train_dtypes.index)

if train_only_columns:
    print("Columns only in train set:")
    for column in train_only_columns:
        print(f"  {column} (data type: {df_train_dtypes[column]})")

if test_only_columns:
    print("Columns only in test set:")
    for column in test_only_columns:
        print(f"  {column} (data type: {df_test_dtypes[column]})")


Data type mismatch in column 'pensions':
  Train data type: float64
  Test data type: int64
  Column 'pensions' in test set converted to float64
Data type mismatch in column 'age':
  Train data type: float64
  Test data type: int64
  Column 'age' in test set converted to float64
Data type mismatch in column 'payroll':
  Train data type: float64
  Test data type: int64
  Column 'payroll' in test set converted to float64
Data type mismatch in column 'activity_index':
  Train data type: float64
  Test data type: int64
  Column 'activity_index' in test set converted to float64
Data type mismatch in column 'cust_primary_type':
  Train data type: float64
  Test data type: int64
  Column 'cust_primary_type' in test set converted to float64


### Create a new row for each element in the list for 'owned_products'

In [63]:
# Product codes start from 0
# There are 24 products and -1 which represents no products owned ie: 25 total product codes
expected_num_items = 25

# Get ownership cols
## train
ownership_cols_train = get_binary_item_ownership_columns(
    df_=df_train_result,
    owned_items_col="owned_products",
    nb_items=expected_num_items
)

# ownership_cols_train.sample(10)

In [64]:
## test
ownership_cols_test = get_binary_item_ownership_columns(
    df_=df_test_result,
    owned_items_col="owned_products",
    nb_items=expected_num_items
)

# ownership_cols_test.sample(5)

In [65]:
# Replace old ownership col
## train
df_train_processed = df_train_result.drop(columns="owned_products")
df_train_processed = df_train_result.join(ownership_cols_train)

## test
df_test_processed = df_test_result.drop(columns="owned_products")
df_test_processed = df_test_result.join(ownership_cols_test)


# Fill nan ownership with all 0s
df_train_processed = df_train_processed.fillna({col: 0 for col in ownership_cols_train.columns})
df_test_processed = df_test_processed.fillna({col: 0 for col in ownership_cols_train.columns})

df_train_processed.head(3)
# df_test_processed.sample(2)


Unnamed: 0,cust_id,date_id,age,gender,emp_index,channel_joined,cust_primary_type,cust_rel_type_at_start_month,activity_index,gross_income,...,owns_product_15,owns_product_16,owns_product_17,owns_product_18,owns_product_19,owns_product_20,owns_product_21,owns_product_22,owns_product_23,owns_product_24
13026343,15889,20160528,56.0,F,F,KAT,1.0,A,1.0,326124.9,...,0,0,0,1,1,0,0,0,0,0
13026342,15890,20160528,63.0,F,A,KAT,1.0,A,1.0,71461.2,...,1,0,0,1,0,0,1,1,1,0
5319232,15891,20150828,59.0,M,N,KAT,99.0,A,0.0,134254.318238,...,0,0,0,0,0,0,0,0,0,0


### Handle categorical columns

In [66]:
# Specify columns to one-hot encode
columns_to_encode = [
    'gender',
    'emp_index',
    'channel_joined',
    'cust_rel_type_at_start_month',
]

# Convert categorical data to numerical data
df_encoded_train = pre_process_categorical_data(df_train_processed, columns_to_encode)
df_encoded_test = pre_process_categorical_data(df_test_processed, columns_to_encode)

# df_encoded_train.sample(5)
# df_encoded_test.sample(5)

In [67]:
# Drop cust_id and date_id
df_encoded_train_for_predictions = df_encoded_train.drop(['cust_id', 'date_id'], axis=1)
df_encoded_test_for_predictions = df_encoded_test.drop(['cust_id', 'date_id'], axis=1)

In [68]:
# Confirm the cust_id and date_id columns were dropped
df_encoded_train_for_predictions.columns

Index(['age', 'cust_primary_type', 'activity_index', 'gross_income',
       'cust_category', 'savings_acc', 'guarantees', 'current_acc',
       'derivada_acc', 'payroll_acc',
       ...
       'channel_joined_KHO', 'channel_joined_KHP', 'channel_joined_KHQ',
       'channel_joined_KHR', 'channel_joined_KHS', 'channel_joined_RED',
       'cust_rel_type_at_start_month_A', 'cust_rel_type_at_start_month_I',
       'cust_rel_type_at_start_month_P', 'cust_rel_type_at_start_month_R'],
      dtype='object', length=228)

In [69]:
# Select only the numerical columns for PCA
numerical_cols = df_encoded_train_for_predictions.select_dtypes(include=[np.number]).columns

df_train_numerical = df_encoded_train_for_predictions[numerical_cols]
df_test_numerical = df_encoded_test_for_predictions[numerical_cols]

In [102]:
# Paths to save the CSV files
train_csv_path = os.path.join(ARTIFACT_SAVE_DIR, 'df_train_numerical.csv')
test_csv_path = os.path.join(ARTIFACT_SAVE_DIR, 'df_test_numerical.csv')

# Save the DataFrames to CSV files
df_train_numerical.to_csv(train_csv_path, index=False)
df_test_numerical.to_csv(test_csv_path, index=False)

In [70]:
# Standardize the data
scaler = StandardScaler()
df_train_scaled = scaler.fit_transform(df_train_numerical)
df_test_scaled = scaler.transform(df_test_numerical)

# Apply PCA
pca = PCA(n_components=0.95)  # Keep 95% of the variance
df_train_pca = pca.fit_transform(df_train_scaled)
df_test_pca = pca.transform(df_test_scaled)

In [72]:
# Ensure the directory exists
os.makedirs(ARTIFACT_SAVE_DIR, exist_ok=True)

In [103]:
# Save the scaler for later use
save_scaler_path = os.path.join(ARTIFACT_SAVE_DIR, "scaler.joblib")
joblib.dump(scaler, save_scaler_path)

save_pca_path = os.path.join(ARTIFACT_SAVE_DIR, "pca.joblib")
joblib.dump(pca, save_pca_path)

['../saved_model/pca.joblib']

## Fit KNN Model and Make Recommendations

**Priority Ordering:** Recommendations are prioritized based on the frequency with which similar users own the products, with more popular products being recommended first. Products that the user already owns are excluded, and only products that are not currently owned are recommended in descending order of popularity among similar users.

In [74]:
# Initialize KNN
knn = NearestNeighbors(metric='minkowski', p=2, algorithm='ball_tree')

# Fit KNN model on the PCA transformed training data
knn.fit(df_train_pca)

In [75]:
# Save the model for later use
save_model_path = os.path.join(ARTIFACT_SAVE_DIR, "model.joblib")
joblib.dump(knn, save_model_path)


['../saved_model/model.joblib']

In [82]:
# Get the optimal number of n_neighbors to use
# NB: This can take awhile to complete running
# Uncomment to change max_neighbors so optimal n_neighbors, default best_n_neighbors = 5
# best_n_neighbors = find_best_n_neighbors(df_test_pca)
# print(f"Best n_neighbors: {best_n_neighbors}")

best_n_neighbors = 5

In [83]:
# Get the nearest neighbors for the test data
# NB: This can take over an hour to complete running
distances, indices = knn.kneighbors(df_test_pca, best_n_neighbors)

In [99]:
# Define file paths
save_distances_path = os.path.join(ARTIFACT_SAVE_DIR, 'distances.joblib')
save_indices_path = os.path.join(ARTIFACT_SAVE_DIR, 'indices.joblib')

# Save distances and indices
joblib.dump(distances, save_distances_path)
joblib.dump(indices, save_indices_path)

['../saved_model/indices.joblib']

In [86]:
# Update product columns list and product name mapping
owns_product_columns = [
    'owns_product_0', 'owns_product_1', 'owns_product_2', 'owns_product_3',
    'owns_product_4', 'owns_product_5', 'owns_product_6', 'owns_product_7',
    'owns_product_8', 'owns_product_9', 'owns_product_10', 'owns_product_11',
    'owns_product_12', 'owns_product_13', 'owns_product_14', 'owns_product_15',
    'owns_product_16', 'owns_product_17', 'owns_product_18', 'owns_product_19',
    'owns_product_20', 'owns_product_21', 'owns_product_22', 'owns_product_23',
    'owns_product_24'
]

product_mapping_naming = {
    'owns_product_0': 'savings_acc',
    'owns_product_1': 'guarantees',
    'owns_product_2': 'current_acc',
    'owns_product_3': 'derivada_acc',
    'owns_product_4': 'payroll_acc',
    'owns_product_5': 'jnr_acc',
    'owns_product_6': 'más_particular_acc',
    'owns_product_7': 'particular_account',
    'owns_product_8': 'particular_plus_account',
    'owns_product_9': 'short_term_deposits',
    'owns_product_10': 'medium_term_deposits',
    'owns_product_11': 'long_term_deposits',
    'owns_product_12': 'e_acc',
    'owns_product_13': 'funds',
    'owns_product_14': 'mortgage',
    'owns_product_15': 'pensions_plan',
    'owns_product_16': 'loans',
    'owns_product_17': 'taxes',
    'owns_product_18': 'credit_card',
    'owns_product_19': 'securities',
    'owns_product_20': 'home_acc',
    'owns_product_21': 'payroll',
    'owns_product_22': 'pensions',
    'owns_product_23': 'direct_debit',
    'owns_product_24': 'no_product_owned'
}


In [89]:
# Ensure cust_id is retained
df_encoded_test_ids = df_encoded_test[['cust_id']].reset_index(drop=True)

# Path to save the CSV file
test_ids_csv_path = os.path.join(ARTIFACT_SAVE_DIR, 'df_encoded_test_ids.csv')

# Save the DataFrame to a CSV file
df_encoded_test_ids.to_csv(test_ids_csv_path, index=False)

recommendations_list = []
for i in range(len(df_encoded_test_for_predictions)):
    # Get the client number from the separate DataFrame with cust_id
    client_number = df_encoded_test_ids.iloc[i]['cust_id']

    # Get the recommendations
    recommendations = get_product_recommendations(i, df_encoded_train_for_predictions, df_encoded_test_for_predictions, indices, owns_product_columns, n_recommendations=2)

    # Append to the list
    recommendations_list.append({'cust_id': client_number, 'Recommended_Products': recommendations})

# Convert to DataFrame
recommendations_df = pd.DataFrame(recommendations_list)


In [92]:
# Map the codes to product names and store the result in 'Recommended_Products'
recommendations_df['Recommended_Products'] = recommendations_df['Recommended_Products'].apply(
    lambda x: map_product_codes_to_names(x, product_mapping_naming)
)

In [93]:
# Save recommendations to a CSV file or any other format
# recommendations_df.to_csv('product_recommendations.csv', index=False)

# Display the recommendations DataFrame
recommendations_df.sample(5)

Unnamed: 0,cust_id,Recommended_Products
739252,1317242,"[savings_acc, funds]"
522668,1042176,"[direct_debit, pensions]"
785130,1370735,"[savings_acc, funds]"
712183,1283960,"[savings_acc, mortgage]"
438731,934627,"[savings_acc, funds]"


In [96]:
user_id =  1322956

# Generate recommendations for the specified user ID
recommendations = get_product_recommendations_for_user_id(
    user_id,
    df_encoded_train_for_predictions,
    df_encoded_test_for_predictions,
    df_encoded_test_ids,
    indices,
    owns_product_columns,
    n_recommendations=2
)

# Map the product codes to product names
recommendations_with_names = map_product_codes_to_names(recommendations, product_mapping_naming)

print(f"Recommendations for user {user_id}: {recommendations_with_names}")

Recommendations for user 1322956: ['savings_acc', 'direct_debit']


### Get recommendations using the saved model

In [104]:
# Load the KNN model
knn = joblib.load(save_model_path)

# Load scaler
scaler = joblib.load(save_scaler_path)

# Load the PCA model
pca = joblib.load(save_pca_path)

# Load distances and indices if necessary
distances = joblib.load(save_distances_path)
indices = joblib.load(save_indices_path)

In [123]:
# Pre-process data
test_data_path = os.path.join(ARTIFACT_SAVE_DIR, 'df_test_numerical.csv')
test_ids_path = os.path.join(ARTIFACT_SAVE_DIR, 'df_encoded_test_ids.csv')
train_data_path = os.path.join(ARTIFACT_SAVE_DIR, 'df_train_numerical.csv')


df_encoded_test_for_predictions = pd.read_csv(test_data_path)
df_encoded_test_ids = pd.read_csv(test_ids_path)
df_encoded_train_for_predictions = pd.read_csv(train_data_path)

# Standardize the test data
scaler = joblib.load(os.path.join(ARTIFACT_SAVE_DIR, 'scaler.joblib'))
test_data_scaled = scaler.transform(df_encoded_test_for_predictions)

# Apply PCA transformation
test_data_pca = pca.transform(test_data_scaled)

In [127]:
print(test_data_scaled.shape)
print(test_data_pca.shape)

(929610, 54)
(929610, 24)


In [124]:
user_id = 1322956

# Generate recommendations for the specified user ID
recommendations = get_product_recommendations_for_user_id(
    user_id,
    df_encoded_train_for_predictions,
    df_encoded_test_for_predictions,
    df_encoded_test_ids,
    indices,
    product_columns,
    n_recommendations=2
)

# Map the product codes to product names
recommendations_with_names = map_product_codes_to_names(recommendations, product_mapping_naming)

print(f"Recommendations for user {user_id}: {recommendations_with_names}")

Recommendations for user 1322956: ['None', 'None']


In [129]:
## What's needed?
 - Check why the recommendations aren't working for the saved model
 - Test the virtual env
 - Save to model and contents HUGGINGFACE
 - Add gitignore file to not push model and dataset to repo
 - Do the PR (Remove other notebooks)

IndentationError: unexpected indent (1048977658.py, line 2)

In [None]:
def check_install(package, version=None):
    import importlib
    try:
        importlib.import_module(package)
        print(f"{package} is already installed.")
    except ImportError:
        print(f"{package} is not installed. Installing...")
        if version:
            !pip install -q {package}=={version} --trusted-host artifactory.dev.africanbank.net
            print(f"Package {package}=={version} installed or upgraded successfully!")
        else:
            !pip install -q {package} --trusted-host artifactory.dev.africanbank.net
            print(f"Package {package} installed or upgraded successfully!")

# List of dependencies with optional versions
dependencies = [
    ("numpy", None),
    ("scikit-learn", None),
    ("skops", None),
    ("jupyter", None),
    ("ipywidgets", None),
    ("joblib", "1.3.2")
]

# Check and install each dependency
for package, version in dependencies:
    check_install(package, version)


numpy is already installed.
scikit-learn is not installed. Installing...

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Package scikit-learn installed or upgraded successfully!
skops is already installed.
jupyter is already installed.
ipywidgets is already installed.
joblib is already installed.
