### Overview
This notebook is used to prepare the `Santander Product Recommendation` dataset into the common dataset needed for the recommender model

> **Note:** Kernel - Python 3.11

- Common Dataset: The common dataset requires the following columns:
    - cust_id
        - unique customer identifier 
        - format: numeric data type eg: 34613
    - date_id
        - identifying the specific date associated with the data entry
        - format: YYYYMMDD eg: 20221228 (28 December 2022)
    - age
        - represents the customer age
        - format: numeric data type eg: 26
    - sex
        - represents if the customer is male or female
        - format: string (F or M)
    - owned_products
        - list of products owned by the customer for the latest date_id
        - format: list e.g: ['current_acc', 'mortgage']

## Access HuggingFace for Dataset access

### Sign in to your Hugging Face account

This will enable you to upload and share the model.

### Steps to get the `Access Token` from Hugging Face:

 - **Sign In or Sign Up:** If you don't have a Hugging Face account yet, you'll need to sign up. If you already have an account, sign in.

 - **Access Your Profile:** Once you're signed in, navigate to your profile settings. You can do this by clicking on your profile icon or username, usually located in the top-right corner of the Hugging Face website.
 
- **Navigate to Access Token Settings:** Within your profile settings, look for an option related to Access tokens. This is where you can manage and generate tokens.

- **Generate a New Token:** If you haven't generated a token before, you'll see a button (`New token`) to generate a new token. Click on this button. Please ensure you give the token `write` access

- **Name Your Token (Optional):** You may be prompted to give your token a name or description. This step is optional but can be helpful if you plan to generate multiple tokens for different purposes.

- **Copy Your Token:** Once your token is generated, you'll typically see it displayed on the screen. Make sure to copy the token and replace it in the `login` code below. 

In [57]:
# Log into Hugging Face
# Replace <access_token> with your access token

HUGGINGFACE_TOKEN = "<access_token>"
!huggingface-cli login --token $HUGGINGFACE_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/verosha/.cache/huggingface/token
Login successful


In [58]:
# imports required

from huggingface_hub import hf_hub_download
from typing import List
import pandas as pd


## Config setup

In [59]:
REPO_ID = "MelioAI/santander-product-recommendation"
HF_TRAIN_DATASET_NAME = "train_ver2.csv"
HF_TEST_DATASET_NAME = "test_ver2.csv"

## Helper Functions
The following cells define helper functions used throughout this notebook.

In [60]:
def update_gender_columns(df: pd.DataFrame, column_name: str = 'gender') -> pd.DataFrame:
    """
    Update gender columns in the DataFrame.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the gender column.
    column_name (str): The name of the gender column. Default is 'gender'.

    Returns:
    pd.DataFrame: The DataFrame with updated gender values.
    """
    df[column_name].replace({'H': 'M', 'V': 'F'}, inplace=True)
    return df


In [61]:
def update_date_id(df: pd.DataFrame, column_name: str = 'date_id') -> pd.DataFrame:
    """
    Convert the date_id column in the DataFrame to YYYYMMDD format.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the date_id column.
    column_name (str): The name of the date_id column. Default is 'date_id'.

    Returns:
    pd.DataFrame: The DataFrame with date_id values in YYYYMMDD format.
    """
    # Convert column to string (if not already)
    df[column_name] = df[column_name].astype(str)

    # Convert ISO 8601 (YYYY-MM-DD) to YYYYMMDD
    df[column_name] = pd.to_datetime(df[column_name], format='%Y-%m-%d', errors='coerce').dt.strftime('%Y%m%d')

    return df

In [62]:
def map_products_with_numbers(owned_products: List[str]) -> List[int]:
    """
    Replace product names with corresponding numbers from the product_mapping dictionary.
    Return a list with -1 if the list is empty.

    Parameters:
    owned_products (List[str]): List of product names to be replaced.

    Returns:
    List[int]: List of corresponding product numbers, or [-1] if input list is empty.
    """
    if not owned_products:
        return [-1]
    return [product_mapping.get(product, -1) for product in owned_products]

## Access Hugging Face

The cells below will use the Hugging Face Client Library to get train and test from the Santander Product Recommendation dataset

In [63]:
# NB: This may take a few seconds to a few minutes to run

ds_train = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=HF_TRAIN_DATASET_NAME, repo_type="dataset")
)

  ds_train = pd.read_csv(


In [64]:
ds_test = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=HF_TEST_DATASET_NAME, repo_type="dataset")
)

  ds_test = pd.read_csv(


In [65]:
# View a sample of the data for train set
# NB: To get details of the columns, see: https://www.kaggle.com/competitions/santander-product-recommendation/data

ds_train.sample(5)

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,fecha_alta,ind_nuevo,antiguedad,indrel,...,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1
13332432,2016-05-28,1534489,N,ES,V,27,2016-02-24,1.0,3,1.0,...,0,0,0,0,0,0,0,0.0,0.0,1
1598273,2015-03-28,1241349,N,ES,V,54,2014-02-03,0.0,17,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
11959018,2016-04-28,1503564,N,ES,H,60,2015-11-13,1.0,5,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
5786744,2015-09-28,244079,N,ES,V,44,2001-05-04,0.0,172,1.0,...,1,0,0,0,0,0,0,1.0,1.0,1
12440785,2016-04-28,1045175,N,ES,H,25,2012-08-07,0.0,44,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0


## Pre-pare Data
The cells below will transform the data into the common format needed for the dataset.

In [66]:
# Step 1: Convert column names from Spanish to more readable English names

col_names = {"fecha_dato": "date_id","ncodpers":"cust_id", "ind_empleado":"emp_index","pais_residencia":"cust_country_res",
            "sexo":"gender","fecha_alta":"cust_start_date_first_holder_contract","ind_nuevo":"new_cust_index","antiguedad":"cust_seniority",
            "indrel":"cust_primary_type","ult_fec_cli_1t":"cust_last_primary_date","indrel_1mes":"cust_type_at_start_month",
            "tiprel_1mes":"cust_rel_type_at_start_month","indresi":"residence_index","indext":"foreigner_index",
            "conyuemp":"spouse_index","canal_entrada":"channel_joined", "indfall":"deceased_index", "tipodom":"address_type",
            "cod_prov":"province","nomprov":"province_name", "ind_actividad_cliente":"activity_index","renta":"gross_income",
            "segmento":"cust_category", "ind_ahor_fin_ult1":"savings_acc", "ind_aval_fin_ult1":"guarantees",
            "ind_cco_fin_ult1":"current_acc", "ind_cder_fin_ult1":"derivada_acc", "ind_cno_fin_ult1":"payroll_acc",
            "ind_ctju_fin_ult1":"jnr_acc", "ind_ctma_fin_ult1":"más_particular_acc", "ind_ctop_fin_ult1":"particular_account",
            "ind_ctpp_fin_ult1":"particular_plus_account", "ind_deco_fin_ult1":"short_term_deposits",
            "ind_deme_fin_ult1":"medium_term_deposits", "ind_dela_fin_ult1":"long_term_deposits", "ind_ecue_fin_ult1":"e_acc",
            "ind_fond_fin_ult1":"funds","ind_hip_fin_ult1":"mortgage", "ind_plan_fin_ult1":"pensions_plan", "ind_pres_fin_ult1":"loans",
            "ind_reca_fin_ult1":"taxes", "ind_tjcr_fin_ult1":"credit_card", "ind_valo_fin_ult1":"securities", "ind_viv_fin_ult1":"home_acc",
            "ind_nomina_ult1":"payroll", "ind_nom_pens_ult1": "pensions", "ind_recibo_ult1":"direct_debit"}

In [67]:
# Rename columns in ds_train and ds_test according to the col_names dictionary,
ds_train_renamed = ds_train.rename(col_names, axis = 1, inplace = False)
ds_test_renamed = ds_test.rename(col_names, axis = 1, inplace = False)

# sample output of the renamed columns for ds_train
ds_train_renamed.sample(3)

Unnamed: 0,date_id,cust_id,emp_index,cust_country_res,gender,age,cust_start_date_first_holder_contract,new_cust_index,cust_seniority,cust_primary_type,...,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
12315234,2016-04-28,1180859,N,ES,V,25,2013-09-23,0.0,31,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
5058563,2015-08-28,1362447,N,ES,V,28,2014-11-27,0.0,9,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
7211050,2015-11-28,1423461,N,ES,H,20,2015-07-31,1.0,4,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0


In [68]:
# Step 2: Update gender values in the data as it's currently reflected as "H" and "V" instead of "F" and "M"
ds_train_renamed = update_gender_columns(ds_train_renamed)
ds_test_renamed = update_gender_columns(ds_test_renamed)

# sample output of the gender update for ds_train
ds_train_renamed.sample(5)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column_name].replace({'H': 'M', 'V': 'F'}, inplace=True)


Unnamed: 0,date_id,cust_id,emp_index,cust_country_res,gender,age,cust_start_date_first_holder_contract,new_cust_index,cust_seniority,cust_primary_type,...,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
11360422,2016-03-28,750627,N,ES,M,63,2008-02-27,0.0,97,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
1179101,2015-02-28,1090064,N,ES,M,25,2012-10-29,0.0,33,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
12617356,2016-04-28,1265106,N,ES,M,38,2014-06-24,0.0,22,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
10574833,2016-02-28,1084948,N,ES,F,61,2012-10-19,0.0,40,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
2793671,2015-05-28,1246381,N,ES,F,49,2014-03-03,0.0,16,1.0,...,0,0,0,0,0,0,0,1.0,1.0,0


In [69]:
# Step 3: Update date_id format to YYYYMMDD (20150328)
ds_train_renamed = update_date_id(ds_train_renamed)
ds_test_renamed = update_date_id(ds_test_renamed)

# sample output of the date format update for ds_train
ds_train_renamed.sample(5)

Unnamed: 0,date_id,cust_id,emp_index,cust_country_res,gender,age,cust_start_date_first_holder_contract,new_cust_index,cust_seniority,cust_primary_type,...,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
9003999,20151228,1425479,N,ES,F,20,2015-08-01,1.0,5,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
6992583,20151028,786104,N,ES,F,53,2008-08-13,0.0,86,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
7491365,20151128,56970,N,ES,F,49,1997-02-15,0.0,225,1.0,...,0,0,0,1,0,0,0,0.0,0.0,0
2287467,20150428,1329216,N,ES,M,21,2014-10-09,0.0,9,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
4430030,20150728,1316048,N,ES,M,21,2014-09-24,0.0,10,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0


In [70]:
# Start building the common dataset with the columns needed
# Step 4: Create the column "owned_products"

# Define the columns that represent product ownership
product_columns = ['savings_acc', 'guarantees', 'current_acc', 'derivada_acc', 'payroll_acc', 'jnr_acc',
                   'más_particular_acc', 'particular_account', 'particular_plus_account', 'short_term_deposits',
                   'medium_term_deposits', 'long_term_deposits', 'e_acc', 'funds', 'mortgage', 'pensions_plan',
                   'loans', 'taxes', 'credit_card', 'securities', 'home_acc', 'payroll', 'pensions', 'direct_debit']

In [71]:
df_train = pd.DataFrame(ds_train_renamed)
df_test = pd.DataFrame(ds_test_renamed)

In [72]:
# Create the 'owned_products' column
# Check if the value is not equal to 0.
# If the value is not 0, it includes the column name in the list.
# NB: This may take a few minutes to finish running
df_train['owned_products'] = df_train[product_columns].apply(lambda row: [col for col in product_columns if row[col] != 0], axis=1)


In [73]:
# Chceck the "owned_products" column was created
df_train.sample(2)

Unnamed: 0,date_id,cust_id,emp_index,cust_country_res,gender,age,cust_start_date_first_holder_contract,new_cust_index,cust_seniority,cust_primary_type,...,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit,owned_products
12522396,20160428,1343257,N,ES,F,69,2014-10-28,0.0,18,1.0,...,0,0,0,0,0,0,0.0,0.0,0,[]
5136729,20150828,1050446,N,ES,F,23,2012-08-10,0.0,36,1.0,...,0,0,0,0,0,0,0.0,0.0,0,[current_acc]


In [74]:
# Step 5: Get the latest record for each customer

df_train = df_train.loc[df_train.groupby('cust_id')['date_id'].idxmax()]

In [75]:
# Step 6: # Select the required columns

result_df_train = df_train[['cust_id', 'date_id', 'age', 'gender', 'owned_products']]

# Reset the index
result_df_train = result_df_train.reset_index(drop=True)

In [76]:
# View a small sample of the selected columns

result_df_train.sample(3)

Unnamed: 0,cust_id,date_id,age,gender,owned_products
794232,1362063,20160528,64,F,"[current_acc, direct_debit]"
257441,524200,20160528,53,F,[]
685309,1227525,20160528,25,M,[]


In [77]:
# Step 7: Map product names to numbers eg: savings_acc = 0 and no products owned will be -1

# Create a dictionary to map product names to numbers
product_mapping = {product: idx for idx, product in enumerate(product_columns)}

In [78]:
# Replace product names with numbers and ensure non-empty lists
result_df_train['owned_products'] = result_df_train['owned_products'].apply(map_products_with_numbers)

In [79]:
# View a small sample of the selected columns with the mapping update

result_df_train.head(5)

Unnamed: 0,cust_id,date_id,age,gender,owned_products
0,15889,20160528,56,F,"[2, 8, 18, 19]"
1,15890,20160528,63,F,"[4, 8, 12, 15, 18, 21, 22, 23]"
2,15891,20150828,59,M,[-1]
3,15892,20160528,62,M,"[2, 11, 12, 17, 18, 19, 23]"
4,15893,20160528,63,F,[19]


In [80]:
df_train.head(5)

Unnamed: 0,date_id,cust_id,emp_index,cust_country_res,gender,age,cust_start_date_first_holder_contract,new_cust_index,cust_seniority,cust_primary_type,...,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit,owned_products
13026343,20160528,15889,F,ES,F,56,1995-01-16,0.0,255,1.0,...,0,0,0,1,1,0,0.0,0.0,0,"[current_acc, particular_plus_account, credit_..."
13026342,20160528,15890,A,ES,F,63,1995-01-16,0.0,256,1.0,...,1,0,0,1,0,0,1.0,1.0,1,"[payroll_acc, particular_plus_account, e_acc, ..."
5319232,20150828,15891,N,ES,M,59,2015-07-28,0.0,246,99.0,...,0,0,0,0,0,0,0.0,0.0,0,[]
13026341,20160528,15892,F,ES,M,62,1995-01-16,0.0,256,1.0,...,0,0,1,1,1,0,0.0,0.0,1,"[current_acc, long_term_deposits, e_acc, taxes..."
13026340,20160528,15893,N,ES,F,63,1997-10-03,0.0,256,1.0,...,0,0,0,0,1,0,0.0,0.0,0,[securities]


### Add additional useful features

In [None]:
# To be added based on findings from the `data_exploration_santander_product_ds.ipynb` Notebook

## Implement KNN for Product Recommendation

Overview of steps:
- Transform the dataset into a user-item matrix where rows represent customers and columns represent products. 
- Fit the KNN model (Use the KNN algorithm to find similar customers)
- Make recommendations for a given customer, find the K nearest neighbors (similar customers).

### Helper Functions

In [82]:
def pre_process_categorical_data(data: pd.DataFrame, columns_to_encode: List[str]) -> pd.DataFrame:
    """
    Convert categorical data to numerical data using one-hot encoding.

    Parameters:
    - data: DataFrame
    - columns_to_encode: List of columns to one-hot encode

    Returns:
    - DataFrame with one-hot encoded columns
    """
    return pd.get_dummies(data, columns=columns_to_encode)

### Pre-process Data
 - Check for null values
 - Handle missing values
 - Handle categorical data
 - Scale features (scikit-learn standard scaler)


In [87]:
# Check for null values
null_summary = result_df_train.isnull().sum()
null_summary

cust_id           0.000000
date_id           0.000000
age               0.000000
gender            0.735487
owned_products    0.000000
dtype: float64

In [90]:
# Since the gender column has about 0.735% missing values
# To address these missing values, we perform imputation by assigning a new category

# Impute missing values with a new category 'Unknown'
result_df_train['gender'].fillna('Unknown', inplace=True)

# Check that there are no null values
null_summary = result_df_train.isnull().sum()
null_summary

cust_id           0
date_id           0
age               0
gender            0
owned_products    0
dtype: int64

In [95]:
# Check that gender column has three values: F, M and Unknown
gender_values = result_df_train['gender'].unique()
gender_values

array(['F', 'M', 'Unknown'], dtype=object)

In [99]:
# Drop the 'date_id' and 'cust_id' columns
df_train_preprocess = result_df_train.drop(['date_id', 'cust_id'], axis=1)
df_train_preprocess.sample(5)

Unnamed: 0,age,gender,owned_products
167304,48.0,M,[-1]
289539,40.0,F,"[2, 7]"
640329,32.0,M,[2]
238828,,Unknown,"[21, 22, 23]"
250826,47.0,M,"[2, 7]"


In [101]:
# Create a new row for each element in the list for 'owned_products'
# This is done to avoid the TypeError: unhashable type: 'list'

df_train_exploded = df_train_preprocess.explode('owned_products')

In [105]:
# One-hot encode the 'gender' and 'owned_products' columns
categorical_columns = ['gender']
df_train_encoded = pre_process_categorical_data(df_train_exploded, categorical_columns)

In [111]:
# Convert 'age' to numeric, setting errors='coerce' will replace non-numeric values with NaN
df_train_encoded['age'] = pd.to_numeric(df_train_encoded['age'], errors='coerce')


In [112]:
df_train_encoded.dtypes

age               float64
owned_products     object
gender_F             bool
gender_M             bool
gender_Unknown       bool
dtype: object