# Santander Product Recommendation

**Dataset description:**

This dataset contains 1.5 years of customer behavior data from Santander Bank, designed to predict the likelihood of customers purchasing new products. The data begins on 2015-01-28 and includes monthly records of products each customer holds, such as "credit card," "savings account," and others.

> See the following link for further details: [SPR dataset](https://www.kaggle.com/competitions/santander-product-recommendation/overview)

## Overview

This notebook is used to explore the Santander dataset to see what additional fields could be useful features in addition to the common set required.

> **Note:** Kernel - Python 3.11

## Access HuggingFace for Dataset access

### Sign in to your Hugging Face account

This will enable you to upload and share the model.

### Steps to get the `Access Token` from Hugging Face:

 - **Sign In or Sign Up:** If you don't have a Hugging Face account yet, you'll need to sign up. If you already have an account, sign in.

 - **Access Your Profile:** Once you're signed in, navigate to your profile settings. You can do this by clicking on your profile icon or username, usually located in the top-right corner of the Hugging Face website.
 
- **Navigate to Access Token Settings:** Within your profile settings, look for an option related to Access tokens. This is where you can manage and generate tokens.

- **Generate a New Token:** If you haven't generated a token before, you'll see a button (`New token`) to generate a new token. Click on this button. Please ensure you give the token `write` access

- **Name Your Token (Optional):** You may be prompted to give your token a name or description. This step is optional but can be helpful if you plan to generate multiple tokens for different purposes.

- **Copy Your Token:** Once your token is generated, you'll typically see it displayed on the screen. Make sure to copy the token and replace it in the `login` code below. 

In [1]:
# Log into Hugging Face
# Replace <access_token> with your access token

HUGGINGFACE_TOKEN = "<access_token>"
!huggingface-cli login --token $HUGGINGFACE_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/verosha/.cache/huggingface/token
Login successful


In [2]:
# imports required

from huggingface_hub import hf_hub_download
from typing import List
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


### Config

In [3]:
REPO_ID = "MelioAI/santander-product-recommendation"
HF_TRAIN_DATASET_NAME = "train_ver2.csv"
HF_TEST_DATASET_NAME = "test_ver2.csv"

## Access Hugging Face

The cells below will use the Hugging Face Client Library to get train and test from the Santander Product Recommendation dataset

In [4]:
# NB: This may take a few seconds to a few minutes to run

ds_train = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=HF_TRAIN_DATASET_NAME, repo_type="dataset")
)

  ds_train = pd.read_csv(


In [5]:
ds_test = pd.read_csv(
    hf_hub_download(repo_id=REPO_ID, filename=HF_TEST_DATASET_NAME, repo_type="dataset")
)

  ds_test = pd.read_csv(


In [7]:
# Data Profiling
# Getting an overview of the test and train dataframe’s structure

ds_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13647309 entries, 0 to 13647308
Data columns (total 48 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   fecha_dato             object 
 1   ncodpers               int64  
 2   ind_empleado           object 
 3   pais_residencia        object 
 4   sexo                   object 
 5   age                    object 
 6   fecha_alta             object 
 7   ind_nuevo              float64
 8   antiguedad             object 
 9   indrel                 float64
 10  ult_fec_cli_1t         object 
 11  indrel_1mes            object 
 12  tiprel_1mes            object 
 13  indresi                object 
 14  indext                 object 
 15  conyuemp               object 
 16  canal_entrada          object 
 17  indfall                object 
 18  tipodom                float64
 19  cod_prov               float64
 20  nomprov                object 
 21  ind_actividad_cliente  float64
 22  renta           

In [8]:
ds_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 929615 entries, 0 to 929614
Data columns (total 24 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   fecha_dato             929615 non-null  object 
 1   ncodpers               929615 non-null  int64  
 2   ind_empleado           929615 non-null  object 
 3   pais_residencia        929615 non-null  object 
 4   sexo                   929610 non-null  object 
 5   age                    929615 non-null  int64  
 6   fecha_alta             929615 non-null  object 
 7   ind_nuevo              929615 non-null  int64  
 8   antiguedad             929615 non-null  int64  
 9   indrel                 929615 non-null  int64  
 10  ult_fec_cli_1t         1683 non-null    object 
 11  indrel_1mes            929592 non-null  float64
 12  tiprel_1mes            929592 non-null  object 
 13  indresi                929615 non-null  object 
 14  indext                 929615 non-nu

In [9]:
# Check for nulls

ds_train.isnull().sum()

fecha_dato                      0
ncodpers                        0
ind_empleado                27734
pais_residencia             27734
sexo                        27804
age                             0
fecha_alta                  27734
ind_nuevo                   27734
antiguedad                      0
indrel                      27734
ult_fec_cli_1t           13622516
indrel_1mes                149781
tiprel_1mes                149781
indresi                     27734
indext                      27734
conyuemp                 13645501
canal_entrada              186126
indfall                     27734
tipodom                     27735
cod_prov                    93591
nomprov                     93591
ind_actividad_cliente       27734
renta                     2794375
segmento                   189368
ind_ahor_fin_ult1               0
ind_aval_fin_ult1               0
ind_cco_fin_ult1                0
ind_cder_fin_ult1               0
ind_cno_fin_ult1                0
ind_ctju_fin_u

In [10]:
# Check for nulls

ds_test.isnull().sum()

fecha_dato                    0
ncodpers                      0
ind_empleado                  0
pais_residencia               0
sexo                          5
age                           0
fecha_alta                    0
ind_nuevo                     0
antiguedad                    0
indrel                        0
ult_fec_cli_1t           927932
indrel_1mes                  23
tiprel_1mes                  23
indresi                       0
indext                        0
conyuemp                 929511
canal_entrada              2081
indfall                       0
tipodom                       0
cod_prov                   3996
nomprov                    3996
ind_actividad_cliente         0
renta                         0
segmento                   2248
dtype: int64

In [16]:
# Generate a detailed statistical summary of the train DataFrame, including both numeric and categorical columns
# Transpose for easier readability and analysis
ds_train.describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
fecha_dato,13647309.0,17.0,2016-05-28,931453.0,,,,,,,
ncodpers,13647309.0,,,,834904.211501,431565.025784,15889.0,452813.0,931893.0,1199286.0,1553689.0
ind_empleado,13619575.0,5.0,N,13610977.0,,,,,,,
pais_residencia,13619575.0,118.0,ES,13553710.0,,,,,,,
sexo,13619505.0,2.0,V,7424252.0,,,,,,,
age,13647309.0,235.0,23.0,542682.0,,,,,,,
fecha_alta,13619575.0,6756.0,2014-07-28,57389.0,,,,,,,
ind_nuevo,13619575.0,,,,0.059562,0.236673,0.0,0.0,0.0,0.0,1.0
antiguedad,13647309.0,507.0,0.0,134335.0,,,,,,,
indrel,13619575.0,,,,1.178399,4.177469,1.0,1.0,1.0,1.0,99.0


In [11]:
# Convert column names from Spanish to more readable English names

col_names = {"fecha_dato": "date_id","ncodpers":"cust_id", "ind_empleado":"emp_index","pais_residencia":"cust_country_res",
            "sexo":"gender","fecha_alta":"cust_start_date_first_holder_contract","ind_nuevo":"new_cust_index","antiguedad":"cust_seniority",
            "indrel":"cust_primary_type","ult_fec_cli_1t":"cust_last_primary_date","indrel_1mes":"cust_type_at_start_month",
            "tiprel_1mes":"cust_rel_type_at_start_month","indresi":"residence_index","indext":"foreigner_index",
            "conyuemp":"spouse_index","canal_entrada":"channel_joined", "indfall":"deceased_index", "tipodom":"address_type",
            "cod_prov":"province","nomprov":"province_name", "ind_actividad_cliente":"activity_index","renta":"gross_income",
            "segmento":"cust_category", "ind_ahor_fin_ult1":"savings_acc", "ind_aval_fin_ult1":"guarantees",
            "ind_cco_fin_ult1":"current_acc", "ind_cder_fin_ult1":"derivada_acc", "ind_cno_fin_ult1":"payroll_acc",
            "ind_ctju_fin_ult1":"jnr_acc", "ind_ctma_fin_ult1":"más_particular_acc", "ind_ctop_fin_ult1":"particular_account",
            "ind_ctpp_fin_ult1":"particular_plus_account", "ind_deco_fin_ult1":"short_term_deposits",
            "ind_deme_fin_ult1":"medium_term_deposits", "ind_dela_fin_ult1":"long_term_deposits", "ind_ecue_fin_ult1":"e_acc",
            "ind_fond_fin_ult1":"funds","ind_hip_fin_ult1":"mortgage", "ind_plan_fin_ult1":"pensions_plan", "ind_pres_fin_ult1":"loans",
            "ind_reca_fin_ult1":"taxes", "ind_tjcr_fin_ult1":"credit_card", "ind_valo_fin_ult1":"securities", "ind_viv_fin_ult1":"home_acc",
            "ind_nomina_ult1":"payroll", "ind_nom_pens_ult1": "pensions", "ind_recibo_ult1":"direct_debit"}

In [12]:
# Rename columns in ds_train and ds_test according to the col_names dictionary,
ds_train_renamed = ds_train.rename(col_names, axis = 1, inplace = False)
ds_test_renamed = ds_test.rename(col_names, axis = 1, inplace = False)

# sample output of the renamed columns for ds_train
ds_train_renamed.sample(3)

Unnamed: 0,date_id,cust_id,emp_index,cust_country_res,gender,age,cust_start_date_first_holder_contract,new_cust_index,cust_seniority,cust_primary_type,...,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
12621456,2016-04-28,1269900,N,ES,V,22,2014-07-16,0.0,21,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
11023676,2016-03-28,1277596,N,ES,V,21,2014-07-25,0.0,20,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
3356142,2015-06-28,1400239,N,ES,V,58,2015-06-03,1.0,1,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0


In [15]:
ds_train_renamed.columns

Index(['date_id', 'cust_id', 'emp_index', 'cust_country_res', 'gender', 'age',
       'cust_start_date_first_holder_contract', 'new_cust_index',
       'cust_seniority', 'cust_primary_type', 'cust_last_primary_date',
       'cust_type_at_start_month', 'cust_rel_type_at_start_month',
       'residence_index', 'foreigner_index', 'spouse_index', 'channel_joined',
       'deceased_index', 'address_type', 'province', 'province_name',
       'activity_index', 'gross_income', 'cust_category', 'savings_acc',
       'guarantees', 'current_acc', 'derivada_acc', 'payroll_acc', 'jnr_acc',
       'más_particular_acc', 'particular_account', 'particular_plus_account',
       'short_term_deposits', 'medium_term_deposits', 'long_term_deposits',
       'e_acc', 'funds', 'mortgage', 'pensions_plan', 'loans', 'taxes',
       'credit_card', 'securities', 'home_acc', 'payroll', 'pensions',
       'direct_debit'],
      dtype='object')

In [19]:
ds_train_renamed.dtypes

date_id                                   object
cust_id                                    int64
emp_index                                 object
cust_country_res                          object
gender                                    object
age                                      float64
cust_start_date_first_holder_contract     object
new_cust_index                           float64
cust_seniority                            object
cust_primary_type                        float64
cust_last_primary_date                    object
cust_type_at_start_month                  object
cust_rel_type_at_start_month              object
residence_index                           object
foreigner_index                           object
spouse_index                              object
channel_joined                            object
deceased_index                            object
address_type                             float64
province                                 float64
province_name       

In [22]:
# Perform data type conversions on columns
# Add additional if required

ds_train_renamed.age = pd.to_numeric(ds_train_renamed.age, errors='coerce')
ds_train_renamed.gross_income = pd.to_numeric(ds_train_renamed.gross_income, errors='coerce')
ds_train_renamed.cust_seniority = pd.to_numeric(ds_train_renamed.cust_seniority, errors='coerce')
ds_train_renamed.cust_start_date_first_holder_contract = pd.to_datetime(ds_train_renamed.cust_start_date_first_holder_contract, errors = 'coerce')
ds_train_renamed['date_id'] = pd.to_datetime(ds_train_renamed['date_id'])

ds_test_renamed.age = pd.to_numeric(ds_test_renamed.age, errors='coerce')
ds_test_renamed.gross_income = pd.to_numeric(ds_test_renamed.gross_income, errors='coerce')
ds_test_renamed.cust_seniority = pd.to_numeric(ds_test_renamed.cust_seniority, errors='coerce')
ds_test_renamed.cust_start_date_first_holder_contract = pd.to_datetime(ds_test_renamed.cust_start_date_first_holder_contract, errors = 'coerce')
ds_test_renamed['date_id'] = pd.to_datetime(ds_test_renamed['date_id'])


In [23]:
# Get the percentage of missing values in each column
ds_train_renamed.isnull().sum()/ds_train_renamed.shape[0] * 100

date_id                                   0.000000
cust_id                                   0.000000
emp_index                                 0.203220
cust_country_res                          0.203220
gender                                    0.203732
age                                       0.203220
cust_start_date_first_holder_contract     0.203220
new_cust_index                            0.203220
cust_seniority                            0.203220
cust_primary_type                         0.203220
cust_last_primary_date                   99.818330
cust_type_at_start_month                  1.097513
cust_rel_type_at_start_month              1.097513
residence_index                           0.203220
foreigner_index                           0.203220
spouse_index                             99.986752
channel_joined                            1.363829
deceased_index                            0.203220
address_type                              0.203227
province                       

In [31]:
# According to the percentage of missing values, we will perform appropriate imputation for each column

# Imputation 1: Drop columns which have > 99% missing values
columns_to_drop = ['cust_last_primary_date', 'spouse_index']

# Drop columns if they exist in the DataFrame
ds_train_renamed.drop(columns=[col for col in columns_to_drop if col in ds_train_renamed.columns], axis=1, inplace=True)
ds_test_renamed.drop(columns=[col for col in columns_to_drop if col in ds_test_renamed.columns], axis=1, inplace=True)

In [None]:
# Imputation 2: Missing values that have <10% are imputed with the most common value (mode) in each column

cols = ['emp_index','cust_country_res','gender','cust_start_date_first_holder_contract','new_cust_index',
        'cust_primary_type',"cust_type_at_start_month", "cust_rel_type_at_start_month", "province","province_name",
        "activity_index","channel_joined","cust_category"]

