<a href="https://colab.research.google.com/github/mrpintime/Santander-Customer-Transaction/blob/main/SantanderCustomerTransaction(Kaggle_competition).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer Specific Transactions
Created by Moein aka Mrpintime

# **DataSet Description**:
`name`: SantanderCustomerSatisfaction
`version`: 3
`Author`: Banco Santander
`description`:
At Santander our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.

Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?

**Dataset taken from Kaggle**: https://www.kaggle.com/c/santander-customer-transaction-prediction/data


---


# Problem


In this challenge, we want to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted.
**Note**: The data provided for this competition has the same structure as the real data we have available to solve this problem.

# Import files from Kaggle

In [None]:
# !pip install opendatasets
# import opendatasets as op
# op.download("https://www.kaggle.com/competitions/santander-customer-transaction-prediction/data")

Install cuML and lightgbm for faster computing

In [None]:
# this cell is for install cuML on Colab
# !git clone https://github.com/rapidsai/rapidsai-csp-utils.git
# !python rapidsai-csp-utils/colab/pip-install.py
# install LightGBM
# !pip install lightgbm

# Pre-processing (Cleansing)

Import necessary libraries

In [None]:
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt

In [None]:
# save files to google drive
# ! cp -r "/content/santander-customer-transaction-prediction/" "/content/drive/MyDrive/"

Read train set

In [None]:
df = pd.read_csv("/content/drive/MyDrive/santander-customer-transaction-prediction/train.csv")
df.head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


**Note**: We have a very large dataset

In [None]:
df.shape

(200000, 202)

Check the null values, feature variables, target variable and clean them

In [None]:
df.isnull().sum()

ID_code    0
target     0
var_0      0
var_1      0
var_2      0
          ..
var_195    0
var_196    0
var_197    0
var_198    0
var_199    0
Length: 202, dtype: int64

In [None]:
df.isnull().sum().sum()

0

In [None]:
df.columns

Index(['ID_code', 'target', 'var_0', 'var_1', 'var_2', 'var_3', 'var_4',
       'var_5', 'var_6', 'var_7',
       ...
       'var_190', 'var_191', 'var_192', 'var_193', 'var_194', 'var_195',
       'var_196', 'var_197', 'var_198', 'var_199'],
      dtype='object', length=202)

In [None]:
df.dtypes

ID_code     object
target       int64
var_0      float64
var_1      float64
var_2      float64
            ...   
var_195    float64
var_196    float64
var_197    float64
var_198    float64
var_199    float64
Length: 202, dtype: object

In [None]:
df.dtypes.unique()

array([dtype('O'), dtype('int64'), dtype('float64')], dtype=object)

In [None]:
df = df.drop(columns='ID_code')

Descriptive statistics

In [None]:
df.describe()

Unnamed: 0,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
count,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,...,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0
mean,0.10049,10.679914,-1.627622,10.715192,6.796529,11.078333,-5.065317,5.408949,16.54585,0.284162,...,3.23444,7.438408,1.927839,3.331774,17.993784,-0.142088,2.303335,8.908158,15.87072,-3.326537
std,0.300653,3.040051,4.050044,2.640894,2.043319,1.62315,7.863267,0.866607,3.418076,3.332634,...,4.559922,3.023272,1.478423,3.99203,3.135162,1.429372,5.454369,0.921625,3.010945,10.438015
min,0.0,0.4084,-15.0434,2.1171,-0.0402,5.0748,-32.5626,2.3473,5.3497,-10.5055,...,-14.0933,-2.6917,-3.8145,-11.7834,8.6944,-5.261,-14.2096,5.9606,6.2993,-38.8528
25%,0.0,8.45385,-4.740025,8.722475,5.254075,9.883175,-11.20035,4.7677,13.9438,-2.3178,...,-0.058825,5.1574,0.889775,0.5846,15.6298,-1.1707,-1.946925,8.2528,13.8297,-11.208475
50%,0.0,10.52475,-1.60805,10.58,6.825,11.10825,-4.83315,5.3851,16.4568,0.3937,...,3.2036,7.34775,1.9013,3.39635,17.95795,-0.1727,2.4089,8.8882,15.93405,-2.81955
75%,0.0,12.7582,1.358625,12.5167,8.3241,12.261125,0.9248,6.003,19.1029,2.9379,...,6.4062,9.512525,2.9495,6.2058,20.396525,0.8296,6.556725,9.5933,18.064725,4.8368
max,1.0,20.315,10.3768,19.353,13.1883,16.6714,17.2516,8.4477,27.6918,10.1513,...,18.4409,16.7165,8.4024,18.2818,27.9288,4.2729,18.3215,12.0004,26.0791,28.5007


We have very imbalabce dataset.

In [None]:
df.target.value_counts(normalize=1)

0    0.89951
1    0.10049
Name: target, dtype: float64

we can do several things for this:
- Class weight approache
- Downsampling
- Upsampling
- Resampling

Import necessary library for imbalance dataset.

In [None]:
from imblearn.under_sampling import RandomUnderSampler, NearMiss, TomekLinks
from imblearn.over_sampling import RandomOverSampler, KMeansSMOTE, SMOTE, SMOTEN

Our data set is clean but first we want to analyse it without Dimension reduction technique like `PCA` or `Clustering` ...
but you have to know your data set so good, because when you do Dimension Reduction technique you lose some information



---



# Data preprocessing for predictive analysis

import necessary libraries

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.iloc[:, 1:]
y = df.target

X.shape, y.shape

((200000, 200), (200000,))

**Note:** in kaggle competitions ussually we have 2 dataset one for train and another one for test so we can train our model on entire train dataset after validate our model.
and another thing is that you have to check that do you have same values and features in train and test set or there are some differences.

Let's our data set into `train` and `validate`.

`Tip`: If you get enough accuracy in ypu model, you can improve it by `Psudo labeling technique` but be aware about **overfitting**.

we use `stratify` parametere here because we want to save ferquency of target in train and validate sets

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify=y, test_size=0.3, random_state=123)

As we can see validate and train sets have same relative ferquencies for their values but they are very `imbalance`.

In [None]:
y_train.value_counts(normalize=1), y_valid.value_counts(normalize=1)

(0    0.899507
 1    0.100493
 Name: target, dtype: float64,
 0    0.899517
 1    0.100483
 Name: target, dtype: float64)

## data sampling for imbalance dataset

Because we have a big dataset and we do not want to create instances that are not 100 pecent real so we decide to do downsampling first.

### Downsampling

#### RandomUnderSampler, NearMiss, TomekLinks

In [None]:
# Implement down sampling techniques.