<a href="https://www.kaggle.com/code/mariuszcha/santander-prediction-v1?scriptVersionId=128894821" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

![](https://storage.googleapis.com/kaggle-competitions/kaggle/10385/logos/thumb76_76.png?t=2019-02-15-16-53-52)The aim of this project is to predict which customers of Santander Bank will carry out a transaction, using a dataset provided by the institution. The data of each customer has been anonymized to protect their privacy. The data provided for this competition has the same structure as the real data we(Santander's Scientists) have available to solve this problem. To assess the quality of our predictions, we will be using the ROC AUC (Receiver Operating Characteristic Area Under Curve) metric, which allows for evaluating the performance of a model by analyzing the ROC curve and calculating the area under the curve.

![](https://images.pexels.com/photos/351264/pexels-photo-351264.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2)

Our evaluation matric is **ROC AUC Curve.**
The ROC AUC Curve is a way to evaluate the performance of a binary classification model, and provides a visual representation of the trade-off between the True Positive Rate and False Positive Rate. A high AUC score and a curve that is as close to the top left corner of the graph as possible indicate good performance of the model.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from lightgbm import LGBMClassifier

sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sns.set_palette('flare')

In [None]:
# Reading test data and train data.
train_df = pd.read_csv('/kaggle/input/santander-customer-transaction-prediction/train.csv', index_col=0)
test_df = pd.read_csv('/kaggle/input/santander-customer-transaction-prediction/test.csv', index_col=0)

# Creating list of column names.
features = [x for x in train_df if x.startswith('var')]

#X and y
X = train_df[features]
y = train_df.target

# EDA
**In our first step we will examine whether there are Missing Values in our data sets**

In [None]:
def isMissing(df):
    missing_values = df.isna().sum()
    total_missings = missing_values.sum()
    print('Number of NAs in set:', total_missings)

isMissing(train_df)
isMissing(test_df)

Above - we counted missing values across whole set - and we can notice that there aren't any missing values

**Now we will check distributions of data for independent variables and for dependent one. We could plot all of them but that requires a lot of CPU Usage. As we are builing the model in order to learn something new - we will select TOP15 important features according to Correlation Matrix and work mostly based on them. We will examine te STD, Mean, MAX, MIN using .describe()**

In [None]:
# We create list of TOP 15 features
significant_features = train_df.corr().abs()['target'].sort_values(ascending=False)[1:16].index.tolist()
display(train_df.describe())
display(test_df.describe())

As we can see from tables above for Test data and Train data and from graphs below - our data is mostly normal distributed. There isn't any skewness. We can have some outliers values so to scale data - we will use RobustScaler. Obviously we don't know the whole set. Maybe the rest variables are less normal - but the ones which are crucial for modeling are fine.
One of concerns i have is Kurtosis. We can notice that some of them have problem with that indicator.
Kurtosis helps to identify the presence of outliers or extreme values in a dataset.
A distribution with positive kurtosis has heavier tails than a normal distribution, which means it has more outliers or extreme values. A distribution with negative kurtosis has lighter tails than a normal distribution, which means it has fewer outliers or extreme values.


In [None]:
fig, axes = plt.subplots(3,5,figsize=(15,10))
axes = axes.flatten()

for f,i in enumerate(significant_features):
    sns.histplot(kde=True, x=train_df[i], ax=axes[f])

We will check distribution of target variable

In [None]:
sns.countplot(x=train_df['target'])

This is kinda big problem. We can notice that our classes are unbalanced. Solutions for problem like that is oversampling, undersampling or randomsampling.

**We will check now distribution Train data vs Test data for all rows.
One method of comparing the distributions between the train and test sets is to calculate the mean, median, standard deviation, etc. for each column in the train set and compare the same values for the test set. If the means or standard deviations differ significantly between the train and test sets, it may indicate problems with the model's generalization.**

In [None]:
fig, axes = plt.subplots(2,2, figsize=(15,7))
axes=axes.flatten()

sns.histplot(x=train_df[features].mean(), kde=True, ax=axes[0], bins=50, color='b')
sns.histplot(x=test_df[features].mean(), kde=True, ax=axes[0], bins=50)

sns.histplot(x=train_df[features].median(), kde=True, ax=axes[1], bins=50, color='b')
sns.histplot(x=test_df[features].median(), kde=True, ax=axes[1], bins=50)

sns.histplot(x=train_df[features].std(), kde=True, ax=axes[2], bins=50, color='b')
sns.histplot(x=test_df[features].std(), kde=True, ax=axes[2], bins=50)

As we can see - from graphs - we shouldn't have problem with generalization in data. Both - test and train sets - are quite similar.

**We will also check if there are columns wich values duplicate. That would mean that the variance of given column is low and. Maybe we could find a similar pattern in test columns and just use it to assign proper class?**

In [None]:
display(train_df.var().sort_values(ascending=True)[0:10])
display(test_df.var().sort_values(ascending=True)[0:10])

As we can see - there are variables with very low variance both on train and test data. That could be important thing for the future.

**To sum up:
a) We have to use RobustScaler on our data to see if it will improve our scores
b) We will use LightGBM for now as an our model**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_df[features], train_df['target'], random_state=13, test_size=0.2)
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
model = LGBMClassifier(random_state=13, n_jobs=-1, class_weight='balanced')
model.fit(X_train, y_train)
y_pred = model.predict_proba(X_test)[:,1]
fpr, tpr, threshold = roc_curve(y_test, y_pred)

In [None]:
plt.plot(fpr, tpr, color='green', label='ROC Curve')
plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Baseline')
plt.xlim([0, 1])
plt.ylim([0, 1])
display('roc_auc_score:', roc_auc_score(model.predict(X_test), y_test))
display('accuracy:', model.score(X_test, y_test))

As we can see - for LightGBM ROC AUC Score is 65%. But the most important thing is how it is going to perform on main validation data where we don't know target values.

In [None]:
# That part is about evaluating our model on new test data and check score thru Kaggle Competitions
x = scaler.transform(test_df)
predictions = model.predict(x)
submission = pd.DataFrame({'ID_code': test_df.index, 'target': predictions})

submission.to_csv('submission.csv', index=False)

Accuracy for test set is 79%. It's not terrible but we will definitely try to improve our score in future. Top scientists on leaderboard do about 90%.

# To do...
* Explore more relations between variables - test variables too
* Improve competition score