# Introduction
This notebook was created to learn basic techniques of data manipulation and machine learning. The idea is to use the dataset UCI_Credit_Card to improve basic skills of data cleaning, data analysis, data visualization and machine learning. It is primarily intended to help myself understanding what to do and how. Any feedback is welcome.

## Variables
There are 25 variables:

* ID: ID of each client
* LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)
* AGE: Age in years
* PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
* PAY_2: Repayment status in August, 2005 (scale same as above)
* PAY_3: Repayment status in July, 2005 (scale same as above)
* PAY_4: Repayment status in June, 2005 (scale same as above)
* PAY_5: Repayment status in May, 2005 (scale same as above)
* PAY_6: Repayment status in April, 2005 (scale same as above)
* BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
* BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
* BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
* BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
* BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
* BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
* PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
* PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
* PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
* PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
* PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
* PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
* default.payment.next.month: Default payment (1=yes, 0=no)


In [7]:
# Import basic libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
# Load the data

df = pd.read_csv('/Users/Michael/Documents/GitHub/C5T2/data/credit.csv')
df.head(3)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0


As a first step, let's have a look if there are missing or anomalous data

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
ID                            30000 non-null int64
LIMIT_BAL                     30000 non-null int64
SEX                           30000 non-null int64
EDUCATION                     30000 non-null int64
MARRIAGE                      30000 non-null int64
AGE                           30000 non-null int64
PAY_0                         30000 non-null int64
PAY_2                         30000 non-null int64
PAY_3                         30000 non-null int64
PAY_4                         30000 non-null int64
PAY_5                         30000 non-null int64
PAY_6                         30000 non-null int64
BILL_AMT1                     30000 non-null int64
BILL_AMT2                     30000 non-null int64
BILL_AMT3                     30000 non-null int64
BILL_AMT4                     30000 non-null int64
BILL_AMT5                     30000 non-null int64
BILL_AMT6               

In [10]:
# Categorical variables description
print(df['SEX'].unique())
print(df['EDUCATION'].unique())
print(df['MARRIAGE'].unique())

# df[['SEX', 'EDUCATION', 'MARRIAGE']].unique()
# df[['SEX', 'EDUCATION', 'MARRIAGE']].unique()

[2 1]
[2 1 3 5 4 6 0]
[1 2 3 0]


No missing data, but a few anomalous things:
* EDUCATION has cathegory 5 and 6 that are unlabeled, moreover it has a label 0 that is undocumented.
* MARRIAGE has a label 0 that is undocumented

In [11]:
# Payment delay description
df[['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']].describe()

Unnamed: 0,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,-0.0167,-0.133767,-0.1662,-0.220667,-0.2662,-0.2911
std,1.123802,1.197186,1.196868,1.169139,1.133187,1.149988
min,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0
25%,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
50%,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0
max,8.0,8.0,8.0,8.0,8.0,8.0


They all present an undocumented label -2. If 1,2,3, etc are the months of delay, 0 should be labeled 'pay duly' and every negative value should be seen as a 0. But we will get to that later

In [12]:
# Bill Statement description description
df[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].describe()

Unnamed: 0,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,51223.3309,49179.075167,47013.15,43262.948967,40311.400967,38871.7604
std,73635.860576,71173.768783,69349.39,64332.856134,60797.15577,59554.107537
min,-165580.0,-69777.0,-157264.0,-170000.0,-81334.0,-339603.0
25%,3558.75,2984.75,2666.25,2326.75,1763.0,1256.0
50%,22381.5,21200.0,20088.5,19052.0,18104.5,17071.0
75%,67091.0,64006.25,60164.75,54506.0,50190.5,49198.25
max,964511.0,983931.0,1664089.0,891586.0,927171.0,961664.0


Negative values can be interpreted as credit? Has to be investigated

In [13]:
#Previous Payment Description description description
df[['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']].describe()

Unnamed: 0,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,5663.5805,5921.163,5225.6815,4826.076867,4799.387633,5215.502567
std,16563.280354,23040.87,17606.96147,15666.159744,15278.305679,17777.465775
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,1000.0,833.0,390.0,296.0,252.5,117.75
50%,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0
75%,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0
max,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0


In [14]:
df.LIMIT_BAL.describe()

count      30000.000000
mean      167484.322667
std       129747.661567
min        10000.000000
25%        50000.000000
50%       140000.000000
75%       240000.000000
max      1000000.000000
Name: LIMIT_BAL, dtype: float64

The range is very broad, Investigation required.

Two columns bother me because are poorly labeled.

In [15]:
df = df.rename(columns={'default.payment.next.month': 'def_pay', 
                        'PAY_0': 'PAY_1'})
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [16]:
# I am interested in having a general idea of the default probability
df.def_pay.sum() / len(df.def_pay)

AttributeError: 'DataFrame' object has no attribute 'def_pay'

In [4]:
# Other ways of getting this kind of numbers (as a reference for newbies like myself)
print(df.shape)
print(df.shape[0])
print(df.def_pay.count())
print(len(df.axes[1]))

NameError: name 'df' is not defined

# Categorical variables
## Cleaning
There is not much to clean, but it is a good occasion to learn how to look at a column and replace anomalous entries

### Looking at our columns with more attention

In [5]:
df.SEX.value_counts() #this is fine, more women than men

NameError: name 'df' is not defined

In [6]:
df['MARRIAGE'].value_counts()

NameError: name 'df' is not defined

In [None]:
df.EDUCATION.value_counts() # yes, I am using different ways of calling a column

### Fixing the mislabeled entries

The 0 in MARRIAGE can be safely categorized as 'Other' (thus 3). 

The 0 (undocumented), 5 and 6 (label unknown) in EDUCATION can also be put in a 'Other' cathegory (thus 4)

Thus is a good occasion to learn how to use the .loc function

In [None]:
fil = (df.EDUCATION == 5) | (df.EDUCATION == 6) | (df.EDUCATION == 0)
df.loc[fil, 'EDUCATION'] = 4
df.EDUCATION.value_counts()

In [None]:
df.loc[df.MARRIAGE == 0, 'MARRIAGE'] = 3
df.MARRIAGE.value_counts()

One might wonder what these labels might mean something.

"Other" in education can be an education lower than the high school level

"Other" in marriage could be, for example, "divorced". 



Numerically they are not very relevant. We can decide later if we can drop them (or, even better, if dropping them can improve our result)

## Analysis

We can look at how these three variables are correlated to our target 'def_pay'. The goal is to see if they can be relevant to our models or not and, most importantly, it gives us a chance of learning a few basic techniques.

### The magic use of groupby

It is a very easy tool to directly see the relation between one category and the other. Just for pedagocial purposes, we can do it step by step.

Let's start by looking at the correlation between gender and default


In [None]:
df.groupby(['SEX', 'def_pay']).size()

Well, this doesn't look very good, why don't we create a dataframe out of it?

In [None]:
gender = df.groupby(['SEX', 'def_pay']).size().unstack(1)
# 1 is the default for unstack, but I put it to show explicitly what we are unstacking
gender

In [None]:
# Another, easier, way is to just use crosstab
pd.crosstab(df.SEX, df.def_pay)

We can do two things: plot directly or compute the probability for each gender to default according to our dataset

In [None]:
gender.plot(kind='bar', stacked = True)

In [None]:
gender['perc'] = (gender[1]/(gender[0] + gender[1])) 
#this creates a new column in our dataset
gender

In [None]:
# and we can visualize it
gender.perc.plot(kind = 'bar')

Considering that about 22% of the customers will default, we see a couple of things:
* there are significantly more women than men
* men are most likely going to default the next month

However, we don't have to jump to any conclusion just yet since there might be some lurking variable that justifies the data better (and, being SEX the first variable we look at, it is most likely the case). However, nice result and move on.

Now let's look at EDUCATION

In [None]:
ed = df.groupby(['EDUCATION', 'def_pay']).size().unstack()
ed.plot(kind = 'bar', stacked = True)

In [None]:
ed['perc'] = (ed[1]/(ed[0] + ed[1]))
ed

It seems that the higher is the education, the lower is the probability of defaulting the next month. Only exception is for the category labeled "Other" that, if we stick to the documentation, would be lower than high school. However, numerically they will not have much weight in the final result.

At last, let's look at MARRIAGE.

In [None]:
mar = df.groupby(['MARRIAGE', 'def_pay']).size().unstack()
mar.plot(kind = 'bar', stacked = True)

In [None]:
mar['perc'] = (mar[1]/(mar[0] + mar[1]))
mar

Here it seems that married people are most likely to default as well as the misterious category "Other" ( which is again numerically less relevant than the others)

All considered, these three categories seem to affect the result we want to predict. Thus we keep them in mind for later. 

I try to explain these first results and, while I can imagine how marital status or education can determine the balance of your credit card, I can't find a way of explaining why the type of genitals can do that as well. This particular result could probably get more meaning when put in the context of the society this people belong to.

Revealing gender inequalities in not our priority (at least not on a beginner notebook on Kaggle), so we move on.

One consideration: we did the same thing over and over, good for practice but still incomplete. Let's see a slightly different way of obtaining the same percentages.

In [None]:
df[["SEX", "def_pay"]].groupby(['SEX'], 
                                        as_index=False).mean().sort_values(by='def_pay', 
                                                                           ascending=False)

A newbie as myself likes to mess around with options. We remove the as_index (which will just make the first column the index) and the ascending (which will make them ascending)

In [None]:
df[["SEX", "def_pay"]].groupby(['SEX']).mean().sort_values(by='def_pay')

One last thing before moving on. If you have to perform repetitive actions like this one, you want to write a function to do it for you. It is a good exercise and it will reveal what we put under the carpet when we called the columns with df[0] and df[1]

In [None]:
def corr_2_cols(Col1, Col2):
    res = df.groupby([Col1, Col2]).size().unstack()
    res['perc'] = (res[1]/(res[0] + res[1]))
    return res

corr_2_cols('SEX', 'def_pay')

Looks great, it does everything we did before, life is wonderful and we can all quit recreational drugs to feel alive again because we don't need them any longer.

Not really.

If we simply try to do corr_2_cols('MARRIAGE', 'SEX') we get an index error. This is because with res[1] we are actually calling the column called 1 in res, which just happens to be also in position 1. Let's write it in a more correct way

In [None]:
def corr_2_cols(Col1, Col2):
    res = df.groupby([Col1, Col2]).size().unstack()
    res['perc'] = (res[res.columns[1]]/(res[res.columns[0]] + res[res.columns[1]]))
    return res

corr_2_cols('SEX', 'def_pay')

In [None]:
corr_2_cols('MARRIAGE', 'SEX')

In [None]:
corr_2_cols('EDUCATION', 'SEX')

Now we are happier, we have written a function that would have saved us some time and we can see some other correlations. For example, in our dataset the percentage of women with higher education is comparable with the one with lower education, which is not a common result in many countries.

However, our function works only if the unstacked column has only 2 possible values. Can you write a more general one?



# Dealing with age

We use some seaborn functions, I found most of them in kernels of the Titanic competition but, again, the all purpose is to practice so.. here we go

In [None]:
# import the libraries we need
#import seaborn as sns
#import matplotlib.pyplot as plt
#%matplotlib inline 
# Already done in the first cell, but kept here as reference

In [None]:
g = sns.FacetGrid(df, col = 'def_pay')
g.map(plt.hist, 'AGE')

Or we can even divide them further and visualize 4 distributions

In [None]:
g = sns.FacetGrid(df, col = 'def_pay', row = 'SEX')
g.map(plt.hist, 'AGE')

It is fair to say that this doesn't really help, so let's use the hue option

In [None]:
g = sns.FacetGrid(df, col='SEX', hue='def_pay')
g.map(plt.hist, 'AGE', alpha=0.6, bins=25) #alpha is for transparency
g.add_legend()

In [None]:
g = sns.FacetGrid(df, col='def_pay', row= "MARRIAGE", hue='SEX')
g.map(plt.hist, 'AGE', alpha=0.3, bins=25) 
g.add_legend()

Can be useful to create categories out of our age distribution. We can do it in three ways (that I know of).

First, we could simply create a column and put a bunch of filters to fill it with the help of loc

In [None]:
df['AgeBin'] = 0 #creates a column of 0
df.loc[((df['AGE'] > 20) & (df['AGE'] < 30)) , 'AgeBin'] = 1
df.loc[((df['AGE'] >= 30) & (df['AGE'] < 40)) , 'AgeBin'] = 2
df.loc[((df['AGE'] >= 40) & (df['AGE'] < 50)) , 'AgeBin'] = 3
df.loc[((df['AGE'] >= 50) & (df['AGE'] < 60)) , 'AgeBin'] = 4
df.loc[((df['AGE'] >= 60) & (df['AGE'] < 70)) , 'AgeBin'] = 5
df.loc[((df['AGE'] >= 70) & (df['AGE'] < 81)) , 'AgeBin'] = 6
df.AgeBin.hist()

This works gives you control of how big the bins are BUT, let's face it, now that we know how loc works (sort of) it is not practical. We can use the second method that I know, which is to cut

In [None]:
bins = [20, 29, 39, 49, 59, 69, 81]
bins_names = [1, 2, 3, 4, 5, 6]
df['AgeBin2'] = pd.cut(df['AGE'], bins, labels=bins_names)
df.AgeBin2.hist()

We notice 2 things:
* the bins have to be defined in a slightly counter intuitive way (at first) due to the fact that it includes the upper limit (as you can check by just changing the bins. You can play with the option "right" that is True by default
* the bins names have to be less numerous than the bins, i.e. with one bin you do bins = [20,81] and bins_names = [ 1 ] 

There is actually a faster way of doing 6 bins with cut, at the price of losing control on how big these bins are

In [None]:
df['AgeBin3'] = pd.cut(df['AGE'], 6)
df.AgeBin3.value_counts()

This is slightly different than we did so far, but also faster. To have the right names we need to add an option to the cut command

In [None]:
df['AgeBin3'] = pd.cut(df['AGE'], 6, labels=bins_names)
df.AgeBin3.hist()

Another way of cutting a countinuos variable can be with a quantile-based discretization. This is done by the function qcut

In [None]:
df['AgeBin4'] = pd.qcut(df['AGE'], 6)
df.AgeBin4.value_counts()

In [None]:
df['AgeBin4'] = pd.qcut(df['AGE'], 6, labels=bins_names)
df.AgeBin4.hist()

This can be useful if, for example, you have outliers (like in the balance variable it is possible there will be some) because those outliers would just fall into the extremal categories.

I still like my age feature, but we don't need so many bin categories. We keep the fast one only

In [None]:
del df['AgeBin2']
del df['AgeBin3']
del df['AgeBin4'] # we don't need these any more
df['AgeBin'] = pd.cut(df['AGE'], 6, labels = [1,2,3,4,5,6])
#because 1 2 3 ecc are "categories" so far and we need numbers
df['AgeBin'] = pd.to_numeric(df['AgeBin']) 
df.AgeBin.hist()

In [None]:
corr_2_cols('AgeBin', 'def_pay')

In [None]:
corr_2_cols('AgeBin', 'SEX')

I am keeping both the AGE and the AgeBin features because I am curious on what difference does it make for our models.

## Time to understand the rest

The variables regarding the credit and payment delays were not explored at all and I am not confortable with using them in a model if I am not sure of their meaning. An important thing I have to remember is that I will make some assumptions and I have to verify them.

First, looking at the number of months of delay, we see that the 'pay duly' should be labeled as -1. So I need to know what 0 and -2 mean. I suspect they can all be called 0, but let's see, for example, their BILL_AMT

In [None]:
df[df.PAY_1 < 1][['BILL_AMT2', 'PAY_AMT1', 'BILL_AMT1', 'PAY_1']].sample(20)

Why not looking at those with a lot of delay?

In [None]:
df[df.PAY_1 > 3][['BILL_AMT2', 'PAY_AMT1', 'BILL_AMT1', 'PAY_1']].sample(20)

One may wonder if that feature is useful at all once that you know the amount of money the clients have to pay. The more time I spend on it, the less this feauture seems important, although it can have the role of "category" for BILL_AMT. We keep it, but I don't like having undocumented labels, I decide to fix it to what seems more logical to me. We will test its importance later on.


In [None]:
fil = (df.PAY_1 == -2) | (df.PAY_1 == -1) | (df.PAY_1 == 0)
df.loc[fil, 'PAY_1'] = 0
fil = (df.PAY_2 == -2) | (df.PAY_2 == -1) | (df.PAY_2 == 0)
df.loc[fil, 'PAY_2'] = 0
fil = (df.PAY_3 == -2) | (df.PAY_3 == -1) | (df.PAY_3 == 0)
df.loc[fil, 'PAY_3'] = 0
fil = (df.PAY_4 == -2) | (df.PAY_4 == -1) | (df.PAY_4 == 0)
df.loc[fil, 'PAY_4'] = 0
fil = (df.PAY_5 == -2) | (df.PAY_5 == -1) | (df.PAY_5 == 0)
df.loc[fil, 'PAY_5'] = 0
fil = (df.PAY_6 == -2) | (df.PAY_6 == -1) | (df.PAY_6 == 0)
df.loc[fil, 'PAY_6'] = 0
df[['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']].describe()

Second, the feature LIMIT_BAL is the amount of given credit, I am incline to interpret it as the credit limit, thus the maximal amount of credit the customer can have in a month. However, being the range from 10000 to 1000000 and excluding we are not dealing with a very careless bank, I would interpret it as a limit in the year. 

To check that, we could see if anyone has a higher BILL_AMT than the LIMIT_BAL

In [None]:
fil = ((df.LIMIT_BAL < df.BILL_AMT1) | 
      (df.LIMIT_BAL < df.BILL_AMT2) |
      (df.LIMIT_BAL < df.BILL_AMT3) |
      (df.LIMIT_BAL < df.BILL_AMT4) |
      (df.LIMIT_BAL < df.BILL_AMT5) |
      (df.LIMIT_BAL < df.BILL_AMT6))
df[fil].def_pay.value_counts()

Nope, it can happen and it doesn't lead to default necessarily. This surprises me because the bank is then asking to some clients to pay more than the bank is allowing them to spend. 

I am clearly missing something.

Let's have a look to the BILL_AMT and PAY_AMT together if they make more sense

In [None]:
 df[['PAY_AMT6', 'BILL_AMT6', 'PAY_AMT5', 
     'BILL_AMT5', 'PAY_AMT4', 'BILL_AMT4', 'PAY_AMT3', 'BILL_AMT3', 
     'PAY_AMT2', 'BILL_AMT2',
     'PAY_AMT1', 'BILL_AMT1',
     'LIMIT_BAL', 'def_pay']].sample(30)

To me it seems that it goes like that:
* I have a BILL of X, I pay Y
* The month after I have to pay X-Y + X', being X' my new expenses, I pay Y'
* The month after I have to pay X+X' - Y - Y' + X'' , I pay Y''
* So on so forth

On top of that I may or may not have months of delay.

It seems that if by september I have a bill too close to my limit, I generally fail.

But this is something you discuss about with your friends while making use of recreational drugs, we are here to be a little bit more scientific. Can I be more scientific than that?



In [None]:
fil = ((df.PAY_AMT1 > df.BILL_AMT2) & df.PAY_1 > 0)
df[fil][['BILL_AMT1', 'PAY_1', 'LIMIT_BAL', 'def_pay']]

This throws me off. There are clients that paid more there were asked to, had even a negative bill in Sept., and still have a month of delay, and even defaulted the next month. 

In [None]:
fil = ((df.PAY_AMT1 > df.BILL_AMT2))
print("Number of clients that tried to save themselves in the last month: ", len(df[fil]))
print("Percentage of default: ", df[fil].def_pay.mean())
fil = ((df.PAY_AMT1 > df.BILL_AMT2) & df.PAY_1 > 0)
print("Percentage of default if delay: ", df[fil].def_pay.mean())
print("Value counts of delay: ", df[fil].PAY_1.value_counts())
fil = ((df.PAY_AMT1 > df.BILL_AMT2) & df.PAY_1 < 1)
print("Percentage of default if no delay: ", df[fil].def_pay.mean())

That's weird, maybe the month of delay is assigned if one month the payment is 0. If this is the case my previous conclusion is wrong and the PAY_1 feature becomes more important (although the last result suggest that having 1 month of delay actually doesn't really matter for these clients). Let's try to verify this

In [None]:
fil = ((df.PAY_6 == 0) & 
      ((df.PAY_5 > df.PAY_6) & (df.BILL_AMT6 > 0) & (df.PAY_AMT5 == 0))
      )

df[fil][['PAY_6', 'BILL_AMT6', 'PAY_AMT5', 'PAY_5', 'BILL_AMT5', 'PAY_AMT4', 'PAY_4', 'PAY_AMT3']]

In [None]:
fil = ((df.PAY_5 == 0) & (df.PAY_6 == 0) &
      ((df.PAY_4 > df.PAY_5) & (df.BILL_AMT5 > 0) & (df.PAY_AMT4 == 0))
      )

df[fil][['BILL_AMT6', 'PAY_AMT5', 'PAY_5', 'BILL_AMT5', 'PAY_AMT4', 'PAY_4', 'BILL_AMT4', 'PAY_AMT3']]

Ok, I am incline to ignore the PAY_n features since I can't give sense to them. However, before looking at how important can be to understand your data before doing predictions, I would like to see what we get by simply feeding the machine with everything we have.

# Blind machine learning

I define it blind because I will just throw everything I have in them and nothing more than that. My hope is to see significant improvements once that I will engineer some features
