In [None]:
import pandas as pd
import seaborn as sns

# load data and rename columns
d = pd.read_csv('data/creditcard.csv')
d.columns = [col.lower() for col in d.columns]
d = d.rename(columns={
    'default payment next month': 'default', 
    'pay_0': 'pay_1'})

In [None]:
d.columns

By definition, column *pay_0* relates to last month, but for the same month serve columns *bill_amt1* and *pay_amt1*. Rename columns *pay_0* to *pay_1* to achieve easier data handling later.

Data Exploration
==============

Missing values
---------------------

In [None]:
d.isnull().sum(axis=1).value_counts()

There are no missing values in the dataset.

In [None]:
d['default'].value_counts().plot(kind='bar', title='Credit card default (0: not-bdefault, 1: default)', ylabel='count', xlabel='default')

Dataset imbalance
------------------

In [None]:
imbalance = d['default'].value_counts()
imbalance_perc = imbalance[1]/(imbalance[0]+imbalance[1])
imbalance_perc

Dataset is not too imbalanced, 5:1 ratio seems to be OK for furthe analysis. In case of worse ratio, some strategies can be adopted: 
- https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
- https://towardsdatascience.com/having-an-imbalanced-dataset-here-is-how-you-can-solve-it-1640568947eb

Column values validation
----------------------

In [None]:
d['sex'].value_counts()

There are only two types of sex - OK

In [None]:
d['education'].value_counts()

There are multiple education levels, according to dictionary it should be only 3 levels and then others: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown). Change levels 0, 4, 5 and 6 to others category. 

In [None]:
d['education'] = d['education'].replace({x:4 for x in [0, 5,6]})
d['education'].value_counts()

**NOTE**: should be "others" category below *grad school* or higher than *high school*, i.e., should the category value be 0 or 4?

In [None]:
def correlate_columns(df: pd.DataFrame, col_a: str, col_b: str) -> pd.DataFrame:
    res = df.groupby([col_a, col_b]).size().unstack()
    res['perc'] = (res[res.columns[1]]/(res[res.columns[0]] + res[res.columns[1]]))
    return res

In [None]:
correlate_columns(d, 'education', 'default')

Let's order education category by percentage of defaults: then *others* category should be 0.

In [None]:
d['education'] = d['education'].replace({x:0 for x in [0, 4, 5,6]})
correlate_columns(d, 'education', 'default')

In [None]:
d['marriage'].value_counts()

Marriage status should be only (1=married, 2=single, 3=others), append unknown value of 0 to "others" category. 

In [None]:
d['marriage'] = d['marriage'].replace({0:3})
d['marriage'].value_counts()

**NOTE**: should be "others" category under or over marriage status?

In [None]:
correlate_columns(d, 'marriage', 'default')

In [None]:
pays = []
for i in [1,2,3,4,5,6]:
    pays.append(d[f'pay_{i}'].value_counts())
pays = pd.concat(pays, axis=1)
pays

In [None]:
pays.plot(kind='bar', title='pay values comparison', ylabel='# of instances', xlabel='payment delay for x monts')

According to `dictionary.txt`, valid values for *pay_\** field are [-1, 1, 2, .., 9]. Additional values are present:
- 0: can be interpreted as "pay duly"?
- -2: no idea about interpretation

Moreover, very small amounts of values *1* in *pay_\** are suspicious. 
Let's see if we can make any sense from these values.

In [None]:
res = []
for col in [f'pay_{i}' for i in range(1,7)]:
    res.append(correlate_columns(d, col, 'default')[['perc']].rename(columns={'perc':col}))
res = pd.concat(res, axis=1)
res.index.name = 'pay values'
sns.heatmap(res, annot=True)

Values 0 and -2 in *pay_\** can be related to minimum paid. E.g.: the user only paid a minimum, paid more than minimum, repayed the whole sum. Corelation with *default* shows that values 0, -1 and -2 behave similarly and based of this assumption we will merge these values into one single value: 0. More elaborated analysis on meaning of *pay* and if time series of amounts (billing, payment) correspond can be elaborated.  

In [None]:
repayment = pd.melt(d[['default']+[f'pay_{i}' for i in range(1,7)]], id_vars='default', var_name='repayment', value_name='value')
sns.boxplot(y='value', x='repayment', hue='default', data=repayment)

Column *pay_1* seems to have large discriminative effect on default, boxplots do not overlap.

In [None]:
cols = [f'pay_{i}' for i in range(1,7)]
d[cols] = d[cols].mask(lambda x: x<0).fillna(0)

**TODO**
- discriminative https://www.kaggle.com/selener/prediction-of-credit-card-default
- standardization, normalization
    - for Batch-GD or SGD scaling matters:
    https://www.quora.com/How-does-feature-scaling-affect-logistic-regression-model
    - for regularization scaling matters too:
    https://www.quora.com/How-does-feature-scaling-affect-logistic-regression-model
    - scalers should be fit on train data and then only used on test data
- split test/train
    - crosss-validation for regularization
- plot log-loss
- correlation matrix
- feature selection
- feature engineering
    - tuple sex X education
    - payment vs. limit
- regularization
    - https://www.kdnuggets.com/2016/06/regularization-logistic-regression.html
- f1 score as performance metric