### Cleaning

In [1]:
import pandas as pd
from fancyimpute import *

Using TensorFlow backend.


In [2]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
target = df_train.target
df_train = df_train.drop('target', axis = 1)

Some ridicolous dtype conversions as NaN is not allowed for integers

In [3]:
df_train.age = df_train.age.astype(float)
df_test.marriage = df_test.marriage.astype(float)
df_test.PAY_6 = df_test.PAY_6.astype(float)

I wanted both dataframes together, so that the one-hot-encoding is consistent. Some categorical values only exist in train

In [4]:
df_both = pd.concat([df_train, df_test], ignore_index=True)

### Data impurity
For some stupid reason, the data contains value 33. I am assuming it means 3

In [5]:
df_both.marriage = df_both.marriage.apply(lambda x: 3 if x == 33 else x)

In [6]:
# df_both = df_both.drop(['n_children', 'profession'], axis = 1) # Maybe these two variables are not important

### Categorical variable -> One-hot-encoding

In [7]:
df = pd.get_dummies(df_both, columns = ['education', 'sex', 'marriage', 'profession'])
# df = pd.get_dummies(df, columns = [f'PAY_{x}' for x in range(1,6)]) # Seems to have no effect

In [8]:
for i in range(1,6):
    df[f'PAY_{i}_minus2'] = df[f'PAY_{i}'] == -2
    df[f'PAY_{i}_minus1'] = df[f'PAY_{i}'] == -1

### Min Max normalisation

In [18]:
normalisation_columns = ['limit_balance', 'n_children', 'age'] + \
[f'PAY_{i}' for i in range(1,6)] +\
[f'BILL_AMT{i}' for i in range(1,6)] +\
[f'PAY_AMT{i}' for i in range(1,6)]

In [20]:
for col_name in normalisation_columns:
    col = df[col_name]
    df[col_name] = (col-col.min())/(col.max()-col.min())

### Imputation with [Fancyimpute](https://stackoverflow.com/questions/45321406/missing-value-imputation-in-python-using-knn)

In [23]:
%%time
X_imputed = KNN(k=3).solve(df.values, df.isnull().values)
# X_imputed = IterativeImputer().fit_transform(X=df.values)#, missing_mask= df.isnull().values)

Imputing row 1/30000 with 0 missing, elapsed time: 277.262
Imputing row 101/30000 with 0 missing, elapsed time: 277.282
Imputing row 201/30000 with 0 missing, elapsed time: 277.283
Imputing row 301/30000 with 0 missing, elapsed time: 277.283
Imputing row 401/30000 with 0 missing, elapsed time: 277.284
Imputing row 501/30000 with 0 missing, elapsed time: 277.284
Imputing row 601/30000 with 0 missing, elapsed time: 277.285
Imputing row 701/30000 with 0 missing, elapsed time: 277.285
Imputing row 801/30000 with 0 missing, elapsed time: 277.285
Imputing row 901/30000 with 0 missing, elapsed time: 277.285
Imputing row 1001/30000 with 0 missing, elapsed time: 277.286
Imputing row 1101/30000 with 0 missing, elapsed time: 277.286
Imputing row 1201/30000 with 0 missing, elapsed time: 277.286
Imputing row 1301/30000 with 0 missing, elapsed time: 277.286
Imputing row 1401/30000 with 0 missing, elapsed time: 277.287
Imputing row 1501/30000 with 0 missing, elapsed time: 277.287
Imputing row 1601/30

In [24]:
imputed_df = pd.DataFrame(X_imputed, columns = df.columns)

In [25]:
df_train_imputed = imputed_df.iloc[:15000]
df_train_imputed['target'] = target
df_test_imputed = imputed_df.iloc[15000:]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [26]:
df_train_imputed.to_csv('train_imputed_normalised.csv', index = False)
df_test_imputed.to_csv('test_imputed_normalised.csv', index = False)