# Customer Data Preprocessing

Notebook Author: Matthew Kearns

This dataset was pulled from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Credit+Approval. 

The features are unlabeled, so we cannot perform feature sampling based on prior knowledge about how the features relate to the output label. We can, however, remove records with multiple erroneous or missing values as these samples could negatively impact the performance of our models. Comprehensively, we will be doing the following preprocessing tasks to prepare the data:

    - Locate missing data/replace with sentinel NaN value
    - Perform record sampling to remove records with more than 1 missing value
    - Fill missing continuous values with column mean
    - Fill missing categorical values with column mode
    - Remove outliers using statistical analysis
    - Perform a One Hot Encoding of the categorical features

In [38]:
import pandas as pd
import numpy as np

In [55]:
df = pd.read_csv('./customer.data', header=None)

In [78]:
# replace missing values with sentinel NaN value
df = df.replace('?', np.nan)
df.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [79]:
# features 2, 7, 10, and 14 are continuous. None of these contain NaN values.
df.count()

0     678
1     678
2     690
3     684
4     684
5     681
6     681
7     690
8     690
9     690
10    690
11    690
12    690
13    677
14    690
15    690
dtype: int64

In [86]:
# the 10 samples below each contain more than a single error and should be removed
nan_vals = dict(df.count(axis=1))
nan_vals = {key:value for (key,value) in nan_vals.items() if value < 15}
nan_vals

{206: 11,
 270: 11,
 330: 11,
 445: 14,
 456: 11,
 479: 13,
 539: 14,
 592: 11,
 601: 13,
 622: 11}

In [87]:
# drop the erroneous records from the data frame
df = df.drop(index=nan_vals.keys())

In [89]:
# this still leaves us with 680 samples for training/testing
len(df)

680

In [172]:
# there are no missing continuous values; however, there are missing categorical values that we can fill in using the column's
# most frequent value -- the mode.
fill = pd.Series(df.mode().values.flatten())
df = df.fillna(fill)

In [177]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
