# Customer Data Preprocessing

Notebook Author: Matthew Kearns

This dataset was pulled from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Credit+Approval. 

The features are unlabeled, so we cannot perform feature sampling based on prior knowledge about how the features relate to the output label. We can, however, remove records with multiple erroneous or missing values as these samples could negatively impact the performance of our models. Comprehensively, we will be doing the following preprocessing tasks to prepare the data:

    - Locate missing data/replace with sentinel NaN value
    - Perform record sampling to remove records with more than 1 missing value
    - Remove outliers using statistical analysis
    - Fill missing continuous values with column mean
    - Fill missing categorical values with column mode
    - Perform One Hot Encoding (OHE) for SVM cla

In [38]:
import pandas as pd
import numpy as np

In [55]:
df = pd.read_csv('./customer.data', header=None)

In [78]:
# replace missing values with sentinel NaN value
df = df.replace('?', np.nan)
df.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [79]:
# features 2, 7, 10, and 14 are continuous. None of these contain NaN values.
df.count()

0     678
1     678
2     690
3     684
4     684
5     681
6     681
7     690
8     690
9     690
10    690
11    690
12    690
13    677
14    690
15    690
dtype: int64

In [85]:
nan_count = dict(df.count(axis=1))
nan_count = {key:value for (key,value) in nan_count.items() if value < 15}
nan_count

{206: 11,
 270: 11,
 330: 11,
 445: 14,
 456: 11,
 479: 13,
 539: 14,
 592: 11,
 601: 13,
 622: 11}