## CMT307 Lab 1 Exercise
Download Credit Approval Data Set from UCI Machine Learning Repsoitory. Do:

- practicing exploratory data analysis
- dealing with missing values if any
- encoding categorical features
- scaling features
- if you have time, implementing a classifier to predict if a credit card application is approved (+ of the last column) or reject (- of the last column)

You can read more information about the data set from https://archive.ics.uci.edu/ml/datasets/Credit+Approval

In [4]:
import pandas as pd

# Get Credit Approval Data Set
crx = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data', header='infer')
crx.columns = ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15', 'Target']

# Start writing your IPython notebook............
crx

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,Target
0,a,58.67,4.460,u,g,q,h,3.040,t,t,6,f,g,00043,560,+
1,a,24.50,0.500,u,g,q,h,1.500,t,f,0,f,g,00280,824,+
2,b,27.83,1.540,u,g,w,v,3.750,t,t,5,t,g,00100,3,+
3,b,20.17,5.625,u,g,w,v,1.710,t,f,0,f,s,00120,0,+
4,b,32.08,4.000,u,g,m,v,2.500,t,f,0,t,g,00360,0,+
5,b,33.17,1.040,u,g,r,h,6.500,t,f,0,t,g,00164,31285,+
6,a,22.92,11.585,u,g,cc,v,0.040,t,f,0,f,g,00080,1349,+
7,b,54.42,0.500,y,p,k,h,3.960,t,f,0,f,g,00180,314,+
8,b,42.50,4.915,y,p,w,v,3.165,t,f,0,t,g,00052,1442,+
9,b,22.08,0.830,u,g,c,h,2.165,f,f,0,t,g,00128,0,+


## Sample solution
This sample solution is only an illustration of an implementation of the tasks in the exercise. It doesn't mean it is the rightest or best soulution, nor it is optimised for best classification performance. 

In [3]:
#pd.set_option('display.max_columns', None, 'max_colwidth', None, 'display.expand_frame_repr', False) # print all columns in full, prevent line break

print(crx.shape)
print(crx.head())
print(crx.info())
print(crx.describe())

(689, 16)
  A1     A2     A3 A4 A5 A6 A7    A8 A9 A10  A11 A12 A13    A14  A15 Target
0  a  58.67  4.460  u  g  q  h  3.04  t   t    6   f   g  00043  560      +
1  a  24.50  0.500  u  g  q  h  1.50  t   f    0   f   g  00280  824      +
2  b  27.83  1.540  u  g  w  v  3.75  t   t    5   t   g  00100    3      +
3  b  20.17  5.625  u  g  w  v  1.71  t   f    0   f   s  00120    0      +
4  b  32.08  4.000  u  g  m  v  2.50  t   f    0   t   g  00360    0      +
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689 entries, 0 to 688
Data columns (total 16 columns):
A1        689 non-null object
A2        689 non-null object
A3        689 non-null float64
A4        689 non-null object
A5        689 non-null object
A6        689 non-null object
A7        689 non-null object
A8        689 non-null float64
A9        689 non-null object
A10       689 non-null object
A11       689 non-null int64
A12       689 non-null object
A13       689 non-null object
A14       689 non-null object
A15     

It is clear that A2 and A14 are neumeric attributes from crx.head(), but they are showing 'object' type from crx.info(), and they aren't included in crx.describe(). This indicates A2 and A14 contain other date type, e.g., strings to represent missing value. You can see A2 contains question marks '?' for missing values if you print it out using the following code. You may also inspect if the dataset contains missing values in Excel. 

In [None]:
pd.set_option('display.max_rows', None)
print(crx[['A2', 'A14']])
#print('\nA2\n', crx['A2'].to_numpy())

In [None]:
import numpy as np
crx.replace(to_replace='?', value=np.nan, inplace=True)
crx[['A2', 'A14']] = crx[['A2', 'A14']].astype(float)

crx.info()

In [5]:
crx_x = crx[['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15']]
crx_y = crx[['Target']]

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(crx_x, crx_y, test_size=0.3, random_state=42)

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
#from sklearn.compose import make_column_transformer

# transformer for categorical features
categorical_features = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13', ]
categorical_transformer = Pipeline(
    [
        ('imputer_cat', SimpleImputer(strategy = 'most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
    ]
)

# transformer for numerical features
numeric_features = ['A2', 'A3', 'A8', 'A11', 'A14', 'A15']
numeric_transformer = Pipeline(
    [
        ('imputer_num', SimpleImputer(strategy = 'median')),
        ('scaler', MinMaxScaler())
    ]
)

# combine them in a single ColumnTransformer
preprocessor = ColumnTransformer(
    [
        ('categoricals', categorical_transformer, categorical_features),
        ('numericals', numeric_transformer, numeric_features)
    ],
    remainder = 'drop'
)


#crx_processed = preprocessor.fit_transform(crx_x)

#np.set_printoptions(threshold=np.inf, linewidth=np.inf, suppress=True, precision=2)
#print(crx_processed[0:10, :])
#crx_processed.shape

ModuleNotFoundError: No module named 'sklearn.impute'

In [8]:
from sklearn.neighbors import KNeighborsClassifier

myClassfier = Pipeline(
    [
     ('preprocessing', preprocessor),
     ('classifier', KNeighborsClassifier())
    ]
)

In [None]:
myClassfier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

y_pred = myClassfier.predict(X_test)
accuracy_score(y_test, y_pred)