In [1]:
import pandas as pd

#import dataset
df = pd.read_csv('./HouseholderAtRisk.csv')

# Task 1
## Data selection and distribution (4 marks)

1. What is the proportion of householders who have high risk?


In [2]:
print(df['High'].value_counts().to_frame())

total = 30497 + 9501
print('\n', (30497 / total) * 100, '%', ' of householders have a high risk value')

       High
High  30497
Low    9501

 76.24631231561578 %  of householders have a high risk value


2. Did you have to fix any data quality problems? Detail them.
    Apply the imputation method(s) to the variable(s) that need it. List the variables that needed it. Justify your choise of imputation if needed

**Data quality problems**
1. Renamed each column with corresponding name
2. Drop rows with over 95% NaN values
3. Drop rows with more than 8 empty cells


**Imputation**
1. imputate `occupation` by removal, small percentage to remove and no simple median to apply
2. imputate rows in `weighting` by removal, small percentage to remove and no simple median to apply
3. imputate rows in `work_class` by applying median value. median value is large majority
4. imputate rows in `country_of_origin` by applying median value. median value is large majority

**DataTypes**
1. Change all object types to categorical, then convert catergorical to representitive `int` value

**Bin Values**
# TODO
1. Bin `weighting` to remove problem of large range of singular values

In [3]:
# list unique values for each column
# check data matches description

# for column in df.columns:
#     print('\nColumn name: ' + column )
#     print(df[column].unique())

In [4]:
### DATA QUALITY
## 1
# rename each column to correct attribute name 
# according to task description
df.rename(columns= {
    '1': 'id',
    '25': 'age',
    ' Private': 'work_class',
    '224942': 'weighting',
    ' 11th': 'education',
    '7': 'num_years_education',
    ' Never-married': 'marital_status',
    ' Machine-op-inspct': 'occupation',
    ' Own-child': 'relationship',
    'Unnamed: 9': 'race',
    ' Male': 'gender',
    '0': 'capital_loss',
    '0.1': 'capital_gain',
    '0.2': 'capital_avg',
    '40': 'num_working_hours_per_week',
    '0.3': 'sex',
    ' US': 'country_of_origin',
    'High': 'at_risk',
    
}, inplace=True)

## 2
# Remove near empty columns
df.drop(columns=['race', 'capital_loss', 'capital_gain', 'capital_avg'], inplace=True)

## 3
# Remove rows with nothing but ID
df.dropna(thresh=8, inplace=True)

## remove all spaces from all columns
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

    

In [5]:
import numpy as np

### IMPUTATION
## 1
# remove rows in 'occupation' with ? value as hard to imputate media
df.drop(df.loc[df['occupation']== '?'].index, inplace=True)

## 2
# remove rows in 'weighting' that are NaN as hard to imputate median
df.dropna(subset=['weighting'], inplace=True)

## 3
# imputate rows with ? in 'work_class' as median is easy to determine
df['work_class'].replace('?', df.work_class.value_counts().idxmax(), inplace=True)

## 4
# imputate rows ? in country_of_origin to median of origin
df['country_of_origin'].replace('?', df.country_of_origin.value_counts().idxmax(), inplace=True)

In [6]:
# ### DATA TYPES
# ## 1
# # change objects to category
# df['marital_status'] = df['marital_status'].astype('category')
# df['occupation'] = df['occupation'].astype('category')
# df['relationship'] = df['relationship'].astype('category')
# df['work_class'] = df['work_class'].astype('category')


# ### TODO USE CATEGORIES IN MODEL NOT INT
# cat_columns = df.select_dtypes(['category']).columns
# df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

# print (df.info())



In [7]:
## 2
# change from float to int
df['age'] = df['age'].astype(int);
df['num_years_education'] = df['num_years_education'].astype(int)
df['num_working_hours_per_week'] = df['num_working_hours_per_week'].astype(int)
df['weighting'] = df['weighting'].astype(int)

In [8]:
# convert to binary var
binary_map = {'High':0, 'Low': 1}
df['at_risk'] = df['at_risk'].map(binary_map)

In [9]:
## TODO
### Bin rows
##df['weighting'] = pd.cut(df['weighting'], 20)
##print(df['weighting'].value_counts())

3. The dataset may include irrelevant and redundant variables. What variables did you include in the analysis and what were their roles and measurement level set? Justify your choice.

**1. Large majority of cells are single value**
`country_of_origin`, `work_class`

**2. Other**
`relationship` : replace `husband` and `wife` with singular `Married` value

`gender`: `sex` column describes same data

`education`: `num_years_education` better descriptor

`id`: not data just identifier


In [10]:
# print(df['Race'].value_counts(dropna=False), '\n')
# print(df['CapitalLoss'].value_counts(bins = 5), '\n')
# print(df['CapitalGain'].value_counts(bins = 5), '\n')
# print(df['CapitalAvg'].value_counts(bins=5), '\n')
# print(df['CountryOfOrigin'].value_counts(), '\n')

In [11]:
# Merge values
df['relationship'].replace('Husband', 'Married', inplace=True)
df['relationship'].replace('Wife', 'Married', inplace=True)

print(df['relationship'].value_counts())

Married           16696
Not-in-family      9523
Own-child          5316
Unmarried          3842
Other-relative     1099
Name: relationship, dtype: int64


In [12]:

## 1
df.drop(columns=['country_of_origin', 'work_class'], inplace=True)
## 2
# Drop
df.drop(columns=['gender', 'education', 'id'], inplace=True)



# perform one hot encoding minus target variable
df = pd.get_dummies(df)

print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36476 entries, 0 to 39997
Data columns (total 32 columns):
age                                     36476 non-null int64
weighting                               36476 non-null int64
num_years_education                     36476 non-null int64
num_working_hours_per_week              36476 non-null int64
sex                                     36476 non-null float64
at_risk                                 36476 non-null int64
marital_status_Divorced                 36476 non-null uint8
marital_status_Married-AF-spouse        36476 non-null uint8
marital_status_Married-civ-spouse       36476 non-null uint8
marital_status_Married-spouse-absent    36476 non-null uint8
marital_status_Never-married            36476 non-null uint8
marital_status_Separated                36476 non-null uint8
marital_status_Widowed                  36476 non-null uint8
occupation_Adm-clerical                 36476 non-null uint8
occupation_Armed-Forces            

4. What distribution scheme did you use? What “data partitioning allocation” did you set? Explain your selection. (Hint: Take the lead from Week 2 lecture on data distribution)

In [13]:
# split data to y = target value, x  = rest
y = df['at_risk']
X = df.drop(['at_risk'], axis=1)

# Task 2
## Predictive Modelling Using Decision Trees (4 marks)
1. Build a decision tree using the default setting. Examine the tree results and answer the followings:

  a. What is classification accuracy on training and test datasets?
  
     ___Train accuracy:___  0.9998154572968185
     
     ___Test accuracy:___  0.7732518084739924
  
  b. Which variable is used for the first split? What are the variables that are used for the second split?
  
  c. What are the 5 important variables in building the tree?
  
  `weighting : 0.27891168563163`
  
`marital_status_ Married-civ-spouse : 0.2001390885510994`

`age : 0.14475803491163708`

`num_years_education : 0.1386242604759175`

`num_working_hours_per_week : 0.09825072789925877`

  
  d. Report if you see any evidence of model overfitting.
  There is evidence of overfitting in that there is a large difference in training accuracy compared to test accuracy

In [14]:
print(X.info())
from sklearn.model_selection import train_test_split

# setting random state
rs = 10

# To ignore any future warnings
import warnings
warnings.filterwarnings("ignore")

X_mat = X.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.3, stratify=y, random_state=rs)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 36476 entries, 0 to 39997
Data columns (total 31 columns):
age                                     36476 non-null int64
weighting                               36476 non-null int64
num_years_education                     36476 non-null int64
num_working_hours_per_week              36476 non-null int64
sex                                     36476 non-null float64
marital_status_Divorced                 36476 non-null uint8
marital_status_Married-AF-spouse        36476 non-null uint8
marital_status_Married-civ-spouse       36476 non-null uint8
marital_status_Married-spouse-absent    36476 non-null uint8
marital_status_Never-married            36476 non-null uint8
marital_status_Separated                36476 non-null uint8
marital_status_Widowed                  36476 non-null uint8
occupation_Adm-clerical                 36476 non-null uint8
occupation_Armed-Forces                 36476 non-null uint8
occupation_Craft-repair            

In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
# record how long processing takes to execute
import time 
startTime = time.time()

# simple decision tree training
model = DecisionTreeClassifier(random_state=rs)
model.fit(X_train, y_train)

print('processing time: ', time.time() - startTime)

processing time:  0.19585466384887695


In [16]:
print('Train accuracy: ', model.score(X_train, y_train))
print('Test accuracy: ', model.score(X_test, y_test))

Train accuracy:  0.9998041749892296
Test accuracy:  0.7666087910079503


In [17]:
import numpy as np
def analyse_feature_importance(dm_model, feature_names, n_to_display=20):
    # grab feature importances from the model
    importances = dm_model.feature_importances_

    # sort them out in descending order
    indices = np.argsort(importances)
    indices = np.flip(indices, axis=0)

    # limit to 20 features, you can leave this out to print out everything
    indices = indices[:n_to_display]

    for i in indices:
       print(feature_names[i], ':', importances[i])

def visualize_decision_tree(dm_model, feature_names, save_name):
    import pydot
    from io import StringIO
    from sklearn.tree import export_graphviz
    
    dotfile = StringIO()
    export_graphviz(dm_model, out_file=dotfile, feature_names=feature_names)
    graph = pydot.graph_from_dot_data(dotfile.getvalue())
    graph[0].write_png(save_name) # saved in the following file

analyse_feature_importance(model, X.columns, 20)
visualize_decision_tree(model, X.columns, "2.1tree.png")

weighting : 0.2843118235468064
relationship_Married : 0.20194034369018554
age : 0.15198078285294297
num_years_education : 0.13849525030637658
num_working_hours_per_week : 0.09804571359252086
sex : 0.015443660190191265
occupation_Exec-managerial : 0.011296831918837602
occupation_Craft-repair : 0.010662237958741983
occupation_Prof-specialty : 0.009945557369829158
occupation_Sales : 0.008769050930764176
occupation_Adm-clerical : 0.008397292312861985
occupation_Transport-moving : 0.007286103578722134
occupation_Other-service : 0.006997125947644501
occupation_Machine-op-inspct : 0.0063237766830279585
occupation_Tech-support : 0.005622447035208534
occupation_Farming-fishing : 0.004277838025085874
occupation_Protective-serv : 0.0038861386776042253
occupation_Handlers-cleaners : 0.003450844061684915
relationship_Unmarried : 0.003229471688710333
marital_status_Divorced : 0.0029756053780288167


2. Build another decision tree tuned with GridSearchCV. Examine the tree results.
  
  a. What is classification accuracy on training and test datasets?
  
      `Train accuracy: 0.8306267070200044`
      
      `Test accuracy: 0.8292283844299001`
  
  b. What are the parameters used? Explain your decision.
  
  c. What are the optimal parameters for this decision tree?
  
  d. Which variable is used for the first split? What are the variables that are used for the second split?
  ___First Split:___ `marital_status_ Marries-civ-spouse`
  ___Second Split:___ 
      `num_years_education <= 12.5 true` 
      `num_years_education <=11.5 false`
  e. What are the 5 important variables in building the tree?
  
  f. Report if you see any evidence of model overfitting.
  Nothing to indicate
  




In [23]:
from sklearn.model_selection import GridSearchCV
# record processing time taken
startTime = time.time()
# grid search CV
params = {'criterion': ['gini', 'entropy'],
          'max_depth': range(2, 7),
          'min_samples_leaf': range(200, 600, 100)}

cv = GridSearchCV(param_grid=params, estimator=DecisionTreeClassifier(random_state=rs), cv=10, n_jobs=-1)
cv.fit(X_train, y_train)
print('processing time: ', time.time() - startTime)

print("Train accuracy:", cv.score(X_train, y_train))
print("Test accuracy:", cv.score(X_test, y_test))

# test the best model
y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))

# print parameters of the best model
print(cv.best_params_)

processing time:  16.31612491607666
Train accuracy: 0.8246191203540516
Test accuracy: 0.822169423375674
              precision    recall  f1-score   support

           0       0.84      0.94      0.89      8246
           1       0.71      0.47      0.57      2697

    accuracy                           0.82     10943
   macro avg       0.78      0.70      0.73     10943
weighted avg       0.81      0.82      0.81     10943

{'criterion': 'entropy', 'max_depth': 6, 'min_samples_leaf': 200}


In [28]:
visualize_decision_tree(cv.best_estimator_, X.columns, "2.2tree.png")

3. What is the significant difference do you see between these two decision tree models – default (Task 2.1) and using GridSearchCV (Task 2.2)? How do theycompare performance-wise? Explain why those changes may have happened.

The tree model in task 2.1, while faster to train and test, is far less accurate on the testing data than the training data and therefore likely overfitted. The GridSearchCV model produces far more consistent results between the training and testing data, however as a result takes much more proccessing time to achieve this.



4. From the better model, can you identify which householders to target for providing loan? Can you provide some descriptive summary of those householders?

From looking at the tree diagram output from the cv.best_estimator, we would likely provide a loan to 

# Task 3
## Predictive Modeling Using Regression (5.5 marks)
1. Describe why you will have to do additional preparation for variables to be
used in regression modelling. Apply transformation method(s) to the
variable(s) that need it. List the variables that needed it.

***We have to perform additional preperation to the dataset to ensure that all the variables are on the same scale.***

In [18]:
from sklearn.preprocessing import StandardScaler

# initialise a standard scaler object
scaler = StandardScaler()

# visualise min, max, mean and standard dev of data before scaling
print("Before scaling\n-------------")
for i in range(5):
    col = X_train[:,i]
    print("Variable #{}: min {}, max {}, mean {:.2f} and std dev {:.2f}".
          format(i, min(col), max(col), np.mean(col), np.std(col)))

# learn the mean and std.dev of variables from training data
# then use the learned values to transform training data
X_train = scaler.fit_transform(X_train, y_train)

print("After scaling\n-------------")
for i in range(5):
    col = X_train[:,i]
    print("Variable #{}: min {}, max {}, mean {:.2f} and std dev {:.2f}".
          format(i, min(col), max(col), np.mean(col), np.std(col)))

# use the statistic that you learned from training to transform test data
# NEVER learn from test data, this is supposed to be a set of dataset
# that the model has never seen before
X_test = scaler.transform(X_test)

Before scaling
-------------
Variable #0: min -1.0, max 90.0, mean 38.60 and std dev 13.23
Variable #1: min 11909.0, max 1488540.0, mean 187746.45 and std dev 104911.88
Variable #2: min 1.0, max 16.0, mean 10.13 and std dev 2.56
Variable #3: min 1.0, max 99.0, mean 41.00 and std dev 11.96
Variable #4: min 0.0, max 1.0, mean 0.32 and std dev 0.47
After scaling
-------------
Variable #0: min -2.993517135325084, max 3.8856934969275305, mean -0.00 and std dev 1.00
Variable #1: min -1.67604890258699, max 12.398915075283975, mean 0.00 and std dev 1.00
Variable #2: min -3.5712037358971154, max 2.293595010447082, mean -0.00 and std dev 1.00
Variable #3: min -3.3448369084800054, max 4.849764787561374, mean 0.00 and std dev 1.00
Variable #4: min -0.6932554171114912, max 1.4424697958599983, mean 0.00 and std dev 1.00


2. Build a regression model using the default regression method with all
inputs. Once you have completed it, build another model and tune it usingGridSearchCV. Answer the followings:

a. Report which variables are included in the regression model.

***age, weighting, num_years_education, num_working_hours_per_week, marital_status, sex, occupation, relationship***

b. Report the top-5 important variables (in the order) in the model.
- `num_years_education : 0.7409109646352008`
- `marital_status_Married-civ-spouse : 0.7697573142106786`
- `age : 0.3801010424276713`
- `num_working_hours_per_week : 0.34987759796306234`
- `occupation_Exec-managerial : 0.21437059680177015`

c. Report any sign of overfitting.

___No signs of overfitting as train and test accuary very similar___

d. What are the parameters used? Explain your decision. What are the
optimal parameters? Which regression function is being used?

e. What is classification accuracy on training and test datasets?

- `Train accuracy: 0.8281048055457643`
- `Test accuracy: 0.8241798409942429`


In [19]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=rs)

# fit it to training data
model.fit(X_train, y_train)

# training and test accuracy
print("Train accuracy:", model.score(X_train, y_train))
print("Test accuracy:", model.score(X_test, y_test))

# classification report on test data
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

Train accuracy: 0.8281048055457643
Test accuracy: 0.8241798409942429
              precision    recall  f1-score   support

           0       0.86      0.92      0.89      8246
           1       0.68      0.54      0.60      2697

    accuracy                           0.82     10943
   macro avg       0.77      0.73      0.74     10943
weighted avg       0.82      0.82      0.82     10943



In [20]:
feature_names = X.columns
coef = model.coef_[0]

# limit to 20 features, you can comment the following line to print out everything
#coef = coef[:20]

for i in range(len(coef)):
    print(feature_names[i], ':', coef[i])

age : 0.36096863161847953
weighting : 0.060650153132075224
num_years_education : 0.7388782937883976
num_working_hours_per_week : 0.3343676526528004
sex : -0.05374196941767031
marital_status_Divorced : -0.27738314837034017
marital_status_Married-AF-spouse : 0.06707137822759268
marital_status_Married-civ-spouse : 0.7773651781762373
marital_status_Married-spouse-absent : -0.05632299499484264
marital_status_Never-married : -0.5223194902130501
marital_status_Separated : -0.12818057916797326
marital_status_Widowed : -0.12202736858371883
occupation_Adm-clerical : -0.025229968501081224
occupation_Armed-Forces : 0.029293708846537297
occupation_Craft-repair : -0.030680889903323745
occupation_Exec-managerial : 0.21437059680177015
occupation_Farming-fishing : -0.23237592255304484
occupation_Handlers-cleaners : -0.1728043599423611
occupation_Machine-op-inspct : -0.12399804860778599
occupation_Other-service : -0.31175939144680764
occupation_Priv-house-serv : -0.18400706554642052
occupation_Prof-spec

3. Build another regression model using the subset of inputs selected either
by RFE or the selection by model method. Answer the followings:

a. Report which variables are included in the regression model.

b. Report the top-5 important variables (in the order) in the model.

c. Report any sign of overfitting.

d. What is classification accuracy on training and test datasets?


4. Using the comparison statistics, which of the regression models appears to
be better? Is there any difference between the two models (i.e one with
selected variables and another with all variables)? Explain why those
changes may have happened.



5. From the better model, can you identify which householders to target for
providing loan? Can you provide some descriptive summary of those
householders?

# Task 4
## Predictive Modeling Using Neural Networks (5.5 marks)
1. Build a Neural Network model using the default setting. Answer the
following:
a. What are the parameters used? Explain your decision. What is the
network architecture?
b. How many iterations are needed to train this network?
c. Do you see any sign of over-fitting?
d. Did the training process converge and resulted in the best model?
e. What is classification accuracy on training and test datasets?
2. Refine this network by tuning it with GridSearchCV. Answer the
following:
a. What are the parameters used? Explain your decision. What is the
network architecture?
b. How many iterations are needed to train this network?
c. Do you see any sign of over-fitting?
d. Did the training process converge and resulted in the best model?
e. What is classification accuracy on training and test datasets?
3. Would feature selection help here? Build another Neural Network model
with inputs selected from RFE with regression (use the best model
generated in Task 3) and from the decision tree (use the best model
from Task 2). Answer the following for the best neural network model:a. Did feature selection help here? Which method of feature selection
produced the best result? Any change in the network architecture?
What inputs are being used as the network input?
b. What is classification accuracy on training and test datasets? Is there
any improvement in the outcome?
c. How many iterations are now needed to train this network?
d. Do you see any sign of over-fitting?
e. Did the training process converge and resulted in the best model?
f. Finally, see whether the change in network architecture can further
improve the performance, use GridSearchCV to tune the network.
Report if there was any improvement.

# Task 5
## Comparing Predictive Models (4 marks)
1. Use the comparison methods to compare the best decision tree model, the
best regression model, and the best neural network model.
a. Discuss the findings led by:
(i) ROC Chart and Index;
(ii) Accuracy Score;
b. Which model would you use in deployment based on these findings?
Discuss why?
c. Do all the models agree on the householder’s characteristics? How do
they vary?
2. How the outcome of this study can be used by decision makers?
3. Can you summarise the positives and negative aspects of each predictive
modelling method based on this data analysis exercise?