In [5]:
import pandas as pd

#import dataset
df = pd.read_csv('./HouseholderAtRisk.csv')

# Task 1
## Data selection and distribution (4 marks)

1. What is the proportion of householders who have high risk?

**76.246% of householders have high risk**

In [6]:
print(df['High'].value_counts().to_frame())

total = 30497 + 9501
print('\n High percentage:', (30497 / total) * 100, '%')

       High
High  30497
Low    9501

 High percentage: 76.24631231561578 %


2. Did you have to fix any data quality problems? Detail them.
    Apply the imputation method(s) to the variable(s) that need it. List the variables that needed it. Justify your choise of imputation if needed

**Data quality problems**
1. Renamed each column with corresponding name
2. Drop rows with over 95% NaN values
3. Drop rows with more than 8 empty cells


**Imputation**
1. imputate `occupation` by removal, small percentage to remove and no simple median to apply
2. imputate rows in `weighting` by removal, small percentage to remove and no simple median to apply
3. imputate rows in `work_class` by applying median value. median value is large majority
4. imputate rows in `country_of_origin` by applying median value. median value is large majority

**DataTypes**
1. map `sex` to __boolean__ type of 0:M and 1:F. changed from __float__
2. change `age`, `num_years_education`, `num_working_hours_per_week`, `weighting` from __float__ to __int__

**Bin Values**
1. Bin `weighting` to remove problem of large range of singular values

In [7]:
# list unique values for each column
# check data matches description

# for column in df.columns:
#     print('\nColumn name: ' + column )
#     print(df[column].unique())

In [8]:
### DATA QUALITY
## 1
# rename each column to correct attribute name 
# according to task description
df.rename(columns= {
    '1': 'id',
    '25': 'age',
    ' Private': 'work_class',
    '224942': 'weighting',
    ' 11th': 'education',
    '7': 'num_years_education',
    ' Never-married': 'marital_status',
    ' Machine-op-inspct': 'occupation',
    ' Own-child': 'relationship',
    'Unnamed: 9': 'race',
    ' Male': 'gender',
    '0': 'capital_loss',
    '0.1': 'capital_gain',
    '0.2': 'capital_avg',
    '40': 'num_working_hours_per_week',
    '0.3': 'sex',
    ' US': 'country_of_origin',
    'High': 'at_risk',
    
}, inplace=True)

## 2
# Remove near empty columns
df.drop(columns=['race', 'capital_loss', 'capital_gain', 'capital_avg'], inplace=True)

## 3
# Remove rows with nothing but ID
df.dropna(thresh=8, inplace=True)

In [9]:
### IMPUTATION
## 1
# remove rows in 'occupation' with ? value as hard to imputate media
df = df[df.occupation != '?']

## 2
# remove rows in 'weighting' that are NaN as hard to imputate median
df.dropna(subset=['weighting'], inplace=True)

## 3
# imputate rows with ? in 'work_class' as median is easy to determine
df['work_class'].replace('?', df['work_class'].mode(), inplace=True)

## 4
# imputate rows ? in country_of_origin to median of origin
df['country_of_origin'].replace('?', df['country_of_origin'].mode(), inplace=True)

In [10]:
### DATA TYPES
## 1
# change sex to boolean map
df['sex'] = df['sex'].map({'M': 0, 'F':1 })

## 2
# change from float to int
df['age'] = df['age'].astype(int);
df['num_years_education'] = df['num_years_education'].astype(int)
df['num_working_hours_per_week'] = df['num_working_hours_per_week'].astype(int)
df['weighting'] = df['weighting'].astype(int)

In [11]:
### Bin rows
df['weighting'] = pd.cut(df['weighting'], 20).value_counts()

3. The dataset may include irrelevant and redundant variables. What variables did you include in the analysis and what were their roles and measurement level set? Justify your choice.

**1. Large majority of cells are single value**
`country_of_origin`, `work_class`

**2. Other**
`gender`: `sex` column describes same data



In [12]:
# print(df['Race'].value_counts(dropna=False), '\n')
# print(df['CapitalLoss'].value_counts(bins = 5), '\n')
# print(df['CapitalGain'].value_counts(bins = 5), '\n')
# print(df['CapitalAvg'].value_counts(bins=5), '\n')
# print(df['CountryOfOrigin'].value_counts(), '\n')

In [13]:
## 1
df.drop(columns=['country_of_origin', 'work_class'], inplace=True)
## 2
df.drop(columns=['gender'], inplace=True)


4. What distribution scheme did you use? What “data partitioning allocation” did you set? Explain your selection. (Hint: Take the lead from Week 2 lecture on data distribution)

# Task 2
## Predictive Modelling Using Decision Trees (4 marks)
1. Build a decision tree using the default setting. Examine the tree results and answer the followings:

In [56]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np

# Ok so we need to transform all catagorical data into objects / numbers of something cause yay!

preped_df = df.drop(['sex'], axis=1)
preped_df = preped_df.drop(['weighting'], axis=1)
preped_df = df.drop(['sex'], axis=1)
print(df.info())
print(df)

ct = ColumnTransformer(
    [('oh_enc', OneHotEncoder(sparse=False), [2, 4, 5,6,8]),],  # the column numbers I want to apply this to
    remainder='passthrough'  # This leaves the rest of my columns in place
)
preped_df = ct.fit_transform(preped_df)

tree = pd.get_dummies(df)
y = preped_df['at_risk']
x = preped_df.drop(['at_risk'], axis=1).as_matrix()

print(x)

model = DecisionTreeClassifier()
model.fit(x, y)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38706 entries, 0 to 39997
Data columns (total 11 columns):
id                            38706 non-null int64
age                           38706 non-null int32
weighting                     29909 non-null float64
education                     38706 non-null object
num_years_education           38706 non-null int32
marital_status                38706 non-null object
occupation                    38692 non-null object
relationship                  38706 non-null object
num_working_hours_per_week    38706 non-null int32
sex                           0 non-null float64
at_risk                       38706 non-null object
dtypes: float64(2), int32(3), int64(1), object(5)
memory usage: 4.4+ MB
None
          id  age  weighting      education  num_years_education  \
0          2   38        NaN        HS-grad                    9   
1          3   28        NaN     Assoc-acdm                   12   
2          4   44        NaN   Some-college 

ValueError: Input contains NaN

  a. What is classification accuracy on training and test datasets?
  
  b. Which variable is used for the first split? What are the variables that are used for the second split?
  
  c. What are the 5 important variables in building the tree?
  
  d. Report if you see any evidence of model overfitting.
  
  
2. Build another decision tree tuned with GridSearchCV. Examine the tree results.
  
  a. What is classification accuracy on training and test datasets?
  
  b. What are the parameters used? Explain your decision.
  
  c. What are the optimal parameters for this decision tree?
  
  d. Which variable is used for the first split? What are the variables that are used for the second split?
  
  e. What are the 5 important variables in building the tree?
  
  f. Report if you see any evidence of model overfitting.
  
  
3. What is the significant difference do you see between these two decision tree models – default (Task 2.1) and using GridSearchCV (Task 2.2)? How do theycompare performance-wise? Explain why those changes may have happened.


4. From the better model, can you identify which householders to target for providing loan? Can you provide some descriptive summary of those householders?



# Task 3
## Predictive Modeling Using Regression (5.5 marks)
1. Describe why you will have to do additional preparation for variables to be
used in regression modelling. Apply transformation method(s) to the
variable(s) that need it. List the variables that needed it.
2. Build a regression model using the default regression method with all
inputs. Once you have completed it, build another model and tune it usingGridSearchCV. Answer the followings:
a. Report which variables are included in the regression model.
b. Report the top-5 important variables (in the order) in the model.
c. Report any sign of overfitting.
d. What are the parameters used? Explain your decision. What are the
optimal parameters? Which regression function is being used?
e. What is classification accuracy on training and test datasets?
3. Build another regression model using the subset of inputs selected either
by RFE or the selection by model method. Answer the followings:
a. Report which variables are included in the regression model.
b. Report the top-5 important variables (in the order) in the model.
c. Report any sign of overfitting.
d. What is classification accuracy on training and test datasets?
4. Using the comparison statistics, which of the regression models appears to
be better? Is there any difference between the two models (i.e one with
selected variables and another with all variables)? Explain why those
changes may have happened.
5. From the better model, can you identify which householders to target for
providing loan? Can you provide some descriptive summary of those
householders?

# Task 4
## Predictive Modeling Using Neural Networks (5.5 marks)
1. Build a Neural Network model using the default setting. Answer the
following:
a. What are the parameters used? Explain your decision. What is the
network architecture?
b. How many iterations are needed to train this network?
c. Do you see any sign of over-fitting?
d. Did the training process converge and resulted in the best model?
e. What is classification accuracy on training and test datasets?
2. Refine this network by tuning it with GridSearchCV. Answer the
following:
a. What are the parameters used? Explain your decision. What is the
network architecture?
b. How many iterations are needed to train this network?
c. Do you see any sign of over-fitting?
d. Did the training process converge and resulted in the best model?
e. What is classification accuracy on training and test datasets?
3. Would feature selection help here? Build another Neural Network model
with inputs selected from RFE with regression (use the best model
generated in Task 3) and from the decision tree (use the best model
from Task 2). Answer the following for the best neural network model:a. Did feature selection help here? Which method of feature selection
produced the best result? Any change in the network architecture?
What inputs are being used as the network input?
b. What is classification accuracy on training and test datasets? Is there
any improvement in the outcome?
c. How many iterations are now needed to train this network?
d. Do you see any sign of over-fitting?
e. Did the training process converge and resulted in the best model?
f. Finally, see whether the change in network architecture can further
improve the performance, use GridSearchCV to tune the network.
Report if there was any improvement.

# Task 5
## Comparing Predictive Models (4 marks)
1. Use the comparison methods to compare the best decision tree model, the
best regression model, and the best neural network model.
a. Discuss the findings led by:
(i) ROC Chart and Index;
(ii) Accuracy Score;
b. Which model would you use in deployment based on these findings?
Discuss why?
c. Do all the models agree on the householder’s characteristics? How do
they vary?
2. How the outcome of this study can be used by decision makers?
3. Can you summarise the positives and negative aspects of each predictive
modelling method based on this data analysis exercise?