In [134]:
import pandas as pd

#import dataset
df = pd.read_csv('./HouseholderAtRisk.csv')
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39999 entries, 0 to 39998
Data columns (total 18 columns):
ID                        39999 non-null int64
Age                       39032 non-null float64
WorkClass                 39027 non-null object
Weighting                 38707 non-null float64
Education                 39027 non-null object
NumYearsEducation         39027 non-null float64
MaritalStatus             39027 non-null object
Occupation                39013 non-null object
Relationship              39027 non-null object
Race                      45 non-null object
Gender                    39027 non-null object
CapitalLoss               39027 non-null float64
CapitalGain               39027 non-null float64
CapitalAvg                39027 non-null float64
NumWorkingHoursPerWeek    39027 non-null float64
Sex                       39027 non-null float64
Country                   39969 non-null object
AtRisk                    39999 non-null object
dtypes: float64(8), int

# Task 1
## Data selection and distribution (4 marks)

1. What is the proportion of householders who have high risk?

**76.246% of householders have high risk**

In [135]:
print(df['AtRisk'].value_counts().to_frame())

total = 30497 + 9501
print('\n High percentage:', (30497 / total) * 100, '%')



      AtRisk
High   30498
Low     9501

 High percentage: 76.24631231561578 %


2. Did you have to fix any data quality problems?

Yes US, USA United states in country have all been compacted into one value in country

Values such as '?', Undefined and any NaN's all needed to be removed in replaced with a value

Yes some rows were filled with entirely null values. Rows with only 3-4 values or 14 to 18 nulls where all removed. This step actually appears to remove almost all columns with null values. It can be seen that most rows have about 1 null value this appears to be coming from the race column which will be dropped later

In [136]:
print(df.isna().sum()) # Count of Null in columns

test_df = pd.DataFrame()
test_df['full_count'] = df.apply(lambda x: 18-x.count(), axis=1)
print(test_df['full_count'].value_counts()) # Count of nulls in rows (null values: Number of rows)


ID                            0
Age                         967
WorkClass                   972
Weighting                  1292
Education                   972
NumYearsEducation           972
MaritalStatus               972
Occupation                  986
Relationship                972
Race                      39954
Gender                      972
CapitalLoss                 972
CapitalGain                 972
CapitalAvg                  972
NumWorkingHoursPerWeek      972
Sex                         972
Country                      30
AtRisk                        0
dtype: int64
1     38648
15      947
2       334
0        45
16       25
Name: full_count, dtype: int64


In [137]:
# Remove extra white spaces
df_obj = df.select_dtypes(['object'])
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())

df['full_count'] = df.apply(lambda x: 18-x.count(), axis=1)
df.drop(df[df['full_count'] >= 14].index, inplace=True)
df = df.drop(columns='full_count') # Remove count of null values in rows as they are no longer needed for data calc

df['Country'].replace("USA", "United-States", inplace=True)
df['Country'].replace("US", "United-States", inplace=True)
df['Country'].replace("?", df['Country'].value_counts().idxmax(), inplace=True)
df['Occupation'].replace("?", df['Occupation'].value_counts().idxmax(), inplace=True)
df['WorkClass'].replace("?", df['WorkClass'].value_counts().idxmax(), inplace=True)

test_df['full_count'] = df.apply(lambda x: 18-x.count(), axis=1)
print(test_df['full_count'].value_counts()) 


df = df.apply(lambda x: x.fillna(x.value_counts().idxmax())) # Remove

test_df['full_count'] = df.apply(lambda x: 18-x.count(), axis=1)
print(test_df['full_count'].value_counts()) # Count of nulls in rows (null values: Number of rows)

1.0    38648
2.0      334
0.0       45
Name: full_count, dtype: int64
0.0    39027
Name: full_count, dtype: int64


3. The dataset may include irrelevant and redundant variables. What variables did you include in the analysis and what were their roles and measurement level set? Justify your choice.

Race is practly empty
Sex is redundent and gender will be used instead
ID is irrelevent giving no unique information
Education is redunent years of education can be used instead
Capital average is redundent being derived from (capital loss + capital gain) / 2


In [138]:
# Clean data to be prepared for the decision tree
df = df.drop(columns='CapitalAvg')
df = df.drop(columns='Sex')
df = df.drop(columns='Education')
df = df.drop(columns='Race')
df = df.drop(columns='ID')

KeyError: "['Race'] not found in axis"

4. What distribution scheme did you use? What “data partitioning allocation” did you set? Explain your selection. (Hint: Take the lead from Week 2 lecture on data distribution)

In [139]:
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

df['AtRisk'],AtRisk = pd.factorize(df['AtRisk'])
df['Country'],Country = pd.factorize(df['Country'])
df['Gender'],Gender = pd.factorize(df['Gender'])
df['WorkClass'],WorkClass = pd.factorize(df['WorkClass'])
df['MaritalStatus'],AtRisk = pd.factorize(df['MaritalStatus'])
df['Occupation'],AtRisk = pd.factorize(df['Occupation'])
df['Relationship'],AtRisk = pd.factorize(df['Relationship'])


y = df['AtRisk']
x = df.drop(['AtRisk'], axis=1).as_matrix()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, stratify=y)

# Task 2
## Predictive Modelling Using Decision Trees (4 marks)
1. Build a decision tree using the default setting. Examine the tree results and answer the followings:

In [140]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
import numpy as np


model = DecisionTreeClassifier()
model.fit(x_train, y_train)

print("Train accuracy:", model.score(x_train, y_train))
x_test = np.where(np.isfinite(x_test)==False, 0, x_test)
print("Test accuracy:", model.score(x_test, y_test))

Train accuracy: 0.9999267881982575
Test accuracy: 0.8160389444017423


In [141]:
import pydot
from io import StringIO
from sklearn.tree import export_graphviz

# visualize
dotfile = StringIO()
export_graphviz(model, out_file=dotfile)
graph = pydot.graph_from_dot_data(dotfile.getvalue())
graph.write_png("viz.png") # saved in the following file - will return True if successful

dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.241527 to fit



True

  a. What is classification accuracy on training and test datasets?
  
  b. Which variable is used for the first split? What are the variables that are used for the second split?
  
  c. What are the 5 important variables in building the tree?
  
  d. Report if you see any evidence of model overfitting.
  
  
2. Build another decision tree tuned with GridSearchCV. Examine the tree results.
  
  a. What is classification accuracy on training and test datasets?
  
  b. What are the parameters used? Explain your decision.
  
  c. What are the optimal parameters for this decision tree?
  
  d. Which variable is used for the first split? What are the variables that are used for the second split?
  
  e. What are the 5 important variables in building the tree?
  
  f. Report if you see any evidence of model overfitting.
  
  
3. What is the significant difference do you see between these two decision tree models – default (Task 2.1) and using GridSearchCV (Task 2.2)? How do theycompare performance-wise? Explain why those changes may have happened.


4. From the better model, can you identify which householders to target for providing loan? Can you provide some descriptive summary of those householders?



# Task 3
## Predictive Modeling Using Regression (5.5 marks)
1. Describe why you will have to do additional preparation for variables to be
used in regression modelling. Apply transformation method(s) to the
variable(s) that need it. List the variables that needed it.

In [None]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
logisticRegr.fit(x_train, y_train)
print("Train accuracy:", logisticRegr.score(x_train, y_train))
x_test = np.where(np.isfinite(x_test)==False, 0, x_test)
print("Test accuracy:", logisticRegr.score(x_test, y_test))

2. Build a regression model using the default regression method with all
inputs. Once you have completed it, build another model and tune it usingGridSearchCV. Answer the followings:
a. Report which variables are included in the regression model.
b. Report the top-5 important variables (in the order) in the model.
c. Report any sign of overfitting.
d. What are the parameters used? Explain your decision. What are the
optimal parameters? Which regression function is being used?
e. What is classification accuracy on training and test datasets?
3. Build another regression model using the subset of inputs selected either
by RFE or the selection by model method. Answer the followings:
a. Report which variables are included in the regression model.
b. Report the top-5 important variables (in the order) in the model.
c. Report any sign of overfitting.
d. What is classification accuracy on training and test datasets?
4. Using the comparison statistics, which of the regression models appears to
be better? Is there any difference between the two models (i.e one with
selected variables and another with all variables)? Explain why those
changes may have happened.
5. From the better model, can you identify which householders to target for
providing loan? Can you provide some descriptive summary of those
householders?

# Task 4
## Predictive Modeling Using Neural Networks (5.5 marks)
1. Build a Neural Network model using the default setting. Answer the
following:
a. What are the parameters used? Explain your decision. What is the
network architecture?
b. How many iterations are needed to train this network?
c. Do you see any sign of over-fitting?
d. Did the training process converge and resulted in the best model?
e. What is classification accuracy on training and test datasets?
2. Refine this network by tuning it with GridSearchCV. Answer the
following:
a. What are the parameters used? Explain your decision. What is the
network architecture?
b. How many iterations are needed to train this network?
c. Do you see any sign of over-fitting?
d. Did the training process converge and resulted in the best model?
e. What is classification accuracy on training and test datasets?
3. Would feature selection help here? Build another Neural Network model
with inputs selected from RFE with regression (use the best model
generated in Task 3) and from the decision tree (use the best model
from Task 2). Answer the following for the best neural network model:a. Did feature selection help here? Which method of feature selection
produced the best result? Any change in the network architecture?
What inputs are being used as the network input?
b. What is classification accuracy on training and test datasets? Is there
any improvement in the outcome?
c. How many iterations are now needed to train this network?
d. Do you see any sign of over-fitting?
e. Did the training process converge and resulted in the best model?
f. Finally, see whether the change in network architecture can further
improve the performance, use GridSearchCV to tune the network.
Report if there was any improvement.

In [None]:
from sklearn.neural_network import MLPClassifier

nn = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2))
nn.fit(x_train, y_train)

print("Train accuracy:", nn.score(x_train, y_train))
x_test = np.where(np.isfinite(x_test)==False, 0, x_test)
print("Test accuracy:", nn.score(x_test, y_test))

# Task 5
## Comparing Predictive Models (4 marks)
1. Use the comparison methods to compare the best decision tree model, the
best regression model, and the best neural network model.
a. Discuss the findings led by:
(i) ROC Chart and Index;
(ii) Accuracy Score;
b. Which model would you use in deployment based on these findings?
Discuss why?
c. Do all the models agree on the householder’s characteristics? How do
they vary?
2. How the outcome of this study can be used by decision makers?
3. Can you summarise the positives and negative aspects of each predictive
modelling method based on this data analysis exercise?