Skip to content

classification model to classify survey responses into one of the financial health categories.

Notifications You must be signed in to change notification settings


Repository files navigation

Supervised Learning Classification - Self-Assessed Health Status


This project is an improvement of the final project of the upper-year Statistic course "Statistical Learning - Classification" at the University of Waterloo by Bolun Cui and Joe Liang.


A video explaination about this project can be found here. (Note: The video was made for the university course project, some parts in the video might not be matched with the file)


The dataset in this project corresponds to the responses in the German General Social Survey (ALLBUS) between 2005 and 2019. The target variable for machine learning is the last variable "health". It is an ordinal variable with five categories from 1 to 5 and represents the "self-asset financial health" of each survey response.

There are two parts of the dataset, ”train.csv" and "test.csv". the samples in ”train.csv" include "health" variables, which are used for model training. And "test.csv" does not have the "health" variable. The goal of this project is to train a classification model using the "train.csv" to classify survey responses in "test.csv" into one of the financial health categories.

File Description

  • Supervised Learning Code.Rmd: Contains codes and documentation for machine learning training and data analysis. Highlights can be found in the next section
  • Supervised Learning Report.pdf: PDF version of "Supervised Learning Code.rmd" for better compatibility
  • test.csv: Testing dataset for machine learning
  • train.csv: Training dataset for machine learning
  • Final Presentation.pdf: Contain the summary and explaination of training process and the outcome of this project. A video explaination about this project can be found here. (Note: The video was made for the university course project, some parts in the video might not be matched with the file)


Compelete documentation can be found in the "Supervised Learning code.rmd" file

Exploratory Data Analysis

  • Outlier anaylsis

Screen Shot 2022-05-09 at 6 39 19 PM

  • Target variables distribution anaylsis and normalization

Screen Shot 2022-05-09 at 6 43 30 PM

Feature Engineering

During analysis, based on our domain knowledge, we derived a new x-variable: the average living space in m2 per person in the household.

Random Foresting

  • Out of Bag (OOB) samples tuning for number of variables to choice and number of trees

Screen Shot 2022-05-09 at 6 59 31 PM

  • Importance of variables from OOB (randomly mix each variables to test the decrease in accuracy)


  • Tracing the perfomance of different number of trees


Neural Network

  • Pipline implentation of two hidden layers neural network


  • Tuning Epochs (number of iteration) to balance bias and vairance tradeoff


  • Number Nodes and Layer tuning with validation cross entropy

image image

Performance of the Model



This project uses R with R markdown for better visualization. Please visit the official websites for documentation and installation of R, and R Markdown. R studio is recommended to open the .rmd file.

The required packages to excuate the code in .rmd file are listed below and can be installed in CRAN using


in R or R studio.

  • randomForest: a comprehensive package for Random Forest Model training
  • caret: a machine learning platform with many integrated features, such as cross-validation
  • fastDummies: a package allows you to convert categorical variables into indicator (Dummy) variables
  • Keras: a comprehensive package under Tensorflow for Neural Network (Tensorflow installation is required)
  • gbm: the Generalized Boosting Model is supported
  • nnet: the Multinomial logistic regression model is supported


classification model to classify survey responses into one of the financial health categories.







No releases published


No packages published