This project is an improvement of the final project of the upper-year Statistic course "Statistical Learning - Classification" at the University of Waterloo by Bolun Cui and Joe Liang.
A video explaination about this project can be found here. (Note: The video was made for the university course project, some parts in the video might not be matched with the file)
The dataset in this project corresponds to the responses in the German General Social Survey (ALLBUS) between 2005 and 2019. The target variable for machine learning is the last variable "health". It is an ordinal variable with five categories from 1 to 5 and represents the "self-asset financial health" of each survey response.
There are two parts of the dataset, ”train.csv" and "test.csv". the samples in ”train.csv" include "health" variables, which are used for model training. And "test.csv" does not have the "health" variable. The goal of this project is to train a classification model using the "train.csv" to classify survey responses in "test.csv" into one of the financial health categories.
- Supervised Learning Code.Rmd: Contains codes and documentation for machine learning training and data analysis. Highlights can be found in the next section
- Supervised Learning Report.pdf: PDF version of "Supervised Learning Code.rmd" for better compatibility
- test.csv: Testing dataset for machine learning
- train.csv: Training dataset for machine learning
- Final Presentation.pdf: Contain the summary and explaination of training process and the outcome of this project. A video explaination about this project can be found here. (Note: The video was made for the university course project, some parts in the video might not be matched with the file)
Compelete documentation can be found in the "Supervised Learning code.rmd" file
- Outlier anaylsis
- Target variables distribution anaylsis and normalization
During analysis, based on our domain knowledge, we derived a new x-variable: the average living space in m2 per person in the household.
- Out of Bag (OOB) samples tuning for number of variables to choice and number of trees
- Importance of variables from OOB (randomly mix each variables to test the decrease in accuracy)
- Tracing the perfomance of different number of trees
- Pipline implentation of two hidden layers neural network
- Tuning Epochs (number of iteration) to balance bias and vairance tradeoff
- Number Nodes and Layer tuning with validation cross entropy
This project uses R with R markdown for better visualization. Please visit the official websites for documentation and installation of R, and R Markdown. R studio is recommended to open the .rmd file.
The required packages to excuate the code in .rmd file are listed below and can be installed in CRAN using
install.package("package_name")
in R or R studio.
- randomForest: a comprehensive package for Random Forest Model training
- caret: a machine learning platform with many integrated features, such as cross-validation
- fastDummies: a package allows you to convert categorical variables into indicator (Dummy) variables
- Keras: a comprehensive package under Tensorflow for Neural Network (Tensorflow installation is required)
- gbm: the Generalized Boosting Model is supported
- nnet: the Multinomial logistic regression model is supported