Skip to content

Supervised learning models, an assignment from the Data Science and Advanced Analytics course from the Big Data & Analytics Masters @ EAE class of 2021

Notifications You must be signed in to change notification settings

joseph-higaki/supervised-learning-R

Repository files navigation

Supervised Learning Models

This is a course assignment for supervised machine learning models using R. This is from the Data Science and Advanced Analytics course from the Big Data & Analytics Masters @ EAE class of 2021. This assignment has three sections.

  1. Regression Analysis for Child Carseat Sales
  2. Classification Analysis for Breast Cancer
  3. Classification Analysis for Iris Species

Regression Analysis for Child Carseat Sales

Given a dataset of 400 observations (locations) with 11 variables, we need to predict the sales volume.

Dataset documentation

Answer

I used Linear Regression with 8 different variable combinations. Model performance was evaluated using Mean Square Error

R script found here: regression_hands_on.R

Classification Analysis for Breast Cancer

Given a dataset of 699 observations with 11 variables, of what appears to be imaging from breast tissue. We need to train a model to predict whether the observation corresponds to a benignant or malignant class.

Dataset documentation

Answer

I used Support Vector Machines models with different kernel functions. For model evaluation purposes I added a cost matrix based on these assumptions

Cost Matrix and assumptions

Table of multiple kernel functions

Conclusion: Use the model #7, as it represents the one with the lower prediction cost. Even though it has an accuracy of ~ 93% even though there are other models at higher accuracies ~ 95%

R script found here svm_hands_on_breast_cancer.R

Classification Analysis for Iris Species

Dataset of 150 observations with 4 variables and a class. The purpose isto predict the classification of the Iris species: Setosa, Versicolor, Virginica.

Dataset documentation

Answer I also used Support Vector Machines models. When doing the variable analysis, by eyeballing the distribution of the species in variable pairs, it looks like Sepal Width and Sepal Length are good input variables. From the different kernel functions tested, I went with the Polynomial Degree 3, Gamma 2.5. Another interesting takeaway from this assignment was to use the plot feature to visualize observations vs prediction.

SVM Classification plot

R script found here svm_hands_on_flowers.R

Professor

Professor Assistants

About

Supervised learning models, an assignment from the Data Science and Advanced Analytics course from the Big Data & Analytics Masters @ EAE class of 2021

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages