- Loading the dataset
- Summarizing the dataset
- Visualizing the dataset
- Evaluating algorithms
- Making predictions
UCI Machine Learning Repository - Iris
80% of data is used to train models. 20% of data is used as a validation dataset. Validation dataset is used to know the accuracy of the model obtained through training.
- Dimensions: how many rows and columns (instances and attributes)
- Types: gives an idea of the types of transforms to use and prepare the data
- Levels: Class variable is a factor. Factor has multiple class labels (levels).
- Class distribution: number of instances that belong to each class
- Univariate Plots: plots of each variable. Input attributes(x) and Output attributes(y).
- Multivariate Plots: to know the relationships between the input attributes and the class values
- Box and Whisker plots: linear seperations between the classes
- Test-harness is set-up to use 10-fold cross validation
- 5 different models are built
- Best model is selected
Involves splitting the dataset into 10 subsets. Model is trained on other subsets while one subset is held out. Process is completed when accuracy estimate is provided. The algorithm’s accuracy is estimated on the dataset.
Model evaluation metric used is Accuracy. Accuracy is the percentage of correctly classified instances out of all instances.
- Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.
5 different stacking algorithms were evaluated.
- Linear Discriminant Analysis (LDA).
- Classification and Regression Trees (CART).
- k-Nearest Neighbors (KNN)
- Support Vector Machines (SVM) with a radial kernel.
- Random Forest (RF)