# Classification (supervised learning)

The notebook aims to study and implement classification (supervised learning) using "sklearn". The iris dataset will be used to classify.


## Acknowledgments

- Used dataset: https://archive.ics.uci.edu/ml/datasets/iris

- Inquiries: mauricio.antelis@tec.mx


# Importing libraries

In [1]:
# Import the packages that we will be using
import numpy as np                  # For array
import pandas as pd                 # For data handling
import seaborn as sns               # For advanced plotting
import matplotlib.pyplot as plt     # For showing plots

# Note: specific functions of the "sklearn" package will be imported when needed to show concepts easily


# Importing data

In [None]:
# Define the col names for the iris dataset
colnames = ["Sepal_Length", "Sepal_Width","Petal_Length","Petal_Width", "Flower"]

# Dataset url
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Load the dataset from HHDD
dataset  = pd.read_csv(url, header = None, names = colnames )

dataset


Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Flower
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


# Undertanding and preprocessing the data

1. Get a general 'feel' of the data


In [None]:
# Print dataset



In [None]:
# Print dataset shape



In [None]:
# Print column names



2. Drop rows with any missing values


In [None]:
# Drop na



3. Encoding the class label categorical column: from string to num


In [None]:
# Encoding the categorical column: {"Iris-setosa":0, "Iris-versicolor":1, "Iris-virginica":2}


#Visualize the dataset


Now the label/category is numeric


4. Discard columns that won't be used


In [None]:
# Drop out non necesary columns



5. Scatter plot of the data

In [None]:
# Scatter plot of Petal_Length vs Petal_Width




In [None]:
# Scatter plot of Petal_Length vs Sepal_Length




In [None]:
# Scatter plot of Petal_Length vs Sepal_Width




In [None]:
# Scatter plot of Petal_Width vs Sepal_Length




In [None]:
# Scatter plot of Petal_Width vs Sepal_Width




In [None]:
# Scatter plot of Sepal_Length vs Sepal_Width




In [None]:
# Pairplot: Scatterplot of all variables (not the flower type)




In [None]:
# Pairplot: Scatterplot of all variables (not the flower type)




6. Scatter plot of the data asigning each point to the cluster it belongs to ¡¡

In [None]:
# Get dataframes for each real cluster



In [None]:
# Scatter plot of each real cluster for Petal




In [None]:
# Scatter plot of each real cluster for Sepal




Recall that for this dataset we know in advance the class to which each point belongs to

# Get variables **X** and labels **y**

In [None]:
# Select variables (one, two, three, four)
X  = dataset[["Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width"]].values
#X  = dataset[["Petal_Length", "Petal_Width"]].values
#X  = dataset[["Sepal_Length", "Sepal_Width"]].values

# Get the class of each observation
y  = dataset["Flower"].values


In [None]:
# Understand the data X


In [None]:
# Understand the data y


In [None]:
# Calculate the number of observations in the dataset



In [None]:
# Calculate the number of observations for class 0



In [None]:
# Calculate the number of observations for class 1



In [None]:
# Calculate the number of observations for class 2



# Train a classifier

## Train the classification model

In [None]:
# Import sklearn linear_model

# Initialize the classifier

# Fit the model to the training data



## Predict the class of a new observation

In [None]:
# Get a new observation
xnew = np.array([[5.5, 3.5, 1.5, 0.5]])
#xnew = np.array([[5.5, 2.5, 3.5, 1.5]])
#xnew = np.array([[6.5, 3.5, 5.5, 2.5]])

# Print the new observation
xnew


array([[5.5, 3.5, 1.5, 0.5]])

In [None]:
# Make the prediction using xnew


# Get the predicted class



The question is, how accurate is the classification model?... we need to evaluate the performance of our classifier

# Evaluation of a classifier

## Split data in train and test sets

Holdout: spliting the dataset in train and test sets

In [None]:
# Import sklearn train_test_split


# Split data in train and test sets



In [None]:
# Number of observations in the train set



In [None]:
# Number of observations of each class in the train set



In [None]:
# Number of observations in the test set



In [None]:
# Number of observations of each class in the test set



## Train the classification model

In [None]:
# Initialize the classifier


# Fit the model to the training data



## Test the classification model

In [None]:
# Make the predictions using the test set



In [None]:
# Explore real and predicted labels



## Compute the acurracy

In [None]:
# Define a function to compute accuracy


In [None]:
# Calculate total accuracy





In [None]:
# Calculate total accuracy using sklearn.metrics



In [None]:
# Compute accuracy for class 0



In [None]:
# Compute accuracy for class 1



In [None]:
# Compute accuracy for class 2



## Confussion matrix

In [None]:
# Compute confussion matrix (normalized confusion matrix)


In [None]:
# Plot normalized confussion matrix



# Final remarks

- Evaluation of classification model is critical

- Train and test set have to be mutually exclusive

- There are several alternatives: Holdout, Montecarlo, k-fold, repeated k-fold, Leave P Out (LPO), Leave One Out (LOO), Stratified k-fold

- https://scikit-learn.org/stable/modules/cross_validation.html

# Activity

1) Compare the accuracy of the classification using (a) the four variables, (b) the two Petal variables, and (c) the two Sepal variables. Which provides the best classification accuracy?


2) Using the four variables, try with two classifiers. Which provides the best performance?