# Code along notebook: Support Vector Calssifier with the Iris-Dataset

Great to have you in this second phase of the workshop!

Now we will have some time for hands-on building our own AI to do, what we did to start of the workshop.
Everything done here also is in the folder "solution", where you can find the notebook "iris_classifier_solved" to look into, whenever you get stuck.

Before you look up the solution: Did you already ask ChatGPT for a hint or the solution?

This notebook is divided into 4 pieces:

1. Load data
2. Plot data
3. Prepare data
4. Train and test

As you can see, the training only is the last part of this notebook. Way more time is used for the correct buildup and understanding of the data.
In every division of this notebook you find a short summary, what should be done in that section and some clues, where to get help for the implementation.


In [None]:
# Imports

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # plotting
import matplotlib.pyplot as plt # plotting

from sklearn.model_selection import train_test_split #to split the dataset for training and testing
from sklearn import svm  #for Support Vector Machine (SVM) Algorithm
from sklearn import metrics #for checking the model accuracy

# Load data

First step is to load data. We use the library pandas for that, as this has built-in functions to read csv and xlsx. pandas is also widely supported by any AI-libraries as it is the de-facto-standard in the industry.

Information on data loading with pandas (imported as "pd") into a __DataFrame__:

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html


In [None]:
# load data 
df_iris = 

# We can drop the ID-fielad as pandas uses its own ID


# sanity check using the head()-function



# Plot data for a better understanding

Plotting the data leads to a better understanding. The second plot was also used for the start into the workshop.

Plotting needs quite some parameters to have a good overview of the plotted data. Therefore, the first plot stays in as code. Use it to build the second plot, that matches the one from the morning.


In [None]:
fig = df_iris[df_iris.Species=='Iris-setosa'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='red', label='Setosa', marker = "1")

df_iris[df_iris.Species=='Iris-versicolor'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='blue', label='Versicolor',ax=fig, marker = "D")
df_iris[df_iris.Species=='Iris-virginica'].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='green', label='Virginica', ax=fig, marker = "x")

fig.set_xlabel("Sepal Length")
fig.set_ylabel("Sepal Width")
fig.set_title("Sepals")

fig=plt.gcf()
fig.set_size_inches(10,6)

plt.show()

In [None]:
# The plot for the Petals comes here:


To get a better understanding, we can build a heatmap of the correlation. 2 functions can be combined for that:

1. Correlation can be calculated by the pandas-package: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html
2. The heatmap can be plotted by seaborn (imported as sns): https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
# Heatmap of Correlation matrix to get a better understandig of connections in the data

plt.figure(figsize=(10,6))
# your code for the heatmap comes here:
###

###
plt.show()

Interpretation of the heatmap:

As seen in the plotted data: The longer the petals, the wider they get. This is not true for the Sepals.
Also there seems to be a strong correlation between Petal size and Sepal length.
Sepal width does not strongly correlate with another feature.

In a real world example, this heatmap might lead to a selection of features to make the learning process faster as we have nearly perfect correlation between some features. As processing time is no concern on 150 records, we won't modify the data at this point.

# Prepare the data for training

Now that we have some understanding of the data and our task, we can prepare it for the following training and testing.
Best practice is to train the model on only 60%-80% of the data and use the rest of it as a test to measure, how good the model performs.

We already imported "train_test_split". You find information on its use here in the documentation of sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

I recommend on using the names "df_train" and "df_test" for the training and testing data to make it easier to match the solution-notebook, if needed.


In [None]:
# split into train and test set
# the parameter test_size = 0.2 uses a random sample of 20% as a test set for validation. A random state of 1 sets a seed to make results easier to compare.


Now that we have 2 sets of training and test data, we need to make sure, to differenciate between the features (x) and the result (y). In  our case the features are the 4 different columns with numerical data and. The result is the species of the flower.
In the next block, we should get up to 4 different dataframes:

__df_train_x__: features of the training set <br>
__df_train_y__: results of the trainig set

__df_test_x__ = features of the test set <br>
__df_test_y__ = results of the test set

pandas documentation on how to split up dataframes by column: https://pandas.pydata.org/docs/user_guide/indexing.html#basics

# Train and test

We use a Support Vector Machine (svm) or as this is a classification task a Support Vector Classifier (svc) as it does the same we did in the intro-task. It calculates a line between the classes

More to read on SVCs: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html <br>
We need to build a new Support Vector classifier by using its constructor


In [None]:
# Build an instance of the Support Vector Machine
model = 


In [None]:
# We can train the model by using its "fit" method. This is also shown in the examples-section of the SVC-documentation.


We bow use the model to predict the classes of the test data <br>
This is done by using the "predict"-function of the model and feeding in the x-data of the test set.


In [None]:
df_prediction = 

Then we compare that to the true classes of the test set. Therefore the accuracy is a great tool, as it measures, which percentage of the test set is labled correctly.
If you want to look into the full understanding, which metrics can be used to ensure the quality of the model, https://en.wikipedia.org/wiki/F-score (__CN statistics__) is a great place to start. The section "Diagnostic testing" shows a great overview over different metrics and their use cases. 

We may use the imported "metrics"-class of sklearn to calculate the accuracy_score. This method need the predictions and the true classes of every record in the test set.

In [None]:

svc_accuracy = 

print(svc_accuracy)

If everything worked as intended, the model labels around 97% of the test data correctly.

DONE! There we have a first AI-model with a great accuracy.

AI is no witchcraft! In a real world example, finding good data, understanding it and finding a good model are the main challenges for many datasets. Also "Finetuning" models is a way to get even better results, by dving deeper into the underlying statistics and using so called hyperparameters to fit the model even better to the use case. <br>
For that, many models like this are built and compared to find a few contenders, which can be used in a real world test and overseen by knowledgeable users to recognize mistakes and flaws and get to a final model or another iteration of training with even better data.

 

## Time left?

Let's look into, where the one label is used wrongly.

For that deeper dive into the labels, a confusion matrix is the tool to use. This matrix shows, what every flower of the test set is truely labeled and how the model labeled it. Both imports are to be found in the sklearn-documentation.

In [None]:
# Diving into the exact mistakes of the model
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


# caught fire? Further information!
- finetuning models: https://scikit-learn.org/stable/modules/grid_search.html

- more algorithms and models: (https://scikit-learn.org/stable/machine_learning_map.html)

- math behind the SVM (MIT Course: https://www.youtube.com/watch?v=_PwhiWxHK8o&t)

