# Iris Flower Classification

## Introduction

The Iris flower classification problem is famous in the world of machine learning. Dating back to R.A. Fisher’s 1936 paper, “[The Use of Multiple Measurements in Taxonomic Problems](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x),” the Iris dataset has  been used for introductory machine learning.

**Problem**

The iris flower classifican problem is to distinguish (classify) between different species of the Iris flower based on four measurements (features):
* sepal length
* sepal width 
* petal length  
* petal width

<img src="images/iris.png" />

This classical machine learning problem contains contains three types of iris flowers:
* Iris Setosa
* Iris Versicolor
* Iris Virginica

these threey types are shown in the follwoing figure:

<img src="images/iris_flowers.jpg" />

*Note*: The data set we're working with is the famous [Iris data set](https://archive.ics.uci.edu/ml/datasets/Iris) — which included with this notebook

## <center> مادة العلوم للمرحلة الابتدائية <center>

للتذكير بدرس "أجزاء الزهرة" بمادة العلوم في المرحلة الإبتدائية
<img src="images/flower_arabic.jpg" />

[مصدر الصورة](https://twitter.com/Noura_alhumaidi/status/912787075174404096)

## Problem Solving

### Let's get started

First, we will import the required libraries.


In [1]:
# Load Pandas - Python data analysis library
import pandas as pd

# importing data sets
from sklearn import datasets
# or you can use - > from sklearn.datasets import load_iris

from sklearn import tree
from sklearn.model_selection import train_test_split

# Load visualization libraries
import matplotlib.pyplot as plt
import seaborn as sb

# This line tells the notebook to show plots inside of the notebook
%matplotlib inline

### Viewing the iris dataset


Let us first have a qucik look and examin what is in the dataset. 
The dataset consists of:
* 150 examples
* 3 labls: species of Iris (setosa, virginica, and versicolor)
* 4 features: Sepal length & width, Petal length & Width in cm

Let's start by writing python coad that load the iris dataset:

In [2]:
iris = datasets.load_iris()

Now that we've loaded the dataset, let's examine what is in it

In [3]:
iris.data # this outputs a NumPy array

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

Let's examine the NumPy array:


In [4]:
iris.data.shape

(150, 4)

This means that the data is 150 rows by 4 columns. Let's look at the first row:

In [5]:
iris.data[0]


array([5.1, 3.5, 1.4, 0.2])

The output for the first row has four numbers. Let's determine what they mean by typing:


In [6]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

The feature or column names name the data. They are strings that correspond to dimensions in different types of flowers.

Putting it all together, we have 150 examples of flowers with four measurements per flower in centimeters. For example, the first flower has measurements of 
* 5.1 cm for sepal length, 
* 3.5 cm for sepal width, 
* 1.4 cm for petal length, and
* 0.2 cm for petal width.

Now, let's look at the output variable in a similar manner:

In [7]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

This target variable yields an array of outputs: 0, 1, and 2. 

That means there are only three outputs.

Type the following to explore more and  look at what the numbers refer to::

In [8]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

The output of the _iris.target_names_ variable gives the English names for the numbers in the _iris.target_ variable. The 
* number **zero** corresponds to the *setosa* flower, 
* number **one** corresponds to the *versicolor* flower, and 
* number **two** corresponds to the *virginica* flower. 

Look at the first row of iris.target

In [9]:
iris.target[0]

0

This produces zero, and thus the first row of observations we examined before correspond to the setosa flower.



### Viewing the Iris dataset with Pandas
 

This section is available only on the advance version of this notebook.

### A Machine Learning Recipe - Classification

our task with iris flower problem is to classify or _predict_ different species of the Iris flower based on four measurements (features). To make predictions, we will:
* choose a machine learning algorithm (model) to solve the problem
* train the model
* make prediction 
* measure how well the model performed

_Getting Ready_
* [Scikit-Learn](https://scikit-learn.org/stable/) will be used in our experiemnt. 
* To solve the supervised classification , we will use [Decision Tree](https://scikit-learn.org/stable/modules/tree.html). Because of its simplicity and interpretability, it is a commonly used algorithm.
* To measure the performance of prediction, we will split the dataset into training and test sets. The training set refers to data we will learn from. The test set is the data we hold out and pretend not to know as we would like to measure the performance of our learning procedure. So, we will use [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for this task.

Let us start by splitting the dataset using [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [27]:
x=iris.data   # x contains the features
y=iris.target # y contains the labels

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25) # Please Note the test_size :)


Now,
* x_train contains the training features

* x_test contains the testing features

* y_train contains the training label

* y_test contains the testing labels



Next, we will build and train the model. Here we train the model with the _fit_ function.

In [24]:
classifier=tree.DecisionTreeClassifier()
classifier.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Now the model is ready to make predictions

Predictions will be done with _predict_ function

In [25]:
predictions=classifier.predict(x_test)


these predictions can be matched with the expected output to measure the accuracy value.

In [26]:
from sklearn.metrics import accuracy_score
model_accuracy  = accuracy_score(y_test,predictions)
print(model_accuracy)

0.9473684210526315


### Exercises

* Visit the [iris demo](https://www.snaplogic.com/machine-learning-showcase/iris-flower-classification) and play with a web UI demo
* try different *test_size* values, write down your observations
* try [k-nearest neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) instead of the decision trees. What you observe in terms of efficiency?!