# IRIS Flowers Dataset
# KNN Classification

### In this exercise we will be using K Nearest Neighbor classifier model from sklearn library and iris flower dataset.

#### Let's first look at the dataset description

In [2]:
with open("datasets/iris_flowers/dataset_info.txt") as f:
    print(f.read())

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [1]:
# Importing necessary libraries
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

We will use Pandas library to work with the dataset

In [4]:
# Reading the dataset from csv file using pandas
df = pd.read_csv("datasets/iris_flowers/dataset.csv")
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### We can also look at the overall metadata for the dataframe by using info() method and the statistics for the dataframe by using describe() method as follows

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
class                150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 5.9 KB


In [16]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


Since we need to predict the **class**, we can treat it as **label** and other columns as **features**.

We will separate features and label into **X** and __y__ variables respectivey as follows.

In [7]:
X = df.drop(['class'], axis=1) # dropping price column and saving other columns into X
y = df['class'] # saving price column into y

### Now that we have out features and labels we can train out model.
### But we dont have any data to test out model's accuracy on.
### Model's accuracy means for how many number of data is the model able to predict the correct value.
### For this purpose we will use a module from sklearn which will split the data into two sets: Training set and Testing set

In [8]:
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.15)

### Let's train out K Nearest Neighbors model on training data

In [9]:
model = KNeighborsClassifier()
model.fit(trainX, trainy)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

### We will use the model to predict for the testing set and store the predictions in a variable

In [14]:
predictions = model.predict(testX)

### Let's look at predicted as well as actual values

In [13]:
pd.DataFrame(list(zip(predictions, testy)), columns=["Prediction", "Actual"])

Unnamed: 0,Prediction,Actual
0,0,0
1,0,0
2,2,2
3,0,0
4,2,2
5,1,1
6,1,1
7,0,0
8,1,1
9,0,0


### Let's use accuracy_score function from sklearn to see how good our model is.

This function will compare the predicted values with actual values and return what percentage of our prediction matches with the actual values.

In [12]:
accuracy_score(testy, predictions)

1.0