# Iris Classification

**Objective :** To classify different species of the Iris flower with the help of features provided.
For this project we will be using the following UCI dataset- https://archive.ics.uci.edu/ml/datasets/Iris
It includes three iris species with 50 samples each as well as some properties about each flower. 


Here are the features represented through columns :
<br>
**Input variables (based on physicochemical tests)**
<br>
1 - ID
<br>
2 - sepal length in cm
<br>
3 - sepal width in cm
<br>
4 - petal length in cm
<br>
5 - petal width in cm
<br>
5 - chlorides
<br>

**Output variable (based on sensory data)**

6 - class:
- Iris Setosa
- Iris Versicolour
- Iris Virginica



## Steps :
1. Importing Libraries
2. Exploring the Dataset
3. Exploratory Data Analysis
4. Data Preprocessing
5. Model Building
> * Logistic regression
> * Decision tree
> * KNN
> * SVM
> * Naive Bayes Classification
> * Random forest
> * Extra Tree Classifier
> * XGBoost
6. Performance Comparison
7. Conclusion

## 1. Import Libraries
Import the necessary packages to process or plot the data

### Get the Data

Use pandas to read iris.csv as a dataframe called iris

## 2. Exploring the Dataset

**Check distribution of data**
<br>
Use head() method

**Check information about the columns**
<br>
Use info() method

## 3. Exploratory Data Analysis

Let's do some data visualization! Feel free to use whatever library you want. 

Note - Directions for a few plots are given below, we encourage you to explore further for more insights into the data!


**Use a countplot on the 'Species' column to find how the species are distributed!**

How is the distribution across species?


**Plot a correlation matrix to study the correlation between features!**

Can we eliminate any of the features? (Are any of them correlated?)

**Create a scatter plot based on petal length and width.**

Which flower species seems to be the most separable?

**Now plot one again based on sepal length and width. Which flower species seems to be the most separable?**

Which flower species seems to be the most separable?

**Use a boxplot to find out the distribution of petal width across the 3 species**

**Use a boxplot to find out the distribution of petal length across the 3 species**

**Use a boxplot to find out the distribution of sepal width across the 3 species**

**Use a boxplot to find out the distribution of sepal length across the 3 species**

## 4. Data Preprocessing

Notice that the species column has entries in the form of species names. <br>

**We will convert those species names to a categorical values using label encoding.**


First, let's seperate the dataset as output variable (species name) and feature variabes

In [1]:
#Set the 'Species' column to y
#Drop the 'Species' and 'Id' column from the dataframe and set the remaining dataframe to x



Now lets assign labels to our output variable using **LabelEncoder**

In [2]:
#Import LabelEncoder and create an instance named encoder



In [3]:
#Use .fit_transform method to fit encoder to y and return encoded labels



In [4]:
#Print out y



See what Iris-setosa, Iris-versicolor and Iris-virginica are converted into respectively


**Train Test Split**

In [5]:
#Import train_test_split


#Split the data set into training data and testing data in a 7:3 ratio



## 5. Model Building

**Since it is a classification problem we will be using the following ML algorithms :**<br>
Logistic regression<br>
Decision tree<br>
KNN<br>
SVM<br>
Naive Bayes Classification<br>
Random forest<br>
XGBoost<br>

### Logistic regression

<img src="https://image.slidesharecdn.com/logitregression-161121215510/95/intro-to-logistic-regression-4-638.jpg?cb=1479765630">
![image.png](attachment:image.png)

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). 

In [6]:
#Import LogisticRegression


#Create an instance of LogisticRegression() called lr_model and fit it to the training data.


#Create predictions from the test set and name the result lr_predict


In [7]:
#print out the accuracy score for LogisticRegression
#Don't forget to import accuracy_score from sklearn.metrics!  


Note the accuracy

### Random Forest Classifier

![image.png](attachment:image.png)

Random Forest is considered to be a panacea of all data science problems. On a funny note, when you can’t think of any algorithm (irrespective of situation), use random forest!

Random Forest is a versatile machine learning method capable of performing both regression and classification tasks. It also undertakes dimensional reduction methods, treats missing values, outlier values and other essential steps of data exploration, and does a fairly good job. It is a type of ensemble learning method, where a group of weak models combine to form a powerful model.

In [8]:
#Import RandomForestClassifier


#Create an instance of RandomForestClassifier() called rfc_model and fit it to the training data.


#Create predictions from the test set and name the result rfc_predict


In [9]:
#print out the accuracy score for RandomForest


Note the accuracy

### KNN

![image.png](attachment:image.png)

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970's as a non-parametric technique.

In [10]:
#Import KNeighborsClassifier


#Create an instance of KNeighborsClassifier() with no. of neighbours = 3 called knn_model and fit it to the training data.


#Create predictions from the test set and name the result knn_predict



In [11]:
#print out the accuracy score for KNN



Note the accuracy

### Support Vector Machine

![image.png](attachment:image.png)

“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very well.

Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is a frontier which best segregates the two classes (hyper-plane/ line).

In [12]:
#Import SVC


#Create an instance of SVC() called svm_model and fit it to the training data.


#Create predictions from the test set and name the result svm_predict



In [13]:
#print out the accuracy score for SVM



Note the accuracy

### Naive Bayes Classification

![image.png](attachment:image.png)

Naive Bayes is a simple, yet effective and commonly-used, machine learning classifier. It is a probabilistic classifier that makes classifications using the Maximum A Posteriori decision rule in a Bayesian setting. It can also be represented using a very simple Bayesian network. Naive Bayes classifiers have been especially popular for text classification, and are a traditional solution for problems such as spam detection.

In [14]:
#Import GaussianNB


#Create an instance of GaussianNB() called nb_model and fit it to the training data.


#Create predictions from the test set and name the result nb_predict

In [15]:
#print out the accuracy score for NaiveBayes

Note the accuracy

### Decision Tree

![image.png](attachment:image.png)

Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables

In [16]:
#Import DecisionTreeClassifier


#Create an instance of DecisionTreeClassifier() called dt_model with 3 nodes and fit it to the training data.


#Create predictions from the test set and name the result dt_sgd



In [17]:
#print out the accuracy score for DecisionTree

Note the accuracy

### XGBoost

![image.png](attachment:image.png)

The beauty of this powerful algorithm lies in its scalability, which drives fast learning through parallel and distributed computing and offers efficient memory usage.

It’s no wonder then that CERN recognized it as the best approach to classify signals from the Large Hadron Collider. This particular challenge posed by CERN required a solution that would be scalable to process data being generated at the rate of 3 petabytes per year and effectively distinguish an extremely rare signal from background noises in a complex physical process. XGBoost emerged as the most useful, straightforward and robust solution.

In [18]:
#Import XGBClassifier


#Create an instance of XGBClassifier() called xg_model and fit it to the training data.


#Print out the model score applied on the test set


Note the score

## 6. Comparison

**You will now create an accuracy table to see all model performances in one glance**

Your first task will be to create 3 lists as mentioned below.

In [19]:
#Create an empty list named 'scores'


In [20]:
#Create a list named 'models' with the following elements - LogisticRegression(), SVC(kernel='linear'), GaussianNB(), DecisionTreeClassifier(max_leaf_nodes=3), RandomForestClassifier(max_depth=3), KNeighborsClassifier(n_neighbors=3), xgb.XGBClassifier()


In [21]:
#Create a list named 'classifiers' comprising of names of the algorithms in the list 'models' which you created above. 
#For e.g. SVC will be 'Support Vector Machine' and GuassianNB will be 'Naive Bayes'


Now you will use a for loop to to implement the algorithms listed in 'models' above and append the accuracy score for each to the list 'scores'<br>
Refer to the following line - wise format of the loop for this step

**for i in models:**

     Line 1 - set the value of a variable named 'current_model' equal to i
     Line 2 - fit current_model to the training data
     Line 3 - create predictions for the test set and store it in variable named 'current_prediction'
     Line 4 - append the value accuracy_score(current_prediction,y_test) to the list 'scores'
     


Now create a dataframe named 'models_accuracy' with the data parameter set to 'scores' and index parameter set to 'classifiers'

Congratulations! Print models_accuracy to see the dataframe you just created!

## 7. Conclusion

What insights did EDA give us about the dataset? <br>
Which model gave the highest accuracy?<br>
How did cross validation techniques and parameter tuning affect the results?