# Iris Data Analysis

## Introduction

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository. It includes three iris species that are Iris-setosa, Iris-versicolor and Iris-virginica with 50 samples each, as well as some properties about each flower.


**Research Work**<br>
To classify the species of the iris flower based on sepal and petal length/width.

**Hypothesis**

**$H_{0}$:-** There is no statistical relationship between the petal length and petal width of the iris flowers.<br>
**$H_{1}$:-** There is statistical relationship between the petal length and petal width of the iris flowers.

**Assumptions**
+ At the beginning, the whole training set is considered as the root.
+ Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
+ Records are distributed recursively on the basis of attribute values.
+ Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

In [39]:
#Importing required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import csv
import statistics as s
import scipy.stats as ss
from sklearn.datasets import load_iris
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

In [38]:
#ignoring the warnings
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

In [43]:
#importing the dataset
data = pd.read_csv("C:/Users/apoorv.srivastava/Downloads/datasets 2/datasets/Iris/iris.data.txt", sep=',')

In [3]:
iris = load_iris()
features = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.Categorical.from_codes(iris.target, iris.target_names)

In [4]:
#Printing out the features names
features.columns.values

array(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'], dtype=object)

In [5]:
#Renaming column names
features.rename(columns = {'sepal length (cm)':'SepalLength', 
                           'sepal width (cm)':'SepalWidth', 
                           'petal length (cm)':'PetalLength', 
                           'petal width (cm)':'PetalWidth'},
               inplace = True)

In [6]:
features.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [7]:
#proving hypothesis
ttest,pVal = ss.ttest_ind(features.PetalLength, features.PetalWidth)
print("pValue = ",pVal)
if pVal < 0.05:
    print("Rejecting Null Hypothesis and Accepting Alternative Hyposthesis")
else:
    print("Accepting Null Hypothesis and Rejecting Alternative Hypothesis")

pValue =  3.883537681963073e-43
Rejecting Null Hypothesis and Accepting Alternative Hyposthesis


In [8]:
#Printing the categorical variable
print(target)

[setosa, setosa, setosa, setosa, setosa, ..., virginica, virginica, virginica, virginica, virginica]
Length: 150
Categories (3, object): [setosa, versicolor, virginica]


**Conclusion-1**<br>
As the above cell depict that the values are categorical

**Decision tree** builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. It uses Entropy and Information Gain to construct a decision tree.<br>
**Entropy:**<br>
Entropy controls how a Decision Tree decides to split the data. It actually effects how a Decision Tree draws its boundaries.<br>
**Information Gain:**<br>
Information gain (IG) measures how much "information" a feature gives us about the class.<br>
**Gini Index:**<br>
The Gini index is a simple measure of the distribution of income across income percentiles in a population. A higher Gini index indicates greater inequality, with high income individuals receiving much larger percentages of the total income of the population.

In [9]:
#decision trees can handle categorical data, we still encode the targets in terms of digits (i.e. setosa=0, versicolor=1, virginica=2) 
#in order to create a confusion matrix later. 
#pandas library provides a method for this.
target = pd.get_dummies(target)

In [10]:
#evaluating the performance of our model. Therefore setting a quarter of the data for testing.
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=1)

In [11]:
#creating and training the DecisionTreeClassifer.
dtree = DecisionTreeClassifier()
model = dtree.fit(X_train, y_train)

In [35]:
#Predicting using our model
y_pred = dtree.predict(X_test)

In [36]:
#this is a classification problem, we use a confusion matrix to get the accuracy of our model.
species = np.array(y_test).argmax(axis=1)
predictions = np.array(y_pred).argmax(axis=1)
confusion_matrix(species, predictions)

array([[13,  0,  0],
       [ 0, 15,  1],
       [ 0,  0,  9]], dtype=int64)

**Conclusion-2**<br>
As the above cell depict that our decision tree model predicted 37/38 species correctly

In [40]:
dtaccuracy=metrics.accuracy_score(y_pred,y_test)
print("Decission Tree Model Accuracy is {}".format(dtaccuracy*100))

Decission Tree Model Accuracy is 97.36842105263158


## **Final Conclusions**
+ Accuracy of our Decision Tree model is 97.36%.
+ Hypothesis Testing (t test) proved that there is statistical relationship between the two variables.
+ Confusion matrix depicts that only one species is wrongly predicted.
+ Our decision tree model gives the promising results.
+ In my last repo I applied Logistic Regression on the same data and got 91.11% accuracy which is less than Dtree Accuracy.
+ Further we will check for some different algorithms such as SVM.
+ After that we will get to know which model is giving the higher accuracy.