## Importing libraries

- **numpy** is a library for multidimensional arrays
- **pandas** library provides data structures for data manipulation and analysis
- **sklearn** is a machine learning library that provides various algorithms. From sklearn we import **MinMaxScaler** : transform features by scaling each feature to a given range (0, 1 by default), **tree** : contains the DecisionTreeClassifier we will be using, **accuracy_score** : calculate the accuracy of our model on unseen data, **train_test_split** : randomly split dataset into training and test examples.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.preprocessing import MinMaxScaler
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

## Importing data

Importing the iris dataset and storing it in a pandas dataframe.

In [None]:
data = pd.read_csv("/content/Iris.csv").dropna()
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## Understand the data

Performing Exploratory Data Analysis.

In [None]:
print(f"Shape of data: {data.shape}")
print(f"Columns: {data.columns}")

Shape of data: (150, 6)
Columns: Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')


In [None]:
data.nunique()

Id               150
SepalLengthCm     35
SepalWidthCm      23
PetalLengthCm     43
PetalWidthCm      22
Species            3
dtype: int64

## Changing index

Changing index of the dataframe to the Id column of the dataset.

In [None]:
data.set_index(data['Id'], inplace = True)
data = data.drop(['Id'], axis = 1)
data

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
146,6.7,3.0,5.2,2.3,Iris-virginica
147,6.3,2.5,5.0,1.9,Iris-virginica
148,6.5,3.0,5.2,2.0,Iris-virginica
149,6.2,3.4,5.4,2.3,Iris-virginica


## Splitting data between X and y

Splitting the dataframe into feature/input and target vectors.

In [None]:
X = np.array(data.drop(['Species'], axis = 1))
y = np.array(data['Species'])

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

X shape: (150, 4)
y shape: (150,)


## Train test split

Splitting the data into training (used to train the model) and test (used to evaluate the model) vectors. 80% of the examples are in training set and remaining 20% in test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (120, 4)
y_train shape: (120,)
X_test shape: (30, 4)
y_test shape: (30,)


## Normalizing data

Normalizing the data to a smaller (0, 1) scale helps the model learn faster.

In [None]:
print(f"X_train before normalization: {X_train[:5]}")

scaler = MinMaxScaler()
X_train, X_test = scaler.fit_transform(X_train), scaler.fit_transform(X_test)

print(f"X_train after normalization: {X_train[:5]}")

X_train before normalization: [[5.1 2.5 3.  1.1]
 [6.2 2.8 4.8 1.8]
 [5.  3.5 1.3 0.3]
 [6.3 2.8 5.1 1.5]
 [6.7 3.  5.  1.7]]
X_train after normalization: [[0.22222222 0.20833333 0.33898305 0.41666667]
 [0.52777778 0.33333333 0.6440678  0.70833333]
 [0.19444444 0.625      0.05084746 0.08333333]
 [0.55555556 0.33333333 0.69491525 0.58333333]
 [0.66666667 0.41666667 0.6779661  0.66666667]]


## Changing target categories to number

Changing the categorial values to a binary label, as the model only accepts numerical input and cannot work with string values.

In [None]:
targets = data['Species'].unique().tolist()

labels = {}

for i in range(len(targets)):
  labels[targets[i]] = i

for i in range(len(y_train)):
  y_train[i] = labels[y_train[i]]

for i in range(len(y_test)):
  y_test[i] = labels[y_test[i]]

y_train

array([1, 2, 0, 2, 1, 0, 0, 0, 0, 1, 0, 1, 0, 2, 2, 0, 2, 2, 2, 2, 0, 2,
       2, 1, 1, 1, 1, 1, 1, 0, 0, 2, 2, 2, 0, 0, 0, 2, 1, 2, 2, 1, 0, 2,
       0, 2, 0, 1, 1, 0, 1, 0, 2, 2, 2, 1, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 2, 1, 2, 2, 1, 0, 1, 2, 0, 0, 2, 2, 1, 1, 2,
       0, 1, 2, 2, 2, 1, 0, 0, 0, 0, 2, 1, 2, 0, 0, 1, 1, 2, 1, 1, 2, 2,
       2, 0, 2, 0, 0, 2, 2, 1, 0, 0], dtype=object)

## Decision tree model

Building the decision tree model using *DecisionTreeClassifier* from sklearn library, and training the model.

In [None]:
model = tree.DecisionTreeClassifier()
model.fit(X_train, y_train.tolist())

DecisionTreeClassifier()

## Prediction

Making predicions on test set

In [None]:
predictions = model.predict(X_test)
predictions

array([1, 1, 2, 0, 1, 0, 0, 0, 1, 2, 1, 0, 2, 1, 0, 1, 2, 0, 2, 1, 1, 1,
       1, 1, 2, 0, 2, 1, 2, 0])

## Evaluation

Evaluating the model to understand how well it generalizes to unseen data.

In [None]:
accuracy = accuracy_score(predictions, y_test.tolist())
print(f"Accuracy of the model: {accuracy}")

Accuracy of the model: 1.0


## Conclusion

Our decision tree model achieves a perfect score of 1.0 on data it had never seen before. Thus, we conclude that the model performs well on new data.


*Created by: Shayan Halder*