# Iris flower dataset

Here you can read more about the dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

### 1. Import the Data

In [1]:
# import it from sklearn
from sklearn.datasets import load_iris

# instantiate it
iris = load_iris()

In [2]:
# we will be creating x and y
iris.data # x = data
iris.target # y = target

# we can create a formel
f(X) = y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [5]:
X = iris.data # the images
y = iris.target # to predict the species

# because each items are features we can store thos names
feature_names = iris.feature_names
target_names = iris.target_names

In [7]:
# if we print it we can see the columns and options
feature_names, target_names

(['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 array(['setosa', 'versicolor', 'virginica'], dtype='<U10'))

In [9]:
# quick view the type
type(X) # will be numpy array

numpy.ndarray

### 2. Split into testing and training set

In [14]:
# will randomly split the data into train, test split
from sklearn.model_selection import train_test_split

# instantiate X test, train and y test, train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # test size 20%

# view the dimensionality
print(X_train.shape)
print(X_test.shape)

(120, 4)
(30, 4)


### 3. Create our Model
We will be using the Nearest Neighbors Algorithm / Model, here you can read more about it - https://scikit-learn.org/stable/modules/neighbors.html

Here you can read how the Model works - https://www.analyticsvidhya.com/blog/2018/08/k-nearest-neighbor-introduction-regression-python/

* It is a classification model, which will Classify by dividing up the data in different segments
* But there are so many algorithms -> Machine Learing Job is to find always the best one

In [23]:
# import the model
from sklearn.neighbors import KNeighborsClassifier

# give parameters, how many segments we want -> we have three flowers -> 3
knn = KNeighborsClassifier(n_neighbors=3)
# fit the model
knn.fit(X_train, y_train)

In [24]:
# start predictions
y_preds = knn.predict(X_test)

In [25]:
# import metrics to view the score of our model
from sklearn import metrics

# our model scores 93%
metrics.accuracy_score(y_test, y_preds)

0.9333333333333333

### 4. Improve our Model

To improve our Model we can
1. Change the test size - make it bigger or smaller
2. We could potentionaly change the n_neighbors, but it isn't recommended
3. We could add another feature from the dataset
4. We could improve the algorithm / find a better one

In [35]:
# import a different model
from sklearn.tree import DecisionTreeClassifier

# instantiate new model and new predictions
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_preds2 = dt.predict(X_test)

In [36]:
# import metrics to view the score of our model
from sklearn import metrics

# our model scores 93% - it doesn't change
metrics.accuracy_score(y_test, y_preds2)

0.9333333333333333

### 5. Some more

Let's say a user uploaded a new image from a new photo of an iris flower - This is how we would do that

Here we can see all parameters of the Dataset - https://www.kaggle.com/datasets/arshid/iris-flower-dataset

In [39]:
# create the images (with random numbers from kaggle dataset)
sample = [[3,5,4,2], [2,3,5,4]]

# create predictions
predictions = knn.predict(sample)
# create list with prediction species
pred_species = [iris.target_names[p] for p in predictions]

# print the predictions
print("predictions: ", pred_species)

predictions:  ['versicolor', 'virginica']


### 6. Export our Model

For this we will be using the joblib library from sklearn, here you can more - https://scikit-learn.org/stable/model_persistence.html

Here more about the library - https://pypi.org/project/joblib/ , https://joblib.readthedocs.io/en/stable/

* Now we created a trained model, which we could use for applications. 

But we problem - our model isn't really good, because normally you would train it on millions of rows, but this is really time consuming

* Model Presistance - if a user inputs a thing we don't want to retrain everything -> save the model to a file

If you are on a iphone the model is already there and not gets always retrained


In [44]:
# import 
from joblib import dump, load

# save the model / dump it into a file
dump(knn, 'mlbrain.joblib') # will store into a binary file

['mlbrain.joblib']

In [48]:
# instead of retraining the model we can do this
model = load('mlbrain.joblib')

# now we can predict again
sample = [[3,5,4,2], [2,3,5,4]]

# create predictions
predictions = model.predict(sample)
# create list with prediction species
pred_species = [iris.target_names[p] for p in predictions]

# print the predictions
print("predictions: ", pred_species)

predictions:  ['versicolor', 'virginica']
