- 1. Project Overview
- 2. Problem Statement
- 3. Metrics
- 4. The Iris Flower Dataset
- 5. Methodology
- 6. Results
- 7. Flask Web App
- 8. Files Structure
- 9. Requirments
- 10. Running Process
- 11. Conclusion
- 12. Improvements
- 13. Acknowledgements
In this project, we will analyze the iris flower dataset, which has three species: Setosa, Versicolor and Virginica. Each flower class has around 50 records in the dataset. The main goal of this project is to create a classification model that uses the length and width measurements of the sepal and petal to categorize new flowers.
Identifying Iris Flowers by eyes and especially for non-experts is a difficult job, but machine learning algorithms make it much easier to classify any flower with high accuracy. This is a classification problem which the model attempts to determine if the flower was Setosa, Versicolor, or Virginica. In this project, we are going to use Logistic Regression from the scikit-learn library.
In the evaluation process, we are going to use the accuracy score metrics to get an overview on the model performance, which is the number of correctly classified data instances over the total number of data instances. The accuracy score is used above other performance metrics since we want to know how the model performs in general because we don't care much about the specificity or sensitivity in this situation.
The Iris flower dataset was taken from Kaggle as a comma-separated values (CSV), and it contains a set of 150 records under 5 attributes - Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).
The data exploration and data visualization were done inside the /data/process_data.ipynb
, but here are some of the findings:
As seen above, there are almost 50 records of each flower class in the dataset
As it shown above, the sepal range is between 4.3cm and 7.9cm in length and 2.0cm and 4.4cm in width. But the petal range is between 1.0cm and 6.9cm in length and 0.1cm and 2.5cm in width.
The chart also shows that Virginica has the longest sepal length which may reach 7.9cm, as opposed to Setosa, which has a range of 4.3cm to 5.8cm. On the other hand, Setosa has the widest sepals at 4.4cm and Virginica has the highest petal length and width.
The machine learning model was trained on the Iris flower dataset using The scikit learn Python library. The model is Logistic Regression, which is an excellent classifier since it applies the one-vs-rest principle to this multi-class situation. We also used the accuracy score metrics to calculate the model accuracy.
The data preprocessing was done inside the /data/process_data.ipynb
using Pandas library. There was only one step which is encoding by using Label Encoder from scikit-learn and it converted the flower classes (Setosa, Versicolor and Virginica) to (1, 2 and 3). This process is important because computers deal with numbers better than anything else.
The implementation of algorthims and techniques was done by using the scikit-learn library. This procedure consists of five phases, which are as follows:
- Loading the data as a pandas dataframe from the database
- Spliting the dataset to train and test using train test split function
- building and training the logistic regression model
- Evaluating the model using the accuracy score
- Saving the model as a pickle file
In this project, GridSearchCV was used which is an exhaustve search over specified parameter values for an estimator. The following are the hyperparameters that was given to the grid search:
parameters = {
'C': [0.1, 1, 10, 100],
'penalty': ['l1', 'l2', 'elasticnet'],
'solver': ['lbfgs', 'liblinear'],
'max_iter': [100, 500]
}
The model evaluation was calculated using the accuracy score and because the GridSearchCV used the cross validation of five folds to search for the best model possible using the given parameters, it identified the following as the optimal hyperparameters for the robust model that achieved 96% accuracy score:
Best parameters: {'C': 10, 'max_iter': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
In this project, the grid search was the only strategy used, and we received a high accuracy with the best parameters.
The Flask Web App allows the user to use the trained model to make predictions on new flowers and find their species easily
โโโ app #Website folder
โย ย โโโ app.py #Responsible of running the website
โย ย โโโ templates
โย ย โโโ index.html # Allows the user to input and predict new flower properties
โย ย โโโ Static
โย ย โโโ index.css # This file has the Cascading Style Sheets of the index.html
|
โโโ data
โย ย โโโ dataset.csv # The Iris flower dataset
โย ย โโโ dataset.db #The prepared dataset as SQLite database
โย ย โโโ process_data.py #Responsible for dataset preparation
|
โโโ models
โย ย โโโ model.pkl #The Logistic Regression Model
โย ย โโโ train_classifier.py #Responsible for creating the machine learning model
|
โโโ images #This folder contains all images for the readme file
โย ย โโโ flower.jpg
|
โโโ README.md #Readme file
In order to run this project, you must have Python3 installed on your machine. You also must have all listed libraries inside the requirments.txt
so run the following command to install them:
pip3 install -r requirments.txt
This secions explains how to run each part of this project using the command prompt or terminal
To look at the data exploration and data visualization, please open /data/process_data.ipynb
with Jupyter Notebook.
To re-train the classifier, you must go inside the models
directory using the terminal or the command prompt and run the following:
python3 train_classifier.py ../data/<database_name>.db <model_name>.pkl
To run the web app, you must go inside the app
directory using the terminal or the command prompt and run the following:
python3 app.py
The link of the website will be 0.0.0.0:3001
In conclusion, classifying iris flower species may be a challenging task, especially for non-experts, but machine learning algorithms make it much easier to determine the flower class. This project designed a basic but strong machine learning model based on the logistic regression algorithm from the scikit-learn python library. We also ensured that we got the best model possbile by using the gridsearch functionality to get the golden model.
We are proud of our solution because it achieved such high accuracy, but there is always room for improvement. In the future, we can attempt to create a deep learning model using neural networks, which may yield even better and more accurate results. You are also welcome to fork this repository and try to enhance the solution on your own.
I would like to express my appreciation to Misk Academy and Udacity for the amazing work on the data science course and the support they give us to build this project