Skip to content

This repo contains all the project files for DataScience nano degree program of Udacity as per the requirement of final Capstone Project (StarBucks)

License

Notifications You must be signed in to change notification settings

hardik101/dsnd-capstone-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo contains all the project files for DataScience nano degree program of Udacity as per the requirement of final Capstone Project (StarBucks)

The accompanying medium blog post can be found here.

Launch a notebook using Binder

Table of Contents

  1. Installation
  2. Project Organization
  3. Evaluation Strategy & Results
  4. Licensing, Authors, and Acknowledgements

Installation

Libraries used

scikit-learn==0.22.2.post1
nltk==3.4.5
numpy==1.18.2
pandas==1.0.3
joblib==0.13.2
progressbar==2.5
catboost==0.20.2
matplotlib==3.2.1
seaborn==0.10.0
scikit-plot==0.3.7
scipy==1.4.1
notebook==6.0.3
jupyterlab==2.1.0
jupyter==1.0.0
bokeh==2.0.1
cookiecutter==1.7.0
ipywidgets==7.5.1
nbconvert==5.6.1
setuptools==46.0.0
tqdm==4.45.0

I have used Python version 3.7.7 in my local machine and prepared project notebook on safari web-browser.

Project Organization


├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── notebooks          <- Jupyter notebooks
│   |__ project_notebook.ipynb <- final version of project notebook containing end to end approach
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│   |__ resources-citation.md <- collection of resources I find useful (not exhaustive)
├── reports            <- Generated analysis as HTML
|__ project_notebook-v2.html <- HTML version of latest project notebook for quick viewing

Evaluation Strategy & Results

Everything is included in the jupyter notebook provided in the notebooks folder Alternatively, you can view html version of notebook in project root directory.

Choice of Measurement Metrics for Multiclass classification

Our problem falls into the category of multi-class classification. The multi-class classification problem can be summarised as below:
Given a dataset with instances 𝑥𝑖 together with 𝑁 classes where every instance 𝑥𝑖 belongs precisely to one class 𝑦𝑖 is a problem targeted for a multiclass classifier.
After the training and testing, we have a table with the correct class 𝑦𝑖 and the predicted class 𝑎𝑖 for every instance 𝑥𝑖 in the test set. So for every instance, we have either a match (𝑦𝑖=𝑎𝑖) or a miss (𝑦𝑖≠𝑎𝑖).
Assuming we have balanced class distribution in our training set, evaluation using a confusion matrix together with the average accuracy score should be sufficient. However, F1-score can also be used for the evaluation of the multi-class problem.

Since the cost of misclassification is not high in our case (sending an offer to the non-responsive customer doesn't cost the company extra money), F1-score is not necessary.
In this project, I prefer to use the confusion matrix and the average score as our evaluation measures.

Confusion Matrix:
A confusion matrix shows the combination of the actual and predicted classes. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class. It is a good measure of whether models can account for the overlap in class properties and understand which classes are most easily confused.
Accuracy:
Percentage of total items classified correctly- (TP+TN)/(N+P)

  • TP: True Positive
  • TN: True Negative
  • N: Negative
  • P: Positive

For unbalanced class distribution, I have provided weights to each class label, and CatBoost automatically handles it.

Evaluation Strategy:

1. Initial Model Evaluation with fixed values of hyperparameters

  • I evaluate CatBoost Classifier with following fixed hyperparameters on all classification problems (class_2, class_3, class_4, class_5)

    • number of iterations = 2000
    • loss_function = 'MultiClass'
    • early_stopping_rounds = 50
    • eval_metric = 'Accuracy'

    and the rest of the parameter values as default provided by CatBoost.

2. Model Evaluation with finding best values of hyperparameters using GridSearch

  • In this round of experiments, I wrote a customer GridSearch function, which finds the best values for each given hyperparameter ranges and returns the model hyperparameters with the best average accuracy on training data.

  • For CatBoost Multiclassifier, there is numerous hyperparameter to tune. An extensive list can be found here: https://catboost.ai/docs/concepts/parameter-tuning.html

  • Here I selected only the following hyperparameters and specified the recommended range(found via some research on CatBoost website and Kaggle)for each of these parameters and find the model with the best score.

    • iterations = [1000, 2000, 3000] no. of boosting iterations
    • loss_function = ['Logloss','MultiClass','MultiClassOneVsAll']
    • depth = [4,6,8] maximum depth of the tress
    • early_stopping_rounds = [10, 20, 50] parameter for fit() - stop the training if one metric of a validation data does not improve in last early_stopping_rounds rounds.

  • I perform training on the model using those identified parameters.Results are visualised as well.

Licensing, Authors, Acknowledgements

Must give credit to StackOverflow and Bokeh documentation for the all general information. You can find the Licensing for the data and other descriptive information at Udacity website. View License terms of this project in LICENSE file.

Signed-off by @Author: Hardik B.

About

This repo contains all the project files for DataScience nano degree program of Udacity as per the requirement of final Capstone Project (StarBucks)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages