Starbucks capstone project belongs to Udacity Data Science Nanodegree program.
The main motivation for the development of this project is to provide a consistent and valid model that allows the creation of interest offers for the clients of any company, in this case starbucks.
The project has the following different resources in file form:
The analysis, cleaning, data exploration and construction of models based on the dataset created as a result of the previous steps is divided into two files that should be executed in the following order:
Starbucks_Capstone_notebook1.ipynb
Starbucks_Capstone_Challenge_Building models.ipynb
On the other hand, the initial datasets used for this project are the following:
data/portfolio.json
: containing offer ids and meta data about each offer (duration, type, etc.)
data/transcript.json
: records for transactions, offers received, offers viewed, and offers completed
data/profile.json
: demographic data for each customer
As a result of the data wranglig and feature engineering operations, two new datasets are created:
data/portfolio_cleaned.csv
: processed information related to initial portfolio dataset
data/combined_data.json
: dataset prepared and built in the first notebook with the objective of building the classification models
-
Numpy: The fundamental package for scientific computing with Python
-
scipy: Python-based ecosystem of open-source software for mathematics, science, and engineering
-
progressbar: Provide visual (yet text based) progress to long running operations.
To find the best possible model, three different supervised classification algorithms have been used:
- Logistic Regression
- Gradient Boosting
- Random Forest
Metrics outcome:
Log. Regression | Gradient Boosting | Random Forest | |
---|---|---|---|
accuracy | 0.698 | 0.726 | 0.734 |
f1score | 0.694 | 0.725 | 0.729 |
precission | 0.667 | 0.691 | 0.707 |
recall | 0.725 | 0.763 | 0.749 |
These are the main resources, including websites, books or papers that have helped me to find solutions and develop this project.
- Hands-on ML with Scikit-Learn, Keras and Tensorflow by Aurélien Géron
- Universitat Oberta de Catalunya - Data Science Degree
- kdnuggets.com: Choosing the Right Metric for Evaluating Machine Learning Models
- kdnuggets.com: More Performance Evaluation Metrics for Classification Problems You Should Know
- IArtificial.net: Precision, Recall, F1, Accuracy en clasificación
- IArtificial.net: Ensemble methods
- Google Developers: AUC-ROC curve
- RandomizedSearchCV
- Progress Bar
- Markdown: Create tables
You can find more information by reading my technical article on The Startup at Medium.