Dataset Hands-On Challenge
Marcello's working notebook on the Flairbit Challenge presented on the February 7 2019 event:
Dataset details and download instructions are at the link above, in the Flairbit section of the event.
- Data related to professional coffe machines
- Dataset categories:
- One file per category per day *type_YYYYMMDDHHMMSS-an.csv (e.g., faults_20190103020001-an.csv)
- Common dataset feature
- Machine serial number
- Machine model
- Timestamp (YYYY MM DD hh:mm:ss and week number)
Warm up: Let’s query the CSV files
- How many connected machines?
- Counters types
- Faults distribution per model
- Cleanings misses distribution per model
- Predict faults occurrences based on counters patterns and cleaning misses
Root cause analysis
- Find correlations between machine usage (counters and cleanings misses) and faults
The code in the repo is in the format of three Jupyter Notebooks
- Exploration.ipynb : first file to look at, it includes data esploration of the dataset (please note that the dataset itslef is not in the repo! you have to look into the dataset presentation here and find the slide with the link to a dropbox folder. The reason for this is to spread knowledge of the DataScienceSeed community and events. This notebook also takes care of generating intermediate dataset files (the .csv files in the repo) to be used for the forecast.
- LGBM Tabular 03_01_Fault_Clean_Count.ipynb : failure forecast based on LightGBM algorithm. There is also some data rearranging to play with different time windows of data in the features side and in the target side. The results are fairly good: the models can predict with 85% of f1 score a failure of a machine in the next 5 days. There is also some feature interpretation based on SHAP.
- Fastai Tabular Paperspace.ipynb : failure forecast based on deep learning, using Fast.ai tabular_learner. I attempted to apply what is described in Lessno4 of Fast.ai MOOC 2018 edition. The results are worst than LGBM, but this is just a dumb attempt!