Skip to content

Bosch Production Line Performance Kaggle Competition. Nr 8 on Kaggle Leaderboard.

License

Notifications You must be signed in to change notification settings

joostgp/kaggle_bosch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle (Bosch, leaderboard 8 / 1373)

This repo contains a collection of Python / R script used for the Bosch Production Line Performance prediction Kaggle competition. This code was used to achieve a nr. 8 position on the leaderboard with a MCC score of 0.50888. More details about the end-to-end approach can be found here. This code does not run end-to-end and is sparsely documented, but people might be interested in parts of the feature engineering, visualization or modeling code. Below a short overview of our approach and links to the related notebooks. Feel free to ask for clarification.

Prerequisites

It requires the usual Python packages: Pandas, Numpy, Seaborn, Scipy and Matplotlib. For modeling Sklearn, XGBoost and LightGBM are required. The datasets can be downloaded here. Memory requirement is at least 16GB.

Preprocessing

After downloading the datasets from Kaggle, the following preprocessing steps are required:

  • Conversion of data from dense format to sparse. Sparsity is around 80% on average, so this allows much faster feature generation with a smaller memory footprint. Script
  • Creating a look-up-table that helps merging data between the numeric, categorical and timestamp datasets. Step1, Step2
  • Clustering all samples based on the path they traversed through the manufacturing lines. Notebook to create paths here and and clustering notebook here. Example result below.

Clustered paths

Feature engineering

Order:

Path:

Datetime:

Numeric:

Below image of dashboards used to evaluate numeric features.

Numeric value

Categorical:

Modelling scripts.

Level one models:

  • Data set 1 (0.477 gbm): order, raw numeric, date, categorical GBM Model
  • Data set 2 (0.482 gbm, 0.477 xgb, 0.473 rf): order, path, raw numeric, date
  • Data set 3 (0.479 gbm, 0.473 xgb): order, path, numeric, date, refined categorical
  • Data set 4 (0.469 xgb, 0.442 rf): XGBoost model
  • Data set 5 (0.43 xgb): has faron’s magic features, path, unsupervised nearest neighbors

Level 2 stack models

Giving the weaker model a stronger weight was better:

  • 30% weighted xgboost gbtree (~0.488 CV) Script
  • 70% weighted Random Forest (~0.485 CV) Script

Typical training dashboard: training dashboard

Hyperparameter optimization using HyperOpt

  • HyperOpt was used to do Bayesian optimization of hyperparameters. Script for XGBoost model and RandomForest. Below example output HyperOpt

Analysis of submission files

  • This notebook was used to evaluate and compare different lvl one models.

About

Bosch Production Line Performance Kaggle Competition. Nr 8 on Kaggle Leaderboard.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published