Kaggle (Bosch, leaderboard 8 / 1373)

This repo contains a collection of Python / R script used for the Bosch Production Line Performance prediction Kaggle competition. This code was used to achieve a nr. 8 position on the leaderboard with a MCC score of 0.50888. More details about the end-to-end approach can be found here. This code does not run end-to-end and is sparsely documented, but people might be interested in parts of the feature engineering, visualization or modeling code. Below a short overview of our approach and links to the related notebooks. Feel free to ask for clarification.

Prerequisites

It requires the usual Python packages: Pandas, Numpy, Seaborn, Scipy and Matplotlib. For modeling Sklearn, XGBoost and LightGBM are required. The datasets can be downloaded here. Memory requirement is at least 16GB.

Preprocessing

After downloading the datasets from Kaggle, the following preprocessing steps are required:

Conversion of data from dense format to sparse. Sparsity is around 80% on average, so this allows much faster feature generation with a smaller memory footprint. Script
Creating a look-up-table that helps merging data between the numeric, categorical and timestamp datasets. Step1, Step2
Clustering all samples based on the path they traversed through the manufacturing lines. Notebook to create paths here and and clustering notebook here. Example result below.

Feature engineering

Order:

Correlation with next/previous samples after sorting by tmax, tmin
Correlation with next/previous samples after sorting by numerical values
Same timestamps previous/next sample after sorting by tmax
Previous / next response for duplicates

Path:

Datetime:

Timestamp per station, per line, merged per path
Max-min per station, per line, merged per path
Kurtosis + kurtosis per line (very strong, surprisingly)
Lead/lag response rate statistics after sorting by tmax, id
Lead/lag response rate statistics after sorting by tmax, numeric, id
Timestamp based error rate (leave one out method) but it did not improve score
Timestamp label-based density (overfit)

Numeric:

Use supervised decision trees to try to predict the numeric value using the date, per station => the split points allow to define thresholds, which clustered the values by group of dates (over 4000 features) => used xgboost per station to shrink it to ~120 features as adding 4000 features didn’t help at all
Numeric after removing time trend (did not improve score)
Deviation from mean summed over station/line
Raw numeric data (a specific selection of them)

Below image of dashboards used to evaluate numeric features.

Categorical:

Couple of features one-hot encoded added a little bit

Modelling scripts.

Level one models:

Data set 1 (0.477 gbm): order, raw numeric, date, categorical GBM Model
Data set 2 (0.482 gbm, 0.477 xgb, 0.473 rf): order, path, raw numeric, date
Data set 3 (0.479 gbm, 0.473 xgb): order, path, numeric, date, refined categorical
Data set 4 (0.469 xgb, 0.442 rf): XGBoost model
Data set 5 (0.43 xgb): has faron’s magic features, path, unsupervised nearest neighbors

Level 2 stack models

Giving the weaker model a stronger weight was better:

30% weighted xgboost gbtree (~0.488 CV) Script
70% weighted Random Forest (~0.485 CV) Script

Typical training dashboard:

Hyperparameter optimization using HyperOpt

HyperOpt was used to do Bayesian optimization of hyperparameters. Script for XGBoost model and RandomForest. Below example output

Analysis of submission files

This notebook was used to evaluate and compare different lvl one models.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
images		images
LICENSE		LICENSE
README.md		README.md
const.py		const.py
feature_set_categorical.ipynb		feature_set_categorical.ipynb
feature_set_date.ipynb		feature_set_date.ipynb
feature_set_lead_lag_numeric.ipynb		feature_set_lead_lag_numeric.ipynb
feature_set_magic.ipynb		feature_set_magic.ipynb
feature_set_numeric.ipynb		feature_set_numeric.ipynb
feature_set_numeric_detrended.ipynb		feature_set_numeric_detrended.ipynb
feature_set_path.ipynb		feature_set_path.ipynb
feature_set_randomized_loo_time_path.ipynb		feature_set_randomized_loo_time_path.ipynb
feature_set_same_line.ipynb		feature_set_same_line.ipynb
feature_set_source_dest_stations.ipynb		feature_set_source_dest_stations.ipynb
feature_set_timestamp.ipynb		feature_set_timestamp.ipynb
func.py		func.py
model_hyperopt-stacker.ipynb		model_hyperopt-stacker.ipynb
model_hyperopt.ipynb		model_hyperopt.ipynb
model_lgbm.R		model_lgbm.R
model_rf_lv2_v3.R		model_rf_lv2_v3.R
model_submission_analysis.ipynb		model_submission_analysis.ipynb
model_xgb.R		model_xgb.R
model_xgb_lv2_v5.R		model_xgb_lv2_v5.R
pre_convert_data_to_sparse.ipynb		pre_convert_data_to_sparse.ipynb
pre_create_lookup_table.ipynb		pre_create_lookup_table.ipynb
pre_create_lookup_table_lines_V2.ipynb		pre_create_lookup_table_lines_V2.ipynb
pre_kmeans_cluster.ipynb		pre_kmeans_cluster.ipynb
pre_path_per_sample.ipynb		pre_path_per_sample.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle (Bosch, leaderboard 8 / 1373)

Prerequisites

Preprocessing

Feature engineering

Order:

Path:

Datetime:

Numeric:

Categorical:

Modelling scripts.

Level one models:

Level 2 stack models

Hyperparameter optimization using HyperOpt

Analysis of submission files

About

Releases

Packages

Languages

License

joostgp/kaggle_bosch

Folders and files

Latest commit

History

Repository files navigation

Kaggle (Bosch, leaderboard 8 / 1373)

Prerequisites

Preprocessing

Feature engineering

Order:

Path:

Datetime:

Numeric:

Categorical:

Modelling scripts.

Level one models:

Level 2 stack models

Hyperparameter optimization using HyperOpt

Analysis of submission files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages