This repo contains a collection of Python / R script used for the Bosch Production Line Performance prediction Kaggle competition. This code was used to achieve a nr. 8 position on the leaderboard with a MCC score of 0.50888. More details about the end-to-end approach can be found here. This code does not run end-to-end and is sparsely documented, but people might be interested in parts of the feature engineering, visualization or modeling code. Below a short overview of our approach and links to the related notebooks. Feel free to ask for clarification.
It requires the usual Python packages: Pandas, Numpy, Seaborn, Scipy and Matplotlib. For modeling Sklearn, XGBoost and LightGBM are required. The datasets can be downloaded here. Memory requirement is at least 16GB.
After downloading the datasets from Kaggle, the following preprocessing steps are required:
- Conversion of data from dense format to sparse. Sparsity is around 80% on average, so this allows much faster feature generation with a smaller memory footprint. Script
- Creating a look-up-table that helps merging data between the numeric, categorical and timestamp datasets. Step1, Step2
- Clustering all samples based on the path they traversed through the manufacturing lines. Notebook to create paths here and and clustering notebook here. Example result below.
- Correlation with next/previous samples after sorting by tmax, tmin
- Correlation with next/previous samples after sorting by numerical values
- Same timestamps previous/next sample after sorting by tmax
- Previous / next response for duplicates
- Cluster of samples by path and calculated per sample the absolute and relative difference between numeric / time difference (max - min) features and cluster mean.
- Path based error rate (leave one out method) but it did not improve score
- Entry / exit station
- Previous and next station for S29 - S37 (one-hot encoded)
- Timestamp per station, per line, merged per path
- Max-min per station, per line, merged per path
- Kurtosis + kurtosis per line (very strong, surprisingly)
- Lead/lag response rate statistics after sorting by tmax, id
- Lead/lag response rate statistics after sorting by tmax, numeric, id
- Timestamp based error rate (leave one out method) but it did not improve score
- Timestamp label-based density (overfit)
- Use supervised decision trees to try to predict the numeric value using the date, per station => the split points allow to define thresholds, which clustered the values by group of dates (over 4000 features) => used xgboost per station to shrink it to ~120 features as adding 4000 features didn’t help at all
- Numeric after removing time trend (did not improve score)
- Deviation from mean summed over station/line
- Raw numeric data (a specific selection of them)
Below image of dashboards used to evaluate numeric features.
- Couple of features one-hot encoded added a little bit
- Data set 1 (0.477 gbm): order, raw numeric, date, categorical GBM Model
- Data set 2 (0.482 gbm, 0.477 xgb, 0.473 rf): order, path, raw numeric, date
- Data set 3 (0.479 gbm, 0.473 xgb): order, path, numeric, date, refined categorical
- Data set 4 (0.469 xgb, 0.442 rf): XGBoost model
- Data set 5 (0.43 xgb): has faron’s magic features, path, unsupervised nearest neighbors
Giving the weaker model a stronger weight was better:
- HyperOpt was used to do Bayesian optimization of hyperparameters. Script for XGBoost model and RandomForest. Below example output
- This notebook was used to evaluate and compare different lvl one models.