# Bosch production line performance
This and all dependant notebooks are attempting to inform strategic decision making based off of the data collected on Bosch's manufacturing lines.

### Goals
To put myself in the mindset of a manufacturer, glean as much manufacturing insight as possible from this data set.  Additionally to use hypothesis testing as a method to systematically find and explain my best analysis

### Challenges
This data set is almost entirely randomized. Catgoricals have no meaning to me, continuous data as meaningful as measurements and time stamps.  
This set is HUGE. It is very unlikely that I will be able to use this entire set with my current resources (home pc).

### References
[Dataset and description](https://www.kaggle.com/c/bosch-production-line-performance/)

## Business Case
This set contains a series of measurements, cateogries, and timestamps relating to a specific part traveling down Bosch's manufacturing lines. Ultimately this part will be classified as passing or failing a quality check.  
### Target
I'm looking to predict if a part will pass or fail QC. According to Bosch's documentation it is the feature ```Response```

### Problem Type
Since it is either pass or fail, I'll be working on ```binary classification```

### Metric
To know what metric, I need to understand how this data is distributed and understand the business consequences of our predictions. This will require some analysis

In [7]:
#Okay, lets get started by taking a look at my distributions.  I know from bosch
#documentation that our target is in the numeric dataset.
import pandas as pd


folder = 'bosch-production-line-performance/'

response = pd.read_csv(folder + 'train_numeric.csv', usecols = ['Response'],
                      squeeze = True)

print(response.shape)
response.head()

(1183747,)


0    0
1    0
2    0
3    0
4    0
Name: Response, dtype: int64

In [8]:
#okay lets look at that distribution
response.value_counts()

0    1176868
1       6879
Name: Response, dtype: int64

In [10]:
percent = round(response.value_counts()[0] / response.shape[0] * 100)

print(f'{percent}% of our parts passed QC')

99.0% of our parts passed QC


#### Distribution
Okay this set is extremely imbalanced, so accuracy is not going to be a good default metric.  So lets knock out a confusion matrix, and see if we can put some stakes to these predictions.

#### Confusion Matrix

|                 | Predicted Failed                                                                                                                                          | Predicted Passed                                                                                                                                     |
|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| Actually Failed | Sweet                                                                                                                                                     | Part goes to production, product made, possibly fails in field, loss of consumer faith, high complexity cost and monetary cost of reverse logistics. Total material costs of other parts in final defective product |
| Actually Passed | Part gets binned for rework.  Best case, is identified as good during rework. Worst case classified as wastage. I'd expect lower overall cost and impact. | Sweet                                                                                                                                                |

### Metric: Precision
It seems to me the cost of letting false positives slip through the line is much higher. So we need to prioritize precision. I will also be tracking MCC as a secondary metric, since it will illuminate my models overall performance. Additionally it is the scoring method of the original competition this came from.

In [12]:
def metrics(y_true, y_pred):
    #We are manually setting our positive label, because our data set
    #We are also setting zero division to 0, because a dataset this imbalance
    #is very likely generate warnings otherwise.
    print('Our Precision is: ', precision_score(y_true, y_pred, pos_label = 0, 
                                               zero_division = 0))
    print('Our MCC is: ', matthews_corrcoef(y_true, y_pred))

## Hypothesis tests
I will update this grid in place as I complete each test. You may need to reference other notebooks to see more indepth supporting work. These hypothesis were guided by the following principles:  
* __Dimension Reduction__ - I only have so much RAM available, I need as many observations as possible due to highly imbalanced set  
* __Numeric features are the most important__ - The set is fuller, and the most predictive in initial exploration  
* __Must be explainable__ - as few "black boxes" as possible. When used, at least use them on a class of data to help maintain storytelling

| |Hypothesis|Action|Results|Insight|Reference|
|-|----------|------|-------|-------|---------|
|0|A model that beats a guess is a good starting baseline|Create a ```uniform``` baseline measured by ```precision```| 0.0056 |A guess is a bad way to run QC|This notebook|
|1|A massive data set made of the combination of numeric, categorical, and timestamp sets would allow the most information for making predictions|Merge datasets and run a few simple models|0.0|Memory footprint was too large, for too few observations. Reverting to split data sets|wrangle_1_megaset.ipynb|
|2.0|A ```logistic regression``` model run on the date.csv set will beat baseline and can be stacked as a metafeature|Clean train_date.csv, fit basic model, eval|0.0|Was unpredictive, try to improve|wrangle_2_metafeature_pca_date.ipynb|
|2.1|logistic regression model could be improved by focusing on high performing features, and removing low performing features|performed ```permutation importance```|0.0|No single feature was improved or diminished by permutation importance, model wasn't improved, try a different linear model| "|
|2.2|```Boosting``` with linear learners would beat baseline|fit basic XGBoost model, eval|0.0|unpredictive. This data set may not be predictive enough on its own. Changing approach|"|
|2.3|Set may be relationally important to train_numeric.csv set|use PCA to reduce train_date.csv|na|Haven't tested yet| this notebook|
|3.0|A ```Random Forest Classifier``` run on the categorical data set would beat baseline|cleaned train_categorical.csv, fit basic RFC model| .33 | First major success, attempt to improve|wrangle_3_explore_categoricals.ipynb|
|3.1|Using ```boosting``` with a random forest will improve model|fit a basic XGBRFClassifier model| 0.0 |boosting reduced predictions significantly, reverted to previous model| " |
|3.2| Random permutation could help me identify under or over performing features| performed ```permutation importance``` on the forest model| 0.33| Very low impact on features, reverted to base model| " |
|3.3| The RFC model's feature importance could help me identify the highest performing features| fit a new RFC model with only features of > 0 importance | .20| Performance went down. Dropped well over 1k features. Likely thousands of minisculy low impact features is still better a couple hundred mediocer features and one high impact feature|"|
|3.4| ```Stacking``` the results and probabilities from RFC model with my numeric data set will improve performance|create metafeature with predictions and probabilities| na| haven't tested yet| This notebook|
