# Entry 10 - Reorder Pre-processing and Make Predictions

Using the pre-processing determined in entries 6-8, predict mass while only changing the surface pressure.

## The Problem

<font color='red'>Entry 9</font> resulted in a trained model. However, I ran into several problems that need to be addressed before a prediction can be made. These problems are:

- Categorical and scaling parameters weren't retained, so they couldn't be applied to the test data
- Target value would scaled using standardization, rendering the predictions uninterpretable

## The Options

Retaining the pre-processing transformations is the easy one as there's basically one option: retain the information so you can apply the same transformations later. This is easily accomplished with the preprocessing module of Scikit Learn. The information just has to be returned as part of the function.

Addressing the second point requires more thought. The target value being scaled is making me reconsider the order in which the pre-processing occurs. I could just separate the target value and features at the scaling step, but I think I should address several other concerns at this point so the process to date is easier to transfer to other datasets.

## The Proposed Solution

Based on the issues encountered, I propose updating my pre-processing process to the following steps:

### Split dataset

The very first step in the standardized process after loading the data should be to split it into train, test, and reserve datasets.

- **Train**: where pre-processing is run, features are generated, and the model is trained
- **Test**: where the trained model(s) is/are tested on data never seen before and hyperparameters are evaluated
- **Reserve**: where the final model is assessed

### Separate target value and features

The target value doesn't need to be pre-processed (except maybe for missing values, but as I'm only dealing with supervised learning at this stage, there shouldn't be any missing values in the target). The easiest way to ensure this is to split off the target from the rest of the data.

### Determine collinearity

Correlation doesn't care about scaling (I checked - the values I got when running correlation on the unscaled features matched the values Sabber and I got when running it on standardized values). To speed up pre-processing, I'm going to remove collinear features before applying transformations.

Correlation only works on numeric values. I'm going to explore determining collinearity of categorical-categorical and categorical-numeric features in a <font color='red'>future entry</font>. For now, I'm going to run collinearity on just the numeric features (due to the transformation issues listed in the 'Apply transformations' section).

### Apply transformations

This is where I encode categorical features and scale numeric features. To ensure the categorical features don't accidentally get scaled (which they did in notebook entry 8 and 9), scaling should happen first on just numeric features, then categorical features can be encoded. Per [these](https://stats.stackexchange.com/questions/169350/centering-and-scaling-dummy-variables) two [posts](https://en.wikipedia.org/wiki/Categorical_variable), categorical features should never be scaled.

This brings up an interesting point of whether encoded categorical features should be scaled. I think I remember one of the books or tutorials recommending scaling when the values were different by a factor of 10 or more. So, as long as there are fewer than 10 (100?) categories in any one feature, it should effect the training. I intend to explore categorical encoding in more detail in a <font color='red'>future entry</font>, so I'll leave categorical features unscaled for now and dive deeper into the issue then.

### Make predictions

Once the above steps have been implemented in order, I should then be able to make predictions.

## The Fail

