## Algorithm Understanding
__Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable. What algorithms can be used to automatically select the most important features (regression, etc..)? Describe at least 3?__
 
For clarity, this answer is with respect to feature selection, not feature engineering.
Two classic methods are forward selection and backwards selection.  

In backwards selection, you start with the “kitchen sink” model and remove a single feature each iteration, according to some selection criterion.  The “kitchen sink” model means that the original model is highly parameterized. In the extreme, it contains all features (and their interactions for models like multiple regression). The removal stops when no features meet the removal criterion at that iteration.  

In forward selection, you start with the “null” model (a zero-feature model, e.g., one that predicts a constant equal to the mean) and add a single feature each iteration, according to some selection criterion. There may be other rules too, for example, that interaction terms may only be added once the constituent variables are all included.  
 
These algorithms are both greedy algorithms in their behavior, in that they’re optimizing only with regards to the very next step. So they can both get caught (deterministically) in local optima. This also means that forward and backward selection rarely meet each other in the middle. Backward selection will tend to end with a large model with higher performance according to the selection criterion or objective function.  Forward selection will tend to stop at a performance plateau that’s achievable by sparse models.  

As a final quick note, the selection criterion might be completely unrelated to the model’s held-out (test or validation) predictive performance metrics. Typical criteria that don’t account for actual predictive tend to be statistics in nature (Akaike criterion, R-squared, correlation, mutual information). Criteria like L2, Lasso, or Ridge tend to be more geared around predictive performance.  

Just to keep it interesting, a third (fourth, and fifth) method for model selection are genetic algorithms, simulated annealing, and tree-based methods. These are interesting because they have built-in methods to break out from local optima.  

For genetic algorithms, picture a vector of binary values (0 / 1), with one value for each feature. If a feature is included in the model then the vector has a 1 for that feature. If not, then 0. The genetic algorithm attempts to optimize the model’s performance (according to an objective function like RMSE, binary cross-entropy, etc.) by having a population of models “interact” via “genetic operations”. The population of models can “mutate” (binary values flip), crossover (combine/mix with other model’s vectors), etc. In this way new model populations can be proposed that are not in the direct neighborhood of the current best proposal. A subset of each generation moves on to the next generation. This can get pretty computationally expensive.  
 

### Interview Readiness
__Explain data leakage and overfitting (define each)?__
__Explain the effect of data leakage and overfitting on the performance of an ML model.__
 
In short, data leakage means that our model has access to information (when predicting on unseen data) that it would not actually have in a real-world scenario.  

A quick example is scaling/normalizing the data before train/validation/test splitting. The testing data will influence the scaling parameters of the training data, when in reality unseen data will not affect these parameters. Data leakage tends to inflate model performance metrics, so you’ll have an over-optimistic expectation for performance. Taking it a step further, if you do EDA on data that’s meant to be unseen then you’re running the risk of letting it influence your feature building.  

Another quick example is temporal leakage or leakage due to poor stratification. For the temporal issue, consider a timeseries modelling task where some of your training data is data that happens after some of your test data. Information for the test set was likely used to train. For the stratification issues, consider the classic example of multiple data points from the same patient. Random splitting could give you a patient’s data in all splits.  

Again, the worry of these is that they tend to inflate the model’s performance on held out data, which will make you over-confident in the model’s real-world performance.  

Overfitting, on the other hand, is when the model performs very well on the training data but generalizes very poorly. For example a highly-parameterized model could be at risk of over-training because it can use those parameters to fit the training data very tightly (or exactly). This comes at the expense of test performance as the model contorts itself to fit the training data, for example fitting itself to noise/randomness in ways that don’t represent that data generative process.  

In the case of overfitting, the model will have strong training performance and low test and/or validation performance. This is (usually) easier to spot than leakage and many training procedures monitor validation performance to know when to stop model training when overfitting starts.  

For what it’s worth, this isn’t the whole story for overfitting. There are many cases where a highly over-fit model is a useful discovery/inference tool in your toolkit.    

## Interview Readiness
### Explain what our outliers in your data?
### Explain at least two methods to deal/treat outliers in your data?
 
I’m not sure if this is still referring to the Walmart data or Capstone.
For the Walmart data there were plenty of outliers. In particular, the weekly sales had a variety of outliers including erroneous data (ex. the negative values) and the extremely-high values. For the high outliers, you might have noticed that some data points were high outliers with respect to the overall distribution of weekly sales (e.g., that exponentially-distributed tail.) But many more data points were outliers with respect to their specific department. There were also outliers in the predictive features. Some were less-obvious, for example, the bimodal features: were observations the fell between the modes outliers? From a probability density perspective I’d say so.  

For my capstone, we also have outlier image data. This is usually due to hardware or environmental-control errors. For example, poor lighting obscuring the image or some (of the 10) hyperspectral filters having dropout.  

Two methods to deal with outliers:  
Probably the worst way to deal with them is to just remove the outlying data. This is for many reasons, including that the data may be perfectly correct and you simply have outlying or anomalous events in your data generative process. By removing them you blind yourself to the nature of your (seen & unseen) data. This is essentially giving your model only the “easy” data to handle.   

So where possible, don’t just remove the outliers. Even if the data is an outlier because it’s blatantly incorrect then the goal should be remediation, not removal.  

I’d say that another highly suboptimal way to hand outliers is to treat them as errors and perform mean or median imputation. This is especially unhelpful if your data set comprises of a large number of small clusters or highly correlated data. Ex. consider a set of 100,000 medical observations from 10,000 patients. The data is not IID. It’s comprised of many clusters of correlated data. So one patient’s “average feature” can be very different from the mean/median of that feature. If you did have to replace error/outliers with less extreme values, something like the Windsor method (Windsorizing) might be the most sensible thing to do. This approach imputes data according to other close-by data. You can even specify with features should be considered to determine the closest data points.  

Very quickly, a third way to handle outliers is simply to use robust methods like transformations and robust models that can accommodate outliers (as predictors or outcomes). Many objective functions have also been created to be resilient to outliers and not over-weight them. Whereas RMSE and BCE are sensitive to large outliers, methods like Huber loss, earth mover’s distance (Wasserstein loss), and mean absolute error (MAE) are attempt to mitigate the trade-off between small predictive errors and large predictive errors.  

## Interview Readiness
### What is feature scaling and why is it important to our model?
### Explain the different between Normalization and Standardization?
 
Feature scaling means to perform (typically algebraic) manipulations on your feature data. This is typically done to help the model optimizer (or inference algorithm for some statistical models) navigate the parameter space in search of a high-performing model.  

For example, gradient-based methods can be highly sensitive to the scaling of features: As they try to find the path of steepest descent, two equally-important variables with produce very different gradients depending on the data range of that feature. Feature scaling helps level the playing field so that the optimizer has an easier time understanding which features should be tuned next as it greedily searches the parameter space.  

Another example is an algorithm like K-means clustering. In this case, the scale of the features directly influence the objective function because “scale” determines how far away some points are from each other.  

Normalization and Standardization are both techniques to scale features. Normalization transforms the data to ensure that everything falls into a defined range like [0,1] or [-1,1]. 
In contrast, standardization is a reference to the “standard normal distribution”, a Gaussian with mean = 0 and variance = 1 (standard-deviation = 1 in this case too). Standardization transforms the data to also have a mean of 0 and a standard deviation of 1. If does this by taking each feature, subtracting the feature mean, and dividing by the feature’s standard deviation. Note that although this produces features with the desired mean and standard-deviation, this transformation (alone) does not make the feature Normally (Gaussian) distributed.  