## Dimensionality Reduction & Feature Selection
---
Link: https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/
![Dimensionality%20Reduction.png](attachment:Dimensionality%20Reduction.png)
- <strong>Missing Value Ratio:</strong> If the dataset has too many missing values, we use this approach to reduce the number of variables. We can drop the variables having a large number of missing values in them.
`[Custom function written in pipeline_functions file]`
- <strong>Low Variance Filter:</strong> We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, and hence these variables can be safely dropped. `[Use sklearns VarianceThreshold()]`
- <strong>High Correlation Filter:</strong> A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly. 
`[Custom function written in pipeline_functions file]`
- <strong>Random Forest:</strong> This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction.
- <strong>Backward Feature Elimination & Forward Feature Selection:</strong> These techniques take a lot of computational time and are thus generally used only on smaller datasets. `[Custom function below]`
- <strong>Factor Analysis:</strong> This technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor
- <strong>Principal Component Analysis:</strong> This is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible. `[Use sklearns PCA()]`
- <strong>Independent Component Analysis:</strong> We can use ICA to transform the data into independent components which describe the data using less number of components.
- <strong>ISOMAP:</strong> We use this technique when the data is strongly non-linear.
- <strong>t-SNE:</strong> This technique also works well when the data is strongly non-linear. It works extremely well for visualizations as well.
- <strong>UMAP:</strong> This technique works well for high dimensional data. Its run-time is shorter as compared to t-SNE.

### Backward Feature Elimination

The p-value in a regression study is used to determine the statistical significance of a feature. Removal of different features from the dataset will have differing effects on the p-values of how other features effect the dataset. We can remove features systematically and measure the restulting p-values of the remaining features in each case. These measured p-values can be used to decide whether to keep a feature or not.

This is what we are doing in the code block below:
- First we calculate the p-values for all current columns.
- Then we locate the column with the highest p-value that's greater than set significance level (SL) and drop that column.
- Then we recalculate the p-values of all remaining columns and continue to drop the least significant column until all of our columns fall under the significance level threshold.

In [None]:
def backwardElimination(x, Y, sl, columns):
    numVars = len(x[0])
    for i in range(0, numVars):
        # Finds new regression p_values after each column is dropped
        regressor_OLS = sm.OLS(Y, x).fit()
        # Determines the highest p_value remaining and then finds that column and drops it
        maxVar = max(regressor_OLS.pvalues).astype(float)
        if maxVar > sl:
            # Drops columns from end towards beginning
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    # Drops column matching the highest p_value
                    x = np.delete(x, j, 1)
                    columns = np.delete(columns, j)

#     print(regressor_OLS.summary())
    return x, columns, regressor_OLS.summary()

SL = 0.05
data_modeled, stat_significant_columns, report = backwardElimination(processed_train_set.values, train_targets.values, SL, selected_columns)
processed_train_set = processed_train_set[stat_significant_columns]
print("Starting features:",len(selected_columns))
print("Remaining statistically significant features:",processed_train_set.shape[1])
print(report)