# Week 14 Group Homework

1. Using the documentation for Recursive Feature Selection, apply this process to the crime dataset to create the best multivariate linear regression model https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html.
You can select what you're trying to predict, but be sure to indicate what that is. Be sure to explain what RFE is in the markdown. You shoud be able to answer this using what's on the documentation page and what you already know. 

The goal of Recursive Feature Selection (RFE) to select a minimum set of informative features for use in a ML model. With RFE, this is accomplished by instantiating a supervised learning estimator and a number of desired features. The estimator "weights" features on a training data set. Features are then ranked, and the 'least important' (based on some criterion) are dropped from the set. This is performed iteratively until the minimum set (number of desired features) is achieved. 

Why do we care? One way to strengthen a signal in our data set (while not creating the artifical signals that arise from overfitting) is to reduce our data set to the smallest set of highly informative features.  Recursive Features Selection is one tool that allows us to approach that goal.

In [5]:
import numpy as np
import pandas as pd

from sklearn.feature_selection import RFE
from sklearn.ensemble import AdaBoostRegressor

In [6]:
crime_df = pd.read_csv("../week_13/crime_data.csv")
crime_df.head(5)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7
0,478,184,40,74,11,31,20
1,494,213,32,72,11,43,18
2,643,347,57,70,18,16,16
3,341,565,31,71,11,25,19
4,773,327,67,72,9,29,24


In [8]:
# Here we used AdaBoostRegressor meta-estimator model. We want the RFE to select only one feature.

X = crime_df.drop('X1', axis=1)
y = crime_df['X1']

estimator = AdaBoostRegressor(random_state=0, n_estimators = 100)
selector = RFE(estimator, n_features_to_select = 1, step = 1)
selector = selector.fit(X, y)

filter = selector.support_
ranking = selector.ranking_

print("Mask data: ", filter)
print("Ranking: ", ranking)


Mask data:  [ True False False False False False]
Ranking:  [1 2 3 5 6 4]


In [9]:
# Display selected features:

features = np.array(X.columns)
print('All features: ')
print(features)

print('Selected features: ')
print(features[filter])

All features: 
['X2' 'X3' 'X4' 'X5' 'X6' 'X7']
Selected features: 
['X2']


2. Create a list of preprocessing steps you should try when working to build a model. Briefly describe what each step is. Work with your group to come up with the most comprehensive list you can.

We came up with a nice long set of steps as a group in class but my notebook did not save properly, my answer will not match my those of other group members.

Preprocessing tasks fall into these categories: Preparation/cleaning, Feature Selection, Transformation, Feature Engineering, and Dimension Reduction

_Preparation/cleaning:_ In this stage, we deal with incomplete, noisy, and/or inconsistent records. Tasks include:
    - Validate data types 
    - Correct inconsistent data
    - Remove or replace uninformative characters 
    - Remove outliers
    - Handle null values (can be tricky to spot, e.g. '?'). Remove or replace with imputation technique (e.g. replace NaN with mean). sklearn has a class, SimpleImputer, that performs this substitution.
    - Smooth noisy data. Can use plots (box, scatter) to identify outliers for numeric variables. Can also use techniques such as binning, regression, or clustering (also for outlier detection).

_Feature Selection:_ Here we identify and select the most informative features. We do this to improve the SNR, minimize overfitting, and reduce computation time. The Recursive Feature Elimination (RFE) algorithm, which removes redundant inpute variables, is widely used for this purpose. 

_Data Transformation:_  This set of tasks "massages" the data into forms that have properties are amenable to modeling. Tasks include:
    - Normalization - a type of re-scaling where all values are between 0-1 (inclusive). Removes "weighting" effect that can result with features that are measured on differet scales (e.g. mass and age). Can use MinMaxScaler. 
    - Standardization. Linear models operate on the assumption that random variables are normally distributed with equal variances. We rescale all of our variables so that the distribution is centered on a mean of zero and a variance of one.We accomplish standardizing using sklearn's StandardScaler. Both training and testing data sets should be standardized. 
    Discretization, the transformation of numerical variables into discrete categories (if we were using a model that wanted categorical features). 
    
_Feature Engineering:_  creation of new, higher-level or numerical features from raw input features. Examples include: 
    - Handling categorical values. (Ordinal and nominal values require different approaches.)
        - Ordinal: create dataframe of ordinals. Then, either create a dictionary and use a .map() function, or use the LabelEncoder class.
            - Nominal: Use One-Hot Encoding, either by hand or by using .get_dummies().
        Prior to applying a One-Hot Encoding technique we need to examine for multicollinearity, a property of a feature set that can influece model behavior and distort interpretation of results. The simplest way is to generate a correlation matrix and drop one of a pair of strongly correlated features. 

_Dimension reduction:_  removal of features that are redundant or of low relevance. We can achieve using Principal Component Analysis (PCA).
    
We can also perform other tasks to reduce the data set, such as filtering based on certain criteria (usually a threshold). For example, _Missing values ratio_ - removal of features with missing values exceeding a specified threshold
    _Low variance filter_ - removal of normalized features that have variance below a certain threshold (low variation - low information)
    _High Correlation filter_ - removal of normalized features that have correlation coefficients above a certain threshold. 