In [17]:
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_digits
import numpy as np


# Feature Engineering

feature engineering: is the process of using domain knowledge of the data to create features that make ML algo. work
it typically involves the following steps:
1. **Feature selection** selecting the most useful features to train on among existing
    are used for the following reasons: 
        * simplification of models to make them easier to interpret by research/users
        * shorter train times
        * avoid the curse of dimensionality
        * improve generalization by reducing overfitting
        
        use ***sklearn.feature_selection***
2. **Feature extraction** combining existing features to produce a more useful one

## Feature Selection - Remove features with low variance

**VarianceThreshold** is a simple baseline approach to feature selection
it removes all features whose variance doesnt meet some threshold
by default, it removes all zero-variance features' i.e features that have the same value in all samples


In [6]:
X = [[0,0,1],[0,1,0],[1,0,0],[0,1,1],[0,1,0],[0,1,1]]
p = 0.8
sel = VarianceThreshold(threshold=p * (1-p))
sel.fit_transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

## Feature Selection - Based on Statistical Test

scikit-learn offers the following selection routines as transformer objects:
**SelectKBest** removes all but the highest scoring features
**SelectPercentile** removes all but a user-specified highest scoring percentage of features
**SelectFpr/SelectFdr** select the features with p values below α

these objects take as input a scoring function:

for regression: f_regression, mutual_info_regression

for classification: chi2, f_classif, mutual_info_classif

### Chi-Squared Test

chi square test is used when the variables being considered are categorical variables
it determines whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. 
the larger the chi-squared value, the more likely the variables are related
the chi-squared stats are computed between each feature and the class variable

In [15]:
X,y = load_digits(return_X_y=True)
display(X.shape)

X_new = SelectKBest(chi2,k=20).fit_transform(X,y)
X_new.shape

(1797, 64)

(1797, 20)

In [19]:
#exercise sample for chi-squared

sum_all = 650
total_col = np.array([349,151,150])
ratios_cols = total_col/sum_all 
display('ratios_cols',ratios_cols)

expected_A = ratios_cols * 150 
display('expected_A',expected_A)
observed_A = np.array([90,30,30])
chi_squared_stat_A = (((observed_A-expected_A)**2)/expected_A).sum()
display('chi2_A',chi_squared_stat_A)


expected_B = ratios_cols * 150 
#display('expected_B',expected_B)
observed_B = np.array([60,50,40])
chi_squared_stat_B = (((observed_B-expected_B)**2)/expected_B).sum()
display('chi2_B',chi_squared_stat_B)

expected_C = ratios_cols * 200
#display('expected_C',expected_C)
observed_C = np.array([104,51,45])
chi_squared_stat_C = (((observed_C-expected_C)**2)/expected_C).sum()
display('chi2_C',chi_squared_stat_C)

expected_D = ratios_cols * 150
#display('expected_D',expected_D)
observed_D = np.array([95,20,35])
chi_squared_stat_D = (((observed_D-expected_D)**2)/expected_D).sum()
display('chi2_D',chi_squared_stat_D)

'ratios_cols'

array([0.53692308, 0.23230769, 0.23076923])

'expected_A'

array([80.53846154, 34.84615385, 34.61538462])

'chi2_A'

2.4008804721152197

'chi2_B'

12.665291983191755

'chi2_C'

0.5788511167194829

'chi2_D'

8.92617928655614

### Feature Extraction

try out variuos attribute combinations.
sample way to extract:
1. check df.corr() - check for correlation between the attributes
2. do any mathematical operation to create a new feature

this is an iterative process

@see housing