## Feature Selection
[Backward elimination](https://en.wikipedia.org/wiki/Stepwise_regression)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import f
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
%matplotlib inline

In [2]:
data = pd.read_csv('winequality-red.csv', sep=';')
#data = pd.read_csv('winequality-white.csv', sep=';')

In [3]:
X, y = data.iloc[:,:-1].to_numpy(), data.iloc[:,-1].to_numpy()

In [4]:
scaler = MinMaxScaler(feature_range=(0,10))

In [5]:
num_examples, num_features = X.shape

In [6]:
X = scaler.fit_transform(X)
y = y.reshape((num_examples, 1))

**Normal Equation**:

In [7]:
def normal_equation(X, y):
    '''
    Returns the estimated parameters
    as a (n+1,1) vector
    [n: number of features].
    '''
    return np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)

In [8]:
def hypothesis(theta, X):
    '''
    Returns the hypothesis for the LinReg
    theta: (n+1, 1) vector [n: # features]
    X: (m, n+1) matrix [m: # examples]
    '''
    return X.dot(theta)

[Root Mean Squared Error](https://en.wikipedia.org/wiki/Root-mean-square_deviation)

In [9]:
def rmse(predicted, actual):
    err = predicted - actual
    return float(np.sqrt(err.T.dot(err)))

**F-test**:

[F-test to compare regression models](https://en.wikipedia.org/wiki/F-test#Regression_problems)

In [10]:
def f_test(residuals_UR, residuals_R, params_UR, params_R, alpha):
    '''
    Performs an F-test between the unrestricted (UR)
    and restricted (R) model, with the informed
    level of significance (alpha).
    Receives the residuals and number of paramaters
    for each model.
    Returns the F-statistic and F-critical,
    respectively.
    '''
    sample_size = len(residuals_UR)
    # Un-restricted RSS
    rss_UR = residuals_UR.T.dot(residuals_UR)
    # Restricted RSS
    rss_R = residuals_R.T.dot(residuals_R)
    
    num = (rss_R-rss_UR)/(params_UR-params_R)
    den = rss_UR/(sample_size-params_UR)
    f_stat = num/den
    f_crit = f.ppf(1.0-alpha,params_UR-params_R,sample_size-params_UR)
    return f_stat, f_crit

### Feature selection

In [11]:
def feature_selection(X, y, alpha=0.05):
    '''
    BACKWARD ELIMINATION
    Evaluates the models with one restriction
    (one less feature), apart from the constant term
    and compares with the unrestricted model,
    using an F-test with the informed level
    of significance (alpha).
    Returns two lists whose sizes are the number
    of features: the first one contains the results
    for the F-tests (True: null hypothesis rejected);
    teh second one, the RMSE if the corresponding feature
    were dropped.
    X must already account for the constant term
    (i. e., it must already contain a one-valued first column).
    '''
    _, num_features = X.shape
    unrestricted_theta = normal_equation(X,y)
    H0_rejected, rmses = [True], [rmse(hypothesis(unrestricted_theta,X), y)]
    
    for restrict in range(1,num_features):
        restricted_X = np.delete(X,restrict,axis=1)
        restricted_theta = normal_equation(restricted_X,y)
        
        # Performing the F-test
        unrestricted_residuals = hypothesis(unrestricted_theta,X) - y
        restricted_residuals = hypothesis(restricted_theta,restricted_X) - y
        f_stat, f_crit = f_test(unrestricted_residuals,restricted_residuals,
                                num_features,num_features-1,alpha)
        
        # Evaluating the statistical significance
        H0_rejected.append(bool(f_stat > f_crit))
        rmses.append(rmse(hypothesis(restricted_theta,restricted_X),y))

    return H0_rejected, rmses

In [12]:
H0_rejected, rmses = feature_selection(X,y)

In [13]:
print('> REPORT:\n\n')
for i in range(num_features):
    print('Feature number: ', i)
    print('Parameter might be zero? ', not H0_rejected[i])
    print('RMSE if dropped: ', rmses[i])
    print('\n', '-'*50 + '\n')

> REPORT:


Feature number:  0
Parameter might be zero?  False
RMSE if dropped:  35.607128175661956

 --------------------------------------------------

Feature number:  1
Parameter might be zero?  True
RMSE if dropped:  35.61252403044953

 --------------------------------------------------

Feature number:  2
Parameter might be zero?  True
RMSE if dropped:  35.64830657984974

 --------------------------------------------------

Feature number:  3
Parameter might be zero?  True
RMSE if dropped:  35.614877210046814

 --------------------------------------------------

Feature number:  4
Parameter might be zero?  False
RMSE if dropped:  35.746411436249176

 --------------------------------------------------

Feature number:  5
Parameter might be zero?  False
RMSE if dropped:  35.70781299296529

 --------------------------------------------------

Feature number:  6
Parameter might be zero?  True
RMSE if dropped:  35.63184620931896

 --------------------------------------------------

Fe

#### Features to eliminate:

**White wine**: 6

**Red wine**: 1, 3, 6, 7