## Feature selection

Create own functions to filter features based on the following criteria:

* lower variance than x
* number of missing values is more than *x* %
* one of each pair of features, which are correlated together more than *x*

Use two data sources as input:
- output dataset from the feature engineering exercise last week.
- output dataset from the PCA exercise

Apply your functions to the combination of these two datasource and come up with the final dataset that can be used for training.

> #### Note
> Don't forget to keep target variable (duration_seconds) intact

In [5]:
import pandas as pd
import numpy as np

In [117]:
# output of 
pca_data = pd.read_csv("./pca_data.csv")
# output of feature engineering 
df_numeric = pd.read_csv("./df_numeric.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [118]:
df = pca_data.merge(df_numeric, left_index=True, right_index=True)
print(df.shape)

(80332, 71)


In [119]:
target = df[["duration_seconds"]]
df.drop("duration_seconds",axis=1, inplace=True)

## missing values

In [120]:
def eliminateLoadsOfMissings(x, df):
    """
    x should be between 0 and 1
    """
    return df.drop(list(df.count()[df.count() / len(df) <= x].index), axis=1).fillna(df.median())

In [121]:
# we eliminate features with more than 10% missings
df = eliminateLoadsOfMissings(0.9, df)

### lower variance

In [122]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler

def eliminateLowVariance(x, df):
    """
    x should be between 0 and 1
    """
    vt = VarianceThreshold(x)
    vt.fit(df)
    return df.loc[:, vt.variances_ > x]

In [123]:
df = eliminateLowVariance(0.00002, df)

In [124]:
df.shape

(80332, 50)

### correlated features

In [125]:
def eliminateCorrFeatures(coef_, df):
    """
    coef_ should be between 0 and 1
    """
    # step 1
    df_corr = df.corr().abs()

    # step 2
    indices = np.where(df_corr > coef_) 
    indices = [(df_corr.index[x], df_corr.columns[y]) for x, y in zip(*indices)
                                        if x != y and x < y]

    # step 3
    for idx in indices: #each pair
        try:
            df.drop(idx[1], axis = 1, inplace=True)
        except KeyError:
            pass
    return df

In [126]:
df = eliminateCorrFeatures(0.5, df)

In [127]:
df.shape

(80332, 39)

In [128]:
df = df.merge(target, left_index=True, right_index=True)

In [129]:
df.shape

(80332, 40)

In [130]:
df.to_csv("df_prepared.csv",index=False)