In [None]:
"""
We will continue in the notebook from last week where we work with house prices prediction.

We are going to implement elements for filter feature selectors based on the following criteria:

Small variance
One of each pair of features, which are correlated together more than x
Before doing any transformations we will extract our target variable to keep it as it is. Even though we can do some 
transformations to it, it is a good practice to do it separately:
"""


y = df_numeric.SalePrice
df_numeric.drop("SalePrice",axis=1, inplace=True)

In [None]:
"""
Part 1: Removing Features With Small Variance
First of all, we will remove the columns with very little variance. Small variance equals small predictive power because all 
houses have very similar values.

For most of our variable selection, we can use methods from sklearn:
"""

from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(0.1)
df_transformed = vt.fit_transform(df_numeric)

In [None]:
"""
fit_transform() in sklearn transforms an object from DataFrame to nd.array and we are losing column names, 
so we need to do some tricks to get them back!

We don't need column names for modeling but it helps with the interpretation of modeling results.
"""

# columns we have selected
# get_support() is method of VarianceThreshold and stores boolean of each variable in the numpy array.
selected_columns = df_numeric.columns[vt.get_support()]
# transforming an array back to a data-frame preserves column labels
df_transformed = pd.DataFrame(df_transformed, columns = selected_columns)

In [None]:
"""
Part 2: Removing Correlated Features
The goal of this part is to remove one feature from each highly correlated pair.

We are going to do this in 3 steps:

Calculate a correlation matrix
Get pairs of highly correlated features
Remove correlated columns
"""

# step 1
df_corr = df_transformed.corr().abs()

# step 2
indices = np.where(df_corr > 0.8) 
    indices = [(df_corr.index[x], df_corr.columns[y]) for x, y in zip(*indices)
                                    if x != y and x < y]

# step 3
for idx in indices: #each pair
    try:
        df_transformed.drop(idx[1], axis = 1, inplace=True)
    except KeyError:
        pass
    
"""
The code above will drop one column from each pair that is correlated at least 0.8. 
If this happens twice, use try-except block to allow the code to continue even when KeyError occurs.

We can check the correlated columns by printing the indices:
"""

print(indices)

In [None]:
"""
Part 3: Forward Regression
We have removed the features with no information and correlated features. The last thing we can do before modeling is to 
select the k-best features in terms of the relationship with the target variable.
We will use the forward wrapper method for that:
"""

from sklearn.feature_selection import f_regression, SelectKBest
skb = SelectKBest(f_regression, k=10)
X = skb.fit_transform(df_transformed, y)


"""
We need to import the SelectKBest method. Plus, we have to decide what algorithm we are going to use for the actual selection. 
Since we want to do a forward regression, we also imported f_regression. We could use some other technique if, for 
example, the target variable was categorical.
"""


#Convert X back to a data-frame and assign back the correct column names.
# this will give us the position of top 10 columns
skb.get_support()
# column names
df_transformed.columns[skb.get_support()]
X = pd.DataFrame(X,columns=df_transformed.columns[skb.get_support()])