# **Dimensionality Reduction Using Feature Selection**
**Thresholding Numerical Feature Variance**   
**Problem**   
You have a set of numerical features and want to remove those with low variance (i.e., likely containing little information).     
**Solution**   
**Select a subset of features with variances above a given threshold:**     
Principle: Low variance features contains less information      
Calculate variance of each features and then drop the features with low variance      
Features should be in same scale

In [None]:
# Tresholding Numerical Feature varience
from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold

# load iris data
iris = datasets.load_iris()

# create features and target
X = iris.data
Y = iris.target

# create VarianceThreshold object with a variance with a threshold of 0.5
thresholder = VarianceThreshold(threshold = 0.5)

# conduct variance thresholding
X_high_variance = thresholder.fit_transform(X)

# view first five rows with features with variances above threshold
X_high_variance[0:5]


array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2],
       [4.6, 1.5, 0.2],
       [5. , 1.4, 0.2]])

In [None]:
# We can see the variance for each feature using variances :
# view variances

thresholder.fit(X).variances_

array([0.68112222, 0.18871289, 3.09550267, 0.57713289])

**Handling Highly Correlated Features**   
**Problem**    
You have a feature matrix and suspect some features are highly correlated.     
**Solution**    
Use a correlation matrix to check for highly correlated features.
If highly correlated features exist, consider dropping one of the correlated features:

In [None]:
# load libraries
import pandas as pd
import numpy as np

# create feature matrix with two highly correlated features
X = np.array([[1, 1, 1],
              [2, 2, 0],
              [3, 3, 1],
              [4, 4, 0],
              [5, 5, 1],
              [6, 6, 0],
              [7, 7, 1],
              [8, 7, 0],
              [9, 7, 1]])

# convert feature matrix into DataFrame

df = pd.DataFrame(X)

# view the data frame
df

Unnamed: 0,0,1,2
0,1,1,1
1,2,2,0
2,3,3,1
3,4,4,0
4,5,5,1
5,6,6,0
6,7,7,1
7,8,7,0
8,9,7,1


In [None]:
#Identify Highly Correlated Features and drop 1
# Create correlation matrix
corr_matrix = df.corr() . abs ()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),k=1).astype(bool) )

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] >0.95) ]

# drop features
df.drop(df[to_drop], axis = 1)

Unnamed: 0,0,2
0,1,1
1,2,0
2,3,1
3,4,0
4,5,1
5,6,0
6,7,1
7,8,0
8,9,1


In [None]:
# correlation matirx
df.corr()

Unnamed: 0,0,1,2
0,1.0,0.976103,0.0
1,0.976103,1.0,-0.034503
2,0.0,-0.034503,1.0


In [None]:
# upper triangle of correlation
upper

Unnamed: 0,0,1,2
0,,0.976103,0.0
1,,,0.034503
2,,,


In [None]:
# drop feature
df.drop(df[to_drop], axis = 1)

Unnamed: 0,0,2
0,1,1
1,2,0
2,3,1
3,4,0
4,5,1
5,6,0
6,7,1
7,8,0
8,9,1


**Removing Irrelevant Features for Classification**     
**Problem**    
You have a categorical target vector and want to remove uninformative features.     
###chi-square (x2) chi-square (x2): If the features are categorical, calculate a chi- square (x2) statistic between each feature and the target vector.

In [None]:
# Load libraries
from sklearn.datasets import load_iris
from sklearn. feature_selection import SelectKBest
from sklearn. feature_selection import chi2

# Load iris data
iris = load_iris()
# Create features and target
X = iris.data
y = iris.target
# Convert to categorical data by converting data to integers
X = X.astype(int)

# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)

# Show results
print ('Original number of features:', X.shape[1])
print ('Reduced number of features: ', X_kbest.shape[1])

Original number of features: 4
Reduced number of features:  2


**ANOVA**      
F-value If the features are quantitative, compute the ANOVA F-value between each feature and the target vector.

In [None]:
# Load libraries
from sklearn.datasets import load_iris

from sklearn. feature_selection import SelectKBest
from sklearn. feature_selection import f_classif

# Load iris data
iris = load_iris()

# Create features and target
X = iris.data
y = iris.target

# Create an SelectKBest object to select features with two best ANOVA F-Values
fvalue_selector = SelectKBest(f_classif, k=2)
# Apply the SelectKBest object to the features and target
X_kbest = fvalue_selector.fit_transform(X, y)

# Show results
print ('Original number of features:', X.shape[1])
print ('Reduced number of features: ', X_kbest.shape[1])

Original number of features: 4
Reduced number of features:  2
