<a href="https://colab.research.google.com/github/penningjoy/MachineLearningwithsklearn/blob/main/Part_2__FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Feature Selection

Feature Selection is a very important part of Feature Engineering process. You  don't want features that are not relevant and unrelated to your target and thus do not have to participate in the Machine Learning process. 

Reasons behind performing feature selection --

*   Reduction in Training time because you have less garbage
*   Reduced Dimension
*   More General and easier model 
*   Better Interpretability

#### Removing features with low variance

It involves removing features which has only one value and other instances share the same value on this feature. There is no variance in its instances. Therefore it will not contribute anything to the prediction. And thus it's best to remove them.

In [None]:
import sklearn.feature_selection as fs
import numpy as np

X = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]])

'''
VarianceThreshold function removes features with low variance based on a 
threshold. Threshold is the variance threshold.
'''

variance = fs.VarianceThreshold(threshold = 0.2)
variance.fit(X)
X_transformed = variance.transform(X)

print(" Original Data ")
print(" ------------- ")
print(X)

print(" ")

print(" Label Encoded Data ")
print(" -------------- ")
print(X_transformed)

 Original Data 
 ------------- 
[[0 0 1]
 [0 1 0]
 [1 0 0]
 [0 1 1]
 [0 1 0]
 [0 1 1]]
 
 Label Encoded Data 
 -------------- 
[[0 1]
 [1 0]
 [0 0]
 [1 1]
 [1 0]
 [1 1]]


#### Select K-Best Features



In [None]:
import sklearn.datasets as datasets

X, Y = datasets.make_classification(n_samples=300, n_features=10, n_informative=4)

# Choosing the f_classif as the metric and K is 3
kbest = fs.SelectKBest( fs.f_classif, 3)
kbest.fit(X,Y)

X_transformed = kbest.transform(X)

print(" Original Data ")
print(" ------------- ")
print(X)

print(" ")

print(" Label Encoded Data ")
print(" -------------- ")
print(X_transformed)

 #### Feature Selection by other model



In [None]:
import sklearn.feature_selection as fs
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import GradientBoostingClassifier
import sklearn.metrics as metrics

X, Y = datasets.make_classification(n_samples=500, n_features=20, n_informative=6,random_state=21)

gbclassifier = GradientBoostingClassifier(n_estimators=20)
gbclassifier.fit(X,Y)

print("Feature Selection using GBDT")
print(gbclassifier.feature_importances_)

gbdtmodel = fs.SelectFromModel(gbclassifier, prefit= True)

# The features with very low importance will be removed
X_transformed = gbdtmodel.transform(X)  

print(" Original Data Shape ")
print(" ------------- ")
print(X.shape)

print(" ")

print(" Transformed Data Shape after Feature Selection ")
print(" -------------- ")
print(X_transformed.shape)

Feature Selection using GBDT
[0.         0.00493847 0.         0.00773537 0.         0.13557457
 0.1600879  0.         0.         0.0490113  0.04407025 0.04873479
 0.0078042  0.         0.005109   0.         0.53693415 0.
 0.         0.        ]
 Original Data Shape 
 ------------- 
(500, 20)
 
 Transformed Data 
 -------------- 
(500, 3)


### Feature Extraction



from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I have an apple.", "The apple is red", "I like the apple",
    "I like the orange", "Apple and orange are fruit", "The orange is yellow"
]

counterVec = CountVectorizer()

counterVec.fit(corpus)

print("Get all the feature names of this corpus")

print(counterVec.get_feature_names())

print("The number of feature is {}".format(len(
    counterVec.get_feature_names())))

corpus_data = counterVec.transform(corpus)

print("The transform data's shape is {}".format(corpus_data.toarray().shape))

print(corpus_data.toarray())