# Feature Selection Cont'd
In our proposal, we illustrate our motivation and part of feature selection.

We can see that correlation among these factors are relatively high, which is easy to understand. In order to solve this problem, we adopt some particular feature selection functions to deal with this issue as can be seen in the following part.

Here we build five models to [select features](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/tree/master/03%20feature%20selection):

* [naiveSelection.py](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/03%20feature%20selection/naiveSelection.py)
* [pcaSelection.py](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/03%20feature%20selection/pcaSelection.py)
* [SVCL1Selection.py](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/03%20feature%20selection/SVCL1Selection.py)
* [treeSelection.py](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/03%20feature%20selection/treeSelection.py)
* [varianceThresholdSelection.py](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/03%20feature%20selection/varianceThresholdSelection.py)

To avoid high correlation among features as much as possible, we can choose [LASSO in SVC model](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/03%20feature%20selection/SVCL1Selection.py). To find the most import features, we can choose PCA methods. Also, XGBoost includes feature selection itself. Moreover, to make it easy to call feature selection model, we encapsulate them as standard functions.

Below is a sample feature selection function ([pcaSelection.py](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/03%20feature%20selection/pcaSelection.py)). As we can see, 12 PCA components can explain 82% of total variance, so we consider that 12 is the proper number of features to work with. 

Table 1. Portion explained by different numbers of components

| Number of PCA components | Total explained variance |
| ------------------------ | ------------------------ |
| 6                        | 65%                      |
| 8                        | 74%                      |
| 10                       | 79%                      |
| 12                       | 82%                      |

In [1]:
import pandas as pd
import os
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')
from sklearn.decomposition import PCA

def pcaSelection(X_train, y_train, X_test, y_test, verbal = None, returnCoef = False):
    '''
    choose the feature selection method = 'pca'
    fit any feature_selection model with the X_train, y_train
    transform the X_train, X_test with the model
    do not use the X_test to build feature selection model
    
    return the selected X_train, X_test
    print info of the selecter
    return the coef or the score of each feature if asked
    
    '''
    #transform to standardscaler
    features = X_train.columns.tolist()
    scaler = preprocessing.StandardScaler().fit(X_train)
    X_train = pd.DataFrame(scaler.transform(X_train))
    X_test = pd.DataFrame(scaler.transform(X_test))
    X_train.columns = features
    X_test.columns = features
    
    pca = PCA(n_components = 12)
    X_train = pca.fit_transform(X_train)
    print ('The explained variance ratio is:')
    print(pca.explained_variance_ratio_)
    print('The total explained variance ratio is ')
    print(sum(pca.explained_variance_ratio_))
    print ('The explained variance is:')
    print(pca.explained_variance_)
    X_test = pca.transform(X_test)
    
    coef = pd.Series()
    # featureName = None
    
    if verbal == True:
       print('The total feature number is '+ str(X_train.shape[1]))
       # print('The selected feature name is '+ str(featureName))
       
    if not returnCoef:
        return(X_train, X_test)
    else:
        return(X_train, X_test, coef)

# PART3 Building Classifiers

As we have already converted the problem into a classification prediction problem, we need to build [classifiers](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/tree/master/04%20build%20classifier%20model) based on machine learning algorithms to implement on the selected features.

## Machine Learning Algorithms

We implement [logistic regression](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/04%20build%20classifier%20model/MyClassifier.py), [naive Bayes](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/04%20build%20classifier%20model/MyClassifier.py), [KNN](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/04%20build%20classifier%20model/MyKNNClassifier.py), [perceptron](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/04%20build%20classifier%20model/MyClassifier.py), [decision tree](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/04%20build%20classifier%20model/MyDecisionTreeClassifier.py), [SVM](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/04%20build%20classifier%20model/MySVMClassifier.py), [XGBoost](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/04%20build%20classifier%20model/MyXGBoostClassifier.py) and [a Sequential neural network model in Keras](https://github.com/eiahb3838ya/PHBS_ML_for_quant_project/blob/master/04%20build%20classifier%20model/MyDeepLearningClassifier.py) to predict the rise or fall of Wind All A Index the next day. Below are some sample codes of implementing these algorithms as classifiers.

In [2]:
class MyLogisticRegClassifier:
    def __init__(self):
        self.parameter = self.getPara()
        self.model = LogisticRegression()
        
    def getPara(self):
        # do some how cv or things to decide the hyperparameter
        return({})
        
    def fit(self, X, y):
        # do what ever plot or things you like 
        # just like your code
        # self.model.fit(X,y)
        return(self.model.fit(X, y))

In [3]:
from xgboost import XGBClassifier
from parametersRepo import *
import matplotlib.pyplot as plt
from matplotlib import pyplot

class MyXGBoostClassifier:
    def __init__(self):
        self.parameter = self.getPara()
        self.model =  XGBClassifier(seed=self.parameter['model_seed'],
                                    n_estimators=self.parameter['n_estimators'],
                                    max_depth=self.parameter['max_depth'],
                                    learning_rate=self.parameter['learning_rate'],
                                    min_child_weight=self.parameter['min_child_weight'])
        
    def getPara(self):
        # do some how cv or things to decide the hyperparameter
        # n_neighbors = 15
        # weights = 'uniform'
        return(paraXGBoost)
        
    def fit(self, X, y):
        # do what ever plot or things you like 
        # just like your code
        self.model.fit(X,y)
        print('The feature importance is :')
        print(self.model.feature_importances_)
        
        plt.figure(figsize = (20, 6))
        pyplot.bar(range(len(self.model.feature_importances_)),     self.model.feature_importances_)
        pyplot.show()
        plt.title('The feature importance')
        plt.savefig('The feature importance')
        plt.show()
        # print('The total score of feature importance is:')
        # print(sum(self.model.feature_importances_))
        return(self.model.fit(X, y))
        
    def predict(self, X):
        return(self.model.predict(X))

ModuleNotFoundError: No module named 'xgboost'

In [4]:
from parametersRepo import *
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import models,layers
import pandas as pd

class MyDeepLearningClassifier:
    def __init__(self):
        self.parameter = None
        self.model = None
        
    def getPara(self):
        # do some how cv or things to decide the hyperparameter
        # return dict
        if self.parameter == None:
            print('Hi~ please first use fit function to get model :)')
        else:
            print('haha! We already trained deepLearning Model~')
            return self.parameter
        return self.parameter
        
    def fit(self, X, y):
        # do what ever plot or things you like 
        # just like your code
        self.parameter = len(X.columns)
        model = models.Sequential()
        model.add(Dense(30,activation = 'relu',input_shape=(len(X.columns),)))
        model.add(Dropout(0.1))
        model.add(Dense(1,activation = 'sigmoid' ))
        # model.summary()
        model.compile( loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'] )
        self.model = model
        return(self.model.fit(X, y,
                              validation_split=0.2, 
                              epochs=1, batch_size=10, verbose=2))
        
    def predict(self, X):
       	 return(pd.Series(self.model.predict_classes(X).flatten()).astype(bool))

ModuleNotFoundError: No module named 'parametersRepo'