# Module 6 - Feature Selection - Homework

For Homework I would like you to conduct your own feature selection proceedure on the PIMA native american dataset distributed with this module.

 ## About the PIMA dataset 
+ Number of Instances: 768
+ Number of Attributes: 8 plus class 
+ For Each Attribute: (all numeric-valued)
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)

+ Missing Attribute Values: Yes

+ Class Distribution: (class value 1 is interpreted as "tested positive for diabetes")

+ The datafile does not contain any column names you will have to generate them your self!

This is a binary classification problem. To complete this homework you will need to load and tidy the data. Notice there are missing data that need to be addressed. Use the table 1 to help reveal any issues with the data distributions. The data is also not partitioned. You will have to conduct a 70:30 split before proceeding with feature selection.  I would like you to compare filter method, Boruta, and LASSO feature selection and validate your results in a final linear model using your reserved testing set.

## Setup
Let's get all the requirements sorted before we move on to the excercise. Most packages should be familiar at this point. Numpy, pandas, matplotlib, and seaborn where all introduced in Part I of the workshop in modules 1-3 and last week in module 5 we introduced tableone. Notice, today we will be using sklearn for the first time to do some machine learning. Don't worry too much about the models we'll be using or how to train them for now. This will the the topic for modules 7 & 8.  

In [None]:
# Requirements
!pip install --upgrade ipykernel
!pip install pandas
!pip install numpy
!pip install tableone
!pip install matplotlib
!pip install sklearn
!pip install boruta

# Globals
seed = 1017

#imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from tableone import TableOne
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

#magic
%matplotlib inline

## Loading the data
Use table 1 to look at how the features are distributed grouped by the outcome. I have used the `<your code>` notation to indcate where you have to fill.

In [None]:
# download the data as a pandas dataframe 
# Note, the datafile has no column names
df = pd.read_csv(<your code>)

# Generate table 1 - group by the outcome index
TableOne(df, groupby=df.columns[<your code>],
         pval=True,
         dip_test=True,
         normal_test=True,
         tukey_test=True)

Let's address the 2 warnings raised by the table 1 and see if we have to reformat some of the features.

### Addressing the warnings
Let's have a look at the disributions for those features that appeared in the warnings.

In [None]:
#plot the feature distributions
for feat in df.columns: 
    df[[feat]].dropna().plot.kde(bw_method='scott') #use bw_method=.02 for a lower bandwidth gaussian representation
    plt.legend([feat])
    plt.show()

### Tasks:
1. Impute missing values with the feature mean.
2. Tuck in any features with long tails by log2 transform?
3. Partition your data into 70% training and 30% testing

In [None]:
#Impute any missing values with their column median
df.fillna(<your code>, inplace=True)

In [None]:
#log2 transform - you will need to identify any features with long tails
df[cols] = np.log2(<your code>)

In [None]:
#70-30 partition
df_test = <your code> 
df_train = <your code> 

## Comparing Models
Let's define a function that will calculate the prodigious and parsimonious model performance.

In [None]:
#define function that compares selected features to full model
def compare_models(dataset, selfeat):
    """compare parsimonious and full linear model"""
    
    # get predictors and labels
    X = dataset.drop(<your code>,axis=1)  #independent columns
    y = dataset[<your code>]    #outcome

    #get selected feature indecies
    isel = [X.columns.get_loc(feat) for feat in selfeat if feat in X]
    
    #70-30 split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=seed)
 

    #define the prodigious and parsimonious logistic models
    prodmodel = linear_model.LinearRegression()
    parsmodel = linear_model.LinearRegression()

    #Fit the models
    prodmodel.fit(X_train, y_train)
    parsmodel.fit(X_train[selfeat], y_train) 

    #Report errors
    display('Prodigious Model Score: %.2f' %prodmodel.score(X_test, y_test))
    display('Parsimonious Model Score: %.2f' %parsmodel.score(X_test[selfeat], y_test))

    return

## Filter Method
The Table 1 conveniently has calculated the association of each feature with the outcome. Let's select only those features that are significatly (p<.05) associated. 

In [None]:
selfeat = [<your code>]
compare_models(df_train, selfeat)

## Boruta

In [None]:
# get predictors and labels
X = np.array(df.drop(<your code>, axis=1)) 
y = np.array(df[<your code>])

# define random forest classifier for boruta
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
forest.fit(X, y)

# define Boruta feature selection method
feat_selector = <your code>

# find all relevant features
feat_selector.fit(X, y)

# zip my names, ranks, and decisions in a single iterable
feature_ranks = list(zip(df.columns.drop(<your code>), 
                         feat_selector.ranking_, 
                         feat_selector.support_))

# iterate through and print out the results
for feat in feature_ranks:
    display('Feature: {:<25} Rank: {},  Keep: {}'.format(feat[0], feat[1], feat[2]))


## LASSO

In [None]:
from sklearn.linear_model import LassoCV

# get predictors and labels
X = np.array(df.drop(<your code>, axis=1)) 
y = np.array(df[<your code>])

#train lasso model with 5-fold cross validataion
lasso = <your code>

#display the model score
lasso.score(X, y)

#plot feature importance based on coeficients
importance = np.abs(lasso.coef_)
feature_names = np.array(df.columns.drop(<your code>))
plt.bar(height=importance, x=feature_names)
plt.xticks(rotation=90)
plt.title("Feature importances via coefficients")
plt.show()

## Report
Create a final logistic regression model with your selected features and compute the accuracy to predict outcomes in the reserved testing set.   

In [None]:
#train a logistic regression model and report accuracy
<your code>