# Feature Selection

## Feature exploration
The next few cells use linear regressions to explore feature importance. 

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.neighbors.regression import KNeighborsRegressor
from sklearn.metrics import mean_squared_error,r2_score


In [2]:
def health_smoothing(df,health,cols,rad=10):
    X=df[cols] #features
    y=df[health] #value to be predicted

    knn= KNeighborsRegressor(n_neighbors=rad).fit(X,y) #fit KNN for smoothing
    
    Y=knn.predict(X) #smoothed column
    df[health+'-smooth']=Y #make new column in dataframe
    return df

def regression_details(X,y,linreg,nhb,cols,health):
      #write a report
    Y=linreg.predict(X)
    print("For",health,"the features are ",cols,"and the number of neighbors is",nhb)
    print("For",health,"the coiefficients are",str(linreg.coef_))
    print("For",health,"the intercept is",str(linreg.intercept_))
    print("For",health,"the R2 score is",str(r2_score(Y,y)))
    print('')
 
    

In [3]:
df_main=pd.read_csv("data/normalized-health-and-environmental-train.csv") #read in data

In [4]:
#this dictionary has the health issue as keys and the intuitively reasonable features as values
D=({'no-asthma':['density',"pollution"], 
    'sleep >7':['density','commute','pollution','safety'],
    'no-obesity':['commute', 'safety','density','pollution'],
    'no-mental-health-prob':['commute', 'safety','density','pollution']})

for health in ['no-obesity', 'sleep >7', 'no-asthma','no-mental-health-prob']:
    df_smooth=health_smoothing(df_main,health,D[health],500) #get smoothed column, with 500 neighbors as default
    smoothed_col=health+"-smooth" #name of smoothed column
    
    X=df_smooth[D[health]] #predictors
    y=df_smooth[smoothed_col] #value to be predicted
    linreg=LinearRegression().fit(X,y) #fit a linear model
    regression_details(X,y,linreg,500,D[health],health) #print out details of model

For no-obesity the features are  ['commute', 'safety', 'density', 'pollution'] and the number of neighbors is 500
For no-obesity the coiefficients are [ 0.09676281  0.16532088  0.04569635 -0.03040415]
For no-obesity the intercept is 57.22395662824114
For no-obesity the R2 score is 0.4369665068762024

For sleep >7 the features are  ['density', 'commute', 'pollution', 'safety'] and the number of neighbors is 500
For sleep >7 the coiefficients are [-0.04181844 -0.09351287  0.02204634  0.14236295]
For sleep >7 the intercept is 60.356156810529804
For sleep >7 the R2 score is 0.8119243635869375

For no-asthma the features are  ['density', 'pollution'] and the number of neighbors is 500
For no-asthma the coiefficients are [-0.02245156  0.00429997]
For no-asthma the intercept is 91.27795288857524
For no-asthma the R2 score is 0.006786551957914222

For no-mental-health-prob the features are  ['commute', 'safety', 'density', 'pollution'] and the number of neighbors is 500
For no-mental-health-pr

Pollution seems to make a very weak contribution in all models. I believe this is due to the pollution measure I am using, which appears to be normalized in a strange way.  I have thus decided to drop pollution as a feature.

Safety seems to make too strong a contribution for sleep and may be tracking with income. I will drop this feature for sleep.

## Final feature choice

In view of the above analysis, the feature are reduced; the cells below run the analysis for the reduced features. The final choices of features are also backed up by public health studies which show the relevant correlations.


In [5]:
#This dictionary has the reduced feature lists as values
D=({'no-asthma':['density'], 
    'sleep >7':['density','commute'],
    'no-obesity':['commute', 'safety','density'],
    'no-mental-health-prob':['commute', 'safety','density']})

for health in ['no-obesity', 'sleep >7', 'no-asthma','no-mental-health-prob']:
    df_smooth=health_smoothing(df_main,health,D[health],500) #use 500 neighbors as default
    smoothed_col=health+"-smooth" #name of smoothed column
    
    X=df_smooth[D[health]]
    y=df_smooth[smoothed_col]
    linreg=LinearRegression().fit(X,y) #fit a linear model
    regression_details(X,y,linreg,500,D[health],health) #print out details of model


For no-obesity the features are  ['commute', 'safety', 'density'] and the number of neighbors is 500
For no-obesity the coiefficients are [0.12986991 0.18869269 0.07403741]
For no-obesity the intercept is 52.72524906011252
For no-obesity the R2 score is 0.5639151945795638

For sleep >7 the features are  ['density', 'commute'] and the number of neighbors is 500
For sleep >7 the coiefficients are [-0.07330155 -0.15554615]
For sleep >7 the intercept is 71.63079456065783
For sleep >7 the R2 score is 0.8464256317780325

For no-asthma the features are  ['density'] and the number of neighbors is 500
For no-asthma the coiefficients are [-0.02849861]
For no-asthma the intercept is 91.71995955868424
For no-asthma the R2 score is 0.8495184357823231

For no-mental-health-prob the features are  ['commute', 'safety', 'density'] and the number of neighbors is 500
For no-mental-health-prob the coiefficients are [ 0.05076365  0.06694088 -0.0327032 ]
For no-mental-health-prob the intercept is 83.6560185

The R2 scores increased after reducing the features, suggesting the feature reductions are sound.

Mental health does not seem to admit a linear model, despite smoothing. I will drop this health issue. 
