# Modeling and Feature Selection
#### Creating a model and selecting features go hand in hand. The main objective here is to build a model with the least amount of error and the fewest number of parameters. While one approach to modeling might be to use every potential parameter to create the "best possible model", it is often the case that sample size, processing limitations, and data limitation make this approach unmanageable, prone to over fitting, and less than optimal. To combat these issues, we will be using an approach that takes into account model parsimony. The idea behind model parsimony is fairly simple and in concept adds a cost function to adding more features to a model (making a model more complex). If a given feature does not improve model accuracy beyond a specified amount, it is not worth adding that complexity to the model. This benefits the modeling process in multiple ways:
- a less complex model
- less processing
- less download data
- fewer sources of error
- less over fitting
- a model that can be generalized
- fewer training samples
- ...

To demonstrate how to create a model we will be using the medeoid_subset.tif image and the plots_subplots that fall within the bounds of the extent of the medoid_subset.tif image. Before continuing, please make sure you have worked through [Summarizing plot data](./Summarizing_plot_data.ipynb) and [Getting The Imagery](./GettingTheImagery.ipynb) notebooks.


In [None]:
#import raster tools
from raster_tools import Raster, general
import geopandas as gpd, shapely, pandas as pd, numpy as np

## Loading the data
#### Let's look at a red, green, blue rendering of the medoid image subset.
Remember the mediod image has 6 bands (blue, green, red, nir, swir1, swir2)

In [None]:
medoid_sub_rs=Raster('medoid_subset.tif')
rgb=medoid_sub_rs.get_bands([3,2,1]) #subset rearrange the bands so we get rgb
rgb.xdata.plot.imshow(robust=True,figsize=(15,12)) #plot the image



### Open the plot_subplot dataset.

In [None]:
#get the cleaned plots and subplots data
plot_sub=pd.read_csv('plot_subplot_data.csv')

#convert plot_sub to a geodataframe
plot_sub=gpd.GeoDataFrame(plot_sub,geometry=gpd.GeoSeries.from_wkt(plot_sub['geometry']),crs=4326)

Subset the points to the boundary of the image

In [None]:
#get the boundary of the  image
img_bnd=gpd.GeoSeries(shapely.box(*rgb.bounds),crs=rgb.crs)

#get all points inside the boundary and just unique ids and geometry
pint=plot_sub.loc[plot_sub.intersects(img_bnd.to_crs(plot_sub.crs).unary_union)].to_crs(rgb.crs)
 
#explore the plot subplot
pint.explore()



## Exercise 1: Reading and displaying the data
- How many plots and subplots?
- What columns are response variables and predictor variables?
- Where did the rest of the plots go?
- Task1: add the image boundary to the map.

In [None]:
plot_sub.columns

## Creating the Use classification model
### There are many potential modelling techniques we can use to estimate Use class labels. To demonstrate how to create competing models and select a parsimonious model we will use [multi-nominal logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and the overall accuracy statistic. T

#### Get Response and Predictor variables

In [None]:
#get resp and predictor variables
resp='Use'
pred=['BLUE','GREEN', 'NIR', 'RED', 'SWIR1', 'SWIR2', 'altura2', 'aspect',
       'aspectcos', 'aspectdeg', 'aspectYesn', 'brightness', 'clay_1mMed',
       'diff', 'elevation', 'evi', 'fpar', 'hand30_100', 'lai', 'mTPI', 'ndvi',
       'ocs_1mMed', 'sand_1mMed', 'savi', 'Yeslt_1mMed', 'slope', 'topDiv',
       'wetness']

X=plot_sub[pred].values
y=plot_sub[resp].values


#### Transform the predictor variables using a PCA and determine how many components are required to account for 95% of the total variation.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

ss=StandardScaler(with_mean=False) #scaling data without centering on the mean because PCA will center the data.
ss.fit(X)
X2=ss.transform(X) # scaling our values so they are comparable

pca = PCA()
pca.fit(X2) # fit the PCA on scaled values
vexp=pd.DataFrame(pca.explained_variance_ratio_,columns=['var']) # get the proportion of variance explained by each component

#find the number of components needed to account for 95% of the variation in the data
cmp=0
s=0
for v in vexp['var']:
    s+=v
    cmp+=1
    if(s>0.95):break
    
#plot % covariance explained
print('95% of the covariation can be explained using the first',cmp,'components')
vexp.plot(kind='barh',figsize=(15,8),title='Percent variation by each component').invert_yaxis()


### Transform our predictors into independent components.

In [None]:
#get our components
X3=pca.transform(X2) #what if we wanted to subset the components to the top 13? How would we do that?


#### Building our first saturated multinominal logistic regression model and calculate overall accuracy.
The overall accuracy is calculated as correctly labeled observations / total number of observations. For sake of parsimony, let's make a threshold based on overall model accuracy that stipulates, in order to add another parameter to our model, it must improve overall accuracy by 0.001%. Now we need to decide if we want to iteratively add parameters to our model (forward selection) using that rule or remove parameter from the model (backward selection). Let's start off using backward selection and remove parameters from a saturated model starting with the parameters that explain the least amount of variation in the data (remember our PCA results).

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#create our saturated logistic regression model
lg = LogisticRegression(random_state=0,solver='lbfgs',max_iter=1000) #increased the iteration so that the model can converge
lg.fit(X3,y)

# Get predicted probabilities of our training data
# Logistic regression models actually estimate each category's probability of occurrence. Using those class probabilities we can label records using a set of rules.
# By default, the predict function in sklearn's logistic class assigns class labels based on the most probable class (maximum likelihood class).

#Let's calculate class accuracy
pred_lbl=lg.predict(X3)
print("% Overall accurate:",accuracy_score(y,pred_lbl))

Using all components we get a overall accuracy of 0.713 for our training data. We can now use a backward selecting procedure to iteratively remove a given parameter and compare the accuracy achieved with a simpler model against the accuracy achieved with the more complex model. If the accuracy of the simpler model is within our threshold, then we can justify removing that parameter from our model. To do this let's convert our logistic regression procedure into a function that returns overall accuracy. In that way we can create a dataframe that quantifies the impact of each predictor variable.

In [None]:
def get_stats(X,y):
    lg = LogisticRegression(random_state=0,solver='lbfgs',max_iter=1000) #increased the iteration so that the model can converge
    lg.fit(X,y)
    pred_lbl=lg.predict(X)
    oa=accuracy_score(y,pred_lbl)
    return oa

#### Now let's use our function to remove predictors (backward selection) and quantify the impact on overall accuracy.

In [None]:
threshold = 0.001 #to remove a parameter the difference in overall accuracy must be less than the threshold
pca_df=pd.DataFrame(X3)
df_clms=pca_df.columns
pt_clms=df_clms.values
oa=get_stats(pca_df.values,y)
m_vls=[['_'.join(pt_clms.astype('str')),oa]]
for i in range(df_clms.shape[0]-1,-1,-1): #count backwards to account for variation explained in principal components
   clm=df_clms[i]
   pt_clms2=np.delete(pt_clms,clm)
   oa2=get_stats(pca_df[pt_clms2].values,y)
   dif=oa-oa2 #subtract accuracies
   if(dif>threshold):
      pass
   else:
      pt_clms=pt_clms2
      oa=oa2
   m_vls.append(["_".join(pt_clms.astype('str')),oa])

ac_df=pd.DataFrame(m_vls,columns=['param','oa'])

print('Components used in the model',pt_clms)
print('Overall accuracy', oa)
display(ac_df)




#### Now let's try forward selection

In [None]:
threshold = 0.001 #to remove a parameter the difference in overall accuracy must be less than the threshold
pca_df=pd.DataFrame(X3)
df_clms=pca_df.columns
pt_clms=[]
m_vls=[]
oa=0
for i in range(0,df_clms.shape[0]): #count forwards to account for variation explained in principal components
   clm=df_clms[i]
   pt_clms2 = pt_clms+[clm]
   oa2=get_stats(pca_df[pt_clms2].values,y)
   dif=oa2-oa #subtract accuracies
   if(dif>threshold):
      pt_clms=pt_clms2
      oa=oa2
   m_vls.append(["_".join(np.array(pt_clms).astype('str')),oa])

ac_df=pd.DataFrame(m_vls,columns=['param','oa'])

print('Components used in the model',pt_clms)
print('Overall accuracy', oa)
display(ac_df)


## Exercise 2: Feature Selection
- Why didn't we just select the first 13 components as our predictors? 
- What features were selected for the backward and forward selection routines?
- Did backward or forward selection produce better results?
- What happens if you increase the threshold?
- Can you think of a way to combine forward and backward selection methods?
- Task 1: What accuracy do you get using the first 13 components as parameters in the model?
- Task 2: Use class probabilities (predict_proba) to estimate the total number of observation in each class. Compare that to the labeled estimate of the number of observations in each class.

# Using sklearn's feature selection
### Sklearn also has a few built in feature selection routines ([link](https://scikit-learn.org/stable/modules/feature_selection.html)). To demonstrate we will use the L2 feature selection.

In [None]:
from sklearn.feature_selection import SelectFromModel

df=pd.DataFrame(X3)

lg = LogisticRegression(random_state=0,solver='lbfgs',max_iter=1000,penalty='l2') 
lg.fit(df,y)
model = SelectFromModel(lg, prefit=True)
X_new = model.transform(df)

#find components selected
fr=X_new[0]
frdf=df.iloc[0,:]
scmp=[]
for v in fr:
    cnt=0
    for v2 in frdf:
        if(v==v2):
            scmp.append(cnt)
            break
        cnt+=1

print('Components selected',scmp)
print('Overall accuracy',get_stats(X_new,y))

## Exercise 3: sklearn's l2 feature selection
- How many features (components) were selected?
- Is model accuracy better or worse?
- What is being selected for using this method? 