# Classification 

Classification modelling is a branch of machine learning concerned with separating sets of objects.

To do this we usually want to predict from a vector $\mathbf{x}$ what class it belongs to, $y$. The class can be different species of dog, different human faces, the phase of a thermodynamic system to name a few. In this example you'll be working with simulated student data to predict whether they get a pass or fail grade with three different models. 

Your learning goals for this project is: 
* Improved understanding of the `scikit-learn` pipeline for modelling, building on the regression tutorial 
* Imroved understanding of machine learning concepts: generalization, overfitting, model set, feature scale, and class bias
* Understanding the difference between linear and non-linear models 
* Understanding some important classification measurements: Area under Curve, Reciever Operator Characteristic, Precision, Recall 
* Understanding the goal of classification modelling

## Further Reading: 


1. A high-bias, low-variance introduction to machine learning, a good introduction for physicists to machine learning concepts https://arxiv.org/abs/1803.08823
2. The Elements of Statistical Learning, a very good textbook giving an introduction to much of the theory used in machine learning https://web.stanford.edu/~hastie/Papers/ESLII.pdf

# Classifying failure in a class

In this project the aim is to predict whether a studen fails or passes a course. The data has the following columns:

*         "cGPA": college GPA 
*        "attendance": attendance percent
*       "passed_percent": percent of courses passed
*      "sex": self reported sex of the student
*     "hsGPA": high school GPA
*    "ethnicity": self reported ethnic identity of the student
*   "failed_course": whether the student failed the course (1) or not (0)

## Data

The data is included in the `data` folder in a python pickle file-format. Pandas has a function to read this data into a data-frame; `pd.read_pickle(file_name)`. If you prefer not to load files from an untrusted source the data can be generated from the associated notebook. 

#### filename: `classification_data.pkl`



### Task 1a:  Exploration

In this task you should:
1. visualize all the features and their distributions. 
2. Like in the regression task you should also investigate the covariance matrix. Will covariance have a different impact for classification than for regression? What models will it impact? 
3. Visualize the class distribution the dependent variable: `failed_course`. What challenges does this distribution pose, if any? 

### Task 1b: Scaling
For different reasons, different models will react to heterogeneously distributed variables. A major part of data analysis and modelling is being aware of the impacts of the distribtions of values on a given model. 

For this task you should:
1. consider which features are suitable for scaling
2. consider the z-scaling function: $f(x) = \frac{x - \mu}{\sigma}$ where $\mu=\langle x \rangle$ and $\sigma^2= \langle x^2 \rangle - \langle x \rangle^2$. What impact does this function have on a normally distributed feature $x $~$N(\mu, \sigma^2)$? (you don't have to solve this by pen and paper - empirically showing by plotting is fine)
3. Scale suitable features with a z-scaling function. Keep a copy of the unscaled data.


## Modelling

Classification models are formulated somewhat differently than regression models, but many of them actually optimize  the same problem and will have the same minimum. In that way Least Squares OLS serves as a natural bridge to two models we've introduced, namely; Logistic Regression and Artificial Neural Networks. The third model we've briefly discussed, Random Forests, has a slightly different formalization, but this difference is very convenient in many cases as we'll get back to. 

### Task 2: fitting models 

In this task you should: 

1. Using `scikit-learn` the data must be appropriately split in separate data-sets
2. Fit a classification model to the appropriate data (implementing this as a function is recommended)
3. Evaluate this model using two different classification metrics. And plot the Reciever Operator Characteristic curve. What do they tell you? 

#### Extra:
4. Evaluate whether the scaling impacts the model performance

## Tuning

There are many parameters that can impact the performance of your model. If they are not explicitly optimized we call them hyperparameters and making sure they have a good value is an active area of research still as the number of such parameters grow and the computational complexity of models grow with them. 

### Task 3: hyperparameter search

In this task you should: 
1. write your own function to do grid-search or random-search for two hyperparameters of your model. Consider the spacing for each parameter, should it be linear, logarithmic...? (hint: regularization strength is a good bet for most models).
2. Plot or tabulate the results using a metric suited to this problem (hint: in `matplotlib.pyplot` there is a function`imshow` that can be very useful for this). 

#### Extra:
3. Using `scikit-learn` add cross-validation to your hyperparameter search. Cross validation gives you a better estimate of your models performance, why? (see [this blog](https://towardsdatascience.com/cross-validation-70289113a072) for an explanation)

## Analysis 

In classification, as in regression, the estimation of which features are _salient_ or important to the prediction is important to the analysis. The explicit procedure by which you can do this is model dependent, but for logistic regression the feature weights are often indicators of feature importance (think back to task 1a fpr statements that might modify this). One general way of measuring feature importance is by _recursive feature elimination_ which is just a fancy name for systematically picking out features you don't need by greedily checking how well your model does without it. 

### Task 4: Recursive Feature Elimination

In this task you shoud:

1. Use the provided code to perform feature extraction. The method included is model agnostic but suffers from performance issues. It's completely linear, it does a greedy search over all feature combinations leading to a long running time ( it goes as $\mathcal{O}(2^N)$ ). What ways do you see of improving this algorithm? 


In [1]:
from sklearn.metrics import accuracy_score
from itertools import chain, combinations
import numpy as np
import sklearn


def feature_selection(model, x_train, x_test, y_train, y_test, score=accuracy_score):
    """
    arguments: 
    ----------
    model: A model implementing .fit and .predict methods
    x_train: a N-samples x features matrix of data to train on
    y_train: targets to train against
    x_test: hold out set to estimate OOS error
    y_test: hold out targets to estimate OOS error
    
    kwargs:
    score: a score function measuring model performance, must implement __call__(y_true, y_pred)
    
    returns:
    which: index(indices) of the subset and performance corresponding to
        the minimum within one std of max performance 
    model_performance: array of all scores as measured by score 
    subsets: list describing the powerset of the features
    --------
    
    """
    
    n_features = x_train.shape[1]
    feature_indices = np.arange(n_features)
    
    #generate a powerset [abc]->[a], [b], [c], [ab], [ac], [bc], [abc]
    #to grab the columns from x for a given experiment
    subset_iterator = chain.from_iterable(combinations(feature_indices, i) for i in range(1, len(feature_indices)+1))
    model_performance = np.zeros(2**n_features)
    subsets = [0]*(2**n_features)

    for i, feature_subset in enumerate(subset_iterator):
        #select subset data
        train_subset = x_train[:, feature_subset]
        test_subset = x_test[:, feature_subset]
        #clone model with hyperparams but without weights to perform fitting
        subset_model = sklearn.base.clone(model)
        subset_model.fit(train_subset, y_train)
        #evaluate performance
        subset_performance = score(y_test, subset_model.predict(test_subset))
        
        #save configuration and performance
        model_performance[i] = subset_performance
        subsets[i] = feature_subset
    
    performance_std = model_performance.std()
    best_performance = model_performance.max()
    within_one_std = model_performance > (best_performance - performance_std)
    min_within_one_std = model_performance[within_one_std].min()

    which = np.where(model_performance == min_within_one_std)
    return which, model_performance, subsets