# Implementation of Logistic Regression on AMP-PD 

#### Authors: Maria Castanos and William Koehler

## Introduction
The objective of this analysis is to train a regularized logistic regression on the AMP-PD Dataset. However, to achieve computational efficiency, the dasaet's dimension is reduced as much as possible, while maintaining the  necessary information to achieve the bighest prediciton power. 

## Feature Selection
Since the dataset has around 2 Million features, the feature space was reduced by choosing the most relevant features according to different [feature selection methods](logistic_regression). 

### Tree-based Feature Selection
Computes impurity-based feature importances to drop irrelevant features. 

### Univariate Feature Selection
This method chooses the best features based on univariate statistical tests. In this case, the chi2, ANOVA F-value, and the mutual information score (which accounts for non-linear relationships) are used to estimate the degree of dependency between two random variables.

## Regularized Multinomial Logistic Regression 
### Training
A regularized multinomial logistic regression is trained to predict two classes (Control, Case, Other). Elastic Net penalization was used in order to find a compromise between ridge and lasso penalizations. 

The optimization problem is:
$$\max_{\beta_{0k}, \beta_{k}} \left\{ \sum_{i = 1}^{N} \log Pr(g_i|x_i) - \lambda \sum_{k = 1}^K\sum_{j = 1}^p \big(\alpha |\beta_{kj}| + (1 - \alpha)\beta_{kj}^2\big) \right\}$$

The training was done for $100$, $1000$, and $10000$ features for each feature selection method. 

### Hyperparameter Optimization
Hyperparameters $\lambda$ and $\alpha$ in the problem above, are optimized by implementing grid search. The optimal hyperparametrs are those that maximize the balanced accuracy:
$$\frac{1}{2}\left( \frac{TP}{TP + FN} + \frac{TN}{FP + TN} \right)$$

## Baseline Classifiers
To compare the performance of the logistic regression against baseline models, both a random and a majority-class classifers were trained. 
To evaluate the performance of the logistic regression, both a random and a majority-class classifers were implemented as baseline models to compare against.

## Results
We compare the predictions with and without PCA.
### Predictions without PCA

In [1]:
from results import results
import pandas as pd

In [2]:
tree_100, tree_1000, tree_10000, chi2_100, chi2_1000, chi2_10000, Ftest_100, Ftest_1000, Ftest_10000, random, majority_class= results.all_metrics()



In [3]:
performance_table = pd.DataFrame({"Tree 100": tree_100, 
                                  "Tree 1000": tree_1000, 
                                  "Tree 10000": tree_10000, 
                                  "Chi2 100": chi2_100, 
                                  "Chi2 1000": chi2_1000, 
                                  "Chi2 10000": chi2_10000,
                                  "Ftest 100": Ftest_100, 
                                  "Ftest 1000": Ftest_1000, 
                                  "Ftest 10000": Ftest_10000, 
                                  "Random": random,
                                  "Majority Class": majority_class
                                 })

performance_table.index = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'Balanced Accuracy'] 
performance_table.round(3)

Unnamed: 0,Tree 100,Tree 1000,Tree 10000,Chi2 100,Chi2 1000,Chi2 10000,Ftest 100,Ftest 1000,Ftest 10000,Random,Majority Class
Accuracy,0.75,0.726,0.688,0.745,0.74,0.695,0.749,0.728,0.694,0.576,0.51
Precision,0.791,0.778,0.729,0.785,0.789,0.725,0.787,0.779,0.716,0.576,0.588
Recall,0.768,0.733,0.729,0.768,0.749,0.756,0.772,0.737,0.776,1.0,0.495
F1-Score,0.779,0.755,0.729,0.776,0.768,0.74,0.78,0.758,0.745,0.731,0.538
Balanced Accuracy,0.746,0.725,0.681,0.741,0.738,0.684,0.744,0.727,0.679,0.5,0.512


### Predictions with PCA

In [1]:
from results import results
import pandas as pd

In [2]:
tree_100, tree_1000, tree_10000, chi2_100, chi2_1000, chi2_10000, Ftest_100, Ftest_1000, Ftest_10000, random, majority_class= results.all_metrics()

In [3]:
performance_table = pd.DataFrame({"Tree 100": tree_100, 
                                  "Tree 1000": tree_1000, 
                                  "Tree 10000": tree_10000, 
                                  "Chi2 100": chi2_100, 
                                  "Chi2 1000": chi2_1000, 
                                  "Chi2 10000": chi2_10000,
                                  "Ftest 100": Ftest_100, 
                                  "Ftest 1000": Ftest_1000, 
                                  "Ftest 10000": Ftest_10000, 
                                  "Random": random,
                                  "Majority Class": majority_class
                                 })

performance_table.index = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'Balanced Accuracy'] 
performance_table.round(3)

Unnamed: 0,Tree 100,Tree 1000,Tree 10000,Chi2 100,Chi2 1000,Chi2 10000,Ftest 100,Ftest 1000,Ftest 10000,Random,Majority Class
Accuracy,0.727,0.722,0.687,0.74,0.74,0.654,0.749,0.726,0.687,0.576,0.501
Precision,0.777,0.774,0.731,0.78,0.787,0.705,0.788,0.782,0.716,0.576,0.574
Recall,0.739,0.729,0.721,0.762,0.75,0.686,0.77,0.727,0.756,1.0,0.517
F1-Score,0.757,0.751,0.726,0.771,0.768,0.696,0.779,0.754,0.736,0.731,0.544
Balanced Accuracy,0.725,0.72,0.681,0.736,0.738,0.649,0.745,0.726,0.674,0.5,0.498


## Previous results for 3 classes

In [68]:
performance_table = pd.DataFrame({"Tree Based Method": tree_based, 
                                  "Univariate Method": univariate, 
                                  "Random": random,
                                  "Majority Class": majority_class
                                 })

performance_table.index = ['Accuracy', 'Precision', 'Recall', 'F1-Score'] 
performance_table.round(3)

Unnamed: 0,Tree Based Method,Univariate Method,Random,Majority Class
Accuracy,0.672,0.679,0.328,0.585
Precision,0.648,0.664,0.472,0.757
Recall,0.672,0.679,0.328,0.585
F1-Score,0.659,0.653,0.366,0.432


## Next Steps
- Add PCA to the pipeline and compare. 
- Run with two classes (merge 'other' with 'case')
- Add Balanced Accuracy 