# Implementation of Logistic Regression on AMP-PD 

#### Author: Maria Castanos and William Koehler

In [1]:
import pandas as pd
import numpy as np
from logistic_regression import RazorLogReg
from dimensionality_reduction import SelectFeatures
from sklearn import model_selection
import time

## Import Datasets

In [2]:
path = "/Users/mdmcastanos/Documents/OccamzRazor/plink_files/"
numpy_file = path + "plink_numpy.npy"
tsv_file = path + "latest_labels.tsv"
y = pd.read_csv(tsv_file, sep = '\t')
df = pd.DataFrame(np.load(numpy_file))
df = df.assign(participant_id=y["participant_id"], case_control_other_latest = y['case_control_other_latest'])

### Split data

In [3]:
X = df.drop(columns=['case_control_other_latest'])
y = df['case_control_other_latest']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=42)

## Dimensionality Reduction 

Since the dataset has *x* features and *n* rows, dimensionality reduction was performed by choosing the most relevant features according to an [Extra-Trees Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeClassifier.html). 

In [4]:
selected_features = SelectFeatures()
train_set_reduced = selected_features.get_reduced_dataset(X_train, y_train)

In [5]:
test_set_reduced = selected_features.get_test_set_reduced(X_test, y_test)

## Regularized Multinomial Logistic Regression 

### Training
A regularized multinomial logistic regression is trained to predict three classes (Control, Case, Other). Ridge penalization was used in order to take advantage of the correlation among the columns, to predict with higher accuracy. 

The optimization problem is:
$$\max_{\beta} \left\{ \sum_{i = 1}^{N} \big[y_i(\beta^Tx_i) - \log(1 + \exp(\beta^Tx_i)) \big] - \lambda \sum_{j = 1}^p \beta_j^2 \right\}$$
### Hyperparameter Optimization
Hyperparameter $\lambda$ in the optimization problem above, is optimized by implementing [randomized search cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html). 

In [7]:
classifier = RazorLogReg(train_set_reduced, test_set_reduced)
logistic_regression = classifier.get_logistic_regression()



In [None]:
logistic_regression.best

## Train Baseline Models

To compare the performance of the logistic regression against baseline models, both a random and a majority-class classifers were trained. 
To evaluate the performance of the logistic regression, both a random and a majority-class classifers were implemented as baseline models to compare against. 

In [8]:
random_classifier = classifier.get_random_classifier()
majority_class_classifier = classifier.get_majority_class_classifier()

## Results

In [9]:
performance_table = classifier.get_performance_table(logistic_regression, 
                                                     random_classifier, 
                                                     majority_class_classifier)
performance_table = pd.DataFrame(performance_table)
performance_table.round(3)

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Logistic Regression,Random Classifier,Majority Class Classifier
Accuracy,0.675,0.337,0.585
Precision,0.617,0.465,0.342
Recall,0.675,0.337,0.585
F1 Score,0.643,0.376,0.432


## Next Steps

- Explore methods to perform dimensionality reduction. 
- Literature review on different ways of training on Patient Data.