# TODO
* Mention more important to minimize FPR than TPR -- FPs are more costly than FNs in treating disease
* Mention we must achieve better performance than guessing -- saying "always pathogenic" gets us 70% accuracy
* Add link to https://www.desmos.com/calculator/bgontvxotm
* Print model coefficients

# Supervised vs. unsupervised learning

Machine learning can be broadly divided into *supervised* and *unsupervised* tasks. In supervised learning, you have data where you know beforehand what the "correct" answer is. For example, in this portion of the workshop, we will develop a classifier to predict whether a given single-nucleotide variant is pathogenic or benign. For this task, we already have a set of thousands of variants where we know the proper answer for each. The classifier's task, then, is to learn what features of a variant are most strongly associated with pathogenic or benign status, allowing the classifier to make predictions about variants for which we don't know the correct response.

In unsupervised learning, by contrast, you have data for which you don't know what the correct answer is. This changes the types of questions you can ask about your data&mdash;creating a classifier for variant pathogenicity would be vastly more difficult if we only had a list of existing variants, with no knowledge of whether each variant was harmful. One example of unsupervised learning is clustering, where your group your data into sets sharing common features. This task is unsupervised because, before your clustering algorithm runs, you don't know how many clusters are in the data or what the features specific to each cluster are. In fact, different clustering algorithms will produce different answers, with one answer not necessarily more correct than another. Exactly which answer you prefer will depend on the purpose to which you will put the results.

Both supervised and unsupervised learning are critical elements of your machine learning toolbox.

# The problem

[ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) is a curated database of over 100,000 mutations in the human genome colelcted through research projects, clinical testing, or extraced from the literature by third parties. Each single-nucleotide variant involved in Mendelian disorders is characterized according to [criteria defined by the American College of Medical Genetics and Genomics](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4544753/), such that it falls into one of the following five classes:

* Pathogenic
* Likely pathogenic
* Likely benign
* Benign
* Unknown significance

The ACMG criteria for deciding a variant's class are listed below, with each factor assigned a weighting according to how likely human experts think it is in influencing pathogenicity. Note, for example, that a mutation inducing a nonsense mutation in a protein whose loss-of-function is known to cause disease has a "very strong" influence on pathogenicity, while a minor (mutant) allele frequency (MAF) amongst the human population substantially exceeding disease prevalence strongly suggests a variant is benign.

![ClinVar criteria](images/clinvar_criteria.jpg)

Our goal is simple: using ClinVar's mutation catalogue, we will train a classifier that can predict variant pathogenicity. You can imagine working on a project where you are trying to identify a *de novo* mutation leading to a Mendelian disorder, but you have observed thousands of candidate mutations amongst your afflicted cohort. Using our classifier trained on ClinVar, you can classify each of the thousands of mutations according to its probability of pathogenicity, vastly reducing the amount of data you must sift through to locate the disease cause.

To simplify our problem, we will perform only a binary classification task of deeming a given variant *pathogenic* or *benign*. Note, however, the methods presented here can be extended to multiclass problems, such that you could produce classifications across all five of the ClinVar categories.

# Feature selection

To decide whether a variant is benign or pathogenic, we need information about the variant. Each piece of information you have is deemed a *feature*. Your goal is to select the pieces of information most informative in making predictions, while reducing irrelevant features that only add noise. In machine learning, *feature selection* is a mundane but critical task&mdash;a simple algorithm using carefully selected features will often outperform a much more complex algorithm lacking access to good features.

For our ClinVar classifier, the simplest model would use only the immediately apparent information about the variant: the chromosome, position, reference allele, and alternate allele. For example, one variant could be represented simply as "chromosome 1, position 123,456, C to T mutation". This model would fare poorly, however, as far too little information is presented to permit a good guess as to pathogenicity&mdash;all of the surrounding genomic context is lost, with such sparse information meaning almost nothing in isolation. Though we could create a classifier using only these features, it would likely perform no better than chance.

Happily, ClinVar provides a great deal of metadata about each variant permitting much more accurate predictions as to pathogenicity. We will use the following features offered by ClinVar:

* *MAF*: minor allele frequency
* *ass*: variant in acceptor splice site
* *dss*: variant in donor splice site
* *int*: variant in intron
* *nsf*: induces frameshift
* *nsm*: induces missense
* *r3*: in 3' region of gene
* *r5*: in 5' region of gene
* *ref*: coding-region variant such that one allele in set is identical to reference
* *syn*: coding-region variant such that one allele in set is synonymous (i.e., does not change amino acid)
* *u3*: in 3' untranslated region
* *u5*: in 5' untranslated region

All features except *MAF* are simply boolean (true or false) values. Excepting *MAF*, every feature is defined for each variant in ClinVar, meaning that we don't need to impute values for missing data. Unfortunately, many real-world datasets do not provide values for each feature across all data, meaning that you must determine a sensible means of providing default values when such data are missing. In our case, we will resolve cases in which no MAF is provided for a variant by simply assigning to it the mean VAF observed across all variants.

# Our first attempt at a linear model

To simplify our problem, imagine we were trying to determine whether a variant is pathogenic or benign using only its minor allele frequency (MAF)&mdash;that is, the frequency with which the variant appears in the human population, with all other individuals presumed to be homozygous reference for the allele in question. We can imagine that most pathogenic variants will have low MAFs, while some benign variants will have high MAFs and some will have low MAFs. To simplify our treatment of the problem, we will assume that benign variants all have high MAFs, though the final logistic regression model we train won't make this assumption.

Suppose we are given training data in which we see a number of pathogenic variants with low MAFs, and a number of benign variants with high MAFs. Our goal is to train a model to predict pathogenicity of a variant using only its MAF.

![Fitting a linear line](images/fitting_linear_line.png)

Here, the pathogenic variants in our training data are coloured red, and the benign variants are coloured green. The model we have fit to the data is the blue line.

![Predicting a novel variant](images/predicting_novel_variant.png)

You can see that, given a novel variant for which we do not know the pathogenicity status, we can predict its pathogenicity by plotting it on the blue line, giving us a probability of pathogenicity. We establish a decision boundary at $status=0.5$, meaning that any variant whose predicted pathogenicity is above 0.5 is deemed pathogenic, and any variant whose predicted pathogenicity is below 0.5 is deemed benign. Here, our predicted pathogenicity is slightly below 0.5, and so we decide this variant is benign.

# How do we fit a linear model?

Now that we've seen our model in action, the question becomes how we fit this model to our data. As our model is merely a line in two dimensions, our hypothesis (i.e., predicted pathogenicity) for a variant with MAF $x$ can be expressed as $h_\theta(x)=\theta_0+\theta_1x$, where $\theta_0$ is the intercept and $\theta_1$ is the model's slope. Now, imagine that you have several different candidate values of $\theta_0$ and $\theta_1$. You must test each to see how well the corresponding model fits the data. How can you quantify data fit to permit making an informed decision?

To quantify data fit, we will define a *cost function* $J(\theta_0,\theta_1)=\frac{1}{2M}\Sigma_{i=1}^{M}(h_{\theta}(x_{i})-y_i)^{2}$, in which you have $M$ variants in your training set and $y_i$ is the pathogenicity status of variant $i$ (that is, $y_i=1$ if variant $i$ is pathogenic, and $y_i=0$ if variant $i$ is benign). Thus, $J(\theta_0,\theta_1)$ will be high when the model fits the data poorly, reflecting a high *cost*. Conversely, if the hypotheses generated by the model are perfectly accurate, we will always have $h_\theta(x_i)=y_i$, implying $J(\theta_0,\theta_1)=0$. Our cost function $J(\theta_0,\theta_1)$ now lets us evaluate different values of $\theta_0$ and $\theta_1$ to determine which best fit the data. (Note that the $\frac{1}{2M}$ factor makes the math slightly cleaner when we differentiate the cost function, which we will do later on.) But can we automatically determine the ideal values of these parameters? Of course&mdash;this is the *learning* part of *machine learning*!

![Gradient descent steps](images/gradient_descent_steps.png)

<img src="images/cost_plot.png" alt="Cost function" width="50%">

Suppose we plot the cost in three dimensions as a function of our parameters $\theta_0$ and $\theta_1$. We wish to locate the cost minimum, which we can see by inspection occurs at $\theta_0=5$ and $\theta_1=-1$. But how do we locate this point automatically? One solution is an algorithm called *gradient descent*.

<img src="images/cost_contours.png" alt="Cost function" width="50%">

Gradient descent is an iterative algorithm that uses partial derivatives of the cost function to determine what direction to move in at each step to get closer to the minimum cost. Specifically, gradient descent works thusly:

1. Set each $\theta_j$ to a random value. In our example, $j\in\{0,1\}$.
2. Update each $\theta_j$ by setting $\theta_{j}\leftarrow\theta_{j}-\alpha\frac{\partial}{\partial\theta_{j}}J(\theta_{0}\theta_{1})$.
    1. If none of the $\theta_j$ values have changed much relative to the last iteration (i.e., they have converged), terminate.
    2. Otherwise, repeat step 2.
    
The parameter $\alpha$ is your *learning rate*. Setting this value may take some trial and error. If $\alpha$ is too small, gradient descent will take a long time to converge; if it is too large, you will continuously oscillate around the minimum without ever converging. If you have a reasonable $\alpha$, you will see the steps you take as you approach the minimum become progressively smaller, as the magnitudes of the cost function's partial derivatives become less. So long as you have defined your cost function such that it is convex&mdash;meaning that there exists only a single minimum, rather than multiple local minima&mdash;gradient descent will locate the minimal-cost value of your parameters. We can use precisely this procedure to determine the best values of $\theta_0$ and $\theta_1$.

In our full model, we will have the $\theta_0$ intercept parameter, as well as $\theta_1$ through $\theta_N$ parameters representing the weights assigned to each of our $N$ features. In our MAF example, we determined that high MAF corresponds to low probability of pathogenicity, and so the weight assigned to the MAF feature will be negative&mdash;the higher the MAF, the lower we wish our predicted pathogenicity to be. Conversely, we can imagine the $\theta_i$ parameter we establish for our "variant induces missense mutation" will be large and positive, as such mutations have high likelihood of being pathogenic. Examining the weight parameters our model determines can inform as to which features are most important in determining pathogenicity.

# Creating a better linear model

As it stands, our model is an example of *linear regression* in which we have fit a simple line to our data. This is a perfectly valid solution if we were predicting a continuous value, such as when using, for example, resting heart rate and weight to predict a patient's blood pressure. Predicting variant pathogenicity, however, is a problem of *classification* rather than *regression*&mdash;we ultimately want to determine whether a variant is pathogenic or not. Let's review our linear regression model:

![Fitting a linear line](images/fitting_linear_line.png)

This model suffers from two critical shortcomings:

1. The model's outputs are not restricted to a sensible range. For instance, a MAF of 0.99 would result in a *negative* value for pathogenicity, which can no longer be interpreted as a probability. Likewise, we can have pathogenicity predictions greater than 1, which are equally nonsensical.
2. Outliers can have a large effect on our model.

The effect of outliers is apparent if we add a pathogenic variant with high MAF. Suppose one of your training points is mislabelled, such that the outlying red point is labelled as pathogenic (hence the red), but is in fact benign.

![Outlier effect](images/outlier_effect.png)

In this case, the line we fit will be "pulled" to the right by the outlier. This means that the leftmost benign point below it will now be misclassified, since it falls on the "pathogenic" side of the line!

To create a model better suited to classification, we will use logistic regression instead of linear regression. Logistic functions take the form $y=\frac{1}{1+\exp(-z)}$, with $\exp$ being the exponential function such that $\exp(a)=e^a$. Rather than creating a linear hypothesis of the form $h_{linear}(x)=\theta_0+\theta_1x$, we will use a logistic hypothesis $h_{logistic}(x)=\frac{1}{1+\exp(-(\theta_{0}+\theta_{1}x))}$.

![Using the logistic function](images/logistic_function.png)

This neatly resolves both of our model's issues&mdash;predictions are now constrained to the range $h_\theta(x)\in[0,1]$, permitting interpretation as probabilities, while outliers have far less effect on our predictions.

![Using the logistic function](images/logistic_prediction.png)

As with our linear regression model, we can now proceed by establishing a decision boundary such that a variant with $MAF=x$ is deemed pathogenic if $h_\theta(x)\geq0.5$, and benign if $h_\theta(x)<0.5$.

As we have changed our model to use the logistic function, we must also change the cost function to allow gradient descent to find a single cost optimum. We will use this cost function for logistic regression:

* If $y=1$ (variant is pathogenic): $\text{Cost}((h_\theta(x),y)=-\log(h_\theta(x))$
* If $y=0$ (variant is benign): $\text{Cost}((h_\theta(x),y)=-\log(1-h_\theta(x))$

Thus, regardless of whether a variant is pathogenic ($y=1$) or not ($y=0$), this cost function will punish our model based on how far the predicated pathogenicity is from the actual variant status. With the cost function defined for a single variant, we can now create a combined cost function for all our $M$ variants:

$J(\theta)=-\frac{1}{M}\sum_{i=1}^{M}\left[y^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right]$

Note that, when the variant is pathogenic, the second term within the sum becomes zero, while when the variant is benign, the first term becomes zero. Thus, we can combine both the pathogenic and benign equations, allowing us to write the cost for all variants.

# Running logistic regression

In [1]:
from pathogenicity_predictor import make_variants, assign_classes, \
  concat_training_data, plot_line_graph, partition_into_training_and_test, \
  eval_performance
from functools import reduce
import numpy as np
import sklearn.cross_validation
import sklearn.metrics
import sklearn.feature_extraction

def impute_missing(variants):
    mafs = []

    for label, vars in variants.items():
        for var in vars:
            if 'MAF' in var:
                mafs.append(var['MAF'])

    missing_maf = np.mean(mafs)

    for label, vars in variants.items():
        for var in vars:
            if 'MAF' not in var:
                var['MAF'] = missing_maf
                
def vectorize_variants(variants):
    vectorizer = sklearn.feature_extraction.DictVectorizer(sparse=False)
    # List all variants (which are dictionaries) in single list regardless of
    # what class (pathogenic, benign, ...) they fall in.
    all_vars = reduce(lambda a, b: a + b, variants.values())
    vectorizer.fit(all_vars)

    vectorized_vars = {}
    for clnsig, vars in variants.items():
        vectorized_vars[clnsig] = vectorizer.transform(vars)
    return vectorized_vars

def prepare_variants(vcf_file):
    variants = make_variants(vcf_file)
    variants = assign_classes(variants)
    impute_missing(variants)
    variants = vectorize_variants(variants)
    return variants
  
def partition_into_training_and_test(vars, labels, training_proportion):
    num_vars = vars.shape[0]
    num_training = int(np.round(training_proportion * num_vars))
    
    indices = np.arange(num_vars)
    np.random.shuffle(indices)
    training_indices, test_indices = indices[:num_training], indices[num_training:]
    
    return (vars[training_indices], vars[test_indices], labels[training_indices], labels[test_indices])

In [2]:
np.random.seed(1)
variants = prepare_variants('clinvar.vcf.gz')

In [3]:
cval = 100
first_model = sklearn.linear_model.LogisticRegression(penalty='l2', C=cval)
vars, labels = concat_training_data(variants)
print(labels.shape)
training_vars, test_vars, training_labels, test_labels = partition_into_training_and_test(vars, labels, 0.8)

first_model.fit(training_vars, training_labels)
pathogenicity_probs = first_model.predict_proba(training_vars)[:,1]
eval_performance(training_labels, pathogenicity_probs)

(34095,)
{'accuracy': 0.88396392432908055, 'f1': 0.91824029345664016, 'roc_auc': 0.90353226041307522, 'pr_auc': 0.94518605771803565}


# Preventing overfitting: training, test, and validation datasets

Now that we've created a logistic regression model to predict variant pathogenicity, we want to assess its performance. The simplest way of doing so would be to simply train the model on all variants in ClinVar, with the features for each variant and its true class (pathogenic or benign) provided. Then, to test performance, we feed each variant into the trained model once more, but without its label. We would then compare the model's prediction to the true label for the variant, seeing what proportion were correctly predicted.

The problem with so simple an approach is that your model will inevitably *overfit* your data. *Overfitting* means that your model is fit not to underlying patterns common to similar data sets&mdash;such as the combination of variant features most predictive of pathogenicity, like whether a mutation induces a missense mutation&mdash;but instead to noise particular to the data you have at hand. Overfitting often results when your model is trained with too little data, meaning it does not see sufficient data to generalize, or when your model is overly complex, with too many parameters that can be adjusted to precisely fit the peculiarities of your training data. 

![Overfitting example](images/overfitting_curves.png)
Consider the case of trying to fit polynomials of various orders to points generated from a sinusoidal curve (taken from Bishop). The green curve represents the underlying truth, with the blue points generated as points on the curve with the addition of some noise pushing them slightly away. We see that the zeroth- and first-degree polynomials, corresponding to $M=0$ and $M=1$, fit the points poorly. The third-order polynomial comes close to approximating the green curve, and so is deemed a good fit to the data. The ninth-order polynomial, however, overfits&mdash;though it passes precisely through every observed point, its structure is intrinsically tied to the noise that pushed the blue points off the green curve, and so it poorly represents the original curve. In this case, the model was too complex&mdash;with ten different parameters (i.e., coefficients for the ninth-order polynomial), the model was able to conform *too* closely to the observed data. Constraining model complexity such that we allow only a third-order polynomial, or providing the training process with more data, would have prevented such overfitting.
 
Now that we know what overfitting is, how do we prevent it? The key is to partition your data into *training* and *validation* sets. For example, you might train on 80% of your data, but then evaluate your model's performance on the remaining 20%. If your model overfits your training set, you will see model performance decrease on the validation set relative to a non-overfit model, allowing you to choose the model that best generalizes to inputs not seen during training.
 
But now we confront a different problem. You have only a limited amount of data, and you would like to evaluate your model against all available examples. You might imagine that, if you retain 20% of ClinVar variants to validate your model, that 20% is unusually easy or unusually hard to classify relative to the whole set, meaning that you will either overestimate or underestimate model perfromance, respectively. The solution is to use *k-fold cross validation*. Cross validation means that, once we partition the dataset into training and validation sets, we rotate through the partitions, changing which is used as the validation set each time. This way, we create three separate models from three distinct training sets and evaluate the performance on each, meaning that every variant will eventually be used to validate the model. This in turn gives us a better idea of how our model generalizes.

Cross-validation is particularly useful in selecting *hyperparameters*. While your model's *parameters* are automatically inferred from your training data&mdash;that's the *learning* part of *machine learning*&mdash;there are other parameters that can't be learned, and must be set by a human engineer. Such hyperparameters can have a significant effect on performance, as we will see when discussing regularization below. Typically, to select hyperparameters, we will repeat our entire k-fold cross validation scheme for each possible value of the parameter, selecting that which yields the best average performance on the validation set. This becomes difficult, however, when you have many hyperparameters. Suppose you have $n$ hyperparameters, each of which can take $k$ different values. This implies you must do a *grid search* over all possible $k^n$ combinations, repeating your entire training and validation procedure for each. This is an unfortunate problem to which there is no ready solution.

For our pathogenicity predictor, we will use three-fold cross validation. Each fold corresponds to a separate model we train. Observe how which portion of the data is used as validation and training changes in each fold.

| Fold | First 33% of data | Second 33% of data | Last 33% of data |                                                                    
| ---- | ----------------- | ------------------ | ----------------                                                                      
| First | Training | Training | Validation |                                                                                            
| Second | Training | Validation | Training |                                                                                           
| Third | Validation | Training | Training | 

Now, suppose we use our three-fold cross validation scheme to evaluate different models, choosing the one that best classifies our variants. Great! We're ready to publish. To maximize the amount of data our trained model has seen, we re-train our best model using all available variants in ClinVar, using the hyperparameters established as ideal during validation. The issue, however, is that you lack an unbiased means of estimating performance&mdash;it is critical that you perform your final evaluation of your model using data it has never seen before, to prevent precisely the overfitting issue we first encountered. To resolve this problem, before we ever begin to work on a model, we remove some of our data from our training/validation set and retain it as *test* data to use in the final evaluation. Suppose at the outset of this project you randomly select 20% of the ClinVar variants as your test set. Then, you proceed with training and evaluating different models with varying hyperparameters using the three-fold cross validation scheme we discussed on the remaining 80% of variants. Once you've selected the best model and hyperparameters, you re-train that model on all variants in your training/validation set. Finally, you evaluate this model's performance on the test set held out since starting your project. The most important consideration is that you cannot adjust your model or its hyperparameters to increase performance on the test set, as the test set should represent an entirely novel set of data your model has never before encountered, granting understanding into how well your model generalizes to real-world data for which the "correct" answer is unknown.

For our present project, we won't bother using a separate test data set, as we do not intend to use this model for real-world tasks. Nevertheless, using a separate test data set is essential when developing models for use in your own work.

# Running logistic regression with three-fold cross validation

In [4]:
def run_cross_validation(vars, labels, model):
    skf = sklearn.cross_validation.StratifiedKFold(labels, n_folds=3, shuffle=True)

    validation_probs = np.zeros(labels.shape)

    for train_index, validation_index in skf:
        training_data, validation_data = vars[train_index], vars[validation_index]
        training_labels, validation_labels = labels[train_index], labels[validation_index]

        validation_fold_probs = np.zeros(validation_labels.shape)

        model.fit(training_data, training_labels)

        validation_fold_probs = model.predict_proba(validation_data)
        validation_fold_probs = validation_fold_probs[:,1]
        validation_probs[validation_index] = validation_fold_probs

    return validation_probs

In [5]:
second_model = sklearn.linear_model.LogisticRegression(penalty='l2', C=cval)
pathogenicity_probs = run_cross_validation(training_vars, training_labels, second_model)

In [6]:
eval_performance(training_labels, pathogenicity_probs)

{'accuracy': 0.88378061299310751, 'f1': 0.91812593625703798, 'roc_auc': 0.90327115893850585, 'pr_auc': 0.94815231025721314}


# Assessing model performance

Given different models, we need some means of comparing them to understand how well each performs. The simplest is the accuracy score: when we examine our model's predictions on our validation data, what proportion of predictions match the true labels? The advantage of the accuracy score is that it's simple to understand; the disadvantage, however, is that it requires determining an appropriate decision boundary. As our classifier produces the probability that each variant is pathogenic, we must determine the probability cutoff at which we distinguish between the two classes. For now, we'll simply deem a variant pathogenic if $P(\text{pathogenic})\geq0.5$, which seems reasonable. 

In [7]:
print('Accuracy:', sklearn.metrics.accuracy_score(training_labels, pathogenicity_probs >= 0.5)) 

0.883780612993



Choosing a decision boundary of 0.5 was arbitrary&mdash;depending on our problem domain, either higher or lower values may be better suited. Exactly which value we choose will depend on whether false positives or false negatives are more costly. Suppose, for example, you use the variant pathogenicity output by our model to decide which variants amongst thousands you've detected in your target population are the best candidates for verification in the wet lab. If your available resources are enough to test only dozens of variants, you must seek to minimize false positives, even at the cost of increasing false negatives. In this case, you might deem a variant pathogenic only if $P(\text{pathogenic})\geq0.1$. Conversely, in building a classifier to determine whether someone is afflicted by cancer based on various biomarkers, the cost of false negatives is greater than false positives&mdash;false-positive diagnoses will be corrected by later screening procedures, but false-negative diagnoses will mean that a cancer sufferer will not receive any treatment. As such, you might establish a decision boundary such that someone is listed as potentially suffering from cancer if $P(\text{has cancer})\geq0.1$.

To understand how your model performs using a variety of decision boundaries, you can utilize both the precision-recall curve and the receiver-operating-characteristic (ROC) curve. We will discuss each in turn. Both rely on determining the relationships between true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs).

The precision-recall curve demonstrates the balance between precision and recall. Precision is defined as $\text{precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}$, while recall is $\text{recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}$. Let's plot the precision-recall curve for our model.

In [8]:
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(training_labels, pathogenicity_probs)
plot_line_graph(
  xvals = recall,
  yvals = precision,
  title = 'Precision-recall for logistic regression',
  xtitle = 'Recall',
  ytitle ='Precision',
  labels = ['Decision boundary = %s' % t for t in list(thresholds) + [-1]]
)


By hovering over the curve, you can see the decision boundary set for each precision/recall point. As we would expect, we can achieve extremely high precision by accepting poor recall&mdash;we get a precision of 0.99 alongside a recall of 0.43 given a decision boundary of 0.97. This makes sense, as almost all positives (i.e., variants deemed pathogenic) will be true positives, at the cost of producing a great many false negatives that lower recall. If we wish to instead minimize the number of false negatives, we can select a decision boundary of 0.40 from the curve's far right portion, yielding a precision of 0.94 while maintaing a recall of 0.90. For most applications, this will be a superior configuration.

To make rapid judgements about a model's quality, we can also examine the area under the precision-recall curve. Clearly, the best possible model would maintain perfect precision at any recall, including perfect recall, which would yield an area under the curve of one. We can compute this easily:

In [9]:
print('Area under precision-recall curve:', sklearn.metrics.average_precision_score(training_labels, pathogenicity_probs))

Area under precision-recall curve: 0.948152310257


This metric thus gives us an efficient means of evaluating how well our model reflects a "perfect" model.

As an alternative to the precision-recall curve, we can consider the receiver operating characteristic (ROC) curve. The ROC curve is similar to the precision-recall curve, plotting the true positive ratio (TPR) against the false positive ratio (FPR). The TPR, defined as $\text{TPR}=\frac{\text{TP}}{\text{TP}+\text{FN}}$, indicates the proportion of positives that are true, while the FPR, defined as $\text{FPR}=\frac{\text{FP}}{\text{FP}+\text{TN}}$, shows the proportion of negatives that are false positives. Once more, we can plot this curve alongside the decision boundary associated with each $(\text{FPR}, \text{TPR})$ pair, and also examine the area under the curve.

In [10]:
fpr, tpr, thresholds = sklearn.metrics.roc_curve(training_labels, pathogenicity_probs)
plot_line_graph(
  xvals = fpr,
  yvals = tpr,
  title = 'ROC for logistic regression',
  xtitle = 'FPR',
  ytitle ='TPR',
  labels = ['Decision boundary = %s' % t for t in list(thresholds) + [-1]]
)
print('Area under ROC curve:', sklearn.metrics.roc_auc_score(training_labels, pathogenicity_probs))

Area under ROC curve: 0.903271158939


Unlike the precision-recall curve, the ROC curve increases monotonically, as a higher FPR always yields at least as good a TPR. Intuitively, as we decrease the decision boundary threshold, both the TPR and FPR will increase. The perfect model would have a TPR of 1 accompaning an FPR of 0, meaning the area under the curve would be one.

Now that we have defined accuracy, the precision-recall curve, and the ROC curve, how do we determine if our classifier is worthwhile? Firstly, we can compare these metrics to other classifiers trained on the same data, granting insight into the impact both choice of algorithm and different hyperparameters has. The other thing we can do, which we will try here, is to compare the validity of our classifier to a naive classifier that simply "guesses" whether each variant is pathogenic or benign.

In [17]:
print('Actual pathogenic proportion:', np.sum(training_labels) / len(training_labels))

guesses = np.random.sample(len(pathogenicity_probs))
print('Accuracy:', sklearn.metrics.accuracy_score(training_labels, guesses >= 0.5)) 
  
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(training_labels, guesses)
plot_line_graph(
  xvals = recall,
  yvals = precision,
  title = 'Precision-recall for logistic regression',
  xtitle = 'Recall',
  ytitle ='Precision',
  labels = ['Decision boundary = %s' % t for t in list(thresholds) + [-1]]
)
print('Area under precision-recall curve:', sklearn.metrics.average_precision_score(training_labels, guesses))

fpr, tpr, thresholds = sklearn.metrics.roc_curve(training_labels, guesses)
plot_line_graph(
  xvals = fpr,
  yvals = tpr,
  title = 'ROC for logistic regression',
  xtitle = 'FPR',
  ytitle ='TPR',
  labels = ['Decision boundary = %s' % t for t in list(thresholds) + [-1]]
)
print('Area under ROC curve:', sklearn.metrics.roc_auc_score(training_labels, guesses))

Actual pathogenic proportion: 0.697902918316
Accuracy: 0.493510778707


Area under precision-recall curve: 0.690321327847


Area under ROC curve: 0.488733852772


Intuitively, this makes sense.

* Since our guesses are probabilities uniformly drawn from the range $[0,1]$, the accuracy is approximately $0.5$, as for every variant, regardless of whether it's actually pathogenic or benign, the probability of falling above the decision boundary and being deemed pathogenic is $0.5$.

* On the precision-recall curve, ignoring noise near the left point of the curve, we see a constant precision of approximately $0.7$, matching the proportion of variants in our training set that are actually pathogenic. To understand this, we will pick an arbitrary decision boundary $B=0.4$. Now, because our guessed pathogenic probabilities are drawn uniformly, 60% of the guessed probabilities will be above the threshold and 40% below. As these probabilities are assigned randomly to variants, within both the "above threshold" set and "below threshold" set, we expect the distribution of actual pathogenicity to match that seen in the training data as a whole. This is exactly what we see&mdash;amongst the 60% of "guessed pathogenic" variants above the $B$ threshold, approximately 70% of those will actually be pathogenic, granting a precision of $0.7$. This holds true no matter what the value of $B$ is. Conversely, we expect recall to increase at the same rate as $B$ decreases. In $\text{recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}$, the denominator is fixed regardless of $B$, while the numerator increases with falling $B$ as we pick up increasing numbers of true positives as the decision boundary falls.

* On the ROC curve, recall is synonymous with TPR, so we can use the argument above to show that TPR will increase alongside the decision boundary. The same argument applies to the FPR, in which $\text{FPR}=\frac{\text{FP}}{\text{FP}+\text{TN}}$&mdash;the denominator remains fixed regardless of the decision boundary, so as the decision boundary decreases we will see TNs "converted" into FPs, meaning the FPR increases at the same rate as the TPR.

Consequently, for your classifier to be worthwhile, you must improve on what you can achieve by mere chance. If we compare our accuracy, ROC curve, and precision-recall curve, we see that our model manages to do exactly this.

# Regularization

Regularization is another strategy for reducing overfitting. We have seen already that we can reduce overfitting by partioning our data into training and validation sets, then evaluating performance using only the validation data. Regularization gives us another means of ensuring our model generalizes to unseen data.

The intuition underlying regularization is simple: without extremely strong evidence supporting particular parameter values, we would prefer the magnitudes of our parameters be low. Let us return to our previous exmaple of fitting a polynomial to points.

![Overfitting example](images/overfitting_curves.png)

As before, the ninth-degree polynomial is clearly too complex a model for the limited amount of data we have at hand&mdash;the corresponding curve is overfit, and thus unlikely to generalize to novel data. Supose the ninth-degree polynomial takes this form:

$y = \theta_0 + \theta_1x + \theta_2x^2 + ... + \theta_8x^8 + \theta_9x^9$

The coefficients associated with the higher-degree terms in the polynomial, such as $\theta_8$ and $\theta_9$, are likely to be large in magnitude, allowing the complex curve to conform exactly to the observed data. Suppose, however, that we did not wish to declare that the $M=3$ cubic polynomial is more suited to the data, for we wanted to preserve the possibility that a more complex polynomial *could* be fit, given sufficient data to provide evidence for this need. In this case, we could penalize the higher-degree coefficients, allowing them to take large values only if they substantially improve fit on validation data.

Regularization will take a similar form when we apply it to our logistic regression model. Each feature in our data, such as a variant's MAF or a boolean indicating whether it induces a missense mutation, has an associated weight (whether positive or negative) indicating that feature's contribution to pathogenicity across the training data. Absent extremely strong evidence from the data, we want these parameters to have small magnitudes, as such values are likely to generalize better to unseen data, rather than overfitting noise present in the training data. To achieve this, we will simply modify our cost function. Recall that our cost function for logistic regression was thus:

$J(\theta)=-\frac{1}{M}\sum_{i=1}^{M}\left[y^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right]$

To perform regularization, we simply add an additional term increasing cost in accordance with the magnitudes of all our weight parameters:

$J(\theta)=-\frac{1}{M}\sum_{i=1}^{M}\left[y^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right]+\beta\sqrt{\sum_{i=1}^{N}\theta_{i}^{2}}$

The extra term is simply the L2 (Euclidean) norm of the vector containing our feature weights $[\theta_1, \theta_2, ..., \theta_N]$. As each term is squared, high-magnitude weights are punished severely. $\beta$ is a hyperparameter termed the *regularization constant*, representing how severely we wish to constrain the magnitudes of our parameters. By incorporating regularization into the cost function and adjusting $\beta$, we can change the tradeoff we make&mdash;we can take small $\beta$, which will yield a model that better fits the training data but with larger weight parameters, or we can take large $\beta$, which will produce a model that fits the training data less well but with smaller weight parameters. By trying different $\beta$ parameters on our training data, we can determine which makes the best tradeoff between data fit and model generalizability.

In this case, as we have used the L2 norm, we have performed *L2 regularization*. The other form of regularization you are likely to encounter is *L1 regularization*, which uses the L1 norm instead:

$J(\theta)=-\frac{1}{M}\sum_{i=1}^{M}\left[y^{(i)}\log(h_{\theta}(x^{(i)}))+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)}))\right]+\beta\sum_{i=1}^{N}|\theta_{i}|$

Observe only the regularization term has changed. Why would you prefer the L1 norm to the L2 norm? The L1 norm induces sparsity in the weights you learn, by readily allowing the weights for unimportant features to go to zero. The L2 norm, conversely, prefers to "spread" weights more equitably amongst multiple features, making it more difficult to discern which are truly important in determining your output. Thus, to gain insight into which features of your data are most important, L1 is often preferable. This comes at a cost, however&mdash;given its use of absolute value, the L1 norm is not differentiable, meaning we cannot use it in our simple gradient descent scheme. To use the L1 norm, we must examine alternative optimization techniques beyond the scope of this workshop.

# Choosing hyperparameters

# Support vector machines

# Random forests