# Classification and Prediction in GenePattern Notebook

This notebook will show you how to use k-Nearest Neighbors (kNN) to build a predictor, use it to classify human airway smooth muscle as treated or untreated by asthma medications, and assess its accuracy in cross-validation.

### K-nearest-neighbors (KNN)
KNN classifies an unknown sample by assigning it the phenotype label most frequently represented among the k nearest known samples (Golub and Slonim et al., 1999). 

Additionally, you can select a weighting factor for the 'votes' of the nearest neighbors. For example, one might weight the votes by the reciprocal of the distance between neighbors to give closer neighors a greater vote.

<h2>1. Log in to GenePattern</h2>

<ul>
	<li>Select Broad Institute as the server</li>
	<li>Enter your username and password.</li>
	<li>Click <em>Login to GenePattern</em>.</li>
	<li>When you are logged in, you can click the - button in the upper right hand corner to collapse the cell.</li>
	<li>Alternatively, if you are prompted to Login as your username, just click that button and give it a couple seconds to authenticate.</li>
</ul>


In [4]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

<h2 id="2.-Preprocess-raw-count-data">2. Preprocess raw count data</h2>

<ul>
	<li>Preprocess RNA-Seq count data in a GCT file so that it is suitable for use in GenePattern analyses.</li>
</ul>

<div class="alert alert-info">
<p>For the <strong><em>input.file</em></strong> parameter, click &quot;Add Path or URL...&quot; then copy and paste this URL into the <em>&quot;Enter Path or URL&quot; </em>text box, and click <strong><em>Select</em></strong>:</p>

<p><a href="https://datasets.genepattern.org/data/VIB/MergedHTSeqCounts_GSE52778.gct" target="_blank">https://datasets.genepattern.org/data/VIB/MergedHTSeqCounts_GSE52778.gct</a></p>

<p>&nbsp;</p>

<p>For the <strong><em>cls.file</em></strong> parameter, click &quot;Add Path or URL...&quot; then copy and paste this URL into the <em>&quot;Enter Path or URL&quot; </em>text box, and click <strong><em>Select</em></strong>:</p>

<p><a href="https://datasets.genepattern.org/data/VIB/MergedHTSeqCounts_GSE52778.cls" target="_blank">https://datasets.genepattern.org/data/VIB/MergedHTSeqCounts_GSE52778.cls</a></p>

<p>&nbsp;</p>

<p>Click the button <strong><em>Run</em></strong> on the analysis below.</p>
</div>


In [5]:
preprocessreadcounts_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00355')
preprocessreadcounts_job_spec = preprocessreadcounts_task.make_job_spec()
preprocessreadcounts_job_spec.set_parameter("input.file", "")
preprocessreadcounts_job_spec.set_parameter("cls.file", "")
preprocessreadcounts_job_spec.set_parameter("output.file", "<input.file_basename>.preprocessed.gct")
preprocessreadcounts_job_spec.set_parameter("expression.value.filter.threshold", "1")
genepattern.display(preprocessreadcounts_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00355')

<h2 id="3.-Run-k-Nearest-Neighbors-Cross-Validation">3. Run k-Nearest Neighbors Cross Validation</h2>

<p>In the result cell for the PreprocessReadCounts job, you will see 2 files.</p>

<div class="alert alert-info">Click the &quot;i&quot; icon next to the MergedHTSeqCounts_GSE52778.preprocessed.gct file.</div>

<p>You will see a dialog box with several options.</p>

<div class="alert alert-info">
<ul>
	<li>Select &quot;Send to existing GenePattern Cell&quot;</li>
	<li>Choose &quot;KNNXvalidation&quot;</li>
	<li>Click Run.</li>
</ul>
</div>

In [6]:
knnxvalidation_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00013')
knnxvalidation_job_spec = knnxvalidation_task.make_job_spec()
knnxvalidation_job_spec.set_parameter("data.filename", "")
knnxvalidation_job_spec.set_parameter("class.filename", "https://datasets.genepattern.org/data/VIB/input.files.list.preprocessed.cls")
knnxvalidation_job_spec.set_parameter("num.features", "10")
knnxvalidation_job_spec.set_parameter("feature.selection.statistic", "0")
knnxvalidation_job_spec.set_parameter("min.std", "")
knnxvalidation_job_spec.set_parameter("num.neighbors", "3")
knnxvalidation_job_spec.set_parameter("weighting.type", "1")
knnxvalidation_job_spec.set_parameter("distance.measure", "1")
knnxvalidation_job_spec.set_parameter("pred.results.file", "<data.filename_basename>.pred.odf")
knnxvalidation_job_spec.set_parameter("feature.summary.file", "<data.filename_basename>.feat.odf")
genepattern.display(knnxvalidation_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00013')

## 4. View prediction results

### a. Read the results into a dataframe


<div class="alert alert-info">
<ul>
	<li>Select the cell containing the job result by clicking anywhere in it.</li>
	<li>Click on the i icon next to the <code>MergedHTSeqCounts_GSE52778.preprocessed.pred.odf</code> file</li>
	<li>Select &quot;Send to DataFrame&quot;</li>
	<li>You will see a new cell that contains 3 lines of code starting with <code>from gp.data import ODF</code></li>
	<li>Execute this cell</li>
	<li>You will see the prediction results as a table.</li>
</ul>
</div>


### b. View prediction results

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol>
    <li>For the <q>prediction results file</q> parameter below, click the down arrow in the file input box.
    </li><li>Right click the <code>MergedHTSeqCounts_GSE52778.preprocessed.pred.odf</code> above and select <q>Copy link address</q>.</li>
    <li>Paste the link into the <q>Prediction Results File</q> parameter.</li>
    <li>Click <strong><em>Run</em></strong>.</li>
    <li>You will see the prediction results in an interactive viewer.</li>
</ol>
</div>


In [8]:
predictionresultsviewer_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00019')
predictionresultsviewer_job_spec = predictionresultsviewer_task.make_job_spec()
predictionresultsviewer_job_spec.set_parameter("prediction.results.file", "")
predictionresultsviewer_job_spec.set_parameter("job.memory", "2 Gb")
predictionresultsviewer_job_spec.set_parameter("job.walltime", "02:00:00")
predictionresultsviewer_job_spec.set_parameter("job.cpuCount", "1")
genepattern.display(predictionresultsviewer_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00019')

## References

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. 1984. [Classification and regression trees](https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418?ie=UTF8&*Version*=1&*entries*=0). Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, CA.

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. [Science 286:531-537](http://science.sciencemag.org/content/286/5439/531.long).

Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-Cordero, A., Ebert, B.L., Mak, R.H., Ferrando, A.A, Downing, J.R., Jacks, T., Horvitz, H.R., Golub, T.R. 2005. MicroRNA expression profiles classify human cancers. [Nature 435:834-838](http://www.nature.com/nature/journal/v435/n7043/full/nature03702.html).

Rifkin, R., Mukherjee, S., Tamayo, P., Ramaswamy, S., Yeang, C-H, Angelo, M., Reich, M., Poggio, T., Lander, E.S., Golub, T.R., Mesirov, J.P. 2003. An Analytical Method for Multiclass Molecular Cancer Classification. [SIAM Review 45(4):706-723](http://epubs.siam.org/doi/abs/10.1137/S0036144502411986).

Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. 2000. Class prediction and discovery using gene expression data. In [Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB)](http://dl.acm.org/citation.cfm?id=332564). ACM Press, New York. pp. 263-272.