tumor origin detection using a deep neural network
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
archive
dnam
mirna
.gitattributes
.gitignore
README.md
analysis-dnam.ipynb
analysis.ipynb

README.md

Neural Networks for detecting tumor origin

MircoRNA results available in analysis.ipynb, viewable here

DNAm results availabile in analysis-dnam.ipynb, viewable here

Workflow:

This project uses virtualenv to create isolated Python environments.

MicroRNA

  • Download isoforms from 17 different classes of cancer from TCGA
In R, on nano cluster
  • Put all samples of the same type into a matrix using rptashkin's TCGA_miRNASeq_Matrix (rows are features; columns are samples)
  • Merge matrices
  • Transpose
  • Randomize, split labels
In Python, on nano cluster
  • Select features based on low NA-values
  • Put all samples of the same type into a matrix using rptashkin's TCGA miRNASeq Matrix (rows are features; columns are samples)
  • Merge matrices
  • Transpose
Jupyter notebook
  • Test random forest, knn, and svm baselines
  • Visualize keras tuning data from cluster
  • Attempt cross validation

DNA Methylation

  • Download 27k Illumina samples from TCGA using TCGA2STAT
In R, on nano cluster
  • Get data from TCGA using tcga2stat.R
  • Select features based on low NA-values
  • Select for high variability (20-80 percentile)
  • Merge samples into one data matrix
  • Randomize, split labels
In Python, on nano cluster
  • Baseline models to guage accuracy before feature selection
  • Tune nnet hyperparameters
Jupyter Notebook
  • Visualize tuning data

Todo:

  • Does feature selection improve random forest model?
  • Does feature selection improve NNet model?
  • Scaling (0,1)
  • Try KNN, SVM, baselines
  • High variability feature selection
  • Process methylation data
  • Import additional metastatic datasets
  • Attempt on non-TCGA datasets

References

This work utilizes resources supported by the National Science Foundation's Major Research Instrumentation program, grant #1725729, as well as the University of Illinois at Urbana-Champaign