GitHub - med-at-scale/pop-strat: Population stratification on adam

Population stratification with ADAM

Data sets from 1000genomes.

List of 1000genomes samples (with population) is given in: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.panel

This file is used as the source for population and is used to extract sample subsets (e.g. extract 5 sample ids per population).

Varations subsets are extracted from the 1000genomes browser (). Examples of regions selections

6:133017695-133161157 => 3481 variations

6:133017695-133031157 => 292 variations

6:133017695-133019157 => 41 variations

6:133017695-133017814 => 5 variations

VCF files are obtained from the 1000genomes data browser at http://browser.1000genomes.org/Homo_sapiens/Info/Index

Deploy on spark-notebook on ec2

Need the deps on the rigth spark, in the project/buill.scala file:

      unmanagedJars in Compile  += file("/root/spark/lib/datanucleus-api-jdo-3.2.1.jar"),
      unmanagedJars in Compile  += file("/root/spark/lib/datanucleus-core-3.2.2.jar"),
      unmanagedJars in Compile  += file("/root/spark/lib/datanucleus-rdbms-3.2.1.jar"),
      unmanagedJars in Compile  += file("/root/spark/lib/spark-assembly-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar")

Also remove the other deps to spark to avoid any conflict

      //libraryDependencies ++= Seq(
      //  sparkRepl,
      //  sparkSQL
      //),

In the notebook, import ADAM deps from central using

resolveAndAddToJars("org.bdgenomics.adam" , "adam-core" , "0.14.0")

Since it will import spark-core and so on, we need to clean it from the jars added to the SparkContext.

jars = jars.filterNot(_.contains("spark-"))
// DITTO in the :cp block!
reset()

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
kmeans-pop		kmeans-pop
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Population stratification with ADAM

Data sets from 1000genomes.

Deploy on spark-notebook on ec2

About

Releases

Packages

Contributors 2

Languages

med-at-scale/pop-strat

Folders and files

Latest commit

History

Repository files navigation

Population stratification with ADAM

Data sets from 1000genomes.

Deploy on spark-notebook on ec2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages