- SRA metadata scraped from downloaded XML files then initially tidied in
notebooks/xml_scraping.Rmd
- Population labels futher categorized by region, continent, and species in
notebooks/categorizing.Rmd
- Coverage of samples in tidied dataset assessed
notebooks/coverage.Rmd
- Duplicate and low quality samples removed from metadata in
source_functions/remove_samples.R
Resulting sample metadata file: data/derived_data/metadata/bovine_demo.sample_metadata.csv
- I know little to nothing about yak breeds, but where I could I tried to use the same population designations for yak samples that were used in the papers they came from. These include:
- Wild yaks (Bos mutus)
- Datong yaks which were recently developed as a cross between wild and domesticated yaks to be hornless
- Tianzhu white yaks, bred in the Qilian mountains of Gansu province
- Jinchuan yaks, which typically have an additional thoracic vertebra and are found in Sichuan province
- Qinghai-Tibet Plateau (QTP) yaks
- Cattle samples with their species listed as "composite" are known taurus/indicus hybrids. These include:
- Breeds intentionally developed within the last 100 years in America and Australia (Beefmaster, Droughtmaster, Santa Gertrudis)
- African sanga & zenga breeds (Ankole, Boran, Fulani)
- Asian advanced generation composites (all others)
Evaluation of computing resources used for each step of genotype calling can be found in notebooks/psrecord.Rmd
with results in html/psrecord.html
- Ancient samples pre-processed and haplotypes called in
source_functions/ancient_preprocess.snakefile
- Trim 5 bp from ends of reads using
bamUtil trimBam
- Realign indels using
GATK IndelRealigner
- Call haplotypes using
GATK HaplotypeCaller
- Trim 5 bp from ends of reads using
- All samples combined and genotypes called in
source_functions/joint_genotyping.snakefile
- Sample cohorts combined using
GATK CombineGVCFs
(cohorts determined insource_functions/genotyping_cohorts.R
) - Joint genotypes called using
GATK GenotypeGVCFs
- Sample cohorts combined using
- After joint genotype calling, INFO field filter values and depth of coverage at each variant on chromosome 28 extracted in
source_functions/filter_eval.snakefile
usingGATK VariantsToTable
andvcftools --site-mean-depth
. Descriptive statistics & distribution of these values explored innotebooks/filter_eval.Rmd
. Results can be found inhtml/filter_eval.html
and were used to inform filtering cutoffs in the next step - Callset filtered in
source_functions/joint_genotyping.snakefile
- Variants restricted to biallelic SNPs using
GATK SelectVariants
- Site-level
and genotype-levelfiltering annotated usingGATK VariantFiltration
. Then failing sites removedand failing genotypes set to missingusingGATK SelectVariants
- Based on results in
notebooks/filter_eval.Rmd
, data processed without genotype-level filtration starting July 23, 2020
- Based on results in
- Variants restricted to biallelic SNPs using
- Summary stats for each chromosome generated using
Picard CollectVariantCallingMetrics
insource_functions/post_process.snakefile
then evaluated insource_functions/post_process.Rmd
, VCF format checked usingGATK ValidateVariants
.CollectVariantCallingMetrics
results:- data/derived_data/joint_genotyping/bovine_demo.variant_metrics.summary_chr.csv contains a summary by chromosome
- data/derived_data/joint_genotyping/bovine_demo.variant_metrics.detail_wg.csv contains a summary by sample averaged/summed across all chromosomes
- data/derived_data/joint_genotyping/bovine_demo.variant_metrics.detail_chr.csv contains a summary by sample and by chromosome
- Duplicate samples identified based on chromosome 28 and chromosome 29 variants using
king
insource_functions/find_dups.sh
. Duplicates and other low quality samples + sites that are no longer variant after sample removal removed insource_functions/post_process.snakemake
See source_functions/phasing.snakefile
and notebooks/phasing.Rmd
- In order to phase X chromosome, missing sexes imputed and incorrectly assigned sexes fixed in
source_functions/sex_imputation.snakefile
andnotebooks/sex_imputation.Rmd
- Ended up using the ratio of average coverage on the X chromosome/average coverage on all autosomes to determine cutoffs. Of all other tested metrics, I think this the only one that should be similar across all species in the dataset + agnostic to effective population size
- Genetic map inferred using several published cattle recombination maps TODO
- Pre-phasing QC in
source_functions/plink_qc.snakefile
- For all chromosomes, sites with > 10% missingness removed
- For all chromosomes, listed sex updated to imputed sex
- Pseudo-autosomal region removed from X chromosome
- Heterozygous genotypes set to missing on Y chromosome
- Derbyshire auroch ID is too long and throws errors in downstream analyses, rename it from "CPC98_Bprimigenius_EnglandDerbyshire_5936" to "ancient_derbyshire"
- Phase autosomes and sex chromosomes separately using
SHAPEIT
TODO
- Downsampling & variant thinning performed in
source_functions/plink_qc.snakefile
- Initial QC same as pre-phasing QC above
- Variants removed from each chromosome with an X% probability of being retained
- Individuals downsampled based on dataset of interest
fastStructure
ran insource_functions/faststructure.bovine_demo.snakefile
, seenotebooks/faststructure.Rmd
for thinning/downsampling dataset designations & resultsEIGENSOFT smartpca
ran insource_functions/smartpca.snakefile
, seenotebooks/smartpca.Rmd
for thinning/downsampling dataset designations & results- For all datasets including ancient samples: eigenvectors calculated without them --> ancients samples projected onto PC space
SMC++
TreeMix