Skip to content

Demographic and selection history of cattle and related species

Notifications You must be signed in to change notification settings

harlydurbin/bovine_demo

Repository files navigation

Demographic and selection history of cattle and related species

Meta-data processing

  1. SRA metadata scraped from downloaded XML files then initially tidied in notebooks/xml_scraping.Rmd
  2. Population labels futher categorized by region, continent, and species in notebooks/categorizing.Rmd
  3. Coverage of samples in tidied dataset assessed notebooks/coverage.Rmd
  4. Duplicate and low quality samples removed from metadata in source_functions/remove_samples.R

Resulting sample metadata file: data/derived_data/metadata/bovine_demo.sample_metadata.csv

Notes about population label assignment

  • I know little to nothing about yak breeds, but where I could I tried to use the same population designations for yak samples that were used in the papers they came from. These include:
  • Cattle samples with their species listed as "composite" are known taurus/indicus hybrids. These include:
    • Breeds intentionally developed within the last 100 years in America and Australia (Beefmaster, Droughtmaster, Santa Gertrudis)
    • African sanga & zenga breeds (Ankole, Boran, Fulani)
    • Asian advanced generation composites (all others)

Genotype calling

Evaluation of computing resources used for each step of genotype calling can be found in notebooks/psrecord.Rmd with results in html/psrecord.html

  1. Ancient samples pre-processed and haplotypes called in source_functions/ancient_preprocess.snakefile
    • Trim 5 bp from ends of reads using bamUtil trimBam
    • Realign indels using GATK IndelRealigner
    • Call haplotypes using GATK HaplotypeCaller
  2. All samples combined and genotypes called in source_functions/joint_genotyping.snakefile
    • Sample cohorts combined using GATK CombineGVCFs (cohorts determined in source_functions/genotyping_cohorts.R)
    • Joint genotypes called using GATK GenotypeGVCFs
  3. After joint genotype calling, INFO field filter values and depth of coverage at each variant on chromosome 28 extracted in source_functions/filter_eval.snakefile using GATK VariantsToTable and vcftools --site-mean-depth. Descriptive statistics & distribution of these values explored in notebooks/filter_eval.Rmd. Results can be found in html/filter_eval.html and were used to inform filtering cutoffs in the next step
  4. Callset filtered in source_functions/joint_genotyping.snakefile
    • Variants restricted to biallelic SNPs using GATK SelectVariants
    • Site-level and genotype-level filtering annotated using GATK VariantFiltration. Then failing sites removed and failing genotypes set to missing using GATK SelectVariants
      • Based on results in notebooks/filter_eval.Rmd, data processed without genotype-level filtration starting July 23, 2020
  5. Summary stats for each chromosome generated using Picard CollectVariantCallingMetrics in source_functions/post_process.snakefile then evaluated in source_functions/post_process.Rmd, VCF format checked using GATK ValidateVariants. CollectVariantCallingMetrics results:
    • data/derived_data/joint_genotyping/bovine_demo.variant_metrics.summary_chr.csv contains a summary by chromosome
    • data/derived_data/joint_genotyping/bovine_demo.variant_metrics.detail_wg.csv contains a summary by sample averaged/summed across all chromosomes
    • data/derived_data/joint_genotyping/bovine_demo.variant_metrics.detail_chr.csv contains a summary by sample and by chromosome
  6. Duplicate samples identified based on chromosome 28 and chromosome 29 variants using king in source_functions/find_dups.sh. Duplicates and other low quality samples + sites that are no longer variant after sample removal removed in source_functions/post_process.snakemake

Phasing

See source_functions/phasing.snakefile and notebooks/phasing.Rmd

  1. In order to phase X chromosome, missing sexes imputed and incorrectly assigned sexes fixed in source_functions/sex_imputation.snakefile and notebooks/sex_imputation.Rmd
    • Ended up using the ratio of average coverage on the X chromosome/average coverage on all autosomes to determine cutoffs. Of all other tested metrics, I think this the only one that should be similar across all species in the dataset + agnostic to effective population size
  2. Genetic map inferred using several published cattle recombination maps TODO
  3. Pre-phasing QC in source_functions/plink_qc.snakefile
    • For all chromosomes, sites with > 10% missingness removed
    • For all chromosomes, listed sex updated to imputed sex
    • Pseudo-autosomal region removed from X chromosome
    • Heterozygous genotypes set to missing on Y chromosome
    • Derbyshire auroch ID is too long and throws errors in downstream analyses, rename it from "CPC98_Bprimigenius_EnglandDerbyshire_5936" to "ancient_derbyshire"
  4. Phase autosomes and sex chromosomes separately using SHAPEIT TODO

Exploratory analyses

  • Downsampling & variant thinning performed in source_functions/plink_qc.snakefile
    • Initial QC same as pre-phasing QC above
    • Variants removed from each chromosome with an X% probability of being retained
    • Individuals downsampled based on dataset of interest
  • fastStructure ran in source_functions/faststructure.bovine_demo.snakefile, see notebooks/faststructure.Rmd for thinning/downsampling dataset designations & results
  • EIGENSOFT smartpca ran in source_functions/smartpca.snakefile, see notebooks/smartpca.Rmd for thinning/downsampling dataset designations & results
    • For all datasets including ancient samples: eigenvectors calculated without them --> ancients samples projected onto PC space
  • SMC++
  • TreeMix

About

Demographic and selection history of cattle and related species

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages