Demographic and selection history of cattle and related species

Meta-data processing

SRA metadata scraped from downloaded XML files then initially tidied in notebooks/xml_scraping.Rmd
Population labels futher categorized by region, continent, and species in notebooks/categorizing.Rmd
Coverage of samples in tidied dataset assessed notebooks/coverage.Rmd
Duplicate and low quality samples removed from metadata in source_functions/remove_samples.R

Resulting sample metadata file: data/derived_data/metadata/bovine_demo.sample_metadata.csv

Notes about population label assignment

I know little to nothing about yak breeds, but where I could I tried to use the same population designations for yak samples that were used in the papers they came from. These include:
- Wild yaks (Bos mutus)
- Datong yaks which were recently developed as a cross between wild and domesticated yaks to be hornless
- Tianzhu white yaks, bred in the Qilian mountains of Gansu province
- Jinchuan yaks, which typically have an additional thoracic vertebra and are found in Sichuan province
- Qinghai-Tibet Plateau (QTP) yaks
Cattle samples with their species listed as "composite" are known taurus/indicus hybrids. These include:
- Breeds intentionally developed within the last 100 years in America and Australia (Beefmaster, Droughtmaster, Santa Gertrudis)
- African sanga & zenga breeds (Ankole, Boran, Fulani)
- Asian advanced generation composites (all others)

Genotype calling

Evaluation of computing resources used for each step of genotype calling can be found in notebooks/psrecord.Rmd with results in html/psrecord.html

Ancient samples pre-processed and haplotypes called in source_functions/ancient_preprocess.snakefile
- Trim 5 bp from ends of reads using bamUtil trimBam
- Realign indels using GATK IndelRealigner
- Call haplotypes using GATK HaplotypeCaller
All samples combined and genotypes called in source_functions/joint_genotyping.snakefile
- Sample cohorts combined using GATK CombineGVCFs (cohorts determined in source_functions/genotyping_cohorts.R)
- Joint genotypes called using GATK GenotypeGVCFs
After joint genotype calling, INFO field filter values and depth of coverage at each variant on chromosome 28 extracted in source_functions/filter_eval.snakefile using GATK VariantsToTable and vcftools --site-mean-depth. Descriptive statistics & distribution of these values explored in notebooks/filter_eval.Rmd. Results can be found in html/filter_eval.html and were used to inform filtering cutoffs in the next step
Callset filtered in source_functions/joint_genotyping.snakefile
- Variants restricted to biallelic SNPs using GATK SelectVariants
- Site-level ~~and genotype-level~~ filtering annotated using GATK VariantFiltration. Then failing sites removed ~~and failing genotypes set to missing~~ using GATK SelectVariants
  - Based on results in notebooks/filter_eval.Rmd, data processed without genotype-level filtration starting July 23, 2020
Summary stats for each chromosome generated using Picard CollectVariantCallingMetrics in source_functions/post_process.snakefile then evaluated in source_functions/post_process.Rmd, VCF format checked using GATK ValidateVariants. CollectVariantCallingMetrics results:
- data/derived_data/joint_genotyping/bovine_demo.variant_metrics.summary_chr.csv contains a summary by chromosome
- data/derived_data/joint_genotyping/bovine_demo.variant_metrics.detail_wg.csv contains a summary by sample averaged/summed across all chromosomes
- data/derived_data/joint_genotyping/bovine_demo.variant_metrics.detail_chr.csv contains a summary by sample and by chromosome
Duplicate samples identified based on chromosome 28 and chromosome 29 variants using king in source_functions/find_dups.sh. Duplicates and other low quality samples + sites that are no longer variant after sample removal removed in source_functions/post_process.snakemake

Phasing

See source_functions/phasing.snakefile and notebooks/phasing.Rmd

In order to phase X chromosome, missing sexes imputed and incorrectly assigned sexes fixed in source_functions/sex_imputation.snakefile and notebooks/sex_imputation.Rmd
- Ended up using the ratio of average coverage on the X chromosome/average coverage on all autosomes to determine cutoffs. Of all other tested metrics, I think this the only one that should be similar across all species in the dataset + agnostic to effective population size
Genetic map inferred using several published cattle recombination maps TODO
Pre-phasing QC in source_functions/plink_qc.snakefile
- For all chromosomes, sites with > 10% missingness removed
- For all chromosomes, listed sex updated to imputed sex
- Pseudo-autosomal region removed from X chromosome
- Heterozygous genotypes set to missing on Y chromosome
- Derbyshire auroch ID is too long and throws errors in downstream analyses, rename it from "CPC98_Bprimigenius_EnglandDerbyshire_5936" to "ancient_derbyshire"
Phase autosomes and sex chromosomes separately using SHAPEIT TODO

Exploratory analyses

Downsampling & variant thinning performed in source_functions/plink_qc.snakefile
- Initial QC same as pre-phasing QC above
- Variants removed from each chromosome with an X% probability of being retained
- Individuals downsampled based on dataset of interest
fastStructure ran in source_functions/faststructure.bovine_demo.snakefile, see notebooks/faststructure.Rmd for thinning/downsampling dataset designations & results
EIGENSOFT smartpca ran in source_functions/smartpca.snakefile, see notebooks/smartpca.Rmd for thinning/downsampling dataset designations & results
- For all datasets including ancient samples: eigenvectors calculated without them --> ancients samples projected onto PC space
SMC++
TreeMix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Demographic and selection history of cattle and related species

Meta-data processing

Notes about population label assignment

Genotype calling

Phasing

Exploratory analyses

Files

README.md

Latest commit

History

README.md

File metadata and controls

Demographic and selection history of cattle and related species

Meta-data processing

Notes about population label assignment

Genotype calling

Phasing

Exploratory analyses