Demographic and selection history of cattle and related species

Meta-data processing

SRA metadata scraped from downloaded XML files then initially tidied in notebooks/xml_scraping.Rmd
Population labels futher categorized by region, continent, and species in notebooks/categorizing.Rmd
Coverage of samples in tidied dataset assessed notebooks/coverage.Rmd
Duplicate and low quality samples removed from metadata in source_functions/remove_samples.R

Resulting sample metadata file: data/derived_data/metadata/bovine_demo.sample_metadata.csv

Notes about population label assignment

I know little to nothing about yak breeds, but where I could I tried to use the same population designations for yak samples that were used in the papers they came from. These include:
- Wild yaks (Bos mutus)
- Datong yaks which were recently developed as a cross between wild and domesticated yaks to be hornless
- Tianzhu white yaks, bred in the Qilian mountains of Gansu province
- Jinchuan yaks, which typically have an additional thoracic vertebra and are found in Sichuan province
- Qinghai-Tibet Plateau (QTP) yaks
Cattle samples with their species listed as "composite" are known taurus/indicus hybrids. These include:
- Breeds intentionally developed within the last 100 years in America and Australia (Beefmaster, Droughtmaster, Santa Gertrudis)
- African sanga & zenga breeds (Ankole, Boran, Fulani)
- Asian advanced generation composites (all others)

Genotype calling

Evaluation of computing resources used for each step of genotype calling can be found in notebooks/psrecord.Rmd with results in html/psrecord.html

Ancient samples pre-processed and haplotypes called in source_functions/ancient_preprocess.snakefile
- Trim 5 bp from ends of reads using bamUtil trimBam
- Realign indels using GATK IndelRealigner
- Call haplotypes using GATK HaplotypeCaller
All samples combined and genotypes called in source_functions/joint_genotyping.snakefile
- Sample cohorts combined using GATK CombineGVCFs (cohorts determined in source_functions/genotyping_cohorts.R)
- Joint genotypes called using GATK GenotypeGVCFs
After joint genotype calling, INFO field filter values and depth of coverage at each variant on chromosome 28 extracted in source_functions/filter_eval.snakefile using GATK VariantsToTable and vcftools --site-mean-depth. Descriptive statistics & distribution of these values explored in notebooks/filter_eval.Rmd. Results can be found in html/filter_eval.html and were used to inform filtering cutoffs in the next step
Callset filtered in source_functions/joint_genotyping.snakefile
- Variants restricted to biallelic SNPs using GATK SelectVariants
- Site-level ~~and genotype-level~~ filtering annotated using GATK VariantFiltration. Then failing sites removed ~~and failing genotypes set to missing~~ using GATK SelectVariants
  - Based on results in notebooks/filter_eval.Rmd, data processed without genotype-level filtration starting July 23, 2020
Summary stats for each chromosome generated using Picard CollectVariantCallingMetrics in source_functions/post_process.snakefile then evaluated in source_functions/post_process.Rmd, VCF format checked using GATK ValidateVariants. CollectVariantCallingMetrics results:
- data/derived_data/joint_genotyping/bovine_demo.variant_metrics.summary_chr.csv contains a summary by chromosome
- data/derived_data/joint_genotyping/bovine_demo.variant_metrics.detail_wg.csv contains a summary by sample averaged/summed across all chromosomes
- data/derived_data/joint_genotyping/bovine_demo.variant_metrics.detail_chr.csv contains a summary by sample and by chromosome
Duplicate samples identified based on chromosome 28 and chromosome 29 variants using king in source_functions/find_dups.sh. Duplicates and other low quality samples + sites that are no longer variant after sample removal removed in source_functions/post_process.snakemake

Phasing

See source_functions/phasing.snakefile and notebooks/phasing.Rmd

In order to phase X chromosome, missing sexes imputed and incorrectly assigned sexes fixed in source_functions/sex_imputation.snakefile and notebooks/sex_imputation.Rmd
- Ended up using the ratio of average coverage on the X chromosome/average coverage on all autosomes to determine cutoffs. Of all other tested metrics, I think this the only one that should be similar across all species in the dataset + agnostic to effective population size
Genetic map inferred using several published cattle recombination maps TODO
Pre-phasing QC in source_functions/plink_qc.snakefile
- For all chromosomes, sites with > 10% missingness removed
- For all chromosomes, listed sex updated to imputed sex
- Pseudo-autosomal region removed from X chromosome
- Heterozygous genotypes set to missing on Y chromosome
- Derbyshire auroch ID is too long and throws errors in downstream analyses, rename it from "CPC98_Bprimigenius_EnglandDerbyshire_5936" to "ancient_derbyshire"
Phase autosomes and sex chromosomes separately using SHAPEIT TODO

Exploratory analyses

Downsampling & variant thinning performed in source_functions/plink_qc.snakefile
- Initial QC same as pre-phasing QC above
- Variants removed from each chromosome with an X% probability of being retained
- Individuals downsampled based on dataset of interest
fastStructure ran in source_functions/faststructure.bovine_demo.snakefile, see notebooks/faststructure.Rmd for thinning/downsampling dataset designations & results
EIGENSOFT smartpca ran in source_functions/smartpca.snakefile, see notebooks/smartpca.Rmd for thinning/downsampling dataset designations & results
- For all datasets including ancient samples: eigenvectors calculated without them --> ancients samples projected onto PC space
SMC++
TreeMix

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
data/derived_data/treemix/output		data/derived_data/treemix/output
figures		figures
html		html
notebooks		notebooks
source_functions		source_functions
.gitignore		.gitignore
README.md		README.md
bovine_demo.Rproj		bovine_demo.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demographic and selection history of cattle and related species

Meta-data processing

Notes about population label assignment

Genotype calling

Phasing

Exploratory analyses

About

Releases

Packages

Languages

harlydurbin/bovine_demo

Folders and files

Latest commit

History

Repository files navigation

Demographic and selection history of cattle and related species

Meta-data processing

Notes about population label assignment

Genotype calling

Phasing

Exploratory analyses

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages