Scripts developed and used in André et al., Nature Communications 2024 André, M., Brucato, N., Hudjasov, G., Pankratov, V., Yermakovich, D., …, Mondal, M., Ricaut, F.-X. Positive selection in the genomes of two Papua New Guinean populations at distinct altitude levels. Nat Commun 15, 3352 (2024). https://doi.org/10.1038/s41467-024-47735-1
We used vcf of 1000 Genomes 30x on GRCh38 as example data. The used vcf file can be downloaded here similarly for the phased vcf and the cram files.
You can use conda to install the dependencies necessary to run the following analysis and listed in png_selection.yml
.
To install conda please visit anaconda webpage.
conda install --file png_selection.yml
We included in the positive coverage mask sites with a minimum base quality of 20, an alignment minimum mapping quality of 20 and downgrading mapping quality for reads containing excessive mismatches with a coefficient of 50. From these sites, we excluded indels, sites with more than two alleles, sites with a maximum missing rate of 5%. We also masked sites whose depth of coverage summed across all samples was higher or lower than the sum of the average depth across the dataset by a factor of 2-fold.
snakemake -s 1.GenomicMask/mask.smk
We create a positive mask using bedtools v2.29.2 including the site
present in both the coverage mask we created in step 1.1. 1.GenomicMask/mask_all_sort_chr22.bed
and gr38.mask.bed
, the mappability mask
used with MSMC liftover to GR38
bedtools intersect -a 1.GenomicMask/mask_all_sort_chr22.bed -b gr38.mask.bed > 1.GenomicMask/mask_coverage_and_map_chr22.bed
Keep only the sites that pass the PASS filter from the variant calling unfiltered vcf files example_dataset/example_raw_chr22.vcf.gz
vcftools --gzvcf example_dataset/example_raw_chr22.vcf.gz --remove-filtered-all --recode --stdout |bgzip > 1.GenomicMask/sites_PASS_chr22.vcf.gz
Output the site without PASS flag from the variant calling unfiltered vcf files example_dataset/example_raw_chr22.vcf.gz
bcftools isec -C -o 1.GenomicMask/sites_not_PASS_chr22.bed -Oz example_dataset/example_raw_chr22.vcf.gz 1.GenomicMask/sites_PASS_chr22.vcf.gz
Format of the output file 1.GenomicMask//sites_not_PASS_chr22.bed
awk '{print $1 "\t" ($2 - 1) "\t" $2}' 1.GenomicMask/sites_not_PASS_chr22.bed|bedtools merge -i - > 1.GenomicMask/sites_not_PASS_chr22_format.bed
bedtools subtract -a 1.GenomicMask/mask_coverage_and_map_chr22.bed -b 1.GenomicMask/sites_not_PASS_chr22_format.bed > 1.GenomicMask/full_positive_mask_chr22.bed
Convert the positive mask to fasta format using the ancestral genome
from the Ensembl 93 release homo_sapiens_ancestor_22.fa.gz
. This will output the final positive mask 1.GenomicMask/full_positive_mask_chr22.fa
used in the further analyses.
1.GenomicMask/make_fa_mask.sh
vcftools --gzvcf example_dataset/example_raw_chr22.vcf.gz --max-alleles 2 --min-alleles 2 --remove-indels --max-missing 0.95 --remove-filtered-all --bed 1.GenomicMask/full_positive_mask_chr22.bed --maf 0.05 --plink --out 2.PCA_Admixture/example_chr22
We prune variants in high linkage disequilibrium using plink v1.9 and the default parameters of 50 variants count window shifting from five variants and a variance inflation factor (VIF) threshold of 2.
plink --file 2.PCA_Admixture/example_chr22 --indep 50 5 2 --out 2.PCA_Admixture/example_chr22 --noweb #output example_chr22.prune.in
plink --file 2.PCA_Admixture/example_chr22 --extract 2.PCA_Admixture/example_chr22.prune.in --out 2.PCA_Admixture/example_pruned_chr22 --noweb --recode #output example_pruned_chr22.ped and example_pruned_chr22.map
We performed the Principal Component Analysis (PCA) on the unrelated LD pruned dataset all_biSNP_PASS_maf_unrelated_VIFLDPruned.ped
using the smartpca program from the EIGENSOFT v.7.2.0 package
smartpca -p 2.PCA_Admixture/par.file # Will output example_pruned_chr22.evec and example_pruned_chr22.evac
Admixture required a bed format
plink --file 2.PCA_Admixture/example_pruned_chr22 --make-bed --out 2.PCA_Admixture/example_pruned_chr22 # Will output example_pruned_chr22.bed
We ran ADMIXTURE v1.3 for the components K=2 to K=10 on the unrelated LD pruned dataset 2.PCA_Admixture/example_pruned.bed
For each component, ADMIXTURE computes the cross-validation error using k-fold cross-validation procedure. We set the k parameter to 100.
In order to select the model with the most likely number of components, in thi example, we generated the cross-validation error 3 times for each component.
./submit.sh 2.PCA_Admixture/example_pruned_chr22
This script will create 10 trial
directory, each of which will have P and Q admixture output files for component k=2 to k=10
We used shapeit4 v4.2.2. Input parameter are found in Phasing/config.sk
We used the genetic map (genetic_map_hg38_chr22.txt.gz
) from Eagle for GRCh38 to shapeit format
snakemake -s 3.Phasing/shapeit4.smk
That snakemake pipeline will create two directories in your working directory. Directory chr/
includes unphased vcf files
per chr filtered for the filters indicated in 3.Phasing/config.sk
.The other directory created is phased/
that includes all the phased vcf files per chr
We used selscan V2.0 with highlanders as the target population and lowlanders as the reference population. For this example run we used all the 99 CEU individuals from 1000G as the target population and the 103 CHB individuals from 1000G as the reference population. We used the already phased phased vcf for which we kept the biallelil SNP present with a max missing rate of 0.95. Selscan v2.0.0 is not available on Anaconda. You must install it following the instruction found in the selscan gitHub repository.
We used the genetic map (genetic_map_hg38_chr22.txt.gz
) from Eagle for GRCh38. You need to convert it to plink .map format first
tail -n+2 genetic_map_hg38_chr22.txt|awk '{print $1 , $1":"$2,$4, $2}'> 4.XPEHH/genetic_map_hg38_chr22_plink_format.txt
vcftools --gzvcf 4.XPEHH/CEU_CHB_YRI_phased_chr22_biallel_snp.vcf.gz --plink --out 4.XPEHH/chr22_input # Will output chr22_input.map
Rscript 4.XPEHH/GeneticMapIntrapolate.r 4.XPEHH/chr22_input.map 4.XPEHH/CEU_CHB_YRI_chr22_GeneticMap.map 4.XPEHH/genetic_map_hg38_chr22_plink_format.txt # Will CEU_CHB_YRI_chr22_GeneticMap.map
Selscan uses phased vcf files separated by chromosome with no missing data. It will also need one population per file.
selscan --xpehh --vcf 4.XPEHH/CEU_phased_chr22_biallel_snp.vcf.gz --vcf-ref 4.XPEHH/CHB_phased_chr22_biallel_snp.vcf.gz --map 4.XPEHH/CEU_CHB_YRI_chr22_GeneticMap.map --out 4.XPEHH/CEU_CHB_chr22 # Output is 4.XPEHH/CEU_CHB_chr22.xpehh.out
norm --xpehh --files 4.XPEHH/CEU_CHB_chr22.xpehh.out # Output is 4.XPEHH/CEU_CHB_chr22.xpehh.out.norm
This script will output the SNP with xpehhnorm in the 99th percentile.
python 4.XPEHH\XPEHH_bed_99.py 4.XPEHH/CEU_CHB_chr22.xpehh.out.norm # Output is 4.XPEHH/CEU_top_snp_XPEHH_99.bed
If you are interested in the top SNP for the reference population. This script will output the SNP with xpehhnorm in the 1st percentile.
python 4.XPEHH\XPEHH_bed_1.py 4.XPEHH/CEU_CHB_chr22.xpehh.out.norm # Output is 4.XPEHH/CHB_top_snp_XPEHH_1.bed
We then merge top SNPs together when they were closer than 10kb with the
closest windows using bedtools v2.29.2
The XP-EHH score of the regions will be the top XP-EHH of the top SNPs uses to create the bigger regions
bedtools merge -i 4.XPEHH/CEU_top_snp_XPEHH_99.bed -d 10000 -c 4 -o max > 4.XPEHH/CEU_merged_top_snp_XPEHH_99.bed
bedtools merge -i 4.XPEHH/CHB_top_snp_XPEHH_1.bed -d 10000 -c 4 -o min > 4.XPEHH/CHB_merged_top_snp_XPEHH_1.bed
NB: For this example run I used only selection scores from chr22 and then there are only random regions from chr22. In the paper we used selection scores from all the autosomes.
snakemake -s random_sampling/p_val.smk -kp --jobs 100 --profile [profile_file]
python 5.PBS/RunPBS.py --popfile example_dataset/example.pop --pbspop CEU:CHB:YRI --dropna 5.PBS/CEU_CHB_YRI_chr22_biallel_snp.vcf.gz > 5.PBS/CEU_example_chr22.pbs
We create sliding windows of X snp with Y snps step and give to each of these regions a PBS that is the average
of the PBS of the X SNP that the window is made with (see CEU_windows_20SNP_step5.bed
). We keep only the windows whose PBS score in
the 99th percentile:CEU_sliding_windows_snp_top_regions_PBS.bed
. It also output CEU_mean_pos_PBS.bed
with the mean pos of the region (mean of
the start and end bp) and the PBS of the region (average of the PBS of all the snp use to create the windows)
We used windows of 20 SNPs sliding by 5 SNPs
python 5.PBS/sliding_windows_snp.py -b 5.PBS/CEU_example_chr22.pbs -w 20 -s 5
We then merge top windows CEU_sliding_windows_snp_top_regions_PBS.bed
together when they were closer than 10kb with the closest
windows using bedtools v2.29.2.
The PBS score of the regions will be the top PBS of the regions uses to create the bigger regions
bedtools merge -i 5.PBS/CEU_sliding_windows_snp_top_regions_PBS.bed -d 10000 -c 4 -o max > 5.PBS/CEU_windows_and_PBS_top_windows_final_merged.bed
Pvalue and Zscore are computed for log10(PBS) NB: For this example run I used only selection scores from chr22 and then there are only random regions from chr22. In the paper we used selection scores from all the autosomes.
snakemake -s random_sampling/p_val.smk -kp --jobs 100 --profile [profile_file]
We computed Fisher score for the same windows used to generate PBS 5.PBS/CEU_windows_20SNP_step5.bed
or 5.PBS/CHB_windows_20SNP_step5.bed
depending
if we want to compute teh reference for the target population CEU or the reference population CHB.
Make XP-EHH output as a bed file:
cat 4.XPEHH/CEU_CHB_chr22.xpehh.out.norm |awk '{print "chr22\t" $2-1 "\t" $2 "\t" $9}'|tail -n+2 > 6.FisherScore/CEU_CHB_chr22.xpehh.bed
If we are interested to get the XPEHH score from the reference pop (CHB), we need to invert the value of XP-EH
python get_inverted_xpehh.py -b 6.FisherScore/CEU_CHB_chr22.xpehh.bed # output inverted.xpehh.bed
Extract the PBS windows coordinates:
cut -f1,2,3 5.PBS/CEU_windows_20SNP_step5.bed|tail -n+2 > 6.FisherScore/CEU_windows.bed
cut -f1,2,3 5.PBS/CHB_windows_20SNP_step5.bed|tail -n+2 > 6.FisherScore/CHB_windows.bed
make a file that list each snp associated to each windows and their xpehh score:
bedtools intersect -a 6.FisherScore/CEU_windows.bed -b 6.FisherScore/CEU_CHB_chr22.xpehh.bed -wa -wb > 6.FisherScore/CEU_windows_xpehh.bed
bedtools intersect -a 6.FisherScore/CHB_windows.bed -b 6.FisherScore/inverted.xpehh.bed -wa -wb > 6.FisherScore/CHB_windows_xpehh.bed
Get the top XPEHH score of all the snp in each windows (top score in the region):
bedtools groupby -i 6.FisherScore/CEU_windows_xpehh.bed -g 1-3 -c 7 -o max |awk 'BEGIN {print "CHR\tSTART\tEND\tNORM"} {print $0}'> 6.FisherScore/CEU_windows_XPEHH_final.bed
bedtools groupby -i 6.FisherScore/CHB_windows_xpehh.bed -g 1-3 -c 7 -o max |awk 'BEGIN {print "CHR\tSTART\tEND\tNORM"} {print $0}'> 6.FisherScore/CHB_windows_XPEHH_final.bed
keep only sliding windows in the 99 percentile for the Fisher score:
python 6.FisherScore/SubmitFish.py --xpehh 6.FisherScore/CEU_windows_XPEHH_final.bed --pbs 5.PBS/CEU_windows_20SNP_step5.bed --out 6.FisherScore/CEU_FisherScore_windows
python 6.FisherScore/SubmitFish.py --xpehh 6.FisherScore/CHB_windows_XPEHH_final.bed --pbs 5.PBS/CHB_windows_20SNP_step5.bed --out 6.FisherScore/CHB_FisherScore_windows
merge the top regions together, score of the regions is the top score. Extract the top 10 regions from it:
bedtools merge -i 6.FisherScore/CEU_FisherScore_windows.topSNP.bed -d 10000 -c 4 -o max > 6.FisherScore/CEU_top_regions_Fisher_merged.bed
bedtools merge -i 6.FisherScore/CHB_FisherScore_windows.topSNP.bed -d 10000 -c 4 -o max > 6.FisherScore/CHB_top_regions_Fisher_merged.bed
Pvalue and Zscore are computed for log10(PBS) NB: For this example run I used only selection scores from chr22 and then there are only random regions from chr22. In the paper we used selection scores from all the autosomes.
snakemake -s random_sampling/p_val.smk -kp --jobs 100 --profile [profile_file]
We used Relate v1.8
The relate snakemake pipeline is ran with the following command. You must install Relate following relate gitHub directory.
snakemake -s Relate/Make_trees.smk
and will create several folders in the working directory:
The snakemake pipeline will create an input_relate folder including the *sample
, *haps
, *annot
and *dist
file used by Relate.
It also includes the file poplabel.txt
that is the population files correctly ordered for the further relate analysis
Will create recombination trees for all the samples included in the vcf files. We used the default
parameters of 1.25e-8 for the mutation rate and 3000 for the effective population size of haplotypes. The output are *anc
and *mut
files in the trees_all/
directory
Trees will be extracted for every population you included in config.sk
. The extracted trees (*anc
and *mut
files) are
found in the extracted_trees/
directory
Effective population size will be estimated for each of the populations you included in the config.sk
file. We used the
default recombination rate of 1.25e-8 and we specified the first time interval from 0 to 102 years ago and
then the periods between 102 and 107 is split into successive bins with a step of 100.1 years. We used Relate default
parameters of 28 years per generation, the number of cycles and the fractions of trees to be dropped.
Outputs (*coal
) are found in the pop_size/
folder
We have created a bed file regions_of_interest.bed
including two random regions of the chr22 that we will consider as the genormic region of interest for which we want to run CLUES.
We are creating another bed files including all the SNPs that are included in these regions of interest.
shell bcftools query -f '%CHROM\t%POS0\t%END\n' CCDG_14151_B01_GRM_WGS_2020-08-05_chr22.filtered.shapeit2-duohmm-phased.vcf.gz -R 8.CLUES/regions_of_interest.bed -o 8.CLUES/SNPs_of_interest.bed
This will filter out the sites from SNPs_of_interest.bed
that are not present in the population extracted tree
shell python bed_sites_to_kept_list.py -b 8.CLUES/SNPs_of_interest.bed -t 7.RELATE/extracted_trees/CEU -o 8.CLUES/SNPs_of_interest
We used clues v1.0
snakemake -s Clues/Clues.smk
The first rule create_clues_input
of the snakemake script extracts the local trees for each SNPs in the list of interest
for the population indicated in the config.sk
file. It samples the branch length for those trees using a RELATE script. The branch length of
the focal SNP tree is resampled 200 times (--num_samples 200
option of the SampleBranchLengths.sh
script ), and each
iteration result is recorded to estimate the uncertainty in the age estimate of each node. To take the uncertainty in the
branch length estimate into account, this is performed this several times (number_of_test
in the config.sk
file).
They output files *timeb
is found in the clues_input/
output directory.
Clues is run for each of the sampled branches output are found in the clues_output/
directory.
We used GEMMA v0.98.4. None phenotype data are available for 1000 genomes individuals. For the following example run we fill use the example datasets provided on GEMMA gitHub page. When working with your own vcf files and phenotype data, you must first:
- Convert the main vcf file to plink bed file format. (
mouse_hs1940.bed
) - Extract the candidate SNPs from the main vcf and convert to plink bed file format. (
target_SNP.bed
) - Add the pheno info to the fam files from the top SNP and the all site file. (
mouse_hs1940.fam
)
gemma -bfile 9.GEMMA/mouse_hs1940 -gk 1 -o 9.GEMMA/mouse_hs1940_centered #will output 9.GEMMA/mouse_hs1940_centered.cXX.txt
These file will be used to compute association only for the target SNP. To run for every phenotype column of interest
gemma -bfile 9.GEMMA/target_SNP -k 9.GEMMA/mouse_hs1940_centered.cXX.txt -notsnp -miss 1 -lmm 4 -n 1 -outdir results_centered/ -o pheno_1 # will output pheno_1.assoc.txt
We first downloaded the summary statistics files of UK bio bank (UKBB) from
publicly available. Files can be downloaded from https://pan-ukb-us-east-1.s3.amazonaws.com/sumstats_flat_files/. Both
data and index should be downloaded in the same folder to use our codes.
wget https://pan-ukb-us-east-1.s3.amazonaws.com/sumstats_flat_files/*.tsv.bgz wget https://pan-ukb-us-east-1.s3.amazonaws.com/sumstats_flat_files_tabix/*.tbi
To find the closest snp present in the UKBB file set use this command
bash UKBB_ClosestSNP.sh snp.bed biomarkers-30600-both_sexes-irnt.tsv.bgz closest.bed # Will output closest.bed, Present.bed and notpresent.snps
You need two files to run the code.
First the targeted SNP, written in bed format. Remember UKBB is in gr37, thus your bed file should be also in gr37. Can
be multiple snp in a same file. Should be just in different lines.
Second you need any of the tsv file that you have downloaded. As all the files have exactly same snps inside, it does
not matter which of the file you use for this command.
It will produce a bed file,Present.bed
where the SNPs present in the UKBB summary file and, when the SNPs are not present, another file closest.bed
where the closest snp to the none-present ones are written . For now, if there is no SNP in UKBB within 1kb
upstream or downstream, the code will ignore that snp.
python 10.UKBB/Run_ExtractScore.py --bed 10.UKBB/closest.bed biomarkers-30600-both_sexes-irnt.tsv.bgz
python 10.UKBB/Run_ExtractScore.py --bed 10.UKBB/Present.bed biomarkers-30600-both_sexes-irnt.tsv.bgz
This will extract UKBB results of pval_EUR and beta_EUR for the targeted snp present in closest.bed created in previous
step.
As these have to be done for all the UKBB phenotypes, you can use this shell script to extract it for all the
phenotypes automatically:
sh 10.UKBB/ExtractScore.sh closest.bed <UKBB_folder>
It has two input.
First the closest_snp.bed that we have created in the previous step (or any bed file where all the snp is present in
UKBB).
Second is the downloaded folder where all the UKBB summary statistics files are kept together.
It will create scores name folder. In which it will dump all the snps in the folder (chr_position.tsv). Inside every
file, you will have all the phenotypes with pval_EUR and beta_EUR together, which then can be sorted using an Excel
sheet to get the significant values.
In case you want to extract UKBB values for all SNPs (which we have used for random sampling), you can use the code.
sh 10.UKBB/ExtractScore_AllSNP.sh <UKBB_folder>
You just have to give the folder path where all UKBB summary files are stored.
It will create Phenotypes.pval.gz files which will be tabix indexed. Meaning you can extract any snp using:
tabix -h 10.UKBB/biomarkers-30600-both_sexes-irnt 1:6623020-6623020
Remember all snp and all phenotype is reading and writing all of very big files. We will suggest rather than put all the
files is same folder, break it smaller folders and then run this code independently to make it faster.
Will we use the file pval_EUR.tsv.bgz
generated in the precedent step to random sample 200bp (or whichever size_windows_bp
you put in 10.UKKB/config.sk
)
windows including at least one SNP significantly associated with at least one phenotype in the UKBB and store how many of these random
windows include at least one SNP significantly associated with at least one phenotype of the blood_count category
of the UKBB (blood_count_100081_list.full.name.txt
). We will compare this count distribution to the number of association with blood pheno found in our target population.
snakemake -s UKBB/enrich.smk --profile [profile_file]
You can use conda to install the dependencies necessary to run the following introgression workflow and listed in png_introgression.yml
.
To install conda please visit anaconda webpage.
You will also have to install hmmix following the instruction on the hmmix github
conda install --file png_introgression.yml
there are two folders, run_hmmix and process_res:
contains the following scripts and implements the clear workflow that could be found here
https://github.com/LauritsSkov/Introgression-detection
- processing_gvcf.sh # bcftools commands that were used for processing the (g)vcf files for the hmmix
- Skov_pipeline.preparing.sh # preparing specific input for the hmmix
- unite_chr_per_sample.sh # combining input for the hmmix model
- Skov_pipeline.hmm.sh # training and decoding by the hmmix model
- laucnh_Skov_pipeline.sh # launching 2-4 bullet points in sbatch mode for the regions of interest
contains the following scripts for proccessing the hmmix output
Introgression_processing.workflow.R # includes the workflow for finding the top frequent haplotype per candidate region for selection
Pairwise_comparison.chr1_GBP.R # includes the counting of the pairwise distance between archaics in GBP region on chr1