Scripts, supporting data, and figures from analysis of Atlantic cod genome sequences through time (Northeast Arctic and Canada), as published in
Pinsky, ML, AM Eikeset, C Helmerson, IR Bradbury, P Bentzen, C Morris, A Gondek, HT Baalsrud, MSO Brieuc, OS Kjesbu, JA Godiksen, JMI Barth, M Matschiner, NC Stenseth, KS Jakobsen, SJentoft, and B Star. Genomic stability through time despite decades of exploitation in cod on both sides of the Atlantic. PNAS.
This repository has been archived on Zenodo. Cite as ML Pinsky & B Star. (2021, February 21). pinskylab/codEvol: Publication (Version v1.0). Zenodo. https://doi.org/10.5281/zenodo.4554375
These scripts contain hard links and specific environment settings. They would require adaptation to run elsewhere and on other data. The scripts were run on a Linux scientific workstation (cod) and two Linux clusters with SLURM job management (abel and saga) at the University of Oslo, plus on a MacBook Pro running R 3.*.
analysis/
: Files output by scripts. Not track by Git.data/
: Input data. Does not include the DNA sequence reads, which are at ENA with project #PRJEB41431figures/
: Output figuresoutput/
: Output files that are tracked by Git (as opposed toanalysis/
)scripts/
: Scripts to do analysestables/
: Tables output by scripts.tmp/
: Temporary files output by scripts. Not tracked by git.
kjesbu_fecundity/AFWG_Table3.11proportionmatureatage.csv
: Proportion of Northeast Arctic cod mature at each age. From Table 3.11 in ICES AFWG Report 2016. ICES CM 2016/ACOM:06. Report of the Arctic Fisheries Working Group (AFWG).kjesbu_fecundity/AFWG_Table3.16stocknumbersatage.csv
: Numbers at age from the stock assessment. Table 3.16 in ICES AFWG Report 2016.kjesbu_fecundity/AFWG_Table3.6catchnumbersatage.csv
: Numbers at age from catch data. Table 3.16 in ICES AFWG Report 2016.kjesbu_fecundity/Fekunditet Andenes 1986_2006 Excel_2.csv
: Fecundity data from a local fishing port in northern Norway covering a period of 20 years. Use Main_series == Y (Yes), Oto_Age, whole body weight (in g) in Fish_Weight, liver weight (in g) in Fish_Liver, Oto_type == 3, 4 or 5 (meaning that we exclude Coastal cod type 1 and 2), fecundity (in mill.) in Fec_Mill. Mean oocyte diameter is in Ooc_Mean_Dia in micrometers; females with small oocytes tend to show a decrease in fecundity up to spawning due to atresia (reabsorption of developing oocytes).kjesbu_fecundity/Leverdata.csv
: Relative liver size, called the hepatosomatic index, from Kjesbu et al. 2014 ICES Journal of Marine Sciencekjesbu_fecundity/Rollesfsen1953_Fig1.csv
: Female age distribution from Fig. 1 in Rollefsen 1953 Observations on the cod and cod fisheries of Lofoten. Rapports et Proces-Verbaux des Reunions, Conseil lnternationale pour l'Exploration de la Mer 136: 40-47phenotypes/AFWG_2019_3_Northeast Arctic_Cod_Table3.18.csv
: Average fishing mortality (F) for ages 5-10 on the Northeast Arctic stock, from the stock assessment. From Table 3.18 in ICES AFWG Report 2019phenotypes/AFWG_Table 3.11.csv
: Proportion mature at each age, Northeast Arctic cod. From Table. 3.11 in ICES. 2020. Arctic Fisheries Working Group (AFWG). ICES Scientific Reports 2(52), 577 ppphenotypes/Brattey et al_2018_Table18_proportion_mature.csv
: Proportion mature at each age, Canadian population. From Table 18 in Brattey, J., Cadigan, N., Dwyer, K. S., Healey, B. P., Ings, D. W., Lee, E. M., Maddock Parsons, D., Morgan, M. J., Regular, P., Rideout, R. M. 2018. Assessment of the Northern Cod (Gadus morhua) stock in NAFO Divisions 2J3KL in 2016. DFO Can. Sci. Advis. Sec. Res. Doc. 2018/018. v + 107pphenotypes/Brattey_etal_2018_CSAS_TableA2-5_NorthernCod_fishingmortality.csv
: Fishing mortality (F) by age, from Brattey et al. 2018.phenotypes/Eikeset_Age_and_length_at_maturation.csv
: Age and length at maturity in Northeast Arctic cod, from Eikeset et al. 2016 PNASinversions.csv
: Genome locations of the cod inversions. From Star et al. 2017 PNASlg_length.csv
: Length in bp of each linkage group. From Tørresen et al 2019 BMC Genomics
Some scripts make use of BEAGLE Utilities.
phenotypes.r
: Calculates the age at 50% mature in Canadian and Northeast Arctic populations. Needs data indata/phenotypes/
. Writesoutput/age_50percmature.csv
.
Paleomix_settings_historic_cod.yaml
: Settings for handling raw historic reads in Paleomix (Adapter removal, Collapse of Paired End reads, BWA, and Duplicate Removal). Requires (paired) fasta.gz and reference genome. Outcome: BAM files and summary dataPaleomix_settings_modern_cod.yaml
: Settings for handling raw modern reads in Paleomix (Adapter removal, Collapse of Paired End reads, BWA, and Duplicate Removal) Requires (paired) fasta.gz and reference genome Outcome: BAM files and summary dataRemove_clipped_reads.sh
: Script to remove clipped reads from BAM file. Any clipped reads (including entire pairs for which one reads was clipped) will has been removed. Required: BAM files. Outcome: BAM file with clipped reads removed.
Haplotype_caller_GATK.sh
: Running GATK's haplotype caller. Requires BAM files, reference genome (as used to create BAM) Outputs a g.vcf.gz file per individual for further analysesGenotype_Caller.sh
: Running GATK's genotype caller. Note we ran this per chromosome in parallel jointly on all .g.vcf files. Requires g.vcf.gz, list of chromosomes, reference genome. Outputs Unfiltered vcf.gzANGSD_snp_calling
: SNP calling using ANGSD. We followed N. O. Therkildsen, et al., Contrasting genomic shifts underlie parallel phenotypic evolution in response to fishing. Science 365, 487–490 (2019), for which N.O.T. kindly shared a template script. Specifically, we identified SNP locations that can be used for faster handling later on (and we used a subset of these based on GATK filtered data). Requires BAM files and reference genome. Outputs a range of standard ANGSD output files for further analyses, including a location file. This file is edited slightly following N.O.T.'s examples.
Create_mappability_track.sh
: We used the GEM mapper to create mappability tracks that can be uniquely mapped. K-mer size determines the size of the unique regions. Requires reference genome and choice of K-mer size. Outputs BED track with unique sections of the genome.Filter_vcf_files.sh
: Filter settings for GATK VCF file as described. Please note that the script cannot be run in a single iteration: various subscripts (such as the binomial test or variance calculations) have to be run at intermediate stages. These scripts generate lists of SNPs to be ex- or included. Requires Unfiltered VCF. Outputs filtered VCF.allele_balance_calc.R
: Script to calculate bias in the number of allelic variants in heterozygote genotypes. I.e. is there consistent bias for specific variants (expection is 50/50). Requires VCF GT and DP files. Wrapper script. Outputs list of SNPs to be filtered.allele_balance_plot.R
: Script to plot histograms of read-depth of alternate alleles for heterozygote genotypes. Requires VCF GT and DP files. Outputs a histogram.filter_allele_balance.r
: Script that performs a binomial test on heterozygote genotype calls. Reequires VCF GT and DP files. Outputs list of SNPs to be filtered.filter_allele_balance.saga.sh
: Wrapper script to submit thefilter_allele_balance.r
to the SAGA compute cluster. Requires VCFfile,allele_balance_calc.R
. Outputs list of SNPs to be filtered at certain thresholds.Calculate_variance2.sh
: Script that calculates a measure of read-depth variation between individuals on a specfic site. Requires VCF file. Outputs list of SNPs with a measure of variation in read depth between individuals. Top 5% were filtered.
Resample_ANGSD_slurm
: Not used. Random resampling of individuals. We resample to get bootstrap intervals of windowed Fst by randomly resampling the temporal comparisons. Requires BAM files, reference genome, list of individuals. Outputs population frequencies based on randomly resampled individuals.ANGSD_pop_freq.sh
: Calculating population specific frequencies. Requires BAM files, reference genome, list of individuals. Outputs population frequenciesngsLD_bypop.sh
: Calculate linkage disequilibrium using ngsLD. Needs list of high qualty SNPs (data_2020.05.07/GATK_filtered_SNP_no_dam2.tab
) and beagle files. Outputsld.*.gatk.gz
ngsLD_decay.r
: Make a plot of the genome-wide average LD decay. Run afterngsLD_bypop.sh
. Writes a figure.ngsLD_find_blocks.r
: Finds blocks of linked SNPs. Run afterngsLD_bypop.sh
. Writesanalysis/ld.blocks.gatk.nodam.csv.gz
.ngsLD_find_blocks.sh
: Submits a job withngsLD_find_blocks.r
.ngsLD_find_unlinked.r
: Identifies unlinked loci for use in downstream analyses. Run afternsdLD_find_blocks.sh
. Writesanalysis/ld.unlinked.*.gatk.nodam.csv.gz
.
ld_decay.r
: Not used. Plot linkage disequilibrium decay using output from vcftools, e.g.,
vcftools --gzvcf data_2020.05.07/Historic_dataset_no_clip.vcf.gz_HF_GQ_HWE_MISS_IND_Kmer_VAR_Binom_No_Dam2.vcf.gz --geno-r2 --ld-window-bp 5000 --keep data_2020.05.07/popLof07.txt --minGQ 15 --minDP 3 --not-chr LG01 --not-chr LG02 --not-chr LG07 --not-chr LG12 --not-chr Unplaced --chrom-map data_2019_03_18/chrom_map --out analysis/LOF_07
PCAANGSD.sh
: Runs PCA from ANGSD. Requires beagle files, reference, positions file. Outputs PCA.Make_eigen_R_PCA_ANGSD.r
: Script to rescale eigen values coming from PCAANGSD. Requires PCA ANGSD output. Outputs rescaled eigenvalues.ANGSD_admixture
: Runs ANSGD admixture for different values of K. Requires BAM files, reference genome, list of individuals. Outputs ADMIXTURE results.Plot_admixture.R
: Basic R script to plot admixture output. Requires ANGSD Admixture output. Outputs Rplot with admixture results.angsd_pcangsd_pca.sh
: Also runs PCAngsd and admixutre a few ways, including after dropping individual BM_115 and removing linked loci. Requires beagle files, output fromngsLD_find_unlinked.r
. Outputs PCA files.angsd_pcangsd_plot_pca.R
: Plot output fromangsd_pcangsd_pca.sh
Filter_prune_admixture
: Script to run ADMIXTURE for a number of K on SAGA using GATK generated VCF. Requires VCF file, chromosome names for plink conversion Outputs Admixture results.Filter_Prune_PCA_LG12.sh
: Script to run PCA using a GATK generated VCF. This is specifically run on an inversion location to determine haplotype status. Requires VCF file, chromosome names for plink conversion, inversion boundaries Outputs PCA showing inversion haplotypes.Filter_Prune_PCA_LG07.sh
: Script to run PCA using a GATK generated VCF. This is specifically run on an inversion location to determine haplotype status. Requires VCF file, chromosome names for plink conversion, inversion boundaries Outputs PCA showing inversion haplotypes.Filter_Prune_PCA_LG02.sh
: Script to run PCA using a GATK generated VCF. This is specifically run on an inversion location to determine haplotype status.Requires VCF file, chromosome names for plink conversion, inversion boundaries Outputs PCA showing inversion haplotypes.Filter_Prune_PCA_LG01.sh
: Script to run PCA using a GATK generated VCF. This is specifically run on an inversion location to determine haplotype status. Requires VCF file, chromosome names for plink conversion, inversion boundaries. Outputs PCA showing inversion haplotypes.Filter_Prune_PCA.sh
: Script to run PCA using a GATK generated VCF. Various filters are applied (E.g. a filter for LD, and removal of inversion chromosomes). Requires VCF file, chromosome names for plink conversion, inversion boundaries Outputs PCA showing inversion haplotypes.
angsd_theta.sh
: Prints the per-site pi and calculates windowed pi and Tajima's D by population. Needs *.thetas.idx and *.thetas.gz files from realSFS and thetaStat, e.g., realSFS theta/$1.saf.idx -fold 1 -P 14 > theta/$1.sfs realSFS saf2theta theta/$1.saf.idx -sfs theta/$1.sfs -outname theta_out/$1 thetaStat print theta_out/$1.thetas.idx 2>/dev/null | head thetaStat do_stat out.thetas.idx cat out.thetas.idx.pestPG thetaStat do_stat out.thetas.idx -win 50000 -step 10000 -outnames theta.thetasWindow.gzangsd_theta_bypop.R
: Calculate theta and Tajima's D for the whole genome and bootstrap across linkage groups. Run by angsd_theta_bypop.sh. Needs *.pestPG.gz files from angsd_theta.sh. Writes analysis/thetas.boot.cis.csv.angsd_theta_bypop.sh
: Submitsangsd_theta_bypop.R
to the SLURM system.kjesbu_calcs.R
: Analyses fecundity, abundance, maturity, and liver condition data to calculate generation time (average age of parents) for the Northeast Arctic population. Needs files indata\kjesbu_fecundity
.calcNe_ANGSD.r
: Calculates effective population size (Ne) from ANGSD output using the Jorde-Ryman temporal method. Bootstraps across loci. Based on equations in the manual from NeEstimator 2.1.
angsd_fst.sh
: Calculate Fst from ANGSD output (global, by site, and windowed). Outputs *.fst.AB.gz, *.slide, *.fst.idx, *.ml, *fst.gzangsd_fst_siteshuffle_null.r
: Shuffles FST A and B values across SNP positions to calculate a null distribution of maximum FST values per genome. Run afterangsd_fst.sh
. Outputsanalysis/*.*.gatk.nodam.fst.siteshuffle.csv.gz
andanalysis/*.*.gatk.nodam.ldtrim.fst.AB.csv.gz
angsd_fst_siteshuffle_null.sh
: Submits angsd_fst_siteshuffle_null.r to SLURM.angsd_fst_siteshuffle_null_stats.r
: Examine the results from shuffling FST values across SNP positions and calculate a p-value per window. Run afterangsd_fst_siteshuffle_null.r
. Writesoutput/fst_siteshuffle.angsd.gatk.csv.gz
.
angsd_theta_siteshuffle_null.r
: Shuffles per-locus theta value values across SNP positions to calculate a null distribution of maximum theta values per genome. Needs *.pestPG.gz files fromangsd_theta.sh
. Outputsanalysis/theta.siteshuffle.*.csv.gz
angsd_theta_siteshuffle_null.sh
: Submitsangsd_theta_siteshuffle_null.r
to SLURM.angsd_theta_siteshuffle_null_stats.R
: Examine the results from shuffling per-locus theta values across SNP positions and calculate a p-value per window. Run afterangsd_theta_siteshuffle_null.sh
/.r
. Writesanalysis/theta_siteshuffle.angsd.gatk.csv.gz
.
ngsLD_region_change.r
: Calculate changes in LD in windows. Needs output fromngsLD_bypop.sh
. Writesanalysis/ld_change_region_5e4_ngsLD.gatk.csv.gz
.
region_change_compare.r
: Compare Fst, LD change, Tajima's D change, pi change across populations to find regions that change substantially in more than one. Needs output from ANGSD Fst, ngsLD, and ANGSD theta change calculations. Writesanalysis/outlier_50kregions_shared_07-11-14_Can.csv.gz
.
wfs_byf1samp.r
: Function to do Wright-Fisher simulation with selection and a finite sample size using priors on Ne and s and specifying initial sample allele frequency. Samples at two time points. (i.e., Canada)wfs_byf1samp_3samps.r
: Function to do Wright-Fisher simulation with selection and a finite sample size using priors on Ne and s and specifying initial sample allele frequency. Samples at three time points (i.e., Northeast Arctic).wfs_make_sims_null_function.r
: Make simulations with the null model for a particular set of sample sizes. For two samples in time (like Canada). Uses starting sample allele frequencies that match the observed data. Uses the function inwfs_byf1samp.r
. Used bywfs_make_sims_null_sbatch.sh
.wfs_make_sims_null.r
: Superceded bywfs_make_sims_null_function.r
. Make simulations with the null model for a particular set of sample sizes. For two samples in time (like Canada). Uses starting sample allele frequencies that match the observed data. Uses the function inwfs_byf1samp.r
.wfs_make_sims_null_3times.r
: Make simulations with the null model for a particular set of sample sizes. For three samples in time (like Northeast Arctic). Uses starting sample allele frequencies that match the observed data. Uses the function inwfs_byf1samp_3samps.r
.wfs_make_sims_null_sbatch.sh
: Submits a SLURM job to make simulations withwfs_make_sims_null_function.r
.wfs_make_sims_null_makebatchscript.r
: Creates batch scripts to submit all thewfs_make_sims_null_sbatch.sh
jobs needed to make the null model simulations for a particular temporal comparison.wfs_test.r
: Code to check that the simulations match theoretical expectations.wfs_ts.r
: Function to do Wright-Fisher forward simulations with drift and selection and save all the allele frequencies through time. Used bywfs_ts_examples.r
.wfs_ts_examples.r
: Plot a null model and a selection simulation example. Useswfs_ts.r
.wfs_nullmodel_compare_sims_and_obs.r
: Not used. Code to compare the simulated and observed allele frequency changes and check that the simulations are appropriate.wfs.r
: Not used. Function to do Wright-Fisher forward simulations with drift and selection and save initial and final allele frequencies. Uses priors on Ne, s, and initial true allele frequency (as opposed to initial sample allele frequency inwfs_byf1samp.r
).
wfs_nullmodel_function.r
: Helper script that compares null model simulations to the observations and calculates a per-locus p-value. Runs in parallel across many loci at once.wfs_nullmodel_sbatch.sh
: Submitswfs_nullmodel_function.r
to run for a set of loci in a population comparison with two time points.wfs_nullmodel_sbatch_3times.sh
: Submitswfs_nullmodel_function.r
to run for a set of loci in a population comparison with three time points.wfs_nullmodel_makesbatchscript.r
: Creates batch scripts to submit all thewfs_nullmodel_sbatch.sh
jobs for a particular temporal comparison. All of those scripts should then be run to calculate the per-locus p-values.wfs_nullmodel_combine.r
: Run after wfs_nullmodel_function.r to combine the results together. Writesanalysis/wfs_nullmodel_pos&pvals_*.rds
.wfs_nullmodel_lowp_rerun.r
: Script to make more null model simulations for loci with low p-values. Run afterwfs_nullmodel_combine.r
. Writesanalysis/wfs_nullmodel_pos&pvals_*.rds
.wfs_nullmodel_analysis.r
: Examines the frequencies at which the null model produces results as extreme as our observations. Run afterwfs_nullmodel_function.r
andwfs_nullmodel_combine.r
/wfs_nullmodel_lowp_rerun.r
. Writesanalysis/wfs_nullmodel_padj.csv.gz
.wfs_nullmodel_analysis_Canada,1907-2011&2014.r
: Not used. Messy code to examine the p-values.
angsd_cut_beagle_bypop.sh
: Divides the beagle genotype file by population comparison, to get ready for running PCAngsd. Needs beagle file from ANGSD trimmed to the GATK high quality SNPs (All_ind_beagle.GATK_no_dam.gz
). Outputs beagle.gz files.angsd_pcangsdoutlier.sh
: Run the PCAngsd selection scan. Needs beagle files, output fromngsLD_find_unlinked.r
. Outputs *.cov, *.selection.npy, *.sites.angsd_pcangsd_plot_selection.R
: Run afterangsd_pcangsdoutlier.sh
to plot the output.
split_fasta.sh
: Split the reference genome fasta file apart by linkage group. Output is used byannotate_outliers.r
.annotate_outliers.r
: Compiles annotation and other information about potential outlier SNPs. Needs the VCF file (All_rerun_hist.vcf.gz_HF_GQ_HWE_MISS_0.6_IND_Norway.vcf.gz
) and the reference genome FASTA split apart by linkage group (seesplit_fasta.sh
), plusanalysis/pcangsd_outlier.gatk.nodam.unlinked.csv.gz
from running PCAngsd,analysis/wfs_nullmodel_padj.csv.gz
from the WFS null model,output/fst_siteshuffle.angsd.gatk.csv.gz
from the windowed FST outlier scan,analysis/theta_siteshuffle.angsd.gatk.csv.gz
from the windowed pi and Tajima's D outlier scan, andanalysis/outlier_50kregions_shared_07-11-14_Can.csv.gz
from the shared outlier regions across populations. Writestables/outlier_annotation.csv
andoutput/outlierSNPs_in_outlierregions.csv
slim_makeburnin.py
: Helper script to make coalescent burn-ins for the SLiM simulations. Writes a vcf file.slim_runburnins.sh
: Set up and run the SLiM burn-ins by callingslim_makeburnin.py
. Writes a set of vcf files.slim_softsweep.slim
: SLiM code to run a forward simulation and output vcf file samples before and after soft sweep.slim_run.R
: Set up and run the SLiM simulations by callingslim_softsweep.slim
. Checks which simulation output already exists so that those are not run again. Uses burnin files fromslim_makeburnin.py
. Outputs vcf files.fst_reynolds_fromvcf.R
: Helper script. Calculates Reynolds FST from .vcf files and writes a table of per-locus A and B values. Useful for processing output from SLiM. ANGSD also uses Reynolds FST.slim_calcfst_reynolds.sh
: Calculate Reynolds FST components for slim output. Processes vcf files from SLiM and callsfst_reynolds_fromvcf.R
.slim_fst_reynolds_plot.R
: Plot per-locus Reynolds Fst as output byslim_calcfst_reynolds.sh
.slim_calcfst.sh
: Not used. Calculate sliding window FST using vcftools on SLiM vcf output. Writes slim_sim_n*_s*_f*_i*.windowed.weir.fst.slim_fst_plot.R
: Not used. Makes FST Manhattan plots from vcftools output to visualize the effect of the sweeps and make sure the sims worked well. Run afterslim_calcfst.sh
.vcftobeagle.R
: Helper script to convert a vcf file to beagle format so that PCAngsd can run. Used byslim_calcpcangsdoutlier.sh
.slim_calcpcangsdoutlier.sh
: Run the PCAngsd outlier test on the vcf output from SLiM. Writes analysis/slim_sim/*.cov, *.selection.npy, and *.sites. 1slim_pcangsdoutlier_plot.R
: Plots the output fromslim_calcpcangsdoutlier.sh
.slim_fst_siteshuffle_null.r
: Conduct the windowed FST outlier test on the SLiM simulations. Needs output fromslim_calcfst_reynolds.sh
. Writesanalysis/slim_sim/slim_sim_n30000_*.fst.siteshuffle.csv.gz
.slim_fst_siteshuffle_null.sh
: Callsslim_fst_siteshuffle_null.r
across all SLiM simulations.slim_fst_siteshuffle_plot.R
: Visualize some of the results from the windowed FST outlier test. Needs output fromslim_fst_siteshuffle_null.sh
.
figures_for_paper.r
: Makes the figures for the paper. Needs output from many of the previous analyses. Writes figures/*.png and .pdf
vcf_to_trimmed_genepop.sh
: Convert vcf file to a genepop format and trim out linked loci with plink. Uses PGDSpider.vcf_to_trimmedplink.sh
: Convert vcf file to a plink format and trim out linked loci with plink. Uses PGDSpider.
age_50percmature.csv
: Age at 50% mature in Canada and NE Arctic cod. Fromscripts/phenotypes.r
fst_siteshuffle.angsd.gatk.csv.gz
: p-values from the windowed FST outlier test inangsd_fst_siteshuffle_null_stats.r
.outlierSNPs_in_outlierregions.csv
: A list of SNPs with FST>0.2 in outlier regions. Produced byannotate_outliers.r
.
The script-produced figures from the paper, plus a handful of others that were useful along the way.
outlier_annotation.csv
: Annotated list of potential outlier SNPs and regions. Produced byannotate_outliers.r
. Used byfigures_for_paper.r
tableS5.csv
: Nicely formated information on potential outlier SNPs and regions. Produced byfigures_for_paper.r
.