Skip to content

LepMap3 Details

Pavel V. Dimens edited this page Feb 3, 2022 · 7 revisions

The general workflow of LepWrap is:

Step Module Description Directory
1 ParentCall2 creates LepMap3 data file 1_ParentCall
2 Filtering2 filter LM3 data 2_Filtering
3 SeparateChromosomes2 clusters markers into possible LGs 3_SeparateChromosomes
4 JoinSingles2ALL clusters remaining markers into defined linkage groups project directory
5 OrderMarkers2 orders markers into linkage groups 4_OrderMarkers
6 Trim ends removes potential error and artifacts from edge clusters 5_Trim
7 OrderMarkers2 reorders trimmed markers 6_OrderMarkers
8 Calculate Distances/Intervals final maps and intervals 7_Distances* 7_Intervals

However, with Snakemake involved, there are a lot more rules that each do specific things. So, let's dive in to each rule and what it does:

1. parent_call

Merges your pedigree file with your filtered VCF file to create a data.call.gz file.

2. filtering

Filters the data from ParentCall2 with some kind of data tolerance value. This value is set in the config.yaml, and if you set it to 0, it will skip this step, instead creating data_f.call.gz as a relative soft link to data.call.gz.

3. separate_chromosomes

Create maps over a range of LOD score values that you have specified in the config.yaml file. By default, it's set to distortionLod=1. Maps will be output into the 3_SeparateChromosomes folder.

4. map_summary

A summary file all.maps.summary will be created combining the summaries of all the maps into a single file text file for easy assessment. The first column is the linkage group number, and the remaining columns correspond to the summary of markers per linkage group for each LOD limit that SeparateChromosomes2 was performed with.

4.5 refine a map

Depending on how your data looks at this point, you may consider cutting off the pipeline and refining a particular map before continuing. For example, if you expect 23 chromosomes/LG's but your map only has 22 and it seems like LG-1 and LG-2 have a lot more markers than the other groups, then you might want to try splitting out those over-clustered LG's. Use /scripts/refine_map.sh to iteratively test LOD scores to try splitting target LG's. You will have to play with the parameters a bit, such as setting a size limit to avoid accidentally creating too many LG's. If you perform this manual intervention, you will need to manually run JoinSingles2All (optional) or just copy your best map from that into the main directory as map.master.

5. join_singles

You are prompted in the console for the map you would like to use for the remainder of the pipeline. The parameters for this rule can be modified in the config.yaml (lodLimit=10,lodDifference=2). If not disabled in the config.yaml, this map will be ran through joinSingles2ALL with those parameters and output as map.master in the project directory. A log of your choice will be in 3_SeparateChromosomes/chosen.map.

6. order_markers

Order your markers into positions among your linkage groups based on the map you chose. You will need to specify the the number of linkage groups and number of iterations in the config.yaml. I would recommend 100 iterations or more, as my own work sometimes revealed that the 100th iteration had the best likelihood. Ordering 25,000 markers across 24 linkage groups 100x on 24 processors should take between 1 and 2 days. That's an oddly specific example, but that's the data it was tested on. Another thing to note is that your linkage groups may be processed at different rates due to the way the rules are configured and how Snakemake dispatches tasks. It's likely the pipeline will process some linkage groups to completion before others, so don't expect LepWrap to wait for all linkage groups to finish initial ordering before moving on to further steps. This is a feature (not a bug) for Snakemake to always try to use available cores to run something when it can.

7. recombination_summary

Runs an R script to provide some summary information on the recombination information into recombinations.summary

8. trim_edge_clusters

Uses LepWrapTrim.r to trim the ordered maps. It's a mechanism to identify and remove clusters of ambiguously ordered markers. A filtered map for each linkage group will be created, where the removed markers are commented out, along with a file containing the names of the markers removed and a pdf visualization of which markers were removed. These aren't actual Marey maps (physical position vs. genetic position) because the X axis is marker number sorted by genetic position, so I've come to refer to them as "Sequential Plots".

This step examines the first and last X% of markers along a linkage group and remove marker clusters that are greater than N%cM away from the next markers (both X and N need to be configured in the config.yaml, but 15 and 10 are reasonable values). The centiMorgan distance is relative to the length of that linkage group for that sex, meaning a 10% threshold (N = 10) will have a separate threshold for the female map length(female) x 0.1 and male map length(male) x 0.1. The percent of markers will look at that many markers from each side, meaning X = 10 will look at the first 10% and last 10% of markers for each sex for each linkage group. Here is an example of one of the output plots:

trimming plot example

9. trim_summary

Collates all the removed markers into a summary files trim.summary (counts per lg) and trim.details (which markers per lg).

10. reorder_markers

Uses the trimmed linkage groups to reorder the markers. Uses the same configurations (#linkage groups, #iterations) as initial ordering.

11. recombination_summary2

Summarizes reordering the same way as in the initial ordering

12. calculate_distances

The final product of the pipeline is in 3 folders: distances, distances_sexaverage, and intervals. The distances folder contains a copy of the best likelihoods from reordering. The distances_sexaverage folder contains the sex-averaged distances using OrderMarkers2 evaluateOrder=LG sexAveraged=1 where LG is each linkage group from the previous step. The intervals folder contains the intervals for each marker in a linkage group using OrderMarkers2 calculateIntervals=.