Skip to content

Identifying representative sequences for chloroplast genes

License

Notifications You must be signed in to change notification settings

mossmatters/plastidTargets

Repository files navigation

citation

If you use the targets from this repository please cite Pokorny et al., 2024 10.3389/fpls.2024.1340056

plastidTargets

Identifying representative sequences for chloroplast genes

A modified version of the Kmedoids script used to design the Angiosperms353 probe set. The main changes from the Angiosperms353 repo are to work with protein sequences rather than DNA.

The input data was from Gizendanner et al. 2018 which was a phylogeny of all green plants as part of the One Thousand Plants Transcriptome Project (1KP).

sequence_clusters.py

  1. Reduces the alignments to just 1KP angiosperms (uses required file 1kp_angio_codes.txt)
  2. Deletes gap-only characters in reduced alignments
  3. Calculates a p-distance matrix among all 1kp sequences
  4. Uses k-medoids to select between 6 and 15 sequences that represent >95% of angiosperms with less than 15% sequence divergence

The gene files should be in FASTA format in a folder named invididual_genes (from 1KP_Plastid) with names like accD.FAA, rbcL.FAA, etc.

python sequence_clusters.py geneName

make_target_file.py

  • Run within the medoid_alignments directory generated by the previous script
  • Collects the medoid sequences into a single FASTA file in the HybPiper format >source-gene and remove all gap characters (make_target_file.py)
  • Writes to standard output: python make_target_file.py > ../plastid_targets.faa

best_medoids_angiosperms.txt

For each gene:

  1. gene name
  2. alignment length
  3. number of medoid sequences
  4. number of sequences not represented (within 15% divergence) by the medoids
  5. Sequence names of medoids

plastid_targets.faa

FASTA file containing the medoid sequences with sequence names ready for use in HybPiper.

Other Output

  • onekp_only_angiosperms_pdistance A directory containing pairwise distance matrices between all pairs of angiosperms sequences
  • onekp_only_angioperms_degapped A directory containing the de-gapped (all sites that are gaps only removed) alignments for just angiosperms
  • medoid_alignments A directory containing alignments of just the medoid sequences for each gene

About

Identifying representative sequences for chloroplast genes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages