Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Diamond blastx and Megan on crab rna-seq reads #26

Closed
kubu4 opened this issue May 30, 2023 · 16 comments
Closed

Run Diamond blastx and Megan on crab rna-seq reads #26

kubu4 opened this issue May 30, 2023 · 16 comments
Assignees

Comments

@kubu4
Copy link
Collaborator

kubu4 commented May 30, 2023

References RobertsLab/resources#1597

@kubu4 kubu4 self-assigned this May 30, 2023
@sr320
Copy link
Collaborator

sr320 commented Jun 27, 2023

@kubu4 what is the status of this?

@kubu4
Copy link
Collaborator Author

kubu4 commented Jun 27, 2023

Technically, the blasting/"MEGANizing" is finished. However, having extreme difficulties getting files opened in MEGAN GUI (required when using the free version) to extract taxonomic breakdown - the files are extremely large and MEGAN keeps crashing...

At this point, I'm mostly resigned to just looking at the smallest file, which only consists of a single set of "MEGANized" reads (R1 only, not paired), and hoping we can consider this one file a "representative" data set? I've gotten it to open, but wanted to have the corresponding R2 reads.

It takes a very long time (an hour or two) to open a file. So, when I attempt to open a file and then MEGAN crashes, it's a massive loss of time.

Anyway, I'll get something posted here sometime on Thursday.

@sr320
Copy link
Collaborator

sr320 commented Jun 27, 2023 via email

@kubu4
Copy link
Collaborator Author

kubu4 commented Jun 27, 2023

It's not a text file - it's some sort of compressed/binary format. Heres a link to the smallest file (68GB), if you want to poke around with it:

https://gannet.fish.washington.edu/Atumefaciens/20230316-mmag-diamond-meganizer-RNAseq/CH09-13.trimmed.R1.blastx.meganized.daa

@kubu4
Copy link
Collaborator Author

kubu4 commented Jun 29, 2023

Okay, I've looked into this a bit more. It turns out that I can convert the DAA files to RMA6 files via command line. I bring this up because it turns out that when importing the MEGANized DAA files into the MEGAN GUI, the GUI is actually converting them to RMA6 format! Gah!

The RMA6 file format, unsurprisingly (since it just contains taxonomic assignment counts/data), is significantly smaller in size (like 2GB vs. 68GB).

I'll get something set up to run on Mox to get these all converted. Then, I should be able to easily, and quickly(!), get them imported into the MEGAN GUI to extract taxonomic counts for all samples.

@kubu4
Copy link
Collaborator Author

kubu4 commented Jun 29, 2023

But, to tide us over until then, here's a quick visual breakdown of taxonomic assignment for the R1 reads for

Phylogenetic tree representing read counts assigned to various taxonomies. Largest circles represent the Proteobacteria (largest) and Arhtropoda classes.

@sr320
Copy link
Collaborator

sr320 commented Jun 29, 2023

do circles indicate prevalence? eg more Proteobacter v Arthropoda?

@kubu4
Copy link
Collaborator Author

kubu4 commented Jun 29, 2023

Correct. Circles represent number of reads assigned within each taxa (legend in top left of that screenshot).

@kristamnichols
Copy link
Collaborator

kristamnichols commented Jun 29, 2023 via email

@mgavery
Copy link
Collaborator

mgavery commented Jul 10, 2023

Wow, lots of bacteria. I'm thinking about these samples and the fact that we put the whole gill filament into the tube. ..multiple filaments even because they were so small. There is a lot of surface area on those feathery gills and filaments for bacteria to hang out on. I'm wondering: 1) can MEGAN break down species any further? maybe not get to a species level but something to compare distributions of "types" of bacteria between OA and treated crabs; 2) what's up with all the not-assigned stuff? that's a big portion of the reads too.

@kubu4
Copy link
Collaborator Author

kubu4 commented Jul 11, 2023

  1. can MEGAN break down species any further?

Definitely. I just posted a quick screencap at the Phylum level for people to glance at. It goes down to the species level.

  1. what's up with all the not-assigned stuff?

Well, my guess is that's due to limitations of the BLAST results (which is what MEGAN is relying on). Since the data is only as good as the BLAST database (and there probably isn't a massive amount of crab sequencing data in NCBI), I'm guessing there are a LOT of matches to undefined proteins (or something similar). So, there's likely not much MEGAN can do with those sequencing reads to try to perform/predict taxonomic assignments.

We could possibly try to redo the analysis with some relaxed BLAST parameters and/or MEGAN parameters to see if it helps reduced the number of unassigned reads, but it will take awhile (BLASTing/MEGANizing took ~45 days and conversion to the RMA6 file for loading into the MEGAN GUI takes ~10 days)...

Another option is to extract the reads assigned to Arthropoda (and below), assemble transcriptome and then try to align the unassigned reads to the transcriptome to identify those that are likely crab sequences. Then, we'd extract those reads and use them to create an updated transcriptome.

@kristamnichols
Copy link
Collaborator

kristamnichols commented Jul 11, 2023 via email

@kubu4
Copy link
Collaborator Author

kubu4 commented Jul 18, 2023

Alrighty, I've uploaded output tables and screencaps:

https://github.com/laurahspencer/DuMOAR/tree/main/results/MEGAN

There is one table per FastQ. The format is:

Phylum Read Counts
Acidobacteria 1045
Aquificae 1253
Bacteroidetes 6709
Proteobacteria 99622
Chlamydiae 643
Planctomycetes 4057
Actinobacteria 4766
Firmicutes 1334
Microsporidia 32041
Chordata 211114
Nematoda 1718
Arthropoda 1187325
Mollusca 10101
Platyhelminthes 17161
Streptophyta 4859
Uroviricota 773

@kubu4 kubu4 closed this as completed Jul 18, 2023
@sr320
Copy link
Collaborator

sr320 commented Jul 25, 2023

added some data summary https://rpubs.com/sr320/1065657

@laurahspencer
Copy link
Owner

laurahspencer commented Jul 25, 2023

image

Red and blue colors on sample names indicate pH treatment (red=OA, blue=ambient)

Samples that we tossed during methylation analysis:
CH05-06, CH07-24, CH07-11, CH03-04, CH03-15, CH10-08, CH09-13, CH05-01.

Neither treatment nor weird methylation data seem to line up with % RNASeq reads identified as arthropoda.

@laurahspencer
Copy link
Owner

No clear correlation between % reads assigned to Arthropoda (y-axis) and genome alignment rate (x-axis).
NOTE: the % assigned to Arthropoda doesn't consider how many reads were not assigned.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants