-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-AssigningTaxonomy.Rmd
211 lines (149 loc) · 10.3 KB
/
08-AssigningTaxonomy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# **ASSIGNING TAXONOMY**
Assigning taxonomy to ASVs (Amplicon Sequence Variants) involves classifying unique genetic sequences into taxonomic categories (e.g., genus or species) to identify the organisms they originate from. This helps researchers understand the microbial composition in a sample based on high-resolution genetic data. To assign taxonomy to ASVs using a reference database, you can use a sequence classification tool like QIIME2's feature-classifier. In this case, you basically provide the ASV sequences as input, and the tool matches them to known sequences in the reference database to assign taxonomic labels.
## **Classifying ASV's**
There are three options for classifying ASVs taxonomically in QIIME:
1) `feature-classifier classify-sklearn`
2) `feature-classifier consensus-vsearch`
3) `feature-classifier classify-consensus-vsearch`
Here, I used the `classify-sklearn` approach, which is known for its speed and efficiency, and can be advantageous when working with large data sets. It's also a well-documented and widely used method, making it a reliable option with ample community support. However, it's important to note that the best choice may depend on your specific data set and research goals, so it's always a good idea to consider the strengths and limitations of each classification method before making a decision.
### Using scikit-learn
Scikit-learn (sklearn) is a powerful machine learning library, including a naive Bayes classifier, used to categorize marker-gene sequences in QIIME based on a pre-trained model (silva-138-99-nb-classifier.qza), revealing the composition of microbial communities in the data set.
To begin, a taxonomic classifier was obtained from the full length SILVA reference sequences on the Qiime2 website: (https://docs.qiime2.org/2022.2/data-resources/)
Download the classifier into the main directory with the `wget` command (this can take a long time and can also be done in a job script)
```{r, eval=FALSE}
wget https://data.qiime2.org/2023.7/common/silva-138-99-nb-classifier.qza
```
Next, run the full length 16S taxonomic classification on the read files with the following qiime feature-classifier:
```{r, eval=FALSE}
#!/bin/bash
#SBATCH --time=
#SBATCH --account=<group-ID>
#SBATCH --mem=
export APPTAINER_BIND=/project/6011879/<user-ID>/:/data
export APPTAINER_WORKDIR=${SLURM_TMPDIR}
module load qiime2/2023.5
#for sediment core samples
qiime feature-classifier classify-sklearn \
--i-reads /data/amplicon-analysis/rep-seqs_sediments.qza \
--i-classifier /data/amplicon-analysis/silva-138-99-nb-classifier.qza \
--o-classification /data/amplicon-analysis/classification_sediments.qza
#for seabird fecal swab samples:
qiime feature-classifier classify-sklearn \
--i-reads /data/amplicon-analysis/rep-seqs_birds.qza \
--i-classifier /data/amplicon-analysis/silva-138-99-nb-classifier.qza \
--o-classification /data/amplicon-analysis/classification_birds.qza
#for the merged file:
qiime feature-classifier classify-sklearn \
--i-reads /data/amplicon-analysis/merged_rep-seqs.qza \
--i-classifier /data/amplicon-analysis/silva-138-99-nb-classifier.qza \
--o-classification /data/amplicon-analysis/classification.qza
```
## **Verifying with BLASTn**
As a safety measure, use the feature-table plug-in **[17]** to run a `tabulate-seqs` **[18]** command that will generate a table with all the Feature ID's associated to each representative sequence. Using this plug-in, it's also possible to use BLASTn to evaluate the subset of taxonomic assignments in the taxa files. By comparing the taxonomic assignments from the preceding output to the leading BLASTn hits, you can verify the classifier's accuracy. Let's do this for the `rep-seqs` file for the sediment and seabird fecal samples, as well as the `merged_rep-seqs` file:
```{r, eval=FALSE}
#!/bin/bash
#SBATCH --time=
#SBATCH --account=<group-ID>
#SBATCH --mem=
export APPTAINER_BIND=/project/6011879/<user-ID>/:/data
export APPTAINER_WORKDIR=${SLURM_TMPDIR}
module load qiime2/2023.5
qiime feature-table tabulate-seqs \
--i-data /data/amplicon-analysis/rep-seqs_sediments.qza \
--o-visualization /data/amplicon-analysis/BLAST_sediments_rep-seqs.qzv
qiime feature-table tabulate-seqs \
--i-data /data/amplicon-analysis/rep-seqs_birds.qza \
--o-visualization /data/amplicon-analysis/BLAST_bird_rep-seqs.qzv
qiime feature-table tabulate-seqs \
--i-data /data/amplicon-analysis/merged_rep-seqs.qza \
--o-visualization /data/amplicon-analysis/BLAST_merged_rep-seqs.qzv
```
Open the Blast files in QIIME2 and there will be three tables that load:
1) The sequence length statistics
2) The seven-number summary of sequence lengths
3) The sequence table
BLAST a sequence against the NCBI nt database by clicking on the linked sequences within the sequence table, which will will open NCBI's blastn suite. Click on the `View Report` button and wait for the suite to load the nt data for this sequence (this can take a while). You can do this with around five sequences at random to determine whether or not the results from the taxonomic classification through QIIME2 align with the NCBI database.
### Filtering contaminants
After assigning taxonomy, it's possible to remove any contaminants or noise in the data by filtering out unwanted assignments based on the taxonomic labels given to ASVs. Here, I remove any mitochondrial and chloroplast 16S sequences by excluding any ASV which contains those terms in its taxonomic label, including any ASV that is unclassified at the phylum level. We omitted the parameter `<--p-include p__>` because we are studying a poorly characterized environment where there is potential to identify novel phyla.
```{r, eval=FALSE}
#!/bin/bash
#SBATCH --time=
#SBATCH --account=<group-ID>
#SBATCH --mem=
export APPTAINER_BIND=/project/6011879/<user-ID>/:/data
export APPTAINER_WORKDIR=${SLURM_TMPDIR}
module load qiime2/2023.5
#for sediment samples
qiime taxa filter-table \
--i-table /data/amplicon-analysis/filtered_feature-table_sediments.qza \
--i-taxonomy /data/amplicon-analysis/classification_sediments.qza \
--p-exclude mitochondria,chloroplast \
--o-filtered-table /data/amplicon-analysis/final_table-sediments.qza
qiime feature-table summarize \
--i-table /data/amplicon-analysis/final-table_sediments.qza \
--o-visualization /data/amplicon-analysis/summaries/final_table-sediments.qzv
#for bird fecal samples
qiime taxa filter-table \
--i-table /data/amplicon-analysis/filtered_feature-table_birds.qza \
--i-taxonomy /data/amplicon-analysis/classification_birds.qza \
--p-exclude mitochondria,chloroplast \
--o-filtered-table /data/amplicon-analysis/final_table-birds.qza
qiime feature-table summarize \
--i-table /data/amplicon-analysis/final-table_birds.qza \
--o-visualization /data/amplicon-analysis/summaries/final_table-birds.qzv
#for the merged table
qiime taxa filter-table \
--i-table /data/amplicon-analysis/merged-table.qza \
--i-taxonomy /data/amplicon-analysis/classification.qza \
--p-exclude mitochondria,chloroplast \
--o-filtered-table /data/amplicon-analysis/final_merged-table.qza
qiime feature-table summarize \
--i-table /data/amplicon-analysis/final_merged-table.qza \
--o-visualization /data/amplicon-analysis/summaries/final_merged-table.qzv
```
## **Grouping feature tables**
Next, let's take the merged feature table and create a grouped feature table that takes the sum of all of the feature counts from each group. This table will be useful later on in the analysis when we are visualizing data. To group all of the samples, you can use the QIIME2 platform to make a feature table that sums up all the feature counts for every featureID according to each sample group with the `qiime feature-table group` function. This is done using the accompanying [metadata.csv](https://github.com/johannabosch/QIIME2_for_Graham/blob/main) files, which specify which samples are a part of which group under the column titled `group`.
To get a better idea of what is meant by the term 'grouping', take some of the data we are using here as an example. In this table below, there are 2 northern gannet (*Morus_bassanus*) samples and 2 black-legged kittiwake (*Rissa_tridactyla*) samples. In the metadata file, the column group contains the scientific names for each seabird species, so when you run the grouping command and specifiy this column the feature table will be summed up accordingly. Right now this feature table (assigned to the term `bird_features`) contains feature counts for every sample individually, but we want it to be grouped, so that it looks like Table 2:
**Table 1. Structure of bird_features file:**
|FeatureID|NOGA-01|NOGA-02|BLKI-01|BLKI-02|
|---------|-------|-------|-------|-------|
|4595dec6ed...|3498|10022|1244|589|
**Table 2. Structure of a grouped bird_features file:**
|FeatureID|Morus_bassanus|Rissa_tridactyla|
|---------|----|----|
|4595dec6ed...|13520|1833|
To group the `final_merged-table.qza`, use the following command:
```{r, eval=FALSE}
#!/bin/bash
#SBATCH --time=
#SBATCH --account=<group-ID>
#SBATCH --mem=
export APPTAINER_BIND=/project/6011879/<user-ID>/:/data
export APPTAINER_WORKDIR=${SLURM_TMPDIR}
module load qiime2/2023.5
qiime feature-table group \
--i-table /data/amplicon-analysis/final_merged-table.qza \
--p-axis sample \
--m-metadata-file /data/amplicon-analysis/metadata/metadata.csv \
--m-metadata-column group \
--p-mode sum \
--o-grouped-table /data/amplicon-analysis/grouped-table.qza
qiime feature-table summarize \
--i-table /data/amplicon-analysis/grouped-table.qza \
--o-visualization /data/amplicon-analysis/summaries/grouped-table.qzv
```
If you're working the seabird and sediments data, you now have a table that contains all of the data, summed up by each group listed in the `metedata-grouped.csv` table.
___
<hr />
<p style="text-align: center;">Github: [johannabosch](https://github.com/johannabosch) </a></p>
<p style="text-align: center;"><span style="color: #808080;"><em>yohannabosch@gmail.com</em></span></p>
<!-- Add icon library -->
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
<!-- Add font awesome icons -->
<p style="text-align: center;">
<a href="https://www.instagram.com/yohannabosch/" class="fa fa-instagram"></a>
<a href="https://www.linkedin.com/in/yan-holtz-2477534a/" class="fa fa-linkedin"></a>
<a href="https://github.com/johannabosch/" class="fa fa-github"></a>
</p>