<img src="30S.png" width="50%" alt="Image of 30S ribosomal subunit" style="float:right"/>

# Ribosomal Database Project

The Ribosomal Database Project (RDP) is a bioinformatics project that aims to provide a comprehensive database of ribosomal RNA (rRNA) sequences and related data for use in phylogenetic analysis and microbial ecology research. The project was initiated in 1984 by Carl Woese, a pioneer in the field of molecular evolution, and is currently maintained by a team of researchers at Michigan State University.

The RDP contains a curated database of rRNA sequences from bacteria, archaea, and eukaryotes. The database includes both 16S and 18S rRNA sequences, which are commonly used as molecular markers for bacterial and eukaryotic phylogenetic analysis, respectively. 

In this workshop, we'll use a curated file of approximately 15000 16S rDNA sequences.


# Retrieving your Sequences

The 16S sequences obtained by Nanopore sequencing have been provided in a csv file called `16S_sequences_23.csv`.

Samples were identified by using extra sequences known as 'barcodes' that were added to the PCR primers. Each sample was tagged with a unique combination from 12 forward and 8 reverse primer sequences, as shown in the picture below.

Identify your sample's forward and reverse primers from the image below:

<img src="dna_samples_w5_23.png" width="85%" alt="DNA samples"/>

# Workshop tasks

Today's workshop is divided into 5 sections:
* Retrieving your 16S sequences
* Loading the reference 16S sequence database
* Searching for database matches
* Visualising your analysis results
* Evaluation and Reflection on Results

### Retrieving your 16S sequences

1. Load the file 'Discovery_23_16S_consensus.csv' into memory. Explore the dataframe. Retrieve your 16S sequences from the Pandas dataframe using the `.query()` function by looking for rows that match your combination of forward and reverse primers. If there is no match for your group's 2 pairs of forward and reverse primers, run the code below and look for a combination of forward and reverse primers that does have reads, and select one you like the look of for further analysis in this workshop. 
```python 
#load the dataframe
df=pd.read_csv("Discovery_23_16S_consensus.csv")
# Create a contingency table of the forward and reverse indices
ct = pd.crosstab(df['reverse_primer'], df['forward_primer'])
print(ct)
```

2. Extract your 16S sequences to a list using `.tolist()`. From the list, try extracting individual sequences using indexing (for example mylist[0] or mylist [1]).


### Loading the reference 16S sequence database

3. Load the 16S_reference.csv database into a pandas dataframe (use a different name to that you used when loading your sequences). This is a large version of the 'mini' dataframe you used last week. How many sequences are in the database?





### Searching for database matches

4. For each sequence, create a new column in the Pandas dataframe that represents the Levenshtein distance between that sequence and the sequences in the dataframe. The code you developed at the end of the week 7 workshop is close to what you will need to use here. Make sure that you create a different column name for the distance to each sequence, for example 'dist_to_seq_1', 'dist_to_seq_2', etc. You can do this by running modified code for each of your group's sequences in turn (or each person can analyse a different one). If you're feeling comfortable and advanced, you could use a loop to process all sequences in your list of your group's sequences.
5. Identify the best match(es) for each sequence: how close is the match? 


In [6]:
!pip install levenshtein
#pip is a package manager:this command installs the levenshtein package on our Noteable instance.



In [7]:
from Levenshtein import distance



### Visualising your analysis results

6. Run a needleman_wunsch alignment between your 16S sequence(s) and the best match. How much of the sequence is aligned? Can you see poorer alignment at the ends of the sequence caused by the presence of PCR primers in your sequence?

An example code for running a Needleman-Wunsch alignment is shown below:

```python
import needleman_wunsch
seq1='ATGCTGAGCTAGCGGCTATATTCTATCGGGAGCGATTTACTACTC'
seq2='ATGCTAGGTAGCGGACTATATACTATCGCGAGCGATTAACTAGCC'
print(needleman_wunsch.align(seq1, seq2))
```
Expected output:
```
seq1: ATGCTGAGCTAGCGG-CTATATTCTATCGGGAGCGATTTACTA-CTC
      ||||| || |||||| |||||| |||||| |||||||| |||| | |
seq2: ATGCT-AGGTAGCGGACTATATACTATCGCGAGCGATTAACTAGC-C
```



### Evaluation and Reflection on Results

7. Draw a histogram of the sequence distances between your group's 16S sequence(s) and each of the sequences in the database. What is the species that has the most distant sequence in the database for your 16S sequence? Does that make sense when put into the context of the tree of life?
8. Is the species you've identified known to produce antimicrobial compounds? 
9. Do you think that the species classification has been accurately predicted? What factors could affect the accuracy of your sequencing hit?


In [2]:
import seaborn as sns


