# A1196 Conservation Analysis
*Elliot Williams*

*August 13th, 2019*


**Note to users of this code**: Let me know! I would definitely be interested in whatever it is being used for.
Also, if you publish anything using it, please cite me!


Question: **Is A1196/A1427 (*S. cerevisiae* / *E. coli* numbering) of the 18S/16S rRNA well conserved within RSCB deposited ribosome-structures? Is this also true of C1054?**

## Background

The Weir Lab has been conducting MD simulations of the region of the ribosome within a 35 angstrom sphere surrounding G530 of the 16S/18S rRNA. The structures for these simulations were obtained from the five cryo-EM translocation intermediate structures obtained from [this paper](https://elifesciences.org/articles/14874), out of the Korostelev research group out of UMass Medical School.

Within these simulations (presumably), it has been noticed that A1196 (which base stacks with C1054) transiently interacts via Hoogsteen/Watson-Crick base pairing with the second nucleotide of the incoming A-site codon, raising the question of a model in which the codon:anticodon base pairing is extended to 5 base pairs.

The goals of this assessment is to find out whether or not the A1427 residue is well-conserved in the [RCSB](http://www.rcsb.org/)-deposited ribosome structures available. Additionally, as C1054 has been shown to [interact with the mRNA](https://academic.oup.com/nar/article/44/13/6036/2457638) during molecular dynamics simulations (as well as during our own Korostelev structure simulations), it would be interesting to look at conservation of this base within structures as well.

One would expect that conservation of a residue, either on an intraspecies or interspecies level, would suggest mechanistic importance of the residue to the ribosome's function; so, getting a sense of a particular residue's level of conservation will allow us, by proxy, to get at a residue's functional importance.

If we find intraspecies conservation but not interspecies conservation, we might expect that the residue is important for the ribosome's function at a species-specific level. If no conservation is found at all, it is unlikely that the residue has high functional relevance.

With all of this background stated before I taint my thoughts with the analysis at hand, let us move forward to the task at hand.




## Data Collection

### Data Source Description

The data I am collecting to conduct my analysis is sourced from the [RCSB](http://www.rcsb.org/), an online database of Protein Data Bank (PDB) files representing the 3D shapes of proteins, nucleic acids, and complex assemblies.

### RCSB API implementation

I'm reusing a RCSB API implementation I made last year to fetch data from the website, defined within `rcsb_api.py`.

The benefit of this is that I can get any data I please, if I implement the API logic.
The drawback is that any new queries must be implemented. 

See [PyPDB](http://www.wgilpin.com/pypdb_docs/html/) for a more 
mature project with more limited functionality.

In [1]:
# All dependencies, including the rcsb_api.py dependencies (of `requests`, `lxml`, `pandas`),
# should be installed for this to work properly.

# Additionally, you should install Clustal Omega to do the alignment later on
# Example:
# 1) Download the executable version from clustal.org/omega/
# 2) Run `sudo mv [CLUSTAL_OMEGA_PATH] /usr/bin/clustalo`
# 3) Run `sudo chmod +x /usr/bin/clustalo` to make it executable

# First, let's import the RCSB Python API I defined
import rcsb_api as rcsb
import pandas as pd

# And let's set the pandas display variables
pd.set_option('display.max_colwidth', 80)
pd.set_option('display.max_rows', 50)

In [2]:
# Then, let's query for all PDB IDs which have 'ribosome' in their title
pdb_ids = rcsb.query_for_pdb_ids(queryTitle="ribosome", verbose=False)
# And let's get all associated metadata with these IDs so we can see 
pdb_ids_w_info = rcsb.get_info_from_ids(pdb_ids)

# # The following lines print the DataFrame in full
# with pd.option_context('display.max_rows', None, 
#                        'display.max_columns', None,
#                        'display.max_colwidth', None):
#     display(pdb_ids_w_info)

# I don't want to keep scrolling past this, so let's actually only print some rows
display(pdb_ids_w_info)

Unnamed: 0,structureId,structureTitle,resolution,depositionDate,experimentalTechnique,residueCount
0,1AHA,THE N-GLYCOSIDASE MECHANISM OF RIBOSOME-INACTIVATING PROTEINS IMPLIED BY CRY...,2.20,1994-01-07,X-RAY DIFFRACTION,246
1,1AHB,THE N-GLYCOSIDASE MECHANISM OF RIBOSOME-INACTIVATING PROTEINS IMPLIED BY CRY...,2.20,1994-01-07,X-RAY DIFFRACTION,246
2,1AHC,THE N-GLYCOSIDASE MECHANISM OF RIBOSOME-INACTIVATING PROTEINS IMPLIED BY CRY...,2.00,1994-01-07,X-RAY DIFFRACTION,246
3,1DD5,"CRYSTAL STRUCTURE OF THERMOTOGA MARITIMA RIBOSOME RECYCLING FACTOR, RRF",2.55,1999-11-08,X-RAY DIFFRACTION,185
4,1DK1,DETAILED VIEW OF A KEY ELEMENT OF THE RIBOSOME ASSEMBLY: CRYSTAL STRUCTURE O...,2.80,1999-12-06,X-RAY DIFFRACTION,143
...,...,...,...,...,...,...
741,6S0X,Erythromycin Resistant Staphylococcus aureus 70S ribosome (delta R88 A89 uL2...,2.42,2019-06-18,ELECTRON MICROSCOPY,10148
742,6S0Z,Erythromycin Resistant Staphylococcus aureus 50S ribosome (delta R88 A89 uL2...,2.30,2019-06-18,ELECTRON MICROSCOPY,6125
743,6S12,Erythromycin Resistant Staphylococcus aureus 50S ribosome (delta R88 A89 uL22).,3.20,2019-06-18,ELECTRON MICROSCOPY,6145
744,6S13,Erythromycin Resistant Staphylococcus aureus 70S ribosome (delta R88 A89 uL22).,3.58,2019-06-18,ELECTRON MICROSCOPY,10167


## PDB ID Filtering

It is obvious if you look through the titles that only some of these structures are even of the ribosome, let alone structures containing the 16S/18S rRNA. So, let's try to get a list of what elements these structures contain, such that we can obtain only the structures we want, we need to apply a filter.

We only want structures with the 16S/18S rRNA, so let's get the descriptions of the chains of each structure,
and use it as a filtering condition.

As an example, let's see what the chain descriptions of one structure look like:

Let's collect this information for all of our PDB IDs,
and try to see how many possible chain names there are.

In [3]:
# Sets containing all unique chain names and descriptions in all structures
# TODO: factor fasta join out of this function, instead do it within notebook logic
# (or add an option <--- this, as jeez it slows the queries down)
chain_name_set = set()
chain_description_set = set()
counter = 0

for (index, pdb_id) in enumerate(pdb_ids):
    # Prints out string to update on progress
    if index % 50 == 0:
        print("Getting chains for %s (%d/%d)" % (pdb_id, index+1, len(pdb_ids)))
    
    # Attempts to get chain descriptions from the API 
    # (HTTP request could fail due to flakiness; will print out if so)
    try:
        chain_df = rcsb.describe_chains(pdb_id)
        for chain_name in chain_df.Name:
            chain_name_set.add(chain_name)
        for chain_description in chain_df.Description:
            chain_description_set.add(chain_description)
    except:
        print("Request for '%s' failed", pdb_id)

Getting chains for 1AHA (1/746)
Getting chains for 1RY1 (51/746)
Getting chains for 2QES (101/746)
Getting chains for 3JAJ (151/746)
Getting chains for 4CSU (201/746)
Getting chains for 4TUE (251/746)
Getting chains for 4V50 (301/746)
Getting chains for 4V7E (351/746)
Getting chains for 4W29 (401/746)
Getting chains for 5BY8 (451/746)
Getting chains for 5J88 (501/746)
Getting chains for 5NDJ (551/746)
Getting chains for 5WFS (601/746)
Getting chains for 6FKR (651/746)
Getting chains for 6MTC (701/746)


In [4]:
# Let's see how many unique chain names and chain descriptions there are
print("There are %d unique chain names" % len(chain_name_set))
print("There are %d unique chain descriptions" % len(chain_description_set))

There are 1024 unique chain names
There are 3291 unique chain descriptions


In [5]:
# What do chain names tend to look like?
display(list(chain_name_set)[1:20])

['30S ribosomal protein S27ae',
 'Guanine nucleotide-binding protein subunit beta-like protein',
 '39S ribosomal protein L27, mitochondrial',
 '40S ribosomal protein SA',
 '40S ribosomal protein S22-like protein',
 'Signal recognition particle receptor subunit alpha',
 '60S ribosomal protein L14',
 '50S ribosomal protein L31',
 '28S ribosomal protein S34, mitochondrial',
 'N-terminal acetyltransferase A complex subunit NAT1',
 '39S ribosomal protein L9, mitochondrial',
 'KRR1 small subunit processome component',
 'Protein transport protein Sec61 subunit gamma',
 '54S ribosomal protein L36, mitochondrial',
 'Ribosomal protein L35Ae, putative',
 'Ribosomal RNA-processing protein 9',
 '30S ribosomal protein S18 2',
 '39S ribosomal protein L54, mitochondrial',
 '28S ribosomal protein S31, mitochondrial']

In [6]:
# What do chain descriptions tend to look like?
display(list(chain_description_set)[1:20])

['A-SITE TRNA A9C TRP-TRNA TRP',
 'Tetracycline resistance protein TetM',
 'N-terminal acetyltransferase A complex subunit NAT1',
 '60S ribosomal protein rpL6 (L6e)',
 '60S ribosomal protein L14',
 'ms57',
 'SHORT RRNA-II OF THE LARGE RIBOSOMAL SUBUNIT',
 'PROTEIN TRANSPORT PROTEIN SEC61 SUBUNIT GAMMA',
 'Protein transport protein Sec61 subunit gamma',
 'Messenger RNA',
 'Nascent peptide',
 '40S RIBOSOMAL PROTEIN S13',
 'MITORIBOSOMAL PROTEIN ML65, MRPS30',
 '60S ribosomal protein L8',
 '40S ribosomal protein S26-B',
 'KLLA0E06843p',
 'Utp18',
 'ribosomal protein S7',
 'Signal recognition particle 19 kDa protein (SRP19)']

It seems at first glance that chain names and chain descriptions are used mostly interchangeably. 

In [7]:
# Let's try printing all chain names with "16" or "18" in their names,
# to give us an idea of what kind of filters would work to get all of the 16S/18S rRNA Structures
# (Note: For display purposes, only printing first 10 names)
display(list(filter(lambda x: ("16" in x) or ("18" in x), chain_name_set ))[0:10] + ["etc..."])

['30S ribosomal protein S18 2',
 '60S ribosomal protein L18-A',
 '37S ribosomal protein S16, mitochondrial',
 '50S ribosomal protein L16, putative',
 '40S ribosomal protein S16-B',
 '30S ribosomal protein S16, chloroplastic',
 'KLLA0C18216p',
 '54S ribosomal protein L16, mitochondrial',
 '60S ribosomal protein L18a',
 'Ribosomal protein L18',
 'etc...']

In [8]:
# Do the same for chain descriptions
# (Note: For display purposes, only printing first 10 descriptions)
display(list(filter(lambda x: ("16" in x) or ("18" in x), chain_description_set ))[0:10]+ ["etc..."])

['Utp18',
 'MITORIBOSOMAL 16S RRNA',
 'Mitochondrial ribosomal protein L16',
 'Fragment of 16S rRNA (h15)',
 '18S rRNA (1707-MER)',
 'UL18',
 '60S ribosomal protein L16',
 '16S rRNA (1504-MER)',
 '60S ribosomal protein eL18',
 'uL18m',
 'etc...']

Having looked through all of these chain names and decriptions, I'm officially convinced that the vast majority of 16S/18S rRNA candidates will have `16S` or `18S` in the their descriptions

So, let's filter our set of structures to just contain those that have 16S or 18S in their descriptions.

In [9]:
# This function returns a bool representing whether or not the
# PDB structure has any of the given substrings in any of its chain names,
# chain descriptions, or both (depending on value of query_fields).
#
# Input:
#   pdb_id : PDB ID of structure we want to evaluate
#   substring_list : List of substrings to query on
#   query_fields : Which pdb chain column(s) we want to filter on
#                  (must be sublist of ["Name", "Description"])
#   num_attempts : The number of times to try the HTTP request to the RCSB API before
#                  giving up (and returning False)
# 
# Output:
#   Whether or not the PDB structure contains any chains satisfying these conditions.
# 
# EG  `chainFieldsContainSubstrings(["16S", "18S"], "5JUP", ["Description"])`
# should return `True`, as it has 18S chain description
# 
# And `chainFieldsContainSubstrings(["16S", "18S"], "5JUP", ["Name", "Description"])`
# should return `True`, as it has 18S chain description, but no 18S chain name
# 
# But `chainFieldsContainSubstrings(["16S", "18S"], "5JUP", ["Name"])`
# should return `False`, as it has no 18S chain description
def chainFieldsContainSubstrings(pdb_id, substring_list, 
                                 query_fields=["Name", "Description"],
                                 num_attempts=3):    
    # Retries the RCSB query multiple times (to reduce flakiness)
    num_attempted = 0
    success = False
    while (num_attempted < num_attempts):
        try:
            chain_df = rcsb.describe_chains(pdb_id)
            success = True
            break
        except:
            num_attempted += 1
    if not success:
        print("Failed to obtain chains from '%s'" % pdb_id)
        return False
    
    # Returns `True` if any of the substrings are in any of the chain names
    if "Name" in query_fields:
        for chain_name in chain_df.Name:
            for substring in substring_list:
                if substring in chain_name:
                    return True
    # Returns `True` if any of the substrings are in any of the chain descriptions
    if "Description" in query_fields:
        for chain_description in chain_df.Description:
            for substring in substring_list:
                if substring in chain_description:
                    return True
                
    # Otherwise, return `False`
    return False

In [10]:
# Let's test this function works properly
(chainFieldsContainSubstrings("5JUP", ["16S","18S"], ["Name", "Description"]) and
 not chainFieldsContainSubstrings("5JUP", ["16S","18S"], ["Name"]) and
 chainFieldsContainSubstrings("5JUP", ["16S","18S"], ["Description"]))

True

Now I'm convinced the function actually works as it is supposed to, let's apply
the filter to our PDB IDs, such that we only have IDs with 16S or 18S.

In [11]:
desired_pdb_ids = []
for (index, pdb_id) in enumerate(pdb_ids):
    # Prints out string to update on progress
    if index % 50 == 0:
        print("Filtering structure ''%s' (%d/%d)" % (pdb_id, index+1, len(pdb_ids)))

    if chainFieldsContainSubstrings(pdb_id, substring_list=["16S", "18S"],
                                    query_fields=["Description"]):
        desired_pdb_ids.append(pdb_id)

Filtering structure ''1AHA' (1/746)
Filtering structure ''1RY1' (51/746)
Filtering structure ''2QES' (101/746)
Filtering structure ''3JAJ' (151/746)
Filtering structure ''4CSU' (201/746)
Filtering structure ''4TUE' (251/746)
Filtering structure ''4V50' (301/746)
Filtering structure ''4V7E' (351/746)
Filtering structure ''4W29' (401/746)
Filtering structure ''5BY8' (451/746)
Filtering structure ''5J88' (501/746)
Filtering structure ''5NDJ' (551/746)
Filtering structure ''5WFS' (601/746)
Filtering structure ''6FKR' (651/746)
Filtering structure ''6MTC' (701/746)


In [12]:
print(len(desired_pdb_ids))

458


In [13]:
# Let's see how many 16S/18S rRNA candidates we have in all those structures
# (and which structures, if any, have repeats)
chain_list = []
id_seen_once = set()
id_seen_twice = set()

for (index, pdb_id) in enumerate(desired_pdb_ids):
    # Prints out string to update on progress
    if index % 50 == 0:
        print("Getting chains for %s (%d/%d)" % (pdb_id, index+1, len(desired_pdb_ids)))
    
    # Attempts to get chain descriptions from the API 
    # (HTTP request could fail due to flakiness; will print out if so)
    try:
        chain_df = rcsb.describe_chains(pdb_id)
        for chain_description in chain_df.Description:
            if "16S" in chain_description or "18S" in chain_description:
                chain_list.append((pdb_id, chain_description))
                if pdb_id not in id_seen_once:
                    id_seen_once.add(pdb_id)
                else:
                    id_seen_twice.add(pdb_id)
    except:
        print("Failed to get chains for structure %s" % pdb_id)

Getting chains for 1EG0 (1/458)
Getting chains for 4CXG (51/458)
Getting chains for 4V4I (101/458)
Getting chains for 4V6I (151/458)
Getting chains for 4V9K (201/458)
Getting chains for 5E81 (251/458)
Getting chains for 5LL6 (301/458)
Getting chains for 5V93 (351/458)
Getting chains for 6GXP (401/458)
Getting chains for 6QZP (451/458)


In [14]:
print(len(chain_list))

507


Notably, 507 != 458. So, let's see which structures have repeats.

In [15]:
print(list(id_seen_twice))

458
['4V7T', '2OM7', '4V7S', '3J0E', '6I7O', '1ZC8', '4V6F', '4CXH', '6BOH', '6OLE', '4V6C', '4WR6', '6OLI', '2NOQ', '6OM7', '1TRJ', '5W4K', '6HCF', '4V88', '6B4V', '6BOK', '3J0D', '4WU1', '4CXG', '5J4D', '486D', '5MYJ', '1MVR', '6OM0', '4V6D', '4V6E', '4WQ1', '6HCM', '4V7V', '6OLF']
35


After a brief look at the FASTA files of those structures, it seems that they mostly have repeated 16S rRNA chains with identical sequences. We'll treat this more rigorously later.

In [16]:
# Now, let's get the associated DataFrame for all ribosome structures with 16S rRNA
# (Additionally, we cut off all structures that aren't at a resolution of at least 3.5 Angstroms)
# -- I got this cutoff from the 'Gene Machine' book by Venki Ramakrishnan on pg. 82
ribosome_df = rcsb.get_info_from_ids(desired_pdb_ids)
ribosome_df = ribosome_df[ribosome_df["resolution"] <= 3.5] # amino acid resolution cutoff
display(ribosome_df)

Unnamed: 0,structureId,structureTitle,resolution,depositionDate,experimentalTechnique,residueCount
9,1VVJ,Crystal Structure of Frameshift Suppressor tRNA SufA6 bound to Codon CCC-G o...,3.44,2013-05-24,X-RAY DIFFRACTION,21420
10,1VY4,Crystal structure of the Thermus thermophilus 70S ribosome in the pre-attack...,2.60,2014-05-13,X-RAY DIFFRACTION,21748
11,1VY5,Crystal structure of the Thermus thermophilus 70S ribosome in the post-catal...,2.55,2014-05-13,X-RAY DIFFRACTION,21748
12,1VY6,Crystal structure of the Thermus thermophilus 70S ribosome in the pre-attack...,2.90,2014-05-13,X-RAY DIFFRACTION,21448
13,1VY7,Crystal structure of the Thermus thermophilus 70S ribosome in the pre-attack...,2.80,2014-05-13,X-RAY DIFFRACTION,21602
...,...,...,...,...,...,...
451,6R5Q,Structure of XBP1u-paused ribosome nascent chain complex (post-state),3.00,2019-03-25,ELECTRON MICROSCOPY,17453
453,6R6P,Structure of XBP1u-paused ribosome nascent chain complex (rotated state),3.10,2019-03-27,ELECTRON MICROSCOPY,17509
454,6RM3,Evolutionary compaction and adaptation visualized by the structure of the do...,3.40,2019-05-05,ELECTRON MICROSCOPY,15471
455,6S0X,Erythromycin Resistant Staphylococcus aureus 70S ribosome (delta R88 A89 uL2...,2.42,2019-06-18,ELECTRON MICROSCOPY,10148


Nice! So we now have our 257 candidate ribosome structures! 

Now, let's attempt to get the FASTA sequences for the 16S rRNA from this, and append them to one big data frame of the chains, the sequences, and their PDB accession IDs!

In [42]:
acc_chain_data = pd.DataFrame()
for (index, pdb_id) in enumerate(ribosome_df["structureId"]):
    # Prints out string to update on progress
    if index % 50 == 0:
        print("Getting chains for %s (%d/%d)" % (pdb_id, index+1, len(ribosome_df["structureId"])))
    # Fetches chain data (with FASTA sequence) from RCSB
    df = rcsb.describe_chains(pdb_id, get_fasta=True)
    # Adds a PDB_ID column (so we can identify where each chain is from)
    df["PDB_ID"] = pdb_id
    # Filters out all chains that aren't 16S or 18S rRNA
    df = df[df["Description"].str.contains("16S") | df["Description"].str.contains("18S")]
    # Appends the desired rows to accumulated DataFrame
    acc_chain_data = acc_chain_data.append(df, ignore_index = True)

Getting chains for 1VVJ (1/257)
Getting chains for 4V5A (51/257)
Getting chains for 4V9S (101/257)
Getting chains for 5HCR (151/257)
Getting chains for 5WFS (201/257)
Getting chains for 6QNR (251/257)


In [43]:
# Drop any duplicate rows
acc_chain_data.drop_duplicates(subset =["Taxonomy","Sequence","PDB_ID"], inplace = True) 

In [40]:
# # Let's take a look at this data
# with pd.option_context('display.max_rows', None, 
#                        'display.max_columns', None,
#                        'display.max_colwidth', 1000):
#     columnsTitles = ['PDB_ID', 'Description', 'Taxonomy', "Name", "Sequence"]
#     frame = acc_chain_data.reindex(columns=columnsTitles)
#     display(frame)

Having looked through this data, it's obvious that some of it is mislabelled.
A chain with a description of `16S rRNA` should not have any protein in it.
So, let's apply another round of filtering.

In [59]:
# 16S/18S rRNA should not have amino acids in them
acc_chain_data = acc_chain_data[~acc_chain_data["Sequence"].str.contains("B")]
# Their names shouldn't include proteins
acc_chain_data = acc_chain_data[~acc_chain_data["Name"].str.contains("protein")]
# In case of duplicate chains with different sequences within same PDB_ID, drop all but the first one
acc_chain_data.drop_duplicates(subset="PDB_ID", inplace=True)

In [56]:
# Let's take a look at this data
# with pd.option_context('display.max_rows', None, 
#                        'display.max_columns', None,
#                        'display.max_colwidth', 1000):
#     acc_chain_data["Length"] = acc_chain_data["Sequence"].str.len()
#     columnsTitles = ['PDB_ID', 'Description', 'Taxonomy', "Name", "Length",  "Sequence"]
#     frame = acc_chain_data.reindex(columns=columnsTitles)
#     display(frame)

In [64]:
len(acc_chain_data) # Now we have 247 final sequences

247

In [70]:
acc_chain_data.head()

Unnamed: 0,Name,Taxonomy,Description,Sequence,PDB_ID,length,Length
0,,Thermus thermophilus,16S rRNA,UUUGUUGGAGAGUUUGAUCCUGGCUCAGGGUGAACGCUGGCGGCGUGCCUAAGACAUGCAAGUCGUGCGGGCCGCG...,1VVJ,0,1522
1,,Thermus thermophilus,16S Ribosomal RNA,UUGUUGGAGAGUUUGAUCCUGGCUCAGGGUGAACGCUGGCGGCGUGCCUAAGACAUGCAAGUCGUGCGGGCCGCGG...,1VY4,0,1521
2,,Thermus thermophilus,16S Ribosomal RNA,UUGUUGGAGAGUUUGAUCCUGGCUCAGGGUGAACGCUGGCGGCGUGCCUAAGACAUGCAAGUCGUGCGGGCCGCGG...,1VY5,0,1521
3,,Thermus thermophilus,16S Ribosomal RNA,UUGUUGGAGAGUUUGAUCCUGGCUCAGGGUGAACGCUGGCGGCGUGCCUAAGACAUGCAAGUCGUGCGGGCCGCGG...,1VY6,0,1521
4,,Thermus thermophilus,16S Ribosomal RNA,UUGUUGGAGAGUUUGAUCCUGGCUCAGGGUGAACGCUGGCGGCGUGCCUAAGACAUGCAAGUCGUGCGGGCCGCGG...,1VY7,0,1521


## Converting Chain Data to FASTA format

Now we have the data we want to feed into Clustal Omega! The problem is that Clustal Omega takes FASTA files as input. So, let's convert our Pandas DataFrame into that format.

In [76]:
# TODO: Annotate this function
def convert_df_to_fasta(df, file_name):
    fasta_string = ""
    for index, row in df.iterrows():
        sequence_header = ("> PDB: %s | Taxonomy: %s | Name: %s | Descr: %s \n" %
                        (row["PDB_ID"], row["Taxonomy"], row["Name"], row["Description"]))
        fasta_string += sequence_header
        fasta_string += row["Sequence"] + "\n"
        
    file = open(file_name,"w") 
    file.write(fasta_string)
    file.close()
    
convert_df_to_fasta(acc_chain_data, "test_file.txt")

In [77]:
%%bash
head test_file.txt

> PDB: 1VVJ | Taxonomy: Thermus thermophilus | Name:  | Descr: 16S rRNA 
UUUGUUGGAGAGUUUGAUCCUGGCUCAGGGUGAACGCUGGCGGCGUGCCUAAGACAUGCAAGUCGUGCGGGCCGCGGGGUUUUACUCCGUGGUCAGCGGCGGACGGGUGAGUAACGCGUGGGUGACCUACCCGGAAGAGGGGGACAACCCGGGGAAACUCGGGCUAAUCCCCCAUGUGGACCCGCCCCUUGGGGUGUGUCCAAAGGGCUUUGCCCGCUUCCGGAUGGGCCCGCGUCCCAUCAGCUAGUUGGUGGGGUAAUGGCCCACCAAGGCGACGACGGGUAGCCGGUCUGAGAGGAUGGCCGGCCACAGGGGCACUGAGACACGGGCCCCACUCCUACGGGAGGCAGCAGUUAGGAAUCUUCCGCAAUGGGCGCAAGCCUGACGGAGCGACGCCGCUUGGAGGAAGAAGCCCUUCGGGGUGUAAACUCCUGAACCCGGGACGAAACCCCCGACGAGGGGACUGACGGUACCGGGGUAAUAGCGCCGGCCAACUCCGUGCCAGCAGCCGCGGUAAUACGGAGGGCGCGAGCGUUACCCGGAUUCACUGGGCGUAAAGGGCGUGUAGGCGGCCUGGGGCGUCCCAUGUGAAAGACCACGGCUCAACCGUGGGGGAGCGUGGGAUACGCUCAGGCUAGACGGUGGGAGAGGGUGGUGGAAUUCCCGGAGUAGCGGUGAAAUGCGCAGAUACCGGGAGGAACGCCGAUGGCGAAGGCAGCCACCUGGUCCACCCGUGACGCUGAGGCGCGAAAGCGUGGGGAGCAAACCGGAUUAGAUACCCGGGUAGUCCACGCCCUAAACGAUGCGCGCUAGGUCUCUGGGUCUCCUGGGGGCCGAAGCUAACGCGUUAAGCGCGCCGCCUGGGGAGUACGGCCGCAAGGCUGAAACUCAAAGGAAUUGACGGGGGCCCGCACAAGCGGUGGAGCA

Success!!!

## Piping this FASTA file into Clustal Omega

We're going to use the `subprocess` Python module to allow us to run Clustal Omega in BASH and have its output as accessible within the Python environment.

In [78]:
# TODO: Implement this last part, and then data analysis of the results

## Afterword: My Analysis Technique

The layout of this notebook was inspired by [this paper](https://arxiv.org/abs/1901.08152) about formatting notebook formats to maximize reproduciblity of data science analyses, drawing from the principles of predictability, computatibility, and stability. 

IE if someone else did this analysis, if I did my job right, they should always get the same conclusions as they will always know which judgement calls I made over the course of my analysis.