In [1]:
import pandas as pd

# Sequence Similarity Demo
In this demo, we will answer the question:

_How does the primary sequence of TMPRSS2 differ between species that one would encounter in a farm environment?_

We will address this question using sequence alignment and analysis tools from the [Biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec81) Python library.

## Outline

* Using the [Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec385) as reference
* Prerequisites
    * Reading on PSSMs: [Rice.edu](https://www.cs.rice.edu/~ogilvie/comp571/2018/09/11/pssm.html)

### Part 1: Preparing input sequences

* Intro to `Bio.Align`
* Learn how to filter sequence records in a multiple sequence alignment by:
    * Species name
    * Sequence snippets
* Find the 
* Generate consensus sequences for the cat sequence

### Part 2: Analyzing aligned sequences

* Compare human homolog to the mouse
    * Compute a [log odds substitution matrix](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec390)
    * What are the log odds of the following polymorphisms?
        * Hydrophobic -> hydrophilic and vice versa
        * Aromatic -> non-aromatic and vice versa
    * Construct a more generalized PSSM for the above categories of "penalized polymorphisms"
        * For instance, we want to parse `R -> Y` and `S -> I` to `hydrophilic -> hydrophobic`

## "Homework"
"Homework" is a recommendedation. If you find yourself more interested in a different analysis, say in a comparison of variants **within** Homo sapiens, feel free to do that analysis instead.

* Repeat analysis for each of the other domestic species (dog, horse, chicken, etc.)
* Generate a "generalized PSSM" for the other types of penalized polymorphisms, such as `acidic -> basic`, `bulky -> small`, `aromatic -> non-aromatic`, etc.

We're using [Biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html) again. If the below import commands fail, you might need to install Biopython from the command line:
```bash
pip install biopython

# or using poetry
poetry add biopython
```

In [2]:
from Bio.Align import AlignInfo, MultipleSeqAlignment
from Bio import AlignIO, Alphabet, SeqRecord, Seq, SubsMat

# Part #1
____

# Read the alignment records
We use the Python function from the Biopython package: `Bio.AlignIO.read` to read the trimmed alignment file. This Python function reads the `*.txt` file in the `'fasta'` format and returns an instance of `Bio.Align.MultipleSeqAlignment` (documentation can be found [here](https://biopython.org/DIST/docs/api/Bio.Align.MultipleSeqAlignment-class.html) and [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec81)).

In [3]:
alignment = AlignIO.read(open('./trimmed_alg.txt'), format='fasta')
alignment

<<class 'Bio.Align.MultipleSeqAlignment'> instance (9757 records of length 60918, SingleLetterAlphabet()) at 7f505c310cd0>

Each element of this list-like instance is a sequence:

In [4]:
alignment[0] 

SeqRecord(seq=Seq('------------------------------------------------------...---', SingleLetterAlphabet()), id='7260.FBpp0240705', name='7260.FBpp0240705', description='7260.FBpp0240705', dbxrefs=[])

This instance of `Bio.Align.MultipleSeqAlignment` is a lot like a Python list. For instance, you can:

In [5]:
# get the number of sequences in this alignment
print("number of sequence records: ", len(alignment))

# iterate over the sequence records in the alignment
record_counter = 0
for record in alignment:
    record_counter += 1
print("number of sequence records (a different way): ", record_counter)

# get the 100th sequence record in the alignment
print("ID of the 100th sequence: ", alignment[99].id)

number of sequence records:  9757
number of sequence records (a different way):  9757
ID of the 100th sequence:  9796.ENSECAP00000016722


# Filter the sequences in the alignment
For now, we're only interested in "domestic species," or species whose scientific name is in the Python list `domestic_sp_names`:

In [6]:
domestic_sp_names = [
    'Homo sapiens', # human
    'Mus musculus', # mouse
    'Canis lupus familiaris', # dog
    'Felis catus', # cat
    'Bos taurus', # cattle
    'Equus caballus', # horse
    'Gallus gallus' # chicken
]

The sequences in the `Bio.Align.MultipleSeqAlignment` are for **all** the species that EggNOG could find, including worms, polar bears, and other species that we're not interested in.

Let's filter out sequences from species whose names are **not** in the list `domestic_sp_names`. To do this, we will:
1. Get the scientific name for each species, and load it into the `description` attribute of each sequence. This should be familiar from the [descriptive stats demo](../descriptive_stats_demo/eggNOG_alignment_metadata.ipynb).
2. Use a [list comprehension](https://github.com/wilfredinni/python-cheatsheet#list-comprehension) to get the list of sequences for species that we are interested in.
3. The above step will generate a Python list; it will need to be converted to an instance of  `Bio.Align.MultipleSeqAlignment` if we want to use fancy Biopython analysis tools on it.

## Step #1: Get scientific name for each species
This should be familiar from the [descriptive stats demo](../descriptive_stats_demo/eggNOG_alignment_metadata.ipynb).

In [7]:
!ls
!
!ls

 20200706_seq_sim-checkpoint.ipynb    20200713_seq_sim_mouse.ipynb
 20200706_seq_sim_cat.ipynb	     "Ty's Playbook.ipynb"
 20200713_seq_sim_cow.ipynb	      extended_members.txt
 20200713_seq_sim_dog.ipynb	      raw_alg.txt
 20200713_seq_sim_horse-Copy1.ipynb   tree.txt
 20200713_seq_sim_horse.ipynb	      trimmed_alg.txt
 20200706_seq_sim-checkpoint.ipynb    20200713_seq_sim_mouse.ipynb
 20200706_seq_sim_cat.ipynb	     "Ty's Playbook.ipynb"
 20200713_seq_sim_cow.ipynb	      extended_members.txt
 20200713_seq_sim_dog.ipynb	      raw_alg.txt
 20200713_seq_sim_horse-Copy1.ipynb   tree.txt
 20200713_seq_sim_horse.ipynb	      trimmed_alg.txt


In [8]:
tmprss2_ext = pd.read_table('../seq_sim_demo/extended_members.txt', header=None)
tmprss2_ext.columns = ['id_1', 'id_2', 'species', '', '']
tmprss2_ext.head()

Unnamed: 0,id_1,id_2,species,Unnamed: 4,Unnamed: 5
0,CRE24749,CRE24749,Caenorhabditis remanei,31234,"aliases:DS268562,E3N945_CAERE,E3N945,CRE_24749..."
1,CRE21132,CRE-TRY-4,Caenorhabditis remanei,31234,"aliases:E3MEX0,DS268440,E3MEX0_CAERE,CRE21132,..."
2,CRE24758,CRE-TRY-6,Caenorhabditis remanei,31234,"aliases:E3N963,DS268562,E3N963_CAERE,CRE24758,..."
3,CRE18672,CRE18672,Caenorhabditis remanei,31234,"aliases:DS268410,E3LKX4_CAERE,E3LKX4,CRE_18672..."
4,CRE24729,CRE-TRY-3,Caenorhabditis remanei,31234,"aliases:E3N418,DS268522,E3N418_CAERE,CRE24729,..."


In [9]:
for record in alignment:
    
    # while we're at it, let's make sure that Biopython knows these
    # are protein sequences
    record.seq.alphabet = Alphabet.generic_protein
    
    # from visual inspection we know the name format is XXXX.unique_id,
    # so we split on "." and take the last element of the list
    id_code = record.id.split('.')[-1]
    
    # reference the metadata to get the species name
    sp_name = tmprss2_ext[tmprss2_ext['id_1'] == id_code]['species'].values
    
    try:
        sp_name = sp_name.item()
    except ValueError:
        sp_name = None
    
    # assign the species name to the species attribute
    record.description = sp_name

## Step #2: Use a list comprehension to filter to domestic species

In [10]:
dom_aln_list = [record for record in alignment
                if record.description in domestic_sp_names]

We see that the length of this filtered list is much shorter:

In [11]:
print("number of records for all species:", len(alignment))
print("number of records for domestic species:", len(dom_aln_list))

number of records for all species: 9757
number of records for domestic species: 732


## Step #3: Convert this list to a new `MultipleSeqAlignment` instance

In [12]:
dom_aln = MultipleSeqAlignment(dom_aln_list)

`dom_aln` has the same data, but is a different type of Python variable:

In [13]:
print("dom_aln_list is type:", type(dom_aln_list))
print("dom_aln is type:", type(dom_aln))

dom_aln_list is type: <class 'list'>
dom_aln is type: <class 'Bio.Align.MultipleSeqAlignment'>


# Get the sequence of human TMPRSS2
Before we start comparing sequences to each other, let's get the sequence of TMPRSS2 in `Homo sapiens`. This is the sequence that we will compare other species' homologs to.

To do this filtering, let's use a list comprehension, then convert to a `MultipleSeqAlignment`, just like we did before:

In [14]:
human_aln_list = [
    record for record in dom_aln
    if record.description == 'Homo sapiens'
]
human_aln = MultipleSeqAlignment(human_aln_list)

We see that there are many records in the alignment that have `Homo sapiens` as the species:

In [15]:
len(human_aln)

118

It would be interesting to look at how the differences between these 118 variants _within_ the human species, but let's move on to our inter-species analysis for this demo.

## Get the sequence of human isoform 2

Let's find the sequence record that has the same sequence as isoform 2 on the [TMPRSS2 UniProt page](https://www.uniprot.org/uniprot/O15393#O15393-1). The first few residues of this isoform are `MPPAPPGG`:

In [16]:
isoform_aln_list = [
    record for record in human_aln
    if 'MPPAPPGG' in str(record.seq).replace("-", "")
]

In [17]:
print("number of human sequences that contain MPPAPPGG:", len(isoform_aln_list))
human_iso2 = isoform_aln_list[0]
human_iso2

number of human sequences that contain MPPAPPGG: 1


SeqRecord(seq=Seq('------------------------------------------------------...---', ProteinAlphabet()), id='9606.ENSP00000381588', name='9606.ENSP00000381588', description='Homo sapiens', dbxrefs=[])

This is an aligned sequence, so it has a lot of `-` characters that signify residues that are missing relative to other sequences in `alignment`:

In [18]:
str(human_iso2.seq)

'---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

We can remove these characters using Python's string replacement method, allowing us to more easily look at the amino acid sequence:

In [19]:
str(human_iso2.seq).replace('-', '')

'MPPAPPGGESGCEERGAAGHIEHSRYLSLLDAVDNSKMALNSGSPPAIGPYYENHGYQPENPYPAQPTVVPTVYEVHPAQYYPSPVPQYAPRVLTQASNPVVCTQPKSPSGTVCTSKTKKALCITLTLGTFLVGAALAAGLLWKFMGSKCSNSGIECDSSGTCINPSNWCDGVSHCPGGEDENRCVRLYGPNFILQVYSSQRKSWHPVCQDDWNENYGRAACRDMGYKNNFYSSQGIVDDSGSTSFMKLNTSAGNVDIYKKLYHSDACSSKAVVSLRCIACGVNLNSSRQSRIVGGESALPGAWPWQVSLHVQNVHVCGGSIITPEWIVTAAHCVEKPLNNPWHWTAFAGILRQSFMFYGAGYQVEKVISHPNYDSKTKNNDIALMKLQKPLTFNDLVKPVCLPNPGMMLQPEQLCWISGWGATEEKGKTSEVLNAAKVLLIETQRCNSRYVYDNLITPAMICAGFLQGNVDSCQGDSGGPLVTSKNNIWWLIGDTSWGSGCAKAYRPGVYGNVMVFTDWIYRQMRADG'

We also notice that most of the sequence of interest is in the middle of the aligned sequence. Let's trim the aligned sequence to generate a compact aligned sequence that it starts with `MPPAPP` and ends with `ADG`. To do this, we will make use of the [`str.index`](https://docs.python.org/2/library/stdtypes.html?highlight=index#str.index) method:

In [20]:
index_nterm = str(human_iso2.seq).index('MPPAPP')
index_cterm = str(human_iso2.seq).index('ADG')

# since we want to cut at ADG^, not ^ADG, we add 3 characters to this index
index_cterm += 3

print("index of N-terminus:", index_nterm)
print("index of C-terminus:", index_cterm)

index of N-terminus: 33713
index of C-terminus: 38856


We can use these indices to trim to the compact sequence:

In [21]:
human_compact = human_iso2[index_nterm:index_cterm]
str(human_compact.seq)

'MPPAPP----------GGESG-CEE-----------------------------------------------------------R--G-A-A---------GHIEHSRYLS-L-LD---------AV-D----------N----SK-------------------------------------------------------M------------------------------------------------------------------------ALNSG-------------------------S----------P----------------------------------------------------------P---AI---G----P-Y---YENHG----------------------------------------YQPE---------NPY-----------------------------------------------------------------------------------------------------------------------------------------P-A-Q-----------------PT----------------VV-P------------------------------------------------------T----------------------------------V-------------------------------------YE-V-H---P----------A------------------------------QY--Y----P-S--------------P-V-P------Q----YAPRVLT---Q-A--SN--P-------V----V--CTQ--PK-SP---SG-----T---------------------------------------------------------------------------------------

These N-terminus and C-terminus indices will be useful when we want to trim sequence records for other species.

# Generate consensus sequences for cat homolog

Just like the sequence records for `Homo sapiens`, the records for the other `domestic_sp_names` have duplicates. For example, let's look at `Equus caballus`:

In [22]:
chicken_aln_list = [
    record for record in dom_aln
    if record.description == 'Gallus gallus'
]
chicken_aln = MultipleSeqAlignment(chicken_aln_list)

In [23]:
len(chicken_aln)

51

Let's compare 1 sequence, instead of all  variants, of  homolog to the human homolog. To do this, we will generate a **consensus sequence** ([Wikipedia](https://en.wikipedia.org/wiki/Consensus_sequence#:~:text=In%20molecular%20biology%20and%20bioinformatics,position%20in%20a%20sequence%20alignment.)) for the cat variants. We do this in 2 steps:
1. Generate a `Bio.Align.AlignInfo.SummaryInfo` instance from the `MultipleSeqAlignment`
2. Call the `SummaryInfo` method `dumb_consensus`, which runs a very simple consensus sequence finding algorithm.

## Step #1

In [24]:
chicken_aln_summary = AlignInfo.SummaryInfo(chicken_aln)
chicken_aln_summary

<Bio.Align.AlignInfo.SummaryInfo at 0x7f505aefa730>

## Step #2

In [25]:
chicken_aln_consensus = chicken_aln_summary.dumb_consensus()
chicken_aln_consensus

Seq('XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...XXX', ProteinAlphabet())

Let's use the N-terminus and C-terminus locations that we calculated above to compact this consensus sequence:

In [26]:
chicken_consensus_compact = chicken_aln_consensus[index_nterm:index_cterm]
str(chicken_consensus_compact).replace('X', '-')

'----------AA---ALLL-G--------IV--------------R-----------------------------------------G--------------------------P------------------S-----------------------------------------------------G-------------------F-A------------H-----I--P--ELNVT-----H--C-----------------------D-----C--------L--PVL------S-M-----------------------L-----------------W-------------------------------T---AA---H----C----------Q--N-------------------------------R---------------I--VY----------------------DEC-----------FTD-----C----------RR---A------TV-R--C---------H-----------Y-G-----VD---L--------E-----------------------------------------PQ-----E----------V------Q------A----------C----------------------L-----T--------C-------N-RI-DN-------------------------------E---E-T--------------D---L------------------------G-----------P-----F-S------N--------------------F---L--------------PRVST-----------------N-P---S-----A--------L-------------------------------------------------------------------------------------------------

Finally, this consensus sequence is a `Seq`, not a `SeqRecord`. Let's convert it to a `SeqRecord` so we can compare it to the human sequence:

In [27]:
# convert 'X' to '-' for consistency with human sequence
# and convert to a Seq.Seq instance
chicken_replaced_str =  str(chicken_consensus_compact).replace('X', '-')
chicken_consensus_replaced = Seq.Seq(chicken_replaced_str)

# then convert to a SeqRecord.SeqRecord instance
chicken_record_compact = SeqRecord.SeqRecord(chicken_consensus_replaced, description='Gallus gallus', name='dumb_consensus')
chicken_record_compact

SeqRecord(seq=Seq('----------AA---ALLL-G--------IV--------------R--------...---'), id='<unknown id>', name='dumb_consensus', description='Gallus gallus', dbxrefs=[])

# Part 2: the fun stuff
**Finally**, we have human TMPRSS2 and a consensus sequence for cat TMPRSS2. The sequences are aligned and ready for some more advanced analysis with the help of Biopython.

Let's start looking at ways we can compare the two sequences. To start, we will answer the question:

**At every location in the sequence, what is the percent probability that this position will be a tyrosine (Y), leucine (L), or any other amino acid?**

To do this, we will calculate a [**position specific score matrix**](https://www.cs.rice.edu/~ogilvie/comp571/2018/09/11/pssm.html) (PSSM). Let's generate a new, very short `MultipleSeqAlignment` between our human and cat sequences:

In [28]:
hum_chicken_aln = MultipleSeqAlignment([human_compact, chicken_record_compact])
hum_chicken_aln

<<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 5143, Alphabet()) at 7f505af4bbb0>

Now we can generate a `SummaryInfo` instance like we did before, and calculate the PSSM:

In [29]:
hum_chicken_summary = AlignInfo.SummaryInfo(hum_chicken_aln)
hum_chicken_summary

<Bio.Align.AlignInfo.SummaryInfo at 0x7f505af0d8b0>

In [30]:
hum_chicken_pssm = hum_chicken_summary.pos_specific_score_matrix(human_compact)
hum_chicken_pssm

<Bio.Align.AlignInfo.PSSM at 0x7f505aefa250>

We can look at the data in the PSSM by inspecting the `pssm` attribute.

The PSSM is a Python list, where each element is a [tuple](https://github.com/wilfredinni/python-cheatsheet#tuple-data-type) of length 2. The first element of the tuple is the amino acid in the human sequence, and the second element is a Python [dictionary](https://github.com/wilfredinni/python-cheatsheet#dictionaries-and-structuring-data). The dictionary keys are all the naturally occurring amino acids, and the values are the number of times that amino acid was found at that position in the alignment.

## At which positions are the sequences identical?
To answer this question, we will use a familiar [for loop](https://github.com/wilfredinni/python-cheatsheet#for-loops-and-the-range-function). When we encounter an `-` in the first element of the `position` tuple, this means that the human sequence had a `-` character at that position. `-` is not an amino acid, so we skip these positions and move on using the [continue statement](https://github.com/wilfredinni/python-cheatsheet#continue-statements).

In the `print` statement at the end of the cell, we also make use of [formatted strings](https://github.com/wilfredinni/python-cheatsheet#formatted-string-literals-or-f-strings-python-36) in Python 3.6.

In [31]:
# we want to keep track of which amino acid our
# "cursor" is on in the for loop
position_counter = 0

for position in hum_chicken_pssm.pssm:
    
    # `position` is the 2-element tuple
    # let's give each element a useful name
    resi_in_human = position[0]
    resi_dict = position[1]
    
    # skip this position if it is a '-'
    # in the human sequence record
    if resi_in_human == '-':
        continue
    else:
        # increment the counter by 1
        position_counter += 1 
    
    # if more than one instance of amino acid
    # `resi_in_human` was found at this position,
    # meaning that the cat homolog is the same amino acid
    if resi_dict[resi_in_human] == 2:
        print(f"chicken and human are the same at position " +
              f"{position_counter}, which is amino acid {resi_in_human}")

chicken and human are the same at position 11, which is amino acid G
chicken and human are the same at position 16, which is amino acid G
chicken and human are the same at position 47, which is amino acid A
chicken and human are the same at position 63, which is amino acid Y
chicken and human are the same at position 67, which is amino acid P
chicken and human are the same at position 69, which is amino acid V
chicken and human are the same at position 91, which is amino acid P
chicken and human are the same at position 92, which is amino acid R
chicken and human are the same at position 93, which is amino acid V
chicken and human are the same at position 95, which is amino acid T
chicken and human are the same at position 143, which is amino acid W
chicken and human are the same at position 145, which is amino acid F
chicken and human are the same at position 153, which is amino acid S
chicken and human are the same at position 157, which is amino acid C
chicken and human are the same

To make sure our `position_counter` variable is working properly, let's double check that the length of the human sequence (without `-` characters) is indeed 529:

In [32]:
# position counter from the above for loop
print(f"the human sequence is {position_counter} amino acids long")

# calling len(str)
length_a_different_way = len(str(hum_chicken_aln[0].seq).replace('-', ''))
print(f"the human sequence is {length_a_different_way} amino acids long")

the human sequence is 529 amino acids long
the human sequence is 529 amino acids long


We see that `position_counter` appears to be working as expected!

## At which positions are amino acids different?
The more interesting question is how these structures differ. We can use a similar for loop to address this question:

In [33]:
# we want to keep track of which amino acid our
# "cursor" is on in the for loop
position_counter = 0

list_to_store_same = list()

for position in hum_chicken_pssm.pssm:
    
    # `position` is the 2-element tuple
    # let's give each element a useful name
    resi_in_human = position[0]
    resi_dict = position[1]
    
    # skip this position if it is a '-'
    # in the human sequence record
    if resi_in_human == '-':
        continue
    else:
        # increment the counter by 1
        position_counter += 1
    
    # if more than one instance of amino acid
    # `resi_in_human` was found at this position,
    # meaning that the cat homolog is the same amino acid
    if position[1][resi_in_human] != 2:
        print(f"chicken and human are the same at position " +
            f"{position_counter}, which is amino acid {resi_in_human}")
        # list_to_store_same.append(position_counter)

chicken and human are the same at position 1, which is amino acid M
chicken and human are the same at position 2, which is amino acid P
chicken and human are the same at position 3, which is amino acid P
chicken and human are the same at position 4, which is amino acid A
chicken and human are the same at position 5, which is amino acid P
chicken and human are the same at position 6, which is amino acid P
chicken and human are the same at position 7, which is amino acid G
chicken and human are the same at position 8, which is amino acid G
chicken and human are the same at position 9, which is amino acid E
chicken and human are the same at position 10, which is amino acid S
chicken and human are the same at position 12, which is amino acid C
chicken and human are the same at position 13, which is amino acid E
chicken and human are the same at position 14, which is amino acid E
chicken and human are the same at position 15, which is amino acid R
chicken and human are the same at position 

chicken and human are the same at position 324, which is amino acid T
chicken and human are the same at position 325, which is amino acid P
chicken and human are the same at position 326, which is amino acid E
chicken and human are the same at position 328, which is amino acid I
chicken and human are the same at position 329, which is amino acid V
chicken and human are the same at position 330, which is amino acid T
chicken and human are the same at position 335, which is amino acid V
chicken and human are the same at position 336, which is amino acid E
chicken and human are the same at position 337, which is amino acid K
chicken and human are the same at position 338, which is amino acid P
chicken and human are the same at position 339, which is amino acid L
chicken and human are the same at position 340, which is amino acid N
chicken and human are the same at position 341, which is amino acid N
chicken and human are the same at position 342, which is amino acid P
chicken and human ar

## At which positions do we encounter a hydrophobic -> hydrophilic (or vice versa)?
For this question, we will need to make our algorithm a little more complex. We are going to start by making a dataframe that stores amino acid properties, such as volume, hydrophobicity, charge, and so forth. We will use the CSV format of this [table of amino acid properties](https://web.nmsu.edu/~talipovm/lib/exe/fetch.php?media=world:pasted:table08.pdf) and load it into a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#object-creation).

Let's also narrow our analysis to sites that are implicated as important in cleaving the SARS-CoV-2 S protein. H296, D345 and S441 are the catalytic triad, and D435 is a binding residue ([Meng et al 2020](https://www.biorxiv.org/content/10.1101/2020.02.08.926006v3.full)).

K225 is implicated as important in binding monobasic targets such as S1/S2 domain of S protein ([Ohno et al 2020](https://www.biorxiv.org/content/10.1101/2020.06.12.149229v1.full)). Residue 225 in isoform 1 is actually a Leucine (L); this might have been a typo, since the two previous residues (223 and 224) are both lysines. We will consider both 223 and 224 as important, since they likely both contribute to the positive patch in the binding site, hypothesized to confer preference for monobasic substrates by Ohno et al.

In [34]:
for position in human_compact:
    print(type(position))
    print(position[0])
    break

<class 'str'>
M


## Pseudocode outline

```
human MP   P   APP
cat   LA   P   ---
```

1. Iterate over each amino acid. For loop. We don't necessarily need the PSSM here.

```python
for position in human_compact:
```
2. Get the amino acid at this position, for both human and cat.

```python
resi_in_human = human_compact[position]
resi_in_chicken = horse_chicken[position]
```
3. Get the hydrophobicity of each amino acid (5.0 and 3.0)

```python
h_hum = get_hydrophobicity(resi_in_human)
h_chicken = get_hydrophobicity(resi_in_horse)
```
4. Get the (absolute value of) **difference** in hydrophobicity (2.0)

```python
h_hum = 9.00
h_chicken = 20.0
diff = abs(h_hum - h_cat)
diff = 11.0
```
5. Is this difference "large" == 5.0 -> yes or no (boolean, True or False)

```python
if diff < 5.0:
    # not a change in hydrophobicity
    no_change_in_h.append(position_counter)
else:
    # is a change in hydrophobicity
    change_in_h.append(position_counter)

position_counter += 1
```
6. Variable that stores this boolean
    1. List that has length of human sequence
    2. Value is this boolean

In [35]:
change_in_h

NameError: name 'change_in_h' is not defined

In [None]:
aa_props['hydrophobicity']

In [None]:
def replace

In [None]:
"".replace("-", '')

In [None]:
def get_hydrophobicity(aa):
    hydrophobicity = aa_props.loc[[aa]]['hydrophobicity'].item()
    return hydrophobicity

In [None]:
aa = 'S'
get_hydrophobicity(aa)

In [None]:
aa_props = pd.read_csv("../../data/amino_acid_properties.csv")
aa_props.set_index('single_letter', inplace=True)
aa_props

See the [PDF format](../../data/amino_acid_properties.pdf) for references and details on how these metrics are calculated.

Next, we will write a Python [function](https://github.com/wilfredinni/python-cheatsheet#functions), in which we pass the single-letter IDs of two amino acids, and get a Python [boolean](https://github.com/wilfredinni/python-cheatsheet#boolean-operators) (variable that stores `True` or `False`) that says whether or not these two amino acids have different hydrophobicity. We arbitrarily define "difference in hydrophobicity" here as a difference of 5.0 units between the amino acids' `hydrophobicity` columns.

The text at the beginning of the funciton that is wrapped in `"""` is a special type of [comment](https://github.com/wilfredinni/python-cheatsheet#comments) called a [function docstring](https://github.com/wilfredinni/python-cheatsheet#comments); it tells us what the function does and how to use it.

In [None]:
def is_change_in_hydrophobicity(resi1, resi2):
    """This function takes string-type amino acid identifiers `resi1` and `resi2`
    and compares their hydrophobicities. If the absolute value of the difference
    between hydrophobicities is greater than `min_diff`, return boolean True.
    Otherwise, return boolean False.
    """
    min_diff = 5.0
    print(f"comparing hydrophobicity between {resi1} and {resi2}")
    h1 = aa_props.loc[[resi1]]['hydrophobicity'].item()
    h2 = aa_props.loc[[resi2]]['hydrophobicity'].item()
    
    diff = abs(h1 - h2)
    print(f"the difference is hydrophobicity is {diff}")
    
    if diff > min_diff:
        return True
    else:
        return False    

We can quickly test our function with some examples:

In [None]:
is_change_in_hydrophobicity('M', 'S')

In [None]:
is_change_in_hydrophobicity('M', 'F')

In [None]:
is_change_in_hydrophobicity('M', 'M')

### Get list of interesting residues

Next, let's generate a list of positions in the human sequence that are residues of interest, such as the catalytic triad (H296, D345 and S441) and important binding residues (D435, K223, and K224).

It is important to remember that these positions reported in the literature are relative to the human **isoform 1** sequence, not the **isoform 2** sequence (which we have stored in the variable `human_compact`). Thankfully, the conversion it relatively simple: isoform 2 is simply a splice variant in which `M → MPPAPPGGESGCEERGAAGHIEHSRYLSLLDAVDNSKM` at the N-terminal methionine. This means that we simply add 37 to the isoform 1 index to get the isoform 2 index. For instance, the catalytic serine S441 in isoform 1 is at position 441 + 37 = 478 in isoform 2.

Lastly, amino acid numbering in the literature uses 1 indexing (first amino acid is `M`), while our Python sequence uses 0 indexing. So position 478 with 1-indexing can be indexed using 477 with 0-indexing:

In [None]:
len(str(human_compact.seq).replace('-', ''))

In [None]:
str(human_compact.seq).replace('-', '')[477]

We can also check that the other residues of interest are the expected amino acids:
* H296 in isoform 1 → 296 + 37 = H333 in isoform 2 → 333 - 1 = position 332 with 0-indexing
* D345 → 381
* D435 → 471
* K223 → 259
* K224 → 260

Let's store these 0-indexed positions in a list so we can use it later:

In [None]:
resi_interest = [332, 381, 471, 259, 260]

Let's check that these positions are the amino acids we expect, this time using a for loop:

In [None]:
for position in resi_interest:
    resi = str(human_compact.seq).replace('-', '')[position]
    print(f"amino acid at 0-indexed position {position} is {resi}")

### Putting it all together

Let's try using our new function in a for loop. This for loop is a bit different from the previous ones; it's actually simpler. Instead of using the PSSM, we can simply iterate over the positions in the human sequence, get the equivalent amino acid in the cat sequence, and use our function to ask whether the amino acids at that position have different hydrophobicity.

One new addition to this algorithm (besides our custom function) is the [`range()` function](https://github.com/wilfredinni/python-cheatsheet#for-loops-and-the-range-function).

In [None]:
list(range(len(human_compact)))

In [None]:
# we want to keep track of which amino acid our
# "cursor" is on in the for loop
position_counter = 0

# get the entire list of positions in the human sequence as
# integers. We include dashes in this calculation
list_of_positions_including_dashes = range(len(human_compact))

for position_with_dashes in list_of_positions_including_dashes:
    
    # get the amino acid at this position (dashes included)
    # in both human and cat
    resi_in_human = human_compact[position_with_dashes]
    resi_in_chicken = chicken_record_compact[position_with_dashes]
    
    # skip this position if it is a '-'
    # in the human sequence record
    if resi_in_human == '-':
        continue
    elif position_counter in resi_interest:
        # detect if we are at an important amino acid
        print(f"* position {resi_in_human}{position_counter} is a residue of interest!")
        position_counter += 1
    else:
        # increment the counter by 1
        position_counter += 1
    
    # detect amino acid deletions
    if resi_in_chicken == '-':
        print(f'detected a deletion at position {position_counter}')
        continue
    
    # check changes in amino acid properties
    if is_change_in_hydrophobicity(resi_in_human, resi_in_horse):
        print(f"detected a change in hydrophobicity at position {position_counter}")
        
    # TODO: check for other changes in amino acid properties

## Goal for the end of this week

For every position in the human sequence (compared to cat sequence), write an algorithm that prints every time there is a hydrophobic residue in human, and non-hydrophobic (hydrophilic) residue in cats.

# Other useful `SummaryInfo` tools

## Compute replacement dictionary

In [None]:
hum_chicken_rep_dict = hum_chicken_summary.replacement_dictionary()
{k: hum_chicken_rep_dict[k] for k in hum_chicken_rep_dict
 if hum_chicken_rep_dict[k] > 0
 and k[0] != k[1]}

## Compute substitution and log odds matrix

In [None]:
my_arm = SubsMat.SeqMat(hum_chicken_rep_dict)
my_arm

In [None]:
my_lom = SubsMat.make_log_odds_matrix(my_arm)
my_lom

# Ty's Work Below

In [None]:
for position in hum_mouse_pssm.pssm:
    resi_in_human = position[0]
    resi_dict=position[1]
    if resi_in_human =="-": 
        continue
    else:
        position_counter += 1
    if position[1][resi_in_human] > 1:
        print(f"mouse and human are the same at position" + f"{position_counter}, which is amino acid {resi_in_human}")
            for position[1][resi_in_human] > 1 in hum_mouse_pssm.pssm:
                if position[0] == ["D", "E"]:
                    print(f"mouse and human have the same acidic amino acid at position" + f"{position_counter}, which is amino {resi_in_human}")
                elif position[0]==["R", "H", "K"]:
                    Print(f"mouse and human have the same basic amino acid at position" + f"{position_counter}, which is amino acid {resi_in_human}")

In [None]:
position_counter=0
for position in hum_mouse_pssm.pssm:
    resi_in_human= position[0]
    resi_dict= position[1]
    if resi_in_human == '-':
        continue
    else:
        position_counter += 1
    if position[1][resi_in_human] > 1:
        print(f"mouse and human are the same at position" + f"{position_counter}, which is amino acid {resi_in_human}")

# How to do all the species automatically

1. Define a function `compute_sequence_diff` that (basically) has all the code in this notebook
2. Use a function

```python
for species in species_list:
    compute_sequence_diff('Homo sapiens', 'monkey')
    
def compute_sequence_diff(species1, species2):
    """
    """
    # do stuff with species sequence
```