# Practice 02

## A few words about modules and functions
- Module: a group of highly related classes/functions. (Usually a something.py file) 
- import mymodule – load the module
    - usage: mymodule.myfunction(); mymodule.myclass()
- from mymodule import myclass – load the myclass
    - Usage of class defined in the module: myclass()
- from mymodule import * - load everything
- Difference is the namespace (like in c++)
- modules used here:
    - the SeqIO module of Biopython helps to easily read sequences
    - the numpy module can do the math
    - the matplotlib makes nice plots, similar to Matlab
- functions can be implemented with the "def" command using only indenting (tabs) to mark its beginning and end
- functions can be called later by their name, giving them the appropriate parameters


## About NumPy
- Scientific module for python: (https://numpy.org/devdocs/user/quickstart.html)
Main type is the multidimensional array)
- 'Matrix' of elements of same type (usually numbers). Basically 'tables'.
    - Axes: „dimensions”
    - Rank: number of dimensions

# Hydrophobicity plot generating code


In [None]:
from Bio import SeqIO
import numpy as np
import matplotlib.pyplot as plt
import requests

def hydrophobicity(seqname, valuefilename, wordsize):
    '''
    Draw the hydrophilicity plot of a protein. 
    
    :param seq1name:       path to the first sequence. 
                           The sequence file is expected to be a valid fasta file. 
    :param valuefilename:  This files describes the hydrophobicity of an amino acid.                           
    :param wordsize:       window size

    '''
    # read in hydrophobicity values to dictionary
    hphobdict = {}
    myfile = open(valuefilename, 'r')
    for line in myfile:
        line=line.replace("\n","")
        linelist=line.split("\t")
        hphobdict[linelist[0]]=float(linelist[1])
    myfile.close()
    # print(hphobdict)

    # read in the fasta file to seq object
    for seqobj in SeqIO.parse(seqname, "fasta"):
        print(seqobj.id, ": ", seqobj.seq)
    seq = seqobj.seq.upper()

    values = np.zeros(len(seq), dtype=float)  # store the values here
    #window sliding
    for i in range(0, len(seq) - wordsize + 1):
        value_act=0
        for k in range(0, wordsize):
            value_act=value_act+hphobdict[seq[i+k]]
        values[i + int(wordsize / 2)]=value_act/wordsize # put average hydrophobicity to the middle of the window

    # visualization:
    plt.figure(1)
    plt.plot(values, 'b-')
    plt.xlabel("Residue position")
    plt.ylabel("Hydrophobicity")
    plt.xlim([0, len(seq)])
    plt.grid(True)
    plt.title('Hydrophobicity plot: ' + seqobj.id + ', word size=' + str(wordsize))
    plt.show()



### Data Retrieval and Hydrophobicity Plot Generation

In this section, we download the necessary input files from a GitHub repository to analyze the hydrophobicity of a protein sequence. Specifically, we are working with the **CNR1_Human.fasta** file, which contains the amino acid sequence for the cannabinoid receptor 1, and the **Kyte_Doolittle_hydrophob.txt** file, which provides the Kyte-Doolittle hydrophobicity scale for each amino acid.

#### Steps:
1. **Downloading Data from GitHub:**
   - We use Python's `requests` library to download the FASTA sequence and Kyte-Doolittle scale files directly from a GitHub repository.
   - The URLs of these files are retrieved from the repository and the files are saved locally in the Colab environment.

2. **Plotting Hydrophobicity:**
   - After downloading the files, we pass them as input to the `hydrophobicity()` function.
   - This function generates a hydrophobicity plot using the Kyte-Doolittle scale with a window size of 2, which gives us insights into the hydrophobic regions of the protein.


In [None]:
# Download sequence file from GitHub
sequence_url = 'https://raw.githubusercontent.com/nbrg-ppcu/Introduction_to_bioinfo/dev/data/dot-hydro-plot/CNR1_Human.fasta'
sequence_file = 'CNR1_Human.fasta'

response = requests.get(sequence_url)
with open(sequence_file, 'wb') as file:
    file.write(response.content)

# Download Kyte-Doolittle scale file from GitHub
kyte_doolittle_url = 'https://raw.githubusercontent.com/nbrg-ppcu/Introduction_to_bioinfo/dev/data/dot-hydro-plot/Kyte_Doolittle_hydrophob.txt'
kyte_doolittle_file = 'Kyte_Doolittle_hydrophob.txt'

response = requests.get(kyte_doolittle_url)
with open(kyte_doolittle_file, 'wb') as file:
    file.write(response.content)







In [None]:
# Draw the plot
hydrophobicity(sequence_file, kyte_doolittle_file, wordsize=2)

### Exercise: 
- Change the word size (the third parameter at the calling of the function). What trends do you obtain?

# Dot plot generating code

In [None]:
from Bio import SeqIO
import numpy as np
import matplotlib.pyplot as plt
from copy import deepcopy

def dotplot(seq1name, seq2name=None,  cutoff=15, wordsize=20, isreversecompl=False):
    '''
    Draw a dotplot of two sequences. If isreversecompl=True, then 
    the first sequence is compared to it's reverse complement.
    
    :param seq1name:       path to the first sequence. 
                           The sequence file is expected to be a valid fasta file. 
    :param seq2name:       path to the second sequence. 
                           The sequence file is expected to be a valid fasta file. 
    :param cutoff:         the treshold defining whether a dot should be plotted
    :param wordsize:       window size
    :param isreversecompl: If True, then the function plots the sequence 1 against
                           it's reverse complement. Default is False. Note if it is
                           True, then seqname2 is ignored.  
    '''
    #scoring scheme
    match_score = 1
    mis_match_score = 0 

    # read in the fasta files
    for seq1obj in SeqIO.parse(seq1name, "fasta"):
        print(seq1obj.id, ": ", seq1obj.seq)
    
    if not isreversecompl:
        for seq2obj in SeqIO.parse(seq2name, "fasta"):
            print(seq2obj.id, ": ", seq2obj.seq)

    if isreversecompl: # make the reverse complement of the sequence to find stem loops
        seq2obj=deepcopy(seq1obj)
        seq2obj.seq=seq1obj.seq.reverse_complement()
        seq2obj.id=seq1obj.id+' reverse complement'
        print(seq2obj.id, ": ", seq2obj.seq)

    seq1=seq1obj.seq.lower()
    seq2=seq2obj.seq.lower()
    nucleotide_list=('c', 'g', 't', 'a')
    matrix=np.zeros((len(seq1), len(seq2)), dtype=int) # store the dots here
    matrix2=np.zeros((len(seq1), len(seq2)), dtype=int) # store the match values here
    for i in range(0,len(seq1)-wordsize+1):
        for j in range(0, len(seq2)-wordsize+1):
            score=0
            for k in range(0, wordsize):
                if seq1[i+k]==seq2[j+k] and seq1[i+k] in nucleotide_list and seq2[j+k] in nucleotide_list: # if there is a match and it is a nucleotide
                    score+=match_score # simplest scoring scheme: match=+1 mismatch=0
                else:
                    score+=mis_match_score
            if score >= cutoff:
                matrix[i+int(wordsize/2),j+int(wordsize/2)]=1 # dot in the middle of the word if the score is not lower then the cut-off score
            matrix2[i + int(wordsize / 2), j + int(wordsize / 2)] = score # store the scores

    # visualization:
    # dotplot itself
    plt.figure(1)
    plt.pcolor(matrix, cmap=plt.cm.binary)
    plt.xlabel(seq2obj.id)
    plt.ylabel(seq1obj.id)
    plt.title('Dotplot: cut-off score= ' + str(cutoff) + ', world size=' + str(wordsize))
    plt.colorbar()
    #heatmap
    plt.figure(2)
    plt.pcolor(matrix2)
    plt.xlabel(seq2obj.id)
    plt.ylabel(seq1obj.id)
    plt.title('Dotplot: cut-off score= ' + str(cutoff) + ', world size=' + str(wordsize))
    plt.colorbar()

    plt.show()


        

## Usage examples

### DNA Sequence Dotplot Generation

In this section, we analyze two DNA sequences by generating a dotplot, which is a graphical method used to compare two biological sequences and identify regions of similarity. 

#### Steps:
1. **Downloading DNA Sequences from GitHub:**
   - Similar to the previous section, we retrieve the DNA sequences **pract3_seq1.fasta** and **pract3_seq2.fasta** from a GitHub repository using Python's `requests` library.
   - The files are downloaded and saved locally in the Colab environment.

2. **Generating the Dotplot:**
   - We use the `dotplot()` function to compare the two DNA sequences.
   - The parameters used are:
     - `cutoff=5`: Sets a threshold for matching subsequences.
     - `wordsize=10`: Specifies the size of the words (k-mers) used for comparison.
     - `isreversecompl=False`: Indicates that reverse complement comparison is not considered in this case.


In [None]:
# Download gene sequence 1 from GitHub
gene_sequence_1_url = 'https://raw.githubusercontent.com/nbrg-ppcu/Introduction_to_bioinfo/dev/data/dot-hydro-plot/seq1.fasta'
gene_sequence_1_file = 'seq1.fasta'

response = requests.get(gene_sequence_1_url)
with open(gene_sequence_1_file, 'wb') as file:
    file.write(response.content)

# Download gene sequence 2 from GitHub
gene_sequence_2_url = 'https://raw.githubusercontent.com/nbrg-ppcu/Introduction_to_bioinfo/dev/data/dot-hydro-plot/seq2.fasta'
gene_sequence_2_file = 'seq2.fasta'

response = requests.get(gene_sequence_2_url)
with open(gene_sequence_2_file, 'wb') as file:
    file.write(response.content)



In [None]:
# Generate the dotplot
dotplot(gene_sequence_1_file, gene_sequence_2_file, cutoff=5, wordsize=10, isreversecompl=False)


### Other examples
### Repeats in sequence 1

In [None]:
#repeats in sequence 1
dotplot(gene_sequence_1_file, gene_sequence_1_file, cutoff=5, wordsize=10)

### Stem loops in sequence 2

In [None]:
#stem loops in sequence 2
dotplot(gene_sequence_2_file, None, cutoff=5, wordsize=10, isreversecompl=True)

### Using the code provided above:
  * you can run the program by calling the function 
    * the 5th parameter is TRUE if we search stem loops, FALSE otherwise (default is FALSE)
    

## Exercise: Using an online service for the same sequences:
- http://www.bioinformatics.nl/cgi-bin/emboss/dotmatcher
- upload the two sequences by clicking the Browse button
- below you can change the parameters: window size, threshold (cut-off score)
    - what do you see compared to the results of the Python code?