# Welcome to ProtaVision: Delving into Protein Structures and Functions🧬🔬🧪🧫🦠
---
Welcome to this interactive Jupyter Notebook! 🎉

**About This Notebook:**
This notebook is dedicated to facilitating sequence alignment, a fundamental principle in bioinformatics. Using a custom package, you can input two proteins from a dictionary and efficiently determine the number of matching amino acids, as well as identify the indices of those that do not match. Whether you're delving into bioinformatics for the first time or seeking a streamlined tool for protein analysis, this notebook provides a comprehensive solution.

**Getting Started:**
To get started, simply navigate through the sections using the table of contents provided on the ReadMe. Each section is packed with explanations, code snippets, and visualizations to help you grasp concepts effectively. 
NOTE: For this package, we use the biopython package. One can do this by typing "pip install biopython". Sometimes you can still encouter an error, in this case please use this piece of code before executing the code: !pip install biopython. The error occurs because it's possible that Jupyter Notebook is not using the same Python environment where Biopython is installed. Jupyter Notebook can use different Python environments, and sometimes the packages installed in one environment might not be available in another.

**How to Use:**
Feel free to experiment with the code! Modify variables, tweak parameters, and run cells to see how different changes affect the outcomes. Don't hesitate to ask questions or seek clarification whenever needed.

**Additional Resources:**
If you're new to Jupyter Notebooks or need a refresher, check out the [Jupyter Notebook Documentation](https://jupyter-notebook.readthedocs.io/en/stable/) for comprehensive guidance.

**Questions?**
If you have any questions, please contact nichelle.sequeira@epfl.ch or aurelie.masson@epfl.ch

**Let's Dive In!**
Start exploring and discovering insights within this notebook. Happy coding! 🚀

🔧 First, let's import everything! 
---
Please run the following code. Here, we are going to import all dependencies and functions to ensure the code runs properly.

In [1]:
from Bio import SeqIO
from Bio import ExPASy
from protavision import functions
import pandas as pd
from Bio.SeqUtils.ProtParam import ProteinAnalysis
import matplotlib.pyplot as plt
import py3Dmol 
from Bio import pairwise2
import os
import ast



🔥Let's Get Started! 
---
Using the function, **get_protein_sequence(protein_name1)**, users input the name of a protein and fetch its sequence from the Swiss-Prot database using the Biopython library.

Using a Python script designed to fetch the sequence of a protein from the **UniProt** database using its name (e.g., "MYG_HUMAN"). 

We will start by importing the required libraries; Bio.SeqIO (a module from Biopython used for reading and writing sequence files in different formats), and Bio.ExPASy (a module that provides access to tools and databases hosted on the ExPASy server, including the Swiss-Prot database). 

### Example Usage📚
If you enter MYG_HUMAN as an input, you should get the following output:
- This is the sequence of your first protein: MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG


In [2]:
protein_name1 = input("Please enter the name of the first protein (example MYG_HUMAN): ")

# Call the function using the module name
sequence1 = functions.get_protein_sequence(protein_name1)
print(f"This is the sequence of your first protein: {sequence1}")

Please enter the name of the first protein (example MYG_HUMAN):  MYG_HUMAN


This is the sequence of your first protein: MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG


In [3]:
protein_name2 = input("Please enter the name of the second protein (example MYG_MOUSE): ")
sequence2 = functions.get_protein_sequence(protein_name2)
print(f"This is the sequence of your second protein; {sequence2}")

Please enter the name of the second protein (example MYG_MOUSE):  MYG_MOUSE


This is the sequence of your second protein; MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNLKSEEDMKGSEDLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEIIIEVLKKRHSGDFGADAQGAMSKALELFRNDIAAKYKELGFQG


🚀Next Step: Let's find the differences between the two sequences!🧬
---
This code snippet compares two sequences (sequence1 and sequence2) that were defined character by character to find their differences and displays the index, and the respective amino acids from each sequence at that index. This code can be used for comparing sequences, such as DNA or protein sequences, and identifying variations or mutations between them.

### Expected output📚
You should obtain an output giving:
- The number of amino acids that don't match
- The number of amino acids that do match
- A table showing the index of the amino acids that don't match as well as the amino acids of both sequences


In [None]:
df = functions.compare_sequences(sequence1, sequence2)
df

## 💻Amino Acid Analysis and Structure Visualization

In this section, we will analyze the properties and visualize the structures of amino acids. Amino acids are the building blocks of proteins, each with unique chemical properties and structures. We will explore various characteristics such as molecular weight, hydrophobicity, and amino acid composition, as well as visualize their structures to gain insights into their functions and interactions within proteins. The proportion of each amino acid of each sequence will be visualise with a bar diagram with the help of the function proportion_amino_acid.

Let's start by seeing a few properties

### Expected Output📚
You should be able to see the molecular weight of each amino acid sequence as well as 2 graphs showing the proportions of each amino acid in each sequences. 

##### Example Usage for the molecular weight:
- The molecular weight of your first amino acid sequence is: 17183.597 g/mol
- The molecular weight of your second amino acid sequence is: 17069.467 g/mol

In [None]:
analyzed_seq1 = ProteinAnalysis(str(sequence1))
analyzed_seq2 = ProteinAnalysis(str(sequence2))

molecular_weight1 = round(analyzed_seq1.molecular_weight(), 3)
molecular_weight2 = round(analyzed_seq2.molecular_weight(), 3)


# Print the molecular weight using an f-string

print(f"The molecular weight of your first amino acid sequence is: {molecular_weight1} g/mol")
print()
print(f"The molecular weight of your second amino acid sequence is: {molecular_weight2} g/mol")

In [None]:
functions.proportion_amino_acid(sequence1, sequence2) #this shows a bar chart of the various amino acids

## 📝Substitutions Conservatrices in Biochemistry Amino Acids

In biochemistry, substitutions conservatrices refer to amino acid substitutions in protein sequences that preserve the physicochemical properties or functional roles of the amino acids involved. These substitutions maintain the structure and function of the protein despite alterations in the amino acid sequence.

### Characteristics:
- **Conservation of Function**: Preserves biochemical properties or functional roles.
- **Conservation of Structure**: Maintains structural integrity, especially in critical regions.
- **Functional Redundancy**: Utilizes multiple amino acids with similar roles.

### Examples📚:
- Aromatic Residues: Phenylalanine, tyrosine, and tryptophan.
- Hydrophobic Residues: Leucine, isoleucine, and valine.
- Charge Conservation: Lysine and arginine, glutamate and aspartate.

### Importance:
Understanding substitutions conservatrices predicts the effects of mutations on protein structure and function, aiding in informed hypotheses about protein behavior.

**Please make sure that your amino acid sequences are the same length to use this code, as without the same length, this code has no sense**

In [None]:
num_conserv_substitutions = functions.count_conservative_substitutions(sequence1, sequence2)

print("The number of conservative substitutions :", num_conserv_substitutions)

## 🎉Protein Visualization

Protein visualization is an essential aspect of structural biology and bioinformatics, allowing researchers to understand the three-dimensional structure and function of proteins. Interactive molecular visualization tools provide powerful capabilities for visualizing and analyzing protein structures directly within Jupyter Notebooks. Before to visualize the protein in 3D, we need to get the PDB name because we use a python package name py3Dmol that needs the PDB name to visualize the protein in 3D. The first part of this section consist to convert the unitprot name (as an example of form MYG_HUMAN) into PDB name (example : 3RGK).

### Expected Output📚
You should be able to see the protein structures in 3D and be able to zoom in/out, and turn the protein around.


In [None]:
pdb_id1 = functions.uniprot_to_pdb(protein_name1)
if pdb_id1:
    print(f"The PDB Code associated to the first protein {protein_name1} is : {pdb_id1}")
else:
    print(f"No PDB Code is associated to this first protein {protein_name1}")

pdb_id2 = functions.uniprot_to_pdb(protein_name2)
if pdb_id2:
    print(f"The PDB code of the second protein {protein_name2} is : {pdb_id2}")
else:
    print(f"No PDB code is associated to the second protein {protein_name2}")

In [None]:
view = py3Dmol.view(query=pdb_id1)

view.setStyle({'cartoon':{'color':'spectrum'}})

view

In [None]:
import py3Dmol #install package py3Dmol is required

view = py3Dmol.view(query=pdb_id2)

view.setStyle({'cartoon':{'color':'spectrum'}})

view

#if nothing appears, this is normal in case there is not PDB name for your protein


## 💡Gap Method for Sequence Alignment

Now let's move onto using the Gap Method. This code implements a simple sequence alignment method using a gap. Sequence alignment is a fundamental task in bioinformatics used to identify regions of similarity between DNA, RNA, or protein sequences.

### Explanation:
This code is useful for aligning and comparing sequences to identify similarities and differences between them. The Needleman-Wunsch algorithm ensures optimal global alignment, considering all possible alignments to maximize similarity.

### What is the Needleman-Wunsch algorithm?
The Needleman-Wunsch algorithm is a dynamic programming algorithm used for global pairwise sequence alignment. It compares two sequences by optimizing a scoring scheme that rewards matches and penalizes mismatches and gaps. By considering all possible alignments and choosing the one with the highest score, it identifies the optimal alignment, providing insights into the evolutionary relationships and functional similarities between the sequences.

### Example Output📚
For sequences defined as: 
- sequence1 = MYG_HUMAN
- sequence2 = MYG_MOUSE

One would obtain the following as output:
  - Sequence 1 aligned: MGLSDGEWQLVLNVWGKVEADIP--GHGQEVLIR-LFKG-HPETLE-KFDKFKH-LKSEDE-MKA-SEDLKKHGA-TVLTALGG-ILKKKGH-HEA-EIK-PLAQSHATKHKIPVKYLEFISECIIQ--VLQSK--HP-GDFGADAQGAMN-KALELFRK-DM-ASN--YKELGFQG
  - Sequence 2 aligned: MGLSDGEWQLVLNVWGKVEAD--LAGHGQEVLI-GLFK-THPETL-DKFDKFK-NLKSE-EDMK-GSEDLKKHG-CTVLTAL-GTILKKKG-QH-AAEI-QPLAQSHATKHKIPVKYLEFISE-II-IEVL--KKRH-SGDFGADAQGAM-SKALELFR-ND-IA--AKYKELGFQG
  - Number of matches: 131
  - Score: 735
  - Number of gaps in Sequence 1: 23
  - Number of gaps in Sequence 2: 23

In [None]:
sequence1_aligned, sequence2_aligned, begin, end, num_matches = functions.calculate_alignment_details(sequence1, sequence2)
num_gaps_sequence1 = functions.calculate_number_of_gaps(sequence1_aligned, sequence1)
num_gaps_sequence2 = functions.calculate_number_of_gaps(sequence2_aligned, sequence2)
score = 10*num_matches - 25*num_gaps_sequence1

print("Sequence 1 aligned:", sequence1_aligned)
print("Sequence 2 aligned:", sequence2_aligned)
print("Number of matches:", num_matches)
print("Score:", score)
print("Number of gaps in Sequence 1:", num_gaps_sequence1)
print("Number of gaps in Sequence 2:", num_gaps_sequence2)


## 🧠 Gap Method for Sequence Alignment - Your Turn!

Now it's your turn to determine what gap size and where you want to add your gap!

In [None]:
gap_length = int(input("Enter the length of the gap: "))
position = int(input("Enter the position to apply the gap: "))

num_matches = functions.count_matches_with_gap(sequence1, sequence2, gap_length, position)
print("Number of matches found with the gap method is:", num_matches)

## 🧩 Amino Acid Composition Analysis

Now let's analyze the amino acid composition of the proteins.

### First, let's visualise the amino acids


This code is designed to read amino acid data from a text file and print out the details of each amino acid, including its name, single-letter code, three-letter code, and SMILES structure. The following steps outline the functionality of the code:

1. **Import Required Libraries**: The code uses the `os` and `ast` libraries. The `os` library is used for file path manipulations, and the `ast` library is used to safely evaluate a string containing a Python literal.

2. **Determine the Path to the Data File**: 
    - The current directory of the notebook is obtained using `os.path.dirname(os.path.abspath('__file__'))`.
    - The project root directory is determined by navigating one level up from the current directory.
    - The full path to the `aminoacids.txt` file is constructed by joining the project root with the `data` folder and the filename.

3. **Read the Amino Acid Data from the File**:
    - The file `aminoacids.txt` is opened and its content is read into a string variable.
    - The data string is processed to extract the list part if it includes a variable assignment.

4. **Convert the String Data to a Python List**:
    - The `ast.literal_eval` function is used to safely evaluate the string containing the list of amino acids and convert it into a Python list of dictionaries.

5. **Print Amino Acid Details**:
    - The code iterates over the list of amino acids.
    - For each amino acid, it prints its name, single-letter code, three-letter code, and SMILES structure in a formatted manner.

This code is useful for processing and displaying biochemical data related to amino acids, making it easier to analyze and work with this information in a structured way.


In [None]:
# Determine the path to the aminoacids.txt file
current_dir = os.path.dirname(os.path.abspath('__file__'))
project_root = os.path.dirname(current_dir)
file_path = os.path.join(project_root, 'data', 'aminoacids.txt')

# Read the amino acids data from the file
with open(file_path, 'r') as file:
    data = file.read().strip()
    amino_acids = ast.literal_eval(data)

# Create a DataFrame
df = pd.DataFrame(amino_acids)

# Display the DataFrame
print("Amino Acid Data:")
display(df)

### 🔍 Counting Specific type of Amino Acids in Protein Sequences (hydrophobic, base...)

This Jupyter notebook code snippet counts and displays the number of different types of amino acids in given protein sequences. The steps below outline the functionality of the code:

1. **Count Amino Acids in Protein Sequences**:
    - The `count_amino_acids` function from the `functions` module is used to count the amino acids in the given protein sequences, `sequence1` and `sequence2`.
    - The counts for `sequence1` are stored in `counts_1rsprotein`.
    - The counts for `sequence2` are stored in `counts_2ndprotein`.

2. **Print Amino Acid Counts**:
    - The code prints the protein sequence `sequence1`.
    - It prints the number of hydrophobic, hydrophilic, acidic, and basic amino acids in `sequence1` using the counts stored in `counts_1rsprotein`.

### Example Usage

If the sequences and the `count_amino_acids` function are properly defined, running this code will provide a detailed breakdown of the types of amino acids in the protein sequences. This is useful for analyzing the properties of proteins based on their amino acid composition.

For a sequence "MVHLTPEEK", one would expect; 
- Number of hydrophobic amino acids: 3
- Number of hydrophilic amino acids: 1
- Number of acidic amino acids: 2 
- Number of basic amino acids: 2

In [4]:
counts_1rsprotein = functions.count_amino_acids(sequence1)
counts_2ndprotein = functions.count_amino_acids(sequence2)
print(f"Séquence : {sequence1}")
print("Number of hydrophobic amino acids :", counts_1rsprotein["hydrophobics"])
print("Number of hydrophilic amino acids  :", counts_1rsprotein["hydrophiles"])
print("Number of acidic amino acids :", counts_1rsprotein["acids"])
print("Number of basic amino acids  :", counts_1rsprotein["bases"])


Séquence : MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG
Number of hydrophobic amino acids : 59
Number of hydrophilic amino acids  : 22
Number of acidic amino acids : 22
Number of basic amino acids  : 31


# ProtaVision: Challenges and Features - General Remarks

## Introduction

Protavision is an interactive Jupyter Notebook dedicated to exploring protein structures and functions!

## Main Features

- **Sequence Alignment**: Efficiently compare two protein sequences to identify matching and non-matching amino acids.
- **Amino Acid Analysis**: Analyze properties of amino acids, such as molecular weight and composition.
- **Substitutions Conservatrices**: Explore amino acid substitutions that preserve biochemical properties.
- **Protein Visualization**: Visualize protein structures in 3D using PDB codes.
- **Gap Method for Sequence Alignment**: Implement a gap-based sequence alignment method.

## Challenges Encountered

1. **PDB Retrieval**: Finding PDB codes for protein visualization posed a challenge, especially for proteins with no assigned PDB codes.
2. **Sequence Alignment**: Writing code for sequence alignment, especially implementing the Needleman-Wunsch algorithm, required careful consideration and debugging.
3. **Package Compatibility**: Ensuring compatibility between different Python packages used in the notebook, such as Biopython and py3Dmol, sometimes resulted in compatibility issues.
4. **User Input Handling**: Handling user input for gap-based sequence alignment involves validating inputs and error handling to ensure smooth execution.
5. **Using GitHub and handling conflicts**: We were having issues working on the notebook together and learning how to use multiple features such as pushing and pulling edits, dealing with environments and packages, and much more.

## Tools Used

- **Biopython**: Utilized for retrieving protein sequences from the UniProt database and performing sequence alignment.
- **py3Dmol**: Used for visualizing protein structures in 3D within Jupyter Notebook.
- **Jupyter Notebook**: Provided an interactive environment for running code, visualizing data, and documenting the analysis process.

## Conclusion

ProtaVision offers a comprehensive toolkit for exploring protein structures and functions, and overcoming challenges in sequence alignment, visualization, and analysis. With its intuitive interface and powerful functionalities, it serves as a valuable resource for bioinformatics enthusiasts and researchers alike.
