# Welcome to ProtaVision: Delving into Protein Structures and Functions🧬🔬🧪🧫🦠
---
Welcome to this interactive Jupyter Notebook! 🎉

**About This Notebook:**
This notebook is dedicated to facilitating sequence alignment, a fundamental principle in bioinformatics. Using a custom package, you can input two proteins from a dictionary and efficiently determine the number of matching amino acids, as well as identify the indices of those that do not match. Whether you're delving into bioinformatics for the first time or seeking a streamlined tool for protein analysis, this notebook provides a comprehensive solution.

**Getting Started:**
To get started, simply navigate through the sections using the table of contents provided on the ReadMe. Each section is packed with explanations, code snippets, and visualizations to help you grasp concepts effectively. 
NOTE: For this package, we use the biopython package. One can do this by typing "pip install biopython". Sometimes you can still encouter an error, in this case please use this piece of code before executing the code: !pip install biopython. The error occurs because it's possible that Jupyter Notebook is not using the same Python environment where Biopython is installed. Jupyter Notebook can use different Python environments, and sometimes the packages installed in one environment might not be available in another.

**How to Use:**
Feel free to experiment with the code! Modify variables, tweak parameters, and run cells to see how different changes affect the outcomes. Don't hesitate to ask questions or seek clarification whenever needed.

**Additional Resources:**
If you're new to Jupyter Notebooks or need a refresher, check out the [Jupyter Notebook Documentation](https://jupyter-notebook.readthedocs.io/en/stable/) for comprehensive guidance.

**Questions?**
If you have any questions, please contact nichelle.sequeira@epfl.ch or aurelie.masson@epfl.ch

**Let's Dive In!**
Start exploring and discovering insights within this notebook. Happy coding! 🚀

Let's Get Started! 
---
Using the function, **get_protein_sequence(protein_name1)**, users input the name of a protein and fetch its sequence from the Swiss-Prot database using the Biopython library.

Using a Python script designed to fetch the sequence of a protein from the **UniProt** database using its name (e.g., "MYG_HUMAN"). 

We will start by importing the required libraries; Bio.SeqIO (a module from Biopython used for reading and writing sequence files in different formats), and Bio.ExPASy (a module that provides access to tools and databases hosted on the ExPASy server, including the Swiss-Prot database). 

In [1]:
from Bio import SeqIO
from Bio import ExPASy
from src.protavision import functions
from functions import get_protein_sequence

protein_name1 = input("Please enter the name of the first protein (example MYG_HUMAN): ")

# Call the function using the module name
sequence1 = functions.get_protein_sequence(protein_name1)
print(f"This is the sequence of your first protein: {sequence1}")

ModuleNotFoundError: No module named 'src'

In [None]:
from functions import get_protein_sequence


protein_name2 = input("Please enter the name of the second protein (example MYG_MOUSE): ")
sequence2 = get_protein_sequence(protein_name2)
print(f"This is the sequence of your second protein; {sequence2}")

Next Step: Let's find the differences between the two sequences!🧬
---
This code snippet compares two sequences (sequence1 and sequence2) character by character to find their differences and displays the index, and the respective amino acids from each sequence at that index. This code can be used for comparing sequences, such as DNA or protein sequences, and identifying variations or mutations between them.

In [None]:
import pandas as pd
import functions
from functions import compare_sequences
df = compare_sequences(sequence1, sequence2)

df

## Amino Acid Analysis and Structure Visualization

In this section, we will analyze the properties and visualize the structures of amino acids. Amino acids are the building blocks of proteins, each with unique chemical properties and structures. We will explore various characteristics such as molecular weight, hydrophobicity, and amino acid composition, as well as visualize their structures to gain insights into their functions and interactions within proteins. The proportion of each amino acid of each sequence will be visualise with a bar diagram with the help of the function proportion_amino_acid.

Let's start by seeing a few properties

In [None]:
from Bio.SeqUtils.ProtParam import ProteinAnalysis
analyzed_seq1 = ProteinAnalysis(str(sequence1))
analyzed_seq2 = ProteinAnalysis(str(sequence2))

molecular_weight1 = round(analyzed_seq1.molecular_weight(), 3)
molecular_weight2 = round(analyzed_seq2.molecular_weight(), 3)


# Print the molecular weight using an f-string

print(f"The molecular weight of your first amino acid sequence is: {molecular_weight1} g/mol")
print()
print(f"The molecular weight of your second amino acid sequence is: {molecular_weight2} g/mol")

In [None]:
import matplotlib.pyplot as plt
from Bio.SeqUtils.ProtParam import ProteinAnalysis

from functions import proportion_amino_acid


proportion_amino_acid(sequence1, sequence2)


## Substitutions Conservatrices in Biochemistry Amino Acids

In biochemistry, substitutions conservatrices refer to amino acid substitutions in protein sequences that preserve the physicochemical properties or functional roles of the amino acids involved. These substitutions maintain the structure and function of the protein despite alterations in the amino acid sequence.

### Characteristics:
- **Conservation of Function**: Preserves biochemical properties or functional roles.
- **Conservation of Structure**: Maintains structural integrity, especially in critical regions.
- **Functional Redundancy**: Utilizes multiple amino acids with similar roles.

### Examples:
- Aromatic Residues: Phenylalanine, tyrosine, and tryptophan.
- Hydrophobic Residues: Leucine, isoleucine, and valine.
- Charge Conservation: Lysine and arginine, glutamate and aspartate.

### Importance:
Understanding substitutions conservatrices predicts the effects of mutations on protein structure and function, aiding in informed hypotheses about protein behavior.

**Please make sure that your amino acid sequences are the same length to use this code, as without the same length, this code has no sense**

In [None]:
from functions import count_conservative_substitutions

#example of utilisation (with sequence1 and 2) 

num_conserv_substitutions = count_conservative_substitutions(sequence1, sequence2)

print("The number of conservative substitutions :", num_conserv_substitutions)


#### Protein Visualization

Protein visualization is an essential aspect of structural biology and bioinformatics, allowing researchers to understand the three-dimensional structure and function of proteins. Interactive molecular visualization tools provide powerful capabilities for visualizing and analyzing protein structures directly within Jupyter Notebooks. Before to visualize the protein in 3D, we need to get the PDB name because we use a python package name py3Dmol that needs the PDB name to visualize the protein in 3D. The first part of this section consist to convert the unitprot name (as an example of form MYG_HUMAN) into PDB name (example : 3RGK).



In [None]:
from functions import uniprot_to_pdb


pdb_id1 = uniprot_to_pdb(protein_name1)
if pdb_id1:
    print(f"The PDB Code associated to the first protein {protein_name1} is : {pdb_id1}")
else:
    print(f"No PDB Code is associated to this first protein {protein_name1}")

pdb_id2 = uniprot_to_pdb(protein_name2)
if pdb_id2:
    print(f"The PDB code of the second protein {protein_name2} is : {pdb_id2}")
else:
    print(f"No PDB code is associated to the second protein {protein_name2}")

In [None]:
import py3Dmol #install package py3Dmol is required

view = py3Dmol.view(query=pdb_id1)

view.setStyle({'cartoon':{'color':'spectrum'}})

view

In [None]:
import py3Dmol #install package py3Dmol is required

view = py3Dmol.view(query=pdb_id2)

view.setStyle({'cartoon':{'color':'spectrum'}})

view

#if nothing appears, this is normal in case there is not PDB name for your protein


## Gap Method for Sequence Alignment

Now let's move onto using the Gap Method. This code implements a simple sequence alignment method using a gap. Sequence alignment is a fundamental task in bioinformatics used to identify regions of similarity between DNA, RNA, or protein sequences.

### Explanation:
This code is useful for aligning and comparing sequences to identify similarities and differences between them. The Needleman-Wunsch algorithm ensures optimal global alignment, considering all possible alignments to maximize similarity.

### What is the Needleman-Wunsch algorithm?
The Needleman-Wunsch algorithm is a dynamic programming algorithm used for global pairwise sequence alignment. It compares two sequences by optimizing a scoring scheme that rewards matches and penalizes mismatches and gaps. By considering all possible alignments and choosing the one with the highest score, it identifies the optimal alignment, providing insights into the evolutionary relationships and functional similarities between the sequences.

In [None]:
from Bio import pairwise2
from functions import calculate_alignment_details
from functions import calculate_number_of_gaps


sequence1_aligned, sequence2_aligned, score, begin, end, num_matches = calculate_alignment_details(sequence1, sequence2)
num_gaps_sequence1 = calculate_number_of_gaps(sequence1_aligned, sequence1)
num_gaps_sequence2 = calculate_number_of_gaps(sequence2_aligned, sequence2)
score = 10*num_matches - 25*num_gaps_sequence1

print("Sequence 1 aligned:", sequence1_aligned)
print("Sequence 2 aligned:", sequence2_aligned)
print("Number of matches:", num_matches)
print("Score:", score)
print("Number of gaps in Sequence 1:", num_gaps_sequence1)
print("Number of gaps in Sequence 2:", num_gaps_sequence2)


## Gap Method for Sequence Alignment - Your Turn!

Now it's your turn to determine what gap size and where you want to add your gap!

In [None]:
from functions import count_matches_with_gap

gap_length = int(input("Enter the length of the gap: "))
position = int(input("Enter the position to apply the gap: "))

num_matches = count_matches_with_gap(sequence1, sequence2, gap_length, position)
print("Number of matches found with the gap method is:", num_matches)



In [None]:
# Amino acid names, single-letter codes, three-letter codes, and SMILES codes
amino_acids = [
    {"name": "Alanine", "single_letter": "A", "three_letter": "Ala", "smiles": "CC(=O)N"},
    {"name": "Arginine", "single_letter": "R", "three_letter": "Arg", "smiles": "NC(CCNC(=N)N)C(=N)N"},
    {"name": "Asparagine", "single_letter": "N", "three_letter": "Asn", "smiles": "CC(=O)N[C@@H](CCC(=O)O)C(N)=O"},
    {"name": "Aspartic acid", "single_letter": "D", "three_letter": "Asp", "smiles": "CC(C(=O)O)C(N)=O"},
    {"name": "Cysteine", "single_letter": "C", "three_letter": "Cys", "smiles": "C(C(=O)O)N"},
    {"name": "Glutamine", "single_letter": "Q", "three_letter": "Gln", "smiles": "CC(=O)NC[C@H](C(=O)O)N"},
    {"name": "Glutamic acid", "single_letter": "E", "three_letter": "Glu", "smiles": "C(CC(=O)O)C(C(=O)O)N"},
    {"name": "Glycine", "single_letter": "G", "three_letter": "Gly", "smiles": "C(C(=O)O)N"},
    {"name": "Histidine", "single_letter": "H", "three_letter": "His", "smiles": "C1=CNC=N1"},
    {"name": "Isoleucine", "single_letter": "I", "three_letter": "Ile", "smiles": "CC[C@H](C)C(C)C"},
    {"name": "Leucine", "single_letter": "L", "three_letter": "Leu", "smiles": "CC(C)CC(C(=O)O)N"},
    {"name": "Lysine", "single_letter": "K", "three_letter": "Lys", "smiles": "NC(CCC(N)=O)C(=N)N"},
    {"name": "Methionine", "single_letter": "M", "three_letter": "Met", "smiles": "CSCCC(C(=O)O)N"},
    {"name": "Phenylalanine", "single_letter": "F", "three_letter": "Phe", "smiles": "c1ccc(cc1)CC(C(=O)O)N"},
    {"name": "Proline", "single_letter": "P", "three_letter": "Pro", "smiles": "C1CC(NC1)C(=O)O"},
    {"name": "Serine", "single_letter": "S", "three_letter": "Ser", "smiles": "C(C(C(=O)O)N)O"},
    {"name": "Threonine", "single_letter": "T", "three_letter": "Thr", "smiles": "CC(C(C)O)C(=O)O"},
    {"name": "Tryptophan", "single_letter": "W", "three_letter": "Trp", "smiles": "c1ccc2c(c1)c(c[nH]2)CC(C(=O)O)N"},
    {"name": "Tyrosine", "single_letter": "Y", "three_letter": "Tyr", "smiles": "c1cc(ccc1CC(C(=O)O)N)O"},
    {"name": "Valine", "single_letter": "V", "three_letter": "Val", "smiles": "CC(C)C(C(=O)O)N"}
]

# Print amino acid details
for amino_acid in amino_acids:
    print(f"Name: {amino_acid['name']}")
    print(f"Single-letter code: {amino_acid['single_letter']}")
    print(f"Three-letter code: {amino_acid['three_letter']}")
    print(f"SMILES structure: {amino_acid['smiles']}")
    print()


In [None]:
import streamlit as st
import py3Dmol
from rdkit import Chem
from rdkit.Chem import AllChem

def generate_3D(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    mol = Chem.AddHs(mol)
    params = AllChem.ETKDGv3()
    params.randomSeed = 42
    AllChem.EmbedMolecule(mol, params)
    molstring = Chem.MolToMolBlock(mol)
    return molstring

def display_molecule_3D(mol_block):
    viewer = py3Dmol.view(width=400, height=400)
    viewer.addModel(mol_block, format='mol')
    viewer.setStyle({'stick': {}})
    viewer.zoomTo()
    return viewer

st.title("Représentation en 3D d'un acide aminé")

smiles_input = st.text_input("Entrez le SMILES de l'acide aminé :")

if smiles_input:
    molblock = generate_3D(smiles_input)
    if molblock:
        st.subheader("Représentation 3D de l'acide aminé:")
        viewer = display_molecule_3D(molblock)
        st.write(viewer)
    else:
        st.error("Impossible de générer les coordonnées 3D pour le SMILES spécifié.")


## Amino Acid Composition Analysis

Now let's analyze the amino acid composition of the proteins.

In [None]:

from functions import count_amino_acids
counts_1rsprotein = count_amino_acids(sequence1)
counts_2ndprotein = count_amino_acids(sequence2)
print(f"Séquence : {sequence1}")
print("Number of hydrophobic amino acids :", counts_1rsprotein["hydrophobes"])
print("Number of hydrophilic amino acids  :", counts_1rsprotein["hydrophiles"])
print("Number of acidic amino acids :", counts_1rsprotein["acides"])
print("Number of basic amino acids  :", counts_1rsprotein["basiques"])


# ProtaVision: Challenges and Features - General Remarks

## Introduction

Protavision is an interactive Jupyter Notebook dedicated to exploring protein structures and functions!

## Main Features

- **Sequence Alignment**: Efficiently compare two protein sequences to identify matching and non-matching amino acids.
- **Amino Acid Analysis**: Analyze properties of amino acids, such as molecular weight and composition.
- **Substitutions Conservatrices**: Explore amino acid substitutions that preserve biochemical properties.
- **Protein Visualization**: Visualize protein structures in 3D using PDB codes.
- **Gap Method for Sequence Alignment**: Implement a gap-based sequence alignment method.

## Challenges Encountered

1. **PDB Retrieval**: Finding PDB codes for protein visualization posed a challenge, especially for proteins with no assigned PDB codes.
2. **Sequence Alignment**: Writing code for sequence alignment, especially implementing the Needleman-Wunsch algorithm, required careful consideration and debugging.
3. **Package Compatibility**: Ensuring compatibility between different Python packages used in the notebook, such as Biopython and py3Dmol, sometimes resulted in compatibility issues.
4. **User Input Handling**: Handling user input for gap-based sequence alignment involved validating inputs and error handling to ensure smooth execution.

## Tools Used

- **Biopython**: Utilized for retrieving protein sequences from the UniProt database and performing sequence alignment.
- **py3Dmol**: Used for visualizing protein structures in 3D within Jupyter Notebook.
- **Jupyter Notebook**: Provided an interactive environment for running code, visualizing data, and documenting the analysis process.

## Conclusion

ProtaVision offers a comprehensive toolkit for exploring protein structures and functions, overcoming challenges in sequence alignment, visualization, and analysis. With its intuitive interface and powerful functionalities, it serves as a valuable resource for bioinformatics enthusiasts and researchers alike.
