## This notebook is for BioComputingPython Refresher course at IIITD, by Dr. Jaspreet Kaur Dhanjal.
## This notebook is created by Prateek Paul.
* Email: prateekp@iiitd.ac.in
* LinkedIn: [linkedin.com/in/prateekpaulpro/](https://linkedin.com/in/prateekpaulpro/)

Disclaimer: 
The code and content in this notebook are compiled from various open sources, personal experience, and reference materials. It is intended solely for educational purposes. All credits for original ideas and code snippets go to their respective authors. If you find any inaccuracies or have suggestions, feel free to reach out.

# Q1
### Write a Python program that classifies a given protein sequence into one of five enzyme classes based on specific hypothetical motifs. Each enzyme class is associated with a unique motif: Hydrolase (GXSXG), Transferase (KTXN), Oxidoreductase (CXXC), Lyase (DXXD), and Ligase (RXYK).
### The program should check if any of these motifs are present in the input sequence and print the corresponding enzyme class. If no motifs are found, the program should indicate that no matching enzyme class exists

In [None]:
#Write your code here
def classify_protein(sequence):
    # Define the motifs for each enzyme class
    hydrolase_motif = "GXSXG"
    transferase_motif = "KTXN"
    oxidoreductase_motif = "CXXC"
    lyase_motif = "DXXD"
    ligase_motif = "RXYK"

    # Check for each motif in the sequence
    if hydrolase_motif.replace("X", "") in sequence.replace("X", ""):
        print("Hydrolase")
    elif transferase_motif.replace("X", "") in sequence.replace("X", ""):
        print("Transferase")
    elif oxidoreductase_motif.replace("X", "") in sequence.replace("X", ""):
        print("Oxidoreductase")
    elif lyase_motif.replace("X", "") in sequence.replace("X", ""):
        print("Lyase")
    elif ligase_motif.replace("X", "") in sequence.replace("X", ""):
        print("Ligase")
    else:
        print("No matching enzyme class")

# Example usage
protein_sequence = input("Enter the protein sequence: ")
classify_protein(protein_sequence)


Enter the protein sequence: ATDRYK
Ligase


# Q2
### Write a Python program that checks if a given amino acid sequence is valid.The program should determine if the sequence contains any invalid amino acids, defined as 'B', 'J', 'O', 'U', 'X', and 'Z'.
### If an invalid amino acid is found, the program should print the first invalid amino acid and state that the sequence is invalid. If the sequence contains only valid amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y), the program should confirm that the sequence is valid.

In [None]:
#Write your code here
def check_valid_amino_acid_sequence(sequence):
    # Define the valid and invalid amino acids
    valid_amino_acids = "ACDEFGHIKLMNPQRSTVWY"
    invalid_amino_acids = "BJOXUZ"

    # Check the sequence for invalid amino acids
    for amino_acid in sequence:
        if amino_acid in invalid_amino_acids:
            print(f"Invalid amino acid found: {amino_acid}. The sequence is invalid.")
            return

    # If no invalid amino acids are found
    print("The sequence is valid.")

# Example usage
protein_sequence = input("Enter the protein sequence: ")
check_valid_amino_acid_sequence(protein_sequence)


Enter the protein sequence: MTPFYWB
Invalid amino acid found: B. The sequence is invalid.


# Q3
### To determine the ATGC content of a given DNA sequence, count the occurrences of each nucleotide and calculate their percentage content relative to the total sequence length.

### For the sequence "ATGCGATAACTAG", the contents are approximately 30.77% Adenine, 23.08% Thymine, 15.38% Guanine, and 23.08% Cytosine.

In [None]:
#Write your code here
def calculate_nucleotide_content(sequence):
    # Convert sequence to uppercase to ensure consistency
    sequence = sequence.upper()

    # Total length of the sequence
    total_length = len(sequence)

    # Count the occurrences of each nucleotide
    count_A = sequence.count('A')
    count_T = sequence.count('T')
    count_G = sequence.count('G')
    count_C = sequence.count('C')

    # Calculate the percentage content of each nucleotide
    percent_A = (count_A / total_length) * 100
    percent_T = (count_T / total_length) * 100
    percent_G = (count_G / total_length) * 100
    percent_C = (count_C / total_length) * 100

    # Print the results
    print(f"Adenine (A): {percent_A:.2f}%")
    print(f"Thymine (T): {percent_T:.2f}%")
    print(f"Guanine (G): {percent_G:.2f}%")
    print(f"Cytosine (C): {percent_C:.2f}%")

# Example usage
dna_sequence = "ATGCGATAACTAG"
seq = "ATGCGATAACTAG"
calculate_nucleotide_content(seq)


Adenine (A): 38.46%
Thymine (T): 23.08%
Guanine (G): 23.08%
Cytosine (C): 15.38%


# Q4
How can we construct the reverse complement of a given DNA sequence?

### Explanation
The reverse complement of a DNA sequence is obtained by two steps:

### Complement: Replace each nucleotide in the sequence with its complement:
* Adenine (A) is replaced with Thymine (T)
* Thymine (T) is replaced with Adenine (A)
* Guanine (G) is replaced with Cytosine (C)
* Cytosine (C) is replaced with Guanine (G)

### Reverse: Reverse the entire sequence after finding its complement.
Example:
Consider the following DNA sequence: "ATGCGATAACTAG"

Complement:

* A -> T
* T -> A
* G -> C
* C -> G
* G -> C
* A -> T
* T -> A
* A -> T
* A -> T
* C -> G
* T -> A
* A -> T
* G -> C

* Complemented sequence: "TACGCTATTGATC"
* Reverse:
Reversed complemented sequence: "CTAGTTATCGCAT"

In [None]:
def reverse_complement(sequence):
    # Convert sequence to uppercase to ensure consistency
    sequence = sequence.upper()

    # Replace each base with its complement
    complement_sequence = sequence.replace('A', 't').replace('T', 'a').replace('G', 'c').replace('C', 'g')
    complement_sequence = complement_sequence.upper()

    # Reverse the complemented sequence without slicing
    reverse_complement_sequence = ""
    for i in range(len(complement_sequence) - 1, -1, -1):
        reverse_complement_sequence += complement_sequence[i]

    return reverse_complement_sequence

# Example usage
sequence = "ATGCGATAACTAG"
rev_complement = reverse_complement(sequence)

print(f"Original sequence: {sequence}")
print(f"Reverse complement: {rev_complement}")

Original sequence: ATGCGATAACTAG
Reverse complement: CTAGTTATCGCAT


# Q5.
How can we calculate the melting temperature (Tm) of a given DNA sequence?

### Explanation
The melting temperature (Tm) of a DNA sequence is the temperature at which half of the DNA duplex dissociates to become single-stranded. It is a crucial parameter in various molecular biology techniques such as PCR.

### There are different methods to estimate Tm, but a commonly used method for short oligonucleotides is the Wallace rule:

### For sequences shorter than 14 nucleotides:
𝑇𝑚=2×(number of A/T)+4×(number of G/C)
* For longer sequences, more complex formulas involving the sequence length and the concentration of ions (salt) in the solution are used, but the Wallace rule provides a quick and useful approximation.

Count the Nucleotides:

Adenine (A): 4
Thymine (T): 3
Guanine (G): 2
Cytosine (C): 3
Apply the Wallace Rule:

Number of A/T: 7
Number of G/C: 5
𝑇𝑚=2×7+4×5=14+20=34∘𝐶
Tm=2×7+4×5=14+20=34∘C


In [None]:
def calculate_melting_temperature(sequence):
    # Convert sequence to uppercase
    sequence = sequence.upper()

    # Initialize counts
    a_count = 0
    t_count = 0
    g_count = 0
    c_count = 0

    # Count the occurrences of each nucleotide
    for nucleotide in sequence:
        if nucleotide == 'A':
            a_count += 1
        elif nucleotide == 'T':
            t_count += 1
        elif nucleotide == 'G':
            g_count += 1
        elif nucleotide == 'C':
            c_count += 1

    # Calculate Tm using the Wallace rule
    at_count = a_count + t_count
    gc_count = g_count + c_count
    tm = 2 * at_count + 4 * gc_count

    return tm

# Example usage
sequence = "ATGCGATAACTAG"
melting_temperature = calculate_melting_temperature(sequence)

print(f"Melting temperature (Tm) of the sequence: {melting_temperature}°C")

Melting temperature (Tm) of the sequence: 36°C


# Q6
### For a given sequence, find the most occurring K-mer count.
### Part 1: Finding the Most Occurring k-mer with k=3
#### Given the nucleotide sequence: ATGCGTACGTAGCTAGCTAGCTGACGTAGCTAGCTACGATCGTACGTACGATCGTACGTAGCTAGCTACGTAGCTAGCTAGCTACGTAGCTAGCTAGCTAGCTG
#### Use the kmer_count function to find the most occurring k-mer of size 3. What is the k-mer and its count in this sequence?


In [None]:
#write your code here
def kmer_count(sequence, k):
    # List to store k-mers and their counts
    kmers = []
    counts = []

    # Slide a window of size k over the sequence
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i:i + k]

        # Check if kmer is already in the list
        if kmer in kmers:
            index = kmers.index(kmer)
            counts[index] += 1
        else:
            kmers.append(kmer)
            counts.append(1)

    # Find the k-mer with the highest count
    max_count = max(counts)
    max_index = counts.index(max_count)
    max_kmer = kmers[max_index]

    return max_kmer, max_count

# Given nucleotide sequence
sequence = "ATGCGTACGTAGCTAGCTAGCTGACGTAGCTAGCTACGATCGTACGTACGATCGTACGTAGCTAGCTACGTAGCTAGCTAGCTACGTAGCTAGCTAGCTAGCTG"
k = 3

# Find the most occurring k-mer
most_frequent_kmer, count = kmer_count(sequence, k)
print(f"The most occurring {k}-mer is '{most_frequent_kmer}' with a count of {count}.")


The most occurring 3-mer is 'TAG' with a count of 14.


Part 2: Finding the Most Occurring k-mer with k Between 10 and 15
Given the same nucleotide sequence, Using the kmer_count function, determine the most occurring k-mer for each k value between 10 and 15. List the k-mer and its count for each value of k.

In [None]:
#Write your code here
def kmer_count(sequence, k):
    # List to store k-mers and their counts
    kmers = []
    counts = []

    # Slide a window of size k over the sequence
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i:i + k]

        # Check if kmer is already in the list
        if kmer in kmers:
            index = kmers.index(kmer)
            counts[index] += 1
        else:
            kmers.append(kmer)
            counts.append(1)

    # Find the k-mer with the highest count
    max_count = max(counts)
    max_index = counts.index(max_count)
    max_kmer = kmers[max_index]

    return max_kmer, max_count

# Given nucleotide sequence
sequence = "ATGCGTACGTAGCTAGCTAGCTGACGTAGCTAGCTACGATCGTACGTACGATCGTACGTAGCTAGCTACGTAGCTAGCTAGCTACGTAGCTAGCTAGCTAGCTG"

# Iterate over k values between 10 and 15
for k in range(10, 16):
    most_frequent_kmer, count = kmer_count(sequence, k)
    print(f"The most occurring {k}-mer is '{most_frequent_kmer}' with a count of {count}.")


The most occurring 10-mer is 'TAGCTAGCTA' with a count of 7.
The most occurring 11-mer is 'ACGTAGCTAGC' with a count of 5.
The most occurring 12-mer is 'ACGTAGCTAGCT' with a count of 5.
The most occurring 13-mer is 'ACGTAGCTAGCTA' with a count of 5.
The most occurring 14-mer is 'TACGTAGCTAGCTA' with a count of 4.
The most occurring 15-mer is 'TACGTAGCTAGCTAG' with a count of 3.


## Q7.
Given a chromosome sequence, identify the number of genes present in it by locating the start and stop codons within the sequence. How can this be achieved, and what is the count of genes in the provided sequence?
Hint:
To determine the number of genes in a chromosome sequence, we need to locate the start and stop codons that define the boundaries of each gene. In the context of DNA sequences, the start codon is typically "ATG" and the stop codons are "TAA", "TAG", and "TGA". The process involves scanning the sequence for these codons in all three reading frames and counting the number of valid genes.

### Here's a step-by-step approach:

1. Identify Start Codons: Search for the "ATG" codon in the sequence.
2. Identify Stop Codons: Search for the "TAA", "TAG", and "TGA" codons in the sequence.
3. Reading Frames: Since DNA can be read in three frames, repeat the search in all three frames.
4. Gene Validation: For each start codon, find the nearest downstream stop codon in the same frame. If found, count it as a valid gene.

Example
Consider the following chromosome sequence: "ATGCGATAACTAGATGTGA"

* Frame 1:
Start at position 0: "ATG"
Stop at position 6: "TAA", position 15: "TGA"
Gene: "ATGCGATAA" (from position 0 to 8)
* Frame 2:
No start codons at positions 1, 4, or 7.
* Frame 3:
Start at position 4: "ATG"
Stop at position 10: "TAG", position 15: "TGA"
Gene: "ATGTGA" (from position 4 to 15)
* Total number of genes: 2


In [None]:
#Write your code here
def find_genes(sequence):
    start_codon = "ATG"
    stop_codons = ["TAA", "TAG", "TGA"]
    total_genes = 0

    for frame in range(3):
        i = frame
        while i < len(sequence) - 2:
            # Check for start codon
            if sequence[i:i+3] == start_codon:
                start_index = i
                # Search for the nearest stop codon in the same frame
                j = i + 3
                while j < len(sequence) - 2:
                    codon = sequence[j:j+3]
                    if codon in stop_codons:
                        total_genes += 1
                        i = j  # Move to the end of the gene
                        break
                    j += 3
            i += 3

    return total_genes

# Given chromosome sequence
sequence = "ATGCGATAACTAGATGTGA"

# Find the number of genes in the sequence
num_genes = find_genes(sequence)
print(f"Total number of genes: {num_genes}")


Total number of genes: 2


# Q8
### Write a Python program that identifies all non-overlapping occurrences of a DNA motif in a given sequence. The motif consists of substrings composed of 'A' and/or 'T' with lengths between 3 and 6 characters.
### The program should return each motif found along with its starting and ending indices in the sequence.
#### Input DNA sequence = 'AATGAAGGGCCGCTACGATAAGGAACTTCGTAATTTCAG'

In [None]:
#write your code here
import re
import sys

# Example DNA sequence
DNA_sequence = 'AATGAAGGGCCGCTACGATAAGGAACTTCGTAATTTCAG'
print('DNA_sequence:', DNA_sequence)

# Regular expression pattern to find substrings of A and/or T of lengths between 3 and 6
motif = r'(([AT]){3,6})'
print('Motif:', motif)

# Checking if motif is a valid regular expression
try:
    re.compile(motif)
except re.error:
    print('Invalid regular expression, exiting the program!')
    sys.exit()

# Find all matches of the motif in the DNA sequence with their indices
matches_with_indices = [(match.group(0), match.start(), match.end()-1) for match in re.finditer(motif, DNA_sequence)]

if matches_with_indices:
    print('List of matches with indices:')
    for match, start_idx, end_idx in matches_with_indices:
        print(f'Match: {match}, Start index: {start_idx}, End index: {end_idx}')
else:
    print('Did not find any match.')

DNA_sequence: AATGAAGGGCCGCTACGATAAGGAACTTCGTAATTTCAG
Motif: (([AT]){3,6})
List of matches with indices:
Match: AAT, Start index: 0, End index: 2
Match: ATAA, Start index: 17, End index: 20
Match: TAATTT, Start index: 30, End index: 35


# Q9
### Open reading frames (ORFs) are regions of DNA that can be translated into proteins. An ORF starts with a start codon (such as ATG) and ends with one of the stop codons (such as TAA, TAG, or TGA). Write a Python function using the re library to identify all ORFs in a given DNA sequence. Your function should return the starting and ending indices (0-based) of each ORF.

## Details:

### The DNA sequence is a string consisting of characters 'A', 'C', 'G', and 'T'. The start codon is always ATG. The stop codons are TAA, TAG, and TGA.

#### Input seq = 'ATGCGATCGACGCTAGCGATCGCGATCGATGGCGATCGCTAGCGATCGATCGCGATCGTAAAGGCTACGTGTCAGTAA'

In [None]:
#write your code here
import re

seq = "ATGCGATCGACGCTAGCGATCGCGATCGATGGCGATCGCTAGCGATCGATCGCGATCGTAAAGGCTACGTGTCAGTAA"
start_codon = "ATG"
stop_codons = ["TAA", "TAG", "TGA"]

# Create the pattern using non-greedy matching
pattern = re.compile(f"{start_codon}(.*?)(?:{'|'.join(stop_codons)})")

matches = pattern.finditer(seq)
for match in matches:
    print(match.start(), match.end())

0 16
28 42


# Q10
### In bioinformatics, the analysis of DNA sequences is fundamental to understanding genetic similarities and variations among different organisms or samples. One common task is to identify regions within DNA sequences that are conserved across different species or individuals. These conserved regions, known as motifs or subsequences, can provide insights into functional or evolutionary relationships. One method to find such regions is by identifying the Longest Common Subsequence (LCS) shared by multiple DNA sequences.

### Problem:
#### Given a list of DNA sequences, you are required to write a Python function to find the longest common subsequence (LCS) shared by all the sequences in the list. The LCS is the longest sequence of nucleotides that appears in the same relative order (but not necessarily contiguously) in all the given DNA sequences.

### Example, Suppose you have the following DNA sequences:

* "ATGCTGAC"
* "CTGACGT"
* "TGAC"
* The LCS shared by all these sequences is "TGAC".
* Task:
### Q. Write a Python function lcs(sequences) that takes a list of DNA sequences as input and returns the longest common subsequence shared by all sequences in the list.
* Input sequences - "ATGCGTACCGTACGTAGCTAGCTGACTGACGTAGCGTACGCTAGCTG"
* "CGTACGTAGCTAGCTGACTGACGTAGCGTACGCTAGCTGACGCTGAC"
* "GCTAGCTGACTGACGTAGCGTACGCTAGCTGACGCTGACTGCGTAGC"
* "TACGTAGCTAGCTGACTGACGTAGCGTACGCTAGCTGACGCTGACTG"
* "GTACGTAGCTAGCTGACTGACGTAGCGTACGCTAGCTGACGCTGACT"

In [None]:
def lcs_two_sequences(seq1, seq2):
    # Create a 2D array to store the lengths of longest common subsequence
    m, n = len(seq1), len(seq2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    # Fill the dp array
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if seq1[i - 1] == seq2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

    # Reconstruct the LCS from the dp array
    lcs = []
    i, j = m, n
    while i > 0 and j > 0:
        if seq1[i - 1] == seq2[j - 1]:
            lcs.append(seq1[i - 1])
            i -= 1
            j -= 1
        elif dp[i - 1][j] > dp[i][j - 1]:
            i -= 1
        else:
            j -= 1

    return ''.join(reversed(lcs))

def lcs_multiple_sequences(sequences):
    if not sequences:
        return ""

    # Start with the first sequence as the initial LCS
    current_lcs = sequences[0]

    # Iteratively find the LCS of the current LCS and the next sequence
    for seq in sequences[1:]:
        current_lcs = lcs_two_sequences(current_lcs, seq)
        if not current_lcs:
            break

    return current_lcs

# Input sequences
sequences = [
    "ATGCGTACCGTACGTAGCTAGCTGACTGACGTAGCGTACGCTAGCTG",
    "CGTACGTAGCTAGCTGACTGACGTAGCGTACGCTAGCTGACGCTGAC",
    "GCTAGCTGACTGACGTAGCGTACGCTAGCTGACGCTGACTGCGTAGC",
    "TACGTAGCTAGCTGACTGACGTAGCGTACGCTAGCTGACGCTGACTG",
    "GTACGTAGCTAGCTGACTGACGTAGCGTACGCTAGCTGACGCTGACT"
]

# Find the LCS shared by all sequences
result = lcs_multiple_sequences(sequences)
print(f"The longest common subsequence is: '{result}'")


The longest common subsequence is: 'CTAGCTGCTGACGACGTAGCGTACGCTAGCTG'


# Q11
#### Identifying Significant Changes in Protein Abundance
### In proteomics, researchers often measure protein abundance under different experimental conditions to understand how treatments, mutations, or environmental factors affect protein levels. To analyze these changes, one can compare the abundance of proteins between conditions and identify those with significant increases or decreases. For each protein, it's crucial to determine not just if its abundance has changed, but whether it has increased or decreased.
### You are provided with a list of tuples, where each tuple represents the abundance levels of a protein in two different conditions. Each tuple contains the protein identifier and its abundance levels in Condition 1 and Condition 2. You need to write a Python function that analyzes these tuples to determine which proteins are significantly different based on a given fold change threshold. Additionally, specify whether each protein is "increased" or "decreased" in Condition 2 compared to Condition 1.
Input:
### A list of tuples where each tuple is of the form (protein_id, abundance1, abundance2), where:
* protein_id is a unique identifier for the protein.
* abundance1 is the abundance level of the protein in Condition 1.
* abundance2 is the abundance level of the protein in Condition 2.
# A fold change threshold.
* Output:
### A list of tuples where each tuple is of the form (protein_id, change_type), where:
* protein_id is the unique identifier for the protein.
* change_type is either 'increased' or 'decreased', depending on whether the protein's abundance in Condition * 2 is higher or lower compared to Condition 1, respectively.

#### Example:

### Given the following list of tuples and a fold change threshold of 1.5:

protein_abundance = [
    ('ProteinX', 5, 8),
    ('ProteinY', 20, 10),
    ('ProteinZ', 150, 300),
    ('ProteinW', 30, 15),
    ('ProteinV', 100, 100)
]
Task:
### Write a Python function analyze_protein_changes(protein_abundance, threshold) that takes the list of tuples and a fold change threshold as input and returns the list of tuples indicating the protein identifier and its abundance change status ('increased', 'decreased', or 'no change').

In [None]:
def analyze_protein_changes(protein_abundance, threshold):
    result = []

    for protein_id, abundance1, abundance2 in protein_abundance:
        if abundance1 == 0 or abundance2 == 0:  # Handle zero abundance cases
            continue

        fold_change = abundance2 / abundance1

        if fold_change >= threshold:
            result.append((protein_id, 'increased'))
        elif fold_change <= 1 / threshold:
            result.append((protein_id, 'decreased'))
        else:
            result.append((protein_id, 'no change'))

    return result

# Example usage
protein_abundance = [
    ('ProteinX', 5, 8),
    ('ProteinY', 20, 10),
    ('ProteinZ', 150, 300),
    ('ProteinW', 30, 15),
    ('ProteinV', 100, 100)
]

threshold = 1.5

# Analyze protein changes
changes = analyze_protein_changes(protein_abundance, threshold)
print(changes)


[('ProteinX', 'increased'), ('ProteinY', 'decreased'), ('ProteinZ', 'increased'), ('ProteinW', 'decreased'), ('ProteinV', 'no change')]


# Q12
### In genetics, the codon table is used to translate RNA sequences into proteins. Each codon, a sequence of three RNA nucleotides, corresponds to a specific amino acid or a stop signal during protein synthesis.

### Write a Python function called translate_rna_to_protein that takes a string representing an RNA sequence as input and returns the corresponding protein sequence. Use a dictionary to map RNA codons to their corresponding amino acids. For simplicity, assume the RNA sequence is always divisible by 3 and contains only the characters 'A', 'U', 'C', and 'G'.

``` #The codon table is given below:
AUG: 'Methionine', UUU: 'Phenylalanine', UUC: 'Phenylalanine', UUA: 'Leucine', UUG: 'Leucine',
CUU: 'Leucine', CUC: 'Leucine', CUA: 'Leucine', CUG: 'Leucine', AUU: 'Isoleucine', AUC: 'Isoleucine',
AUA: 'Isoleucine', GUU: 'Valine', GUC: 'Valine', GUA: 'Valine', GUG: 'Valine', UCU: 'Serine',
UCC: 'Serine', UCA: 'Serine', UCG: 'Serine', CCU: 'Proline', CCC: 'Proline', CCA: 'Proline',
CCG: 'Proline', ACU: 'Threonine', ACC: 'Threonine', ACA: 'Threonine', ACG: 'Threonine', GCU: 'Alanine',
GCC: 'Alanine', GCA: 'Alanine', GCG: 'Alanine', UAU: 'Tyrosine', UAC: 'Tyrosine', UAA: 'Stop',
UAG: 'Stop', CAU: 'Histidine', CAC: 'Histidine', CAA: 'Glutamine', CAG: 'Glutamine', AAU: 'Asparagine',
AAC: 'Asparagine', AAA: 'Lysine', AAG: 'Lysine', GAU: 'Aspartic Acid', GAC: 'Aspartic Acid',
GAA: 'Glutamic Acid', GAG: 'Glutamic Acid', UGU: 'Cysteine', UGC: 'Cysteine', UGA: 'Stop',
UGG: 'Tryptophan', CGU: 'Arginine', CGC: 'Arginine', CGA: 'Arginine', CGG: 'Arginine', AGU: 'Serine',
AGC: 'Serine', AGA: 'Arginine', AGG: 'Arginine', GGU: 'Glycine', GGC: 'Glycine', GGA: 'Glycine',
GGG: 'Glycine' ```


In [None]:
codon_table = {
    'TCA': 'S',    # Serina
    'TCC': 'S',    # Serina
    'TCG': 'S',    # Serina
    'TCT': 'S',    # Serina
    'TTC': 'F',    # Fenilalanina
    'TTT': 'F',    # Fenilalanina
    'TTA': 'L',    # Leucina
    'TTG': 'L',    # Leucina
    'TAC': 'Y',    # Tirosina
    'TAT': 'Y',    # Tirosina
    'TAA': '*',    # Stop
    'TAG': '*',    # Stop
    'TGC': 'C',    # Cisteina
    'TGT': 'C',    # Cisteina
    'TGA': '*',    # Stop
    'TGG': 'W',    # Triptofano
    'CTA': 'L',    # Leucina
    'CTC': 'L',    # Leucina
    'CTG': 'L',    # Leucina
    'CTT': 'L',    # Leucina
    'CCA': 'P',    # Prolina
    'CCC': 'P',    # Prolina
    'CCG': 'P',    # Prolina
    'CCT': 'P',    # Prolina
    'CAC': 'H',    # Histidina
    'CAT': 'H',    # Histidina
    'CAA': 'Q',    # Glutamina
    'CAG': 'Q',    # Glutamina
    'CGA': 'R',    # Arginina
    'CGC': 'R',    # Arginina
    'CGG': 'R',    # Arginina
    'CGT': 'R',    # Arginina
    'ATA': 'I',    # Isoleucina
    'ATC': 'I',    # Isoleucina
    'ATT': 'I',    # Isoleucina
    'ATG': 'M',    # Methionina
    'ACA': 'T',    # Treonina
    'ACC': 'T',    # Treonina
    'ACG': 'T',    # Treonina
    'ACT': 'T',    # Treonina
    'AAC': 'N',    # Asparagina
    'AAT': 'N',    # Asparagina
    'AAA': 'K',    # Lisina
    'AAG': 'K',    # Lisina
    'AGC': 'S',    # Serina
    'AGT': 'S',    # Serina
    'AGA': 'R',    # Arginina
    'AGG': 'R',    # Arginina
    'GTA': 'V',    # Valina
    'GTC': 'V',    # Valina
    'GTG': 'V',    # Valina
    'GTT': 'V',    # Valina
    'GCA': 'A',    # Alanina
    'GCC': 'A',    # Alanina
    'GCG': 'A',    # Alanina
    'GCT': 'A',    # Alanina
    'GAC': 'D',    # Acido Aspartico
    'GAT': 'D',    # Acido Aspartico
    'GAA': 'E',    # Acido Glutamico
    'GAG': 'E',    # Acido Glutamico
    'GGA': 'G',    # Glicina
    'GGC': 'G',    # Glicina
    'GGG': 'G',    # Glicina
    'GGT': 'G'     # Glicina
}

In [None]:
def translate_rna_to_protein(rna_sequence):

    # Initialize the protein sequence
    protein_sequence = []

    # Translate RNA sequence to protein sequence
    for i in range(0, len(rna_sequence), 3):
        codon = rna_sequence[i:i+3]
        amino_acid = codon_table.get(codon, 'Stop')
        if amino_acid == 'Stop':
            break
        protein_sequence.append(amino_acid)

    return '-'.join(protein_sequence)

# Example usage
rna_sequence = "AUGUUUUCUAUGCGUACUAGCUG"
protein = translate_rna_to_protein(rna_sequence)
print(f"The protein sequence is: {protein}")def translate_rna_to_protein(rna_sequence):
    # Initialize the protein sequence
    protein_sequence = []

    # Translate RNA sequence to protein sequence
    for i in range(0, len(rna_sequence), 3):
        codon = rna_sequence[i:i+3]
        amino_acid = codon_table.get(codon, 'Stop')
        if amino_acid == 'Stop':
            break
        protein_sequence.append(amino_acid)

    return '-'.join(protein_sequence)

# Example usage
rna_sequence = "AUGUUUUCUAUGCGUACUAGCUG"
protein = translate_rna_to_protein(rna_sequence)
print(f"The protein sequence is: {protein}")

The protein sequence is: Methionine-Phenylalanine-Serine-Methionine-Arginine-Threonine-Serine


# Q13
### In genetics, mutations refer to changes in the DNA sequence. Given a reference DNA sequence and a mutated DNA sequence, write a Python function called find_mutations that identifies the positions where mutations have occurred. The function should take two strings as input: the reference DNA sequence and the mutated DNA sequence (both of the same length), and return a dictionary where the keys are the positions (0-based index) of the mutations and the values are tuples containing the reference nucleotide and the mutated nucleotide.

## Input sequence:
#### reference_sequence = "ATCGGCTAATCGGCTAGCTAGCTAGCTGATCGGCTAATCGGCTAGCTAGCTAGCTGATCGGCTAATCGGCTAGCTAGCTAGCTGATCGGCTAATCGGCTAGCTAGCTAGCTG"

#### mutated_sequence = "ATCGGCTGATCGGCTTGCTAGCTGGCTGATCGGCTAATCGGCTAGCTAGCTGGCTGATCGGCTAATCGGCTAGCTTGCTAGCTGATCGGCTAATCGGCTAGCTAGCTAGCAG"

In [None]:
def find_mutations(reference_sequence, mutated_sequence):
    # Ensure the sequences are of the same length
    if len(reference_sequence) != len(mutated_sequence):
        raise ValueError("Sequences must be of the same length")

    # Dictionary to store the positions and the corresponding nucleotides
    mutations = {}

    # Iterate over the sequences and compare each position
    for i in range(len(reference_sequence)):
        if reference_sequence[i] != mutated_sequence[i]:
            mutations[i] = (reference_sequence[i], mutated_sequence[i])

    return mutations

# Input sequences
reference_sequence = "ATCGGCTAATCGGCTAGCTAGCTAGCTGATCGGCTAATCGGCTAGCTAGCTAGCTGATCGGCTAATCGGCTAGCTAGCTAGCTGATCGGCTAATCGGCTAGCTAGCTAGCTG"
mutated_sequence = "ATCGGCTGATCGGCTTGCTAGCTGGCTGATCGGCTAATCGGCTAGCTAGCTGGCTGATCGGCTAATCGGCTAGCTTGCTAGCTGATCGGCTAATCGGCTAGCTAGCTAGCAG"

# Find mutations
mutations = find_mutations(reference_sequence, mutated_sequence)
print(mutations)


{7: ('A', 'G'), 15: ('A', 'T'), 23: ('A', 'G'), 51: ('A', 'G'), 75: ('A', 'T'), 110: ('T', 'A')}
