
# BBC - Bioinformatics and Computational Biology 2025

> Practicals 01 - Introduction to Bioinformatics

---

- Professeur: Brochet Xavier (<a href="mailto:xavier.brochet@heig-vd.ch">xavier.brochet@heig-vd.ch</a>)
- Assistant: Thibault Schowing (<a href="mailto:thibault.schowing@heig-vd.ch">thibault.schowing@heig-vd.ch</a>)

- Student: Michael Strefeler


## Instructions

- Add your name to the filename and here above
- Answer the questions 
- Add python or markdown cells if you deem it necessary
- Return the notebook according to instructions given in class. 

## Content

- Basics of DNA structure and replication (Crash course)
- Sequences and files



> Throughout the laboratories, the questions that you should answer are highlighted as follows :
> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b>0. </b></font> This is a question.</p>

## Basic DNA biology


If you are passionate and want to review all basics of biology in a rainy Sunday afternoon, you can watch this full biology crash course [from carbon atoms to ecology - full YouTube Playlist](https://www.youtube.com/playlist?list=PL3EED4C1D684D3ADF) but for this introductory lab you just need to quickly review in 13 minutes what is DNA and how it is replicated you can watch just [this video](https://www.youtube.com/watch?v=8kK2zwjRV0M&list=PL3EED4C1D684D3ADF&index=12) from the series followed by [the next 14 minutes one](https://www.youtube.com/watch?v=itsb2SqR-R0&list=PL3EED4C1D684D3ADF&index=12) which describes the bases of DNA transcription and RNA translation. 




Answer the following questions. The answer are found in the two videos above, wo you can answer while watching, but feel free to search by yourself.  Give simple and short answers, with no need to ask GPT or Claude to write a paragraph for each question, the goal being for you to understand these concepts.

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What are 3' and 5' when we talk about DNA ? </p>
The double helix is like a twisted ladder with 2 strands that are connected. At each "step" of the ladder, both strands are connected, each side has a sugar molecule and a phosphate, 3' and 5' are how each step is connected to the next one
[5' --> 3'] is the pattern of the strand (going up)
[3' --> 5'] is the pattern of the other strand

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What are the different steps of replication ?</p>
- The Helicase splits the strand into 2 strands (leading strand and lagging strand)
- The 2 strands are used as templates for replication
- RNA Primase puts the compatible nucleotides with the ones on the leading strand, (A with U, C with G, T with A) at the start and then DNA polymerase does the rest
- The lagging strand is harder to replicate, RNA primase can do sections of the strand at a time and then the DNA polymerase works backwards, once all that is done a different DNA polymerase has to replace the RNA primers, then, finally, DNA ligase makes the bond

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> Where is DNA located in eukaryotes ?</p>
- In the nucleus

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> How is DNA packed in the nucleus ? </p>
- Gene by gene

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What is a promoter (TATA box) ?</p>
- A sequence that defines when the transcription unit is going to begin, the TATA box helps the enzyme figure out where to bind to the strand

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What strand is used to transcribe in RNA ? </p>
- The leading strand

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> How is the mRNA matured after transcription ?</p>
-  By adding Guanine to the 5' end and by adding a poly-A tail (about 250 adenines) on the 3' end

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> Snurps and splicing ? Exons ? Introns ? What are they ?</p>
- RNA Splicing is used to remove excess information from the mRNA
- Snurps a key player in RNA splicing, snurp means Small Nuclear RibonucleoProteins. They are combinations of RNA and proteins, they are used to find the start and end of splices
- Exons is we keep when splicing, introns are when is cut out when splicing

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> Endoplasmic reticulum and ribosomes: what are they used for ? </p>
- The endoplasmic reticulum is a part of a transportation system of the eukaryotic cell, and has many other important functions such as protein folding.
- Ribosomes are a mixuture of proteins and ribosomal RNA (rRNA). The incoming mRNA goes there. They translate genetic code into chains of amino acids using tRNA.

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What is tRNA and what is it used for ? </p>
- tRNA is transfer RNA and it does translation from the language of nucleotides into the language of amino acids and proteins

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> In how many dimentions can we analyse proteins ?</p>
- Proteins can be analyzed in multiple dimensions, primarily focusing on their primary, secondary, tertiary, and quaternary structures

---

## Sequences and files

A DNA sequence is a string of nucleotides represented by four letters:
- A (Adenine)
- T (Thymine)
- C (Cytosine)
- G (Guanine)

Here is a simple class representing a DNA sequence. Take a look at the code. 

In [2]:
# Basic DNA Sequence Representation
class DNASequence:
    def __init__(self, sequence):
        # Validate and store the sequence
        self.sequence = sequence.upper()
        if not self._is_valid():
            raise ValueError("Invalid DNA sequence")
    
    def _is_valid(self):
        """Check if sequence contains only valid nucleotides"""
        valid_nucleotides = set('ATCG')
        return all(base in valid_nucleotides for base in self.sequence)
    
    def complement(self):
        """Generate the complement sequence"""
        complement_dict = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
        return ''.join(complement_dict[base] for base in self.sequence)
    
    def reverse_complement(self):
        """Generate the reverse complement sequence"""
        return self.complement()[::-1]
    
    def gc_content(self):
        """Calculate GC content percentage"""
        gc_count = self.sequence.count('G') + self.sequence.count('C')
        return (gc_count / len(self.sequence)) * 100

# Example usage
dna = DNASequence("ATCGATCG")
print("Original Sequence:", dna.sequence)
print("Complement:", dna.complement())
print("Reverse Complement:", dna.reverse_complement())
print("GC Content:", dna.gc_content(), "%")

Original Sequence: ATCGATCG
Complement: TAGCTAGC
Reverse Complement: CGATCGAT
GC Content: 50.0 %


> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What is the GC content used for ? (search on the interweb and give your sources)</p>
- The structural integrity of DNA is influenced by its nucleotide composition, particularly the presence of guanine and cytosine pairs. These pairs are bonded by three hydrogen bonds, compared to the two bonds found in adenine-thymine pairs. 
- GC content also affects the physical properties of DNA, influencing its melting temperature—the point at which half of the DNA strands are in the single-stranded state. This is a key parameter in techniques like polymerase chain reaction (PCR). In the PCR, higher GC templates increase the chances of non-specific bindings and consequently the chances of false-positive results. Care must be taken while selecting the PCR template DNA and designing primers. 
- The genome can broadly be classified into genes and non-coding junk sequences. However, the GC content regions spam both coding and noncoding regions henceforth are important for protein formation as well as gene expression.  

Sources:
- [genetic education](https://geneticeducation.co.in/what-is-the-importance-of-gc-content/)
- [Biology insights](https://biologyinsights.com/gc-content-effects-on-dna-stability-and-gene-expression/)

A protein sequence is a string of amino acids represented each by one or three leter depending on the chosen notation. Here we provide a code that simply translates a DNA sequence to protein (we omit all the middle steps of transcription into mRNA (messenger RNA) and the splicing (removing the non coding parts)). 


Read the code and answer the questions.

In [1]:
# Smoothly generated by a well known LLM

# Complete Genetic Code Dictionary
# Standard genetic code mapping (64 possible codons)
genetic_code = {
    # Start Codon
    'ATG': 'M',  # Methionine (Start)
    
    # Phenylalanine
    'TTT': 'F', 'TTC': 'F',
    
    # Leucine
    'TTA': 'L', 'TTG': 'L',
    'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L',
    
    # Isoleucine
    'ATT': 'I', 'ATC': 'I', 'ATA': 'I',
    
    # Methionine and Stop
    'ATC': 'I', 'ATA': 'I',
    
    # Valine
    'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',
    
    # Serine
    'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TCG': 'S',
    'AGT': 'S', 'AGC': 'S',
    
    # Proline
    'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
    
    # Threonine
    'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
    
    # Alanine
    'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
    
    # Tyrosine
    'TAT': 'Y', 'TAC': 'Y',
    
    # Histidine
    'CAT': 'H', 'CAC': 'H',
    
    # Glutamine (GLN)
    'CAA': 'Q', 'CAG': 'Q',
    
    # Asparagine
    'AAT': 'N', 'AAC': 'N',
    
    # Lysine
    'AAA': 'K', 'AAG': 'K',
    
    # Aspartic Acid
    'GAT': 'D', 'GAC': 'D',
    
    # Glutamic Acid (GLU) - Turns into Glutamate in the body
    'GAA': 'E', 'GAG': 'E',
    
    # Cysteine
    'TGT': 'C', 'TGC': 'C',
    
    # Tryptophan
    'TGG': 'W',
    
    # Arginine
    'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
    'AGA': 'R', 'AGG': 'R',
    
    # Glycine
    'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
    
    # Stop Codons
    'TAA': '*', 'TAG': '*', 'TGA': '*'
}

def translate_dna_to_protein(dna_sequence, start_codon='ATG'):
    """
    Translate DNA sequence to protein sequence with advanced features
    
    Parameters:
    - dna_sequence: Input DNA sequence to translate
    - start_codon: Specifies the start codon (default is 'ATG')
    
    Returns:
    - Translated protein sequence
    - Diagnostic information about translation
    """
    # Convert sequence to uppercase to handle mixed case input
    dna_sequence = dna_sequence.upper()
    
    # Find the first occurrence of start codon
    start_index = dna_sequence.find(start_codon)
    
    # If no start codon found, return empty results
    if start_index == -1:
        return '', {'translated_length': 0, 'start_codon_found': False}
    
    # Translate from start codon
    protein = []
    diagnostic_info = {
        'start_codon_found': True,
        'total_codons': 0,
        'stop_codon_found': False,
        'unknown_codons': 0
    }
    
    # Translate in groups of 3 nucleotides
    for i in range(start_index, len(dna_sequence), 3):
        # Ensure we have a complete codon
        if i + 3 > len(dna_sequence):
            break
        
        codon = dna_sequence[i:i+3]
        diagnostic_info['total_codons'] += 1
        
        # Check if codon is a stop codon
        if codon in ['TAA', 'TAG', 'TGA']:
            diagnostic_info['stop_codon_found'] = True
            break
        
        # Translate codon
        amino_acid = genetic_code.get(codon, 'X')
        
        # Track unknown codons
        if amino_acid == 'X':
            diagnostic_info['unknown_codons'] += 1
        
        protein.append(amino_acid)
    
    # Convert protein to string
    protein_sequence = ''.join(protein)
    
    # Update diagnostic info
    diagnostic_info['translated_length'] = len(protein_sequence)
    
    return protein_sequence, diagnostic_info

def analyze_dna_translation(dna_sequence):
    """
    Comprehensive DNA translation analysis
    
    Provides multiple reading frames and detailed translation information
    """
    print("DNA Sequence Analysis:")
    print("-" * 40)
    print(f"Original Sequence: {dna_sequence}")
    print(f"Sequence Length: {len(dna_sequence)} nucleotides")
    
    # Analyze all 3 reading frames
    for frame in range(3):
        print(f"\nReading Frame {frame + 1}:")
        translated_protein, diagnostic = translate_dna_to_protein(
            dna_sequence[frame:], 
            start_codon='ATG'
        )
        
        print(f"Translated Protein: {translated_protein}")
        print("Diagnostic Information:")
        for key, value in diagnostic.items():
            print(f"  {key.replace('_', ' ').title()}: {value}")

# Example usage
example_sequences = [
    "ATGCATCGATCGATCGATCG",  # Sample DNA sequence
    "GCATCGATCGATCGATCGATCG",  # Different starting point
    "ATGAAACCCGGGTTT"  # Another example
]

for sequence in example_sequences:
    analyze_dna_translation(sequence)
    print("\n" + "="*50 + "\n")

DNA Sequence Analysis:
----------------------------------------
Original Sequence: ATGCATCGATCGATCGATCG
Sequence Length: 20 nucleotides

Reading Frame 1:
Translated Protein: MHRSID
Diagnostic Information:
  Start Codon Found: True
  Total Codons: 6
  Stop Codon Found: False
  Unknown Codons: 0
  Translated Length: 6

Reading Frame 2:
Translated Protein: 
Diagnostic Information:
  Translated Length: 0
  Start Codon Found: False

Reading Frame 3:
Translated Protein: 
Diagnostic Information:
  Translated Length: 0
  Start Codon Found: False


DNA Sequence Analysis:
----------------------------------------
Original Sequence: GCATCGATCGATCGATCGATCG
Sequence Length: 22 nucleotides

Reading Frame 1:
Translated Protein: 
Diagnostic Information:
  Translated Length: 0
  Start Codon Found: False

Reading Frame 2:
Translated Protein: 
Diagnostic Information:
  Translated Length: 0
  Start Codon Found: False

Reading Frame 3:
Translated Protein: 
Diagnostic Information:
  Translated Length: 0
  St

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What are the differences between a DNA sequence and a Protein sequence ? Simply summarize how they are encoded to show you read the code above. </p>

- DNA is encoded with 4 nucleotides (ACGT) and protein is encoded using amino acids (**more** than 4 different types)

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What is a reading frame ? If you have a mutation that add one nucleic acid in your DNA sequence, what are the consequences ? </p>

- Reading frames are used to translate the DNA into amino acids step by step. If there's a mutation, the reading frames will be offset. I'm guessing that the translating would happen differently in this case.



> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What should be changed if we use RNA instead of DNA sequences for our sequences ?  </p>
Check the DNA and RNA codon table page [here](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables) to help you answer.

- RNA uses uracil (U) instead of thymine (T), the code would need to be changed to work with RNA. The start and stop codons are different with RNA


Have you ever wondered what all these amino acids taste like ? Here we answer the question for you.


![](https://www.umamiinfo.com/images/what/whatisumami/table_02_pc.png)

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> What product often stored in a green, yellow and red cylindrical container, and consummed in Switzerland contains Glutamate ?  </p>

- Aromat


# File types overview

> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> Describe the following file formats and what they used for. Give a small extract or an example of what the file looks like and give example of what tools produce and/or use it (e.g. The tool X uses Fasta file to generate Y). You can use an LLM to help you format your answer but you should cite reliable bioinformatics sources for each description (wikipedia is ok but try to find more bioinfo ressources if possible). </p>

- Fasta / Fastq
- GTF/GFF
- SAM/BAM


*Réponses:*


### Fasta/Fastq

Both are used for representing either nucleotide sequences or amino acid (protein) sequences. FASTQ adds a quality score to the sequence as well.

Here's an example of a FASTA file:

```text
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*
```

SSEARCH performs protein-protein or DNA-DNA comparisons using the Smith-Waterman algorithm.

Here's an example of a FASTQ file:

```text
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
```

FASTQ format is typically used for raw sequence reads from high-throughput sequencing technologies like Illumina. It includes raw sequence reads and per-base quality scores.


### GTF/GFF

The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines.
The GTF (General Transfer Format) is identical to GFF.

Both formats are used to describe and represent genomic features.

Sample GFF output from Ensembl data dump:

```text
1 transcribed_unprocessed_pseudogene  gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 
1 processed_transcript transcript  11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_sourc e "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";
```

Sample GFF output from Ensembl export:

```text
X	Ensembl	Repeat	2419108	2419128	42	.	.	hid=trf; hstart=1; hend=21
X	Ensembl	Repeat	2419108	2419410	2502	-	.	hid=AluSx; hstart=1; hend=303
X	Ensembl	Repeat	2419108	2419128	0	.	.	hid=dust; hstart=2419108; hend=2419128
X	Ensembl	Pred.trans.	2416676	2418760	450.19	-	2	genscan=GENSCAN00000019335
X	Ensembl	Variation	2413425	2413425	.	+	.	
X	Ensembl	Variation	2413805	2413805	.	+	.
```

Programs like StringTie and Cufflinks utilize GFF and GTF files for processing and analyzing genomic data, particularly in transcriptome analysis and gene expression studies.


### SAM/BAM

SAM means Sequence Alignment Map, BAM means Binary Alignement Map.

The SAM Format is a text format for storing sequence data in a series of tab delimited ASCII columns. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form.

SAM format files are generated following mapping of the reads to reference sequence.

SAM file:
```text
1:497:R:-272+13M17D24M	113	1	497	37	37M	15	100338662	0	CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG	0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>	XT:A:U	NM:i:0	SM:i:37	AM:i:0	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:37
19:20389:F:275+18M2D19M	99	1	17644	0	37M	=	17919	314	TATGACTGCTAATAATACCTACACATGTTAGAACCAT	>>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9	RG:Z:UM0098:1	XT:A:R	NM:i:0	SM:i:0	AM:i:0	X0:i:4	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:37
19:20389:F:275+18M2D19M	147	1	17919	0	18M2D19M	=	17644	-314	GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT	;44999;499<8<8<<<8<<><<<<><7<;<<<>><<	XT:A:R	NM:i:2	SM:i:0	AM:i:0	X0:i:4	X1:i:0	XM:i:0	XO:i:1	XG:i:2	MD:Z:18^CA19
9:21597+10M2I25M:R:-209	83	1	21678	0	8M2I27M	=	21469	-244	CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT	<;9<<5><<<<><<<>><<><>><9>><>>>9>>><>	XT:A:R	NM:i:2	SM:i:0	AM:i:0	X0:i:5	X1:i:0	XM:i:0	XO:i:1	XG:i:2	MD:Z:35
```

BAM file:

```text
##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
```

[samtools](https://github.com/samtools/samtools) is a toolkit on GitHub used to process SAM and BAM files.

### Sources

- FASTA [wikipedia](https://en.wikipedia.org/wiki/FASTA_format), [microbe notes](https://microbenotes.com/fasta/)
- FASTQ [wikipedia](https://en.wikipedia.org/wiki/FASTQ_format), [omics tutorials](https://omicstutorials.com/efficient-processing-of-sff-fastq-files/), [Illumina](https://www.illumina.com/informatics.html)
- FASTA vs FASTQ [sequence server](https://sequenceserver.com/blog/fasta-vs-fastq-formats/)
- FASTA, FASTQ and SAM [bioinformatics stack exchange](https://bioinformatics.stackexchange.com/questions/14/what-is-the-difference-between-fasta-fastq-and-sam-file-formats)
- GTF/GFF [ensembl.org](https://www.ensembl.org/info/website/upload/gff.html?redirect=no), [agat](https://agat.readthedocs.io/en/latest/gxf.html), [biobam](https://www.biobam.com/differences-between-gtf-and-gff-files-in-genomic-data-analysis/), [GFF Utilities](https://ccb.jhu.edu/software/stringtie/gff.shtml)
- SAM/BAM [zymoresearch](https://zymoresearch.eu/blogs/blog/what-are-sam-and-bam-files), [bioinformatics uconn edu](https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/file-formats-tutorial/), [bioinformaticamente.com](https://bioinformaticamente.com/2021/03/03/sam-bam/), [samtools](https://github.com/samtools/samtools)

# Bioinformatics is not only sequence analysis

- Image detection / microscopy 
- Sequencing technologies benchmarking (DNA, RNA, Long vs Short reads)
- Mass spectrometry
- FTIR and Raman spectrometry
- Ecology and simulation
- Chemical interaction prediction (Cheminformatics)
- Etc...




> <p style="background-color:#AFEEEE;padding:3px"><font size="4"><b> </b></font> Chose a subject from the list above or related to genetics (cancer genomics, epigenetics, single cell sequencing, ), and write a short description of what is the role of bioinformatics in this domain. Cite at least two papers (or reviews) refering to the subject. </p>

Subject of choice : Metabolomics

Provides insights into the metabolic state of organisms through the analysis of small molecules, or metabolites, which reflect physiological processes. This integration helps in understanding complex biological systems and can aid in disease characterization and biomarker discovery.

Article 1: [Metabolomics technology and bioinformatics](https://academic.oup.com/bib/article-abstract/7/2/128/306184)

Article 2: [Guide to Metabolomics Analysis: A Bioinformatics Workflow](https://www.mdpi.com/2218-1989/12/4/357)