# üß† Part 1: The Plan for unzipping the Train and Test folders and view the MPEG-G files
We‚Äôll create a local Jupyter notebook or script that:

0. üîß Step 0: Prerequisites
1. üê≥ Pulls and runs the Genie Docker container
2. üì§ Extract the Train and Test File from TrainFiles.zip and TestFiles.zip
3. üê≥ Pull the Genie Docker Image
4. üìÇ Mounts your ```.mgb``` file for one file in TrainFiles.
5. üß¨ Load FASTQ in Python
6. üí° Explain the layers and outputs in plain English for a data scientist unfamiliar with bioinformatics
7. üèÉ Run steps 4 & 5 for all files in TrainFiles and TestFiles


## üìì Let‚Äôs Start with a Local Notebook
Here's how your local notebook should look:

### üîß Step 0: Prerequisites

1. Install an Integrated Development environment VSCode or Anaconda
2. Install Docker (if you haven‚Äôt already):
üëâ https://www.docker.com/products/docker-desktop/
3. Install Python
4. Install Conda
5. Create a virtual invironment

Note: doing this you may be asked to install command line tools depending on your machine, xcode, git, etc, but it is worth it as you will find them useful. Some of the downloads may take more than half an hour, stay focused and follow instructions.

 Instructions for Mac are available [here](https://docs.google.com/document/d/1Rug61nxs6FLqlYh8Qzqgzuv6nq3lyTq45HVB6cqERfE/edit?usp=sharing)  
If you already have IDE, Python, CondaInstall Docker:
üëâ https://www.docker.com/products/docker-desktop/

Important:

 However, we recommend that you install Biopython using conda, as it will also install the required dependencies.

` conda install Bio`

 A better approach is to create a new conda environment for your project, and then install Biopython in that environment:

 `conda create -n myenv mpegg_env `
 Activate the environment

 `conda activate mpegg_env`
 Now you can install Biopython in the new environment  and then import Bio in your code

` conda install -c conda-forge biopython`



In [None]:
#Uncomment the following lines to install the required packages

# !pip install pandas --quiet
# !pip install numpy --quiet
# !pip install matplotlib --quiet
# !pip install seaborn --quiet
# !pip install Bio --quiet
# !pip install scikit-learn --quiet
# !pip install tqdm --quiet


In [1]:
# Test Docker works
!docker run hello-world


Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/



In [2]:
# Import packages
import os
import zipfile
import subprocess
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import shutil


from Bio import SeqIO
from collections import Counter

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

# Errors ignore
import warnings
warnings.filterwarnings('ignore')

## Step 1. üì§ Extract the Train and Test File from a .zip

## üê≥ Step 2: Pull the Genie Docker Image

In [4]:
# Ensure you have the latest version of Genie
!docker pull muefab/genie:latest

latest: Pulling from muefab/genie
Digest: sha256:c3112a3879cc18061bbab5ed8f76dec255ab1be46e2133cd59320dd5ba98ef89
Status: Image is up to date for muefab/genie:latest
docker.io/muefab/genie:latest


## üèÉ Steps 4 - 8 for one MPEG-G file

### üìÇ Step 4: Mounts your ```.mgb``` file for one file in TrainFiles.

In [None]:
TRAIN_MGB_PATH = "Data/Trainfiles"
TEST_MGB_PATH = "Data/Testfiles"
TRAIN_FASTQ_PATH = "Data/TrainFastQ"
TEST_FASTQ_PATH = "Data/TestFastQ"

os.makedirs(TRAIN_FASTQ_PATH, exist_ok=True)
os.makedirs(TEST_FASTQ_PATH, exist_ok=True)

In [None]:
import os

notebook_dir = os.getcwd()

# Pick one `.mgb` file from TrainFiles
mgb_filename = "ID_AAFNOT.mgb"                 # just the file name
train_dir = os.path.abspath(TRAIN_MGB_PATH)   # input folder
fastq_dir = os.path.abspath(TRAIN_FASTQ_PATH)   # output folder

# Paths
mgb_file_path = os.path.join(train_dir, mgb_filename)
mgb_filename_no_mgb = os.path.splitext(mgb_filename)[0]
output_fastq = os.path.join(fastq_dir, f"{mgb_filename_no_mgb}.fastq")

# Docker mount paths
# (we mount both input + output so you get results locally)
input_container_dir = "/input"
output_container_dir = "/output"

# Show paths
print(f"üìÅ Host path to `.mgb`: {mgb_file_path}")
print(f"üìÅ Input directory mounted: {train_dir} ‚Üí {input_container_dir}")
print(f"üìÅ Output directory mounted: {fastq_dir} ‚Üí {output_container_dir}")
print(f"üì¶ Container will see: {input_container_dir}/{mgb_filename}")
print(f"üìÑ Output FASTQ: {output_fastq}")


üìÅ Host path to `.mgb`: f:\Desktop\Zindi\MPEG\TrainFiles\ID_AAFNOT.mgb
üìÅ Input directory mounted: f:\Desktop\Zindi\MPEG\TrainFiles ‚Üí /input
üìÅ Output directory mounted: f:\Desktop\Zindi\MPEG\TrainFastQ ‚Üí /output
üì¶ Container will see: /input/ID_AAFNOT.mgb
üìÑ Output FASTQ: f:\Desktop\Zindi\MPEG\TrainFastQ\ID_AAFNOT.fastq


### üîç Step 5: Decodes ```.mgb``` to .```fastq``` of that one file.

In [None]:
import os
import subprocess

def inspect_mgb_structure(mgb_filename):
    # Absolute paths for mounting
    train_dir = os.path.abspath(TRAIN_MGB_PATH)
    fastq_dir = os.path.abspath(TRAIN_FASTQ_PATH)

    # File names
    mgb_filename_no_mgb = os.path.splitext(mgb_filename)[0]

    # Container paths
    input_container_dir=  "/input"
    output_container_dir = "/output"

    command = [
        "docker", "run", "--rm",
        "-v", f"{train_dir}:{input_container_dir}",
        "-v", f"{fastq_dir}:{output_container_dir}",
        "muefab/genie:latest", "run",
        "-f",
        "-i", f"{input_container_dir}/{mgb_filename}",
        "-o", f"{output_container_dir}/{mgb_filename_no_mgb}.fastq"
    ]

    print("Running:", " ".join(command))
    result = subprocess.run(command, capture_output=True, text=True)
    print("\n--- STDOUT ---\n")
    print(result.stdout)
    if result.stderr:
        print("\n--- STDERR ---\n")
        print(result.stderr)

# Example call
inspect_mgb_structure("ID_AAFNOT.mgb")


Running: docker run --rm -v f:\Desktop\Zindi\MPEG\TrainFiles:/input -v f:\Desktop\Zindi\MPEG\TrainFastQ:/output muefab/genie:latest run -f -i /input/ID_AAFNOT.mgb -o /output/ID_AAFNOT.fastq

--- STDOUT ---

[INFO,      0.000s, App]:    ______           _
[INFO,      0.000s, App]:   / ____/__  ____  (_)__
[INFO,      0.000s, App]:  / / __/ _ \/ __ \/ / _ \
[INFO,      0.000s, App]: / /_/ /  __/ / / / /  __/
[INFO,      0.000s, App]: \____/\___/_/ /_/_/\___/
[INFO,      0.000s, App]: Command: /usr/local/bin/genie run -f -i /input/ID_AAFNOT.mgb -o /output/ID_AAFNOT.fastq 
[INFO,      0.002s, App/Run]: Input file 1: /input/ID_AAFNOT.mgb with size 1.34MiB
[INFO,      0.005s, App/Run]: Working directory: /output with 743GiB available
[INFO,      0.010s, App/Run]: Output file: /output/ID_AAFNOT.fastq with 743GiB available
[INFO,      0.010s, App/Run]: Threads: 24 with 24 supported
[INFO,      0.011s, Spring]: Temporary directory: /output/tmp.U1xdED8xUf/
[INFO,      0.011s, Spring]: Temporary 

### Process all of them

In [None]:
import os
import subprocess
from pathlib import Path
from tqdm import tqdm

def decode_mgb_directory(input_dir, output_dir):
    """
    Decode all .mgb files in input_dir to .fastq in output_dir using Docker genie.
    Skips files that are already decoded.
    Shows a tqdm progress bar.
    """
    input_dir = os.path.abspath(input_dir)
    output_dir = os.path.abspath(output_dir)

    os.makedirs(output_dir, exist_ok=True)

    # Container mount points
    input_container_dir = "/input"
    output_container_dir = "/output"

    # Find all .mgb files
    mgb_files = [f for f in os.listdir(input_dir) if f.endswith(".mgb")]
    if not mgb_files:
        print(f"‚ö†Ô∏è No .mgb files found in {input_dir}")
        return

    print(f"\nüîç Found {len(mgb_files)} .mgb files in {input_dir}")

    for mgb_filename in tqdm(mgb_files, desc=f"Decoding {os.path.basename(input_dir)}"):
        mgb_filename_no_mgb = os.path.splitext(mgb_filename)[0]
        output_fastq = Path(output_dir) / f"{mgb_filename_no_mgb}.fastq"

        # Skip if already processed
        if output_fastq.exists():
            tqdm.write(f"‚è© Skipping {mgb_filename}, already decoded")
            continue

        command = [
            "docker", "run", "--rm",
            "-v", f"{input_dir}:{input_container_dir}",
            "-v", f"{output_dir}:{output_container_dir}",
            "muefab/genie:latest", "run",
            "-f",
            "-i", f"{input_container_dir}/{mgb_filename}",
            "-o", f"{output_container_dir}/{mgb_filename_no_mgb}.fastq"
        ]

        result = subprocess.run(command, capture_output=True, text=True)

        if result.returncode != 0:
            tqdm.write(f"‚ö†Ô∏è Error decoding {mgb_filename}")
            if result.stderr.strip():
                tqdm.write(result.stderr)

    print(f"‚úÖ Finished decoding all files from {input_dir} ‚Üí {output_dir}")


# Example usage
decode_mgb_directory(TRAIN_MGB_PATH, TRAIN_FASTQ_PATH)



üîç Found 2901 .mgb files in f:\Desktop\Zindi\MPEG\TrainFiles


Decoding TrainFiles:   2%|‚ñè         | 72/2901 [00:00<00:03, 717.15it/s]

‚è© Skipping ID_AAFNOT.mgb, already decoded
‚è© Skipping ID_AAXPTO.mgb, already decoded
‚è© Skipping ID_AAYKAN.mgb, already decoded
‚è© Skipping ID_ABEZNS.mgb, already decoded
‚è© Skipping ID_ABFFLP.mgb, already decoded
‚è© Skipping ID_ABFQPG.mgb, already decoded
‚è© Skipping ID_ABMLPB.mgb, already decoded
‚è© Skipping ID_ABOEMW.mgb, already decoded
‚è© Skipping ID_ABRMNZ.mgb, already decoded
‚è© Skipping ID_ABROLI.mgb, already decoded
‚è© Skipping ID_ABYEPC.mgb, already decoded
‚è© Skipping ID_ABYUSV.mgb, already decoded
‚è© Skipping ID_ABZIIM.mgb, already decoded
‚è© Skipping ID_ACDYOS.mgb, already decoded
‚è© Skipping ID_ACFOIY.mgb, already decoded
‚è© Skipping ID_ACKYNO.mgb, already decoded
‚è© Skipping ID_ACNBRX.mgb, already decoded
‚è© Skipping ID_ACPNZE.mgb, already decoded
‚è© Skipping ID_ACSAGK.mgb, already decoded
‚è© Skipping ID_ACWUII.mgb, already decoded
‚è© Skipping ID_ADDBVX.mgb, already decoded
‚è© Skipping ID_ADGTHC.mgb, already decoded
‚è© Skipping ID_AEHLIF.mgb, alre

Decoding TrainFiles:   2%|‚ñè         | 72/2901 [00:00<00:03, 717.15it/s]

‚è© Skipping ID_AXRWDA.mgb, already decoded
‚è© Skipping ID_AXRZES.mgb, already decoded
‚è© Skipping ID_AXUMEQ.mgb, already decoded
‚è© Skipping ID_AXYRUO.mgb, already decoded
‚è© Skipping ID_AYCGRQ.mgb, already decoded
‚è© Skipping ID_AYEWPT.mgb, already decoded
‚è© Skipping ID_AYHIBL.mgb, already decoded
‚è© Skipping ID_AYHLGO.mgb, already decoded
‚è© Skipping ID_AYHRMP.mgb, already decoded
‚è© Skipping ID_AYNMSB.mgb, already decoded
‚è© Skipping ID_AYNXMD.mgb, already decoded
‚è© Skipping ID_AYOTKS.mgb, already decoded
‚è© Skipping ID_AYPACN.mgb, already decoded
‚è© Skipping ID_AYWJHN.mgb, already decoded
‚è© Skipping ID_AYWSLB.mgb, already decoded
‚è© Skipping ID_AZAARB.mgb, already decoded
‚è© Skipping ID_AZZSMR.mgb, already decoded
‚è© Skipping ID_BAMZGX.mgb, already decoded
‚è© Skipping ID_BASXMJ.mgb, already decoded
‚è© Skipping ID_BATLNN.mgb, already decoded
‚è© Skipping ID_BAYRPN.mgb, already decoded
‚è© Skipping ID_BAZXSK.mgb, already decoded


Decoding TrainFiles:   2%|‚ñè         | 72/2901 [00:00<00:03, 717.15it/s]

‚è© Skipping ID_BBNWLY.mgb, already decoded
‚è© Skipping ID_BBNYMC.mgb, already decoded
‚è© Skipping ID_BBPEYX.mgb, already decoded
‚è© Skipping ID_BBUIXC.mgb, already decoded
‚è© Skipping ID_BBYKSG.mgb, already decoded
‚è© Skipping ID_BCDPBO.mgb, already decoded
‚è© Skipping ID_BCHMUF.mgb, already decoded
‚è© Skipping ID_BCOHGF.mgb, already decoded
‚è© Skipping ID_BCWAFO.mgb, already decoded
‚è© Skipping ID_BCYOZW.mgb, already decoded
‚è© Skipping ID_BDIHFC.mgb, already decoded
‚è© Skipping ID_BDKLOC.mgb, already decoded


                                                                        

‚è© Skipping ID_BDMAFZ.mgb, already decoded
‚è© Skipping ID_BDPILQ.mgb, already decoded
‚è© Skipping ID_BDUKPO.mgb, already decoded
‚è© Skipping ID_BDXMPE.mgb, already decoded
‚è© Skipping ID_BEKTVZ.mgb, already decoded
‚è© Skipping ID_BFDOIX.mgb, already decoded
‚è© Skipping ID_BFHQLE.mgb, already decoded
‚è© Skipping ID_BFJJUV.mgb, already decoded
‚è© Skipping ID_BFPKJO.mgb, already decoded
‚è© Skipping ID_BFZKGB.mgb, already decoded
‚è© Skipping ID_BFZNLF.mgb, already decoded
‚è© Skipping ID_BGVFRI.mgb, already decoded
‚è© Skipping ID_BGZYQJ.mgb, already decoded
‚è© Skipping ID_BHBSOA.mgb, already decoded
‚è© Skipping ID_BHYLYB.mgb, already decoded
‚è© Skipping ID_BIGCDY.mgb, already decoded
‚è© Skipping ID_BILXIA.mgb, already decoded
‚è© Skipping ID_BIRZSY.mgb, already decoded
‚è© Skipping ID_BJABXF.mgb, already decoded
‚è© Skipping ID_BJBUOY.mgb, already decoded
‚è© Skipping ID_BJCMPA.mgb, already decoded
‚è© Skipping ID_BJEDLO.mgb, already decoded
‚è© Skipping ID_BJGGSH.mgb, alre

Decoding TrainFiles:   7%|‚ñã         | 217/2901 [00:00<00:03, 715.05it/s]

‚è© Skipping ID_BYADQR.mgb, already decoded
‚è© Skipping ID_BYBBFU.mgb, already decoded
‚è© Skipping ID_BYBDDN.mgb, already decoded
‚è© Skipping ID_BYFKYS.mgb, already decoded
‚è© Skipping ID_BYGTOX.mgb, already decoded
‚è© Skipping ID_BYIDTS.mgb, already decoded
‚è© Skipping ID_BYQAWV.mgb, already decoded
‚è© Skipping ID_BYTILZ.mgb, already decoded
‚è© Skipping ID_BYUSVC.mgb, already decoded
‚è© Skipping ID_BYXUVL.mgb, already decoded
‚è© Skipping ID_BZAEKX.mgb, already decoded
‚è© Skipping ID_BZLNAL.mgb, already decoded
‚è© Skipping ID_BZLYBW.mgb, already decoded
‚è© Skipping ID_CABEZP.mgb, already decoded
‚è© Skipping ID_CAMZZK.mgb, already decoded
‚è© Skipping ID_CAQGKS.mgb, already decoded
‚è© Skipping ID_CAVMMY.mgb, already decoded
‚è© Skipping ID_CBCIXT.mgb, already decoded
‚è© Skipping ID_CBEPXX.mgb, already decoded
‚è© Skipping ID_CBIXKY.mgb, already decoded
‚è© Skipping ID_CBIYFC.mgb, already decoded


Decoding TrainFiles:   7%|‚ñã         | 217/2901 [00:00<00:03, 715.05it/s]

‚è© Skipping ID_CBOALL.mgb, already decoded
‚è© Skipping ID_CBUHYF.mgb, already decoded
‚è© Skipping ID_CBVUCG.mgb, already decoded
‚è© Skipping ID_CBVZUG.mgb, already decoded
‚è© Skipping ID_CBYCRL.mgb, already decoded
‚è© Skipping ID_CCJFJF.mgb, already decoded
‚è© Skipping ID_CCMXHQ.mgb, already decoded
‚è© Skipping ID_CCUWGC.mgb, already decoded
‚è© Skipping ID_CCXOJL.mgb, already decoded
‚è© Skipping ID_CCZQYA.mgb, already decoded
‚è© Skipping ID_CDBSEY.mgb, already decoded


Decoding TrainFiles:  12%|‚ñà‚ñè        | 359/2901 [00:00<00:03, 688.21it/s]

‚è© Skipping ID_CDDJVZ.mgb, already decoded
‚è© Skipping ID_CDUQDF.mgb, already decoded
‚è© Skipping ID_CEFEDI.mgb, already decoded
‚è© Skipping ID_CEFFHZ.mgb, already decoded
‚è© Skipping ID_CEFQDN.mgb, already decoded
‚è© Skipping ID_CEPVCX.mgb, already decoded
‚è© Skipping ID_CFFBOX.mgb, already decoded
‚è© Skipping ID_CFKXGH.mgb, already decoded
‚è© Skipping ID_CFNKVZ.mgb, already decoded
‚è© Skipping ID_CFOWCA.mgb, already decoded
‚è© Skipping ID_CGISFP.mgb, already decoded
‚è© Skipping ID_CGJOKG.mgb, already decoded
‚è© Skipping ID_CHATBB.mgb, already decoded
‚è© Skipping ID_CHBZIR.mgb, already decoded
‚è© Skipping ID_CHKMMC.mgb, already decoded
‚è© Skipping ID_CHMRCZ.mgb, already decoded
‚è© Skipping ID_CHYNLQ.mgb, already decoded
‚è© Skipping ID_CIFVFZ.mgb, already decoded
‚è© Skipping ID_CISUFU.mgb, already decoded
‚è© Skipping ID_CIVZBZ.mgb, already decoded
‚è© Skipping ID_CIYRXW.mgb, already decoded
‚è© Skipping ID_CJKSZF.mgb, already decoded
‚è© Skipping ID_CJLWBX.mgb, alre

Decoding TrainFiles:  12%|‚ñà‚ñè        | 359/2901 [00:00<00:03, 688.21it/s]

‚è© Skipping ID_DBONRH.mgb, already decoded
‚è© Skipping ID_DBORVZ.mgb, already decoded
‚è© Skipping ID_DBRZSV.mgb, already decoded
‚è© Skipping ID_DBTOZR.mgb, already decoded
‚è© Skipping ID_DCPNAQ.mgb, already decoded
‚è© Skipping ID_DCSGGY.mgb, already decoded
‚è© Skipping ID_DCXMIR.mgb, already decoded
‚è© Skipping ID_DCYFGD.mgb, already decoded
‚è© Skipping ID_DCZLYK.mgb, already decoded
‚è© Skipping ID_DCZWEX.mgb, already decoded
‚è© Skipping ID_DDAAKL.mgb, already decoded
‚è© Skipping ID_DDBXLW.mgb, already decoded
‚è© Skipping ID_DDEORS.mgb, already decoded
‚è© Skipping ID_DDFSGU.mgb, already decoded
‚è© Skipping ID_DDHTGO.mgb, already decoded
‚è© Skipping ID_DDPTVX.mgb, already decoded
‚è© Skipping ID_DDXNFC.mgb, already decoded
‚è© Skipping ID_DEEOPS.mgb, already decoded
‚è© Skipping ID_DEHQPX.mgb, already decoded
‚è© Skipping ID_DEPLTQ.mgb, already decoded
‚è© Skipping ID_DEXQHR.mgb, already decoded


Decoding TrainFiles:  12%|‚ñà‚ñè        | 359/2901 [00:00<00:03, 688.21it/s]

‚è© Skipping ID_DFAUOS.mgb, already decoded
‚è© Skipping ID_DFINMT.mgb, already decoded
‚è© Skipping ID_DFRYTE.mgb, already decoded
‚è© Skipping ID_DGBPRY.mgb, already decoded
‚è© Skipping ID_DGFBWI.mgb, already decoded
‚è© Skipping ID_DGGQCW.mgb, already decoded
‚è© Skipping ID_DGVTSK.mgb, already decoded
‚è© Skipping ID_DGYWSB.mgb, already decoded
‚è© Skipping ID_DHJYME.mgb, already decoded
‚è© Skipping ID_DHKWTO.mgb, already decoded
‚è© Skipping ID_DHOGEM.mgb, already decoded


                                                                        

‚è© Skipping ID_DHSUVH.mgb, already decoded
‚è© Skipping ID_DIGQNV.mgb, already decoded
‚è© Skipping ID_DIQPCS.mgb, already decoded
‚è© Skipping ID_DITEVC.mgb, already decoded
‚è© Skipping ID_DIXWKV.mgb, already decoded
‚è© Skipping ID_DJVEGJ.mgb, already decoded
‚è© Skipping ID_DJYPZW.mgb, already decoded
‚è© Skipping ID_DJZTZA.mgb, already decoded
‚è© Skipping ID_DKDBIW.mgb, already decoded
‚è© Skipping ID_DKNVUZ.mgb, already decoded
‚è© Skipping ID_DKPBNN.mgb, already decoded
‚è© Skipping ID_DKXOVC.mgb, already decoded
‚è© Skipping ID_DLVVCV.mgb, already decoded
‚è© Skipping ID_DMNCTG.mgb, already decoded
‚è© Skipping ID_DMSFGF.mgb, already decoded
‚è© Skipping ID_DMYPOS.mgb, already decoded
‚è© Skipping ID_DMZRLF.mgb, already decoded
‚è© Skipping ID_DNJJLA.mgb, already decoded
‚è© Skipping ID_DNNPWN.mgb, already decoded
‚è© Skipping ID_DNTCNB.mgb, already decoded
‚è© Skipping ID_DNWBTT.mgb, already decoded
‚è© Skipping ID_DOCYNQ.mgb, already decoded
‚è© Skipping ID_DOFZPC.mgb, alre

Decoding TrainFiles:  17%|‚ñà‚ñã        | 500/2901 [00:00<00:03, 693.81it/s]

‚è© Skipping ID_EGXKRV.mgb, already decoded
‚è© Skipping ID_EHCCXF.mgb, already decoded
‚è© Skipping ID_EHJADX.mgb, already decoded
‚è© Skipping ID_EHLHXH.mgb, already decoded
‚è© Skipping ID_EHQWZN.mgb, already decoded
‚è© Skipping ID_EHUOAV.mgb, already decoded
‚è© Skipping ID_EHZHQM.mgb, already decoded
‚è© Skipping ID_EICCAN.mgb, already decoded
‚è© Skipping ID_EIHMGE.mgb, already decoded
‚è© Skipping ID_EIOCOK.mgb, already decoded
‚è© Skipping ID_EJCWGM.mgb, already decoded
‚è© Skipping ID_EJFEWM.mgb, already decoded
‚è© Skipping ID_EJINYE.mgb, already decoded
‚è© Skipping ID_EJMLOW.mgb, already decoded
‚è© Skipping ID_EJVYKU.mgb, already decoded
‚è© Skipping ID_EKFEBW.mgb, already decoded
‚è© Skipping ID_EKGEQY.mgb, already decoded
‚è© Skipping ID_EKLDLL.mgb, already decoded
‚è© Skipping ID_EKMOCN.mgb, already decoded
‚è© Skipping ID_EKMYAN.mgb, already decoded
‚è© Skipping ID_EKNQHF.mgb, already decoded
‚è© Skipping ID_EKQQNR.mgb, already decoded


Decoding TrainFiles:  17%|‚ñà‚ñã        | 500/2901 [00:00<00:03, 693.81it/s]

‚è© Skipping ID_EKWJKN.mgb, already decoded
‚è© Skipping ID_ELUJSA.mgb, already decoded
‚è© Skipping ID_ELUMYU.mgb, already decoded
‚è© Skipping ID_EMDVPR.mgb, already decoded
‚è© Skipping ID_EMEMQN.mgb, already decoded
‚è© Skipping ID_EMPUET.mgb, already decoded
‚è© Skipping ID_EMXBST.mgb, already decoded
‚è© Skipping ID_ENPDII.mgb, already decoded
‚è© Skipping ID_ENTHMT.mgb, already decoded
‚è© Skipping ID_EOEKFK.mgb, already decoded
‚è© Skipping ID_EOFJYY.mgb, already decoded


                                                                        

‚è© Skipping ID_EOPVWI.mgb, already decoded
‚è© Skipping ID_EOTFRJ.mgb, already decoded
‚è© Skipping ID_EOYXMZ.mgb, already decoded
‚è© Skipping ID_EPFBJN.mgb, already decoded
‚è© Skipping ID_EPFKPV.mgb, already decoded
‚è© Skipping ID_EPHGWF.mgb, already decoded
‚è© Skipping ID_EPHMFD.mgb, already decoded
‚è© Skipping ID_EPOOYS.mgb, already decoded


Decoding TrainFiles:  17%|‚ñà‚ñã        | 500/2901 [00:00<00:03, 693.81it/s]

‚è© Skipping ID_EPQAGU.mgb, already decoded
‚è© Skipping ID_EPTMKH.mgb, already decoded
‚è© Skipping ID_EQTHRG.mgb, already decoded
‚è© Skipping ID_EQUKGI.mgb, already decoded
‚è© Skipping ID_ERBWJV.mgb, already decoded
‚è© Skipping ID_ERGFRA.mgb, already decoded
‚è© Skipping ID_ERWHVV.mgb, already decoded
‚è© Skipping ID_ESAWNV.mgb, already decoded
‚è© Skipping ID_ESDTLH.mgb, already decoded
‚è© Skipping ID_ESODFA.mgb, already decoded
‚è© Skipping ID_ESUPOV.mgb, already decoded
‚è© Skipping ID_ETHGVP.mgb, already decoded
‚è© Skipping ID_ETQKXV.mgb, already decoded
‚è© Skipping ID_ETRVTP.mgb, already decoded
‚è© Skipping ID_ETXPIR.mgb, already decoded
‚è© Skipping ID_EUFHUG.mgb, already decoded
‚è© Skipping ID_EUINVA.mgb, already decoded
‚è© Skipping ID_EUZWPA.mgb, already decoded
‚è© Skipping ID_EVBJJD.mgb, already decoded
‚è© Skipping ID_EVJBEP.mgb, already decoded
‚è© Skipping ID_EVLRBX.mgb, already decoded
‚è© Skipping ID_EVMBHD.mgb, already decoded


Decoding TrainFiles: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2901/2901 [4:22:12<00:00,  5.42s/it]    

‚úÖ Finished decoding all files from f:\Desktop\Zindi\MPEG\TrainFiles ‚Üí f:\Desktop\Zindi\MPEG\TrainFastQ





In [None]:
decode_mgb_directory(TEST_MGB_PATH, TEST_FASTQ_PATH)


üîç Found 1068 .mgb files in f:\Desktop\Zindi\MPEG\TestFiles


Decoding TestFiles:   0%|          | 0/1068 [00:00<?, ?it/s]

‚è© Skipping ID_ABHFUP.mgb, already decoded
‚è© Skipping ID_ADBLNY.mgb, already decoded
‚è© Skipping ID_AFAEMB.mgb, already decoded
‚è© Skipping ID_AFBBWK.mgb, already decoded
‚è© Skipping ID_AGHEZK.mgb, already decoded
‚è© Skipping ID_AGKIYB.mgb, already decoded
‚è© Skipping ID_AIQHUX.mgb, already decoded
‚è© Skipping ID_AIVFAZ.mgb, already decoded
‚è© Skipping ID_AJCBOB.mgb, already decoded
‚è© Skipping ID_AJGVTS.mgb, already decoded
‚è© Skipping ID_AJKHOU.mgb, already decoded
‚è© Skipping ID_AKDVQI.mgb, already decoded
‚è© Skipping ID_AKIAKB.mgb, already decoded


Decoding TestFiles: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1068/1068 [1:32:36<00:00,  5.20s/it] 

‚úÖ Finished decoding all files from f:\Desktop\Zindi\MPEG\TestFiles ‚Üí f:\Desktop\Zindi\MPEG\TestFastQ





### üß† What MPEG-G Did (Plain English)

Your `.mgb` file used the MPEG-G standard to store sequencing data efficiently. Here's what happened under the hood:

- **Access Units (AUs)**: Think of these as independent blocks, like packets or video frames. Each AU can be decoded without needing the entire file.
  
- **Descriptor Streams**:
  - `SEQUENCE`: These are the DNA letters (A, T, C, G...).
  - `QUALITY`: Confidence for each base (used to assess sequencing accuracy).
  - `READ_IDENTIFIER`: Name or ID of each read.

- **Compression Techniques**:
  - Redundancies in the reads and IDs were removed.
  - Quality scores may have been quantized or entropy-coded.
  - Optional reference-based compression could align reads to a known genome and store only differences.

- **Output Format (`.fastq`)**:
  - This format is standard in genomics: it includes the ID, DNA sequence, and quality scores for each read.

MPEG-G is to genomics what `.mp4` is to video ‚Äî a way to store large data efficiently without losing critical information.