# Genomics checkpoint

**Due**: Feb 22 by 11:59 pm

**Points**: 100

As discussed in the [introduction](https://pitt-biosc1540-2024s.oasci.org/assessments/checkpoints/genomics/), we will be training a machine learning model to predict which antibiotics an *E. coli* strain is susceptible to.

The teaching team has meticulously crafted this evaluation checkpoint to significantly reduce the necessity for writing Python code from scratch.
Throughout the assessment, you will encounter several notations, such as:

```python
# DO NOT MODIFY CODE BELOW THIS LINE.
```

This directive ensures that the code following the comment remains uniform across all submissions, thereby standardizing the evaluation process.
If you inadvertently alter any part of this pre-written code, or if the teaching team advises a modification, you must restore the original code by copying and pasting the correct version directly from the designated website.

Any rubric item that you need to fill out will be indicated with either a `TODO:` or assigning some variable as instructed.

## Rubric

Because most of the Python has been written for you, the rubric is based on your decisions and explanations.

| # | Label | Points | Description |
| ------ | ---- | ------ | ----------- |
| 1 | AMR insight | 10 | TODO |
| 2 | Antibiotic selection | 5 | TODO |
| 3 | Gene selection | 30 | TODO |


## Genomes

Our study leverages fully assembled and annotated genomic data from the BioProject [PRJNA278886](https://www.ncbi.nlm.nih.gov/bioproject/278886).
This project has meticulously collected clinical isolates, conducted comprehensive whole genome sequencing (WGS), and pinpointed resistance mechanisms to carbapenemases or $\beta$-lactamases. The samples originate from Brigham & Women's Hospital, located in Boston, MA.

Our analysis will focus on a curated subset of 280 E. coli isolates, all collected in 2023. The teaching team has refined the dataset to omit any instances of sparse data. The cleaned dataset is accessible through [a CSV file](https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/blob/main/biosc1540/files/csv/checkpoints/genomics/ecoli-amr-isolates.csv).
In the following cells, we will demonstrate how to load this CSV file and showcase the first `n` isolates to illustrate the data we will analyze.

In [None]:
# DO NOT MODIFY CODE BELOW THIS LINE.
import numpy as np
import pandas as pd

In [None]:
# DO NOT MODIFY CODE BELOW THIS LINE.
ISOLATE_CSV_PATH = "https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/biosc1540/files/csv/checkpoints/genomics/ecoli-amr-isolates.csv"
df_isolates = pd.read_csv(ISOLATE_CSV_PATH)
n_isolates = len(df_isolates)
print(f"There are {n_isolates} isolates in the dataset.")
print(df_isolates.head(n=5))

Our dataset encompasses resistance data against a spectrum of 12 antibiotics, which are critical in the treatment and management of bacterial infections.
These antibiotics include: Amoxicillin-Clavulanic Acid, Cefepime, Cefoxitin, Ceftazidime, Ceftriaxone, Ciprofloxacin, Gentamicin, Levofloxacin, Piperacillin-Tazobactam, Tetracycline, Tobramycin, and Trimethoprim-Sulfamethoxazole.
For each antibiotic, the dataset provides insights into the bacterial isolates' susceptibility profile, which is crucial for understanding resistance patterns and guiding therapeutic decisions.

The dataset categorizes the response of each *E. coli* isolate to these antibiotics using three distinct labels, reflecting the degree of resistance observed. These labels are:

-   `S` (Susceptible): Indicates that the isolate is not resistant to the antibiotic, suggesting that the drug is likely to be effective in treating infections caused by this isolate.
-   `I` (Intermediate): Signifies a moderate level of resistance; the antibiotic may be effective in certain conditions, such as when a higher dose is administered or when drug concentration at the site of infection is optimized.
-   `R` (Resistant): Shows that the isolate has a high level of resistance to the antibiotic, implying that the drug is unlikely to be effective in treating infections caused by this isolate.

This nuanced classification enables healthcare professionals and researchers to make informed decisions regarding antibiotic selection, contributing to the ongoing battle against antibiotic resistance.

## Preliminary analysis

To provide valuable insights from this dataset, we calculate the frequency of each resistance label (`S`, `I`, `R`) for all antibiotics to understand the overall resistance pattern.

In [None]:
# DO NOT MODIFY CODE BELOW THIS LINE.
resistance_summary = df_isolates.iloc[:, 3:].apply(pd.Series.value_counts).T.fillna(0)
print(resistance_summary)

TODO: AMR insight

Write here any insight you have with the above data.
This can include class of antibiotics used, recommendations for hospitals, statistics, etc.

### Select antibiotic

You are tasked with selecting an antibiotic to focus on, which will later facilitate feature extraction and model training.
This step is pivotal for ensuring the effectiveness of your final classifier.
Analyze the provided antibiotic susceptibility data to identify an antibiotic that will simplify the subsequent steps of feature extraction and classifier training.

Your options are:

-   `amoxicillin-clavulanic acid`
-   `cefoxitin`
-   `ceftazidime`
-   `ciprofloxacin`
-   `levofloxacin`
-   `piperacillin-tazobactam`
-   `tetracycline`
-   `tobramycin`

The teaching team arbitrarily choose `piperacillin-tazobactam` as a placeholder.
(Hint: this is not a good choice.)

In [None]:
antibiotic_sel = "piperacillin-tazobactam"

In [None]:
# DO NOT MODIFY CODE BELOW THIS LINE.
def encode_labels(df, antibiotic_sel):
    labels = df[antibiotic_sel].to_numpy()
    resistance_mapping = {"S": 0, "I": 1, "R": 2}
    labels = np.vectorize(resistance_mapping.get)(labels)
    labels = labels.reshape(-1, 1)
    return labels


labels = encode_labels(df_isolates, antibiotic_sel)
print(labels.shape)

## Loading genes

Each isolate had 2,113 sequenced genes common between them.
However, in order to simplify downstream analyses, we limit our analysis to 1,480 genes that all had the same length across our isolates.
This allows us to bypass the need to align sequences.
All sequences in the same isolate order as `df_isolates` is stored [as NumPy files in a 135.11 MB zip archive](https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/blob/main/large-files/genomics-checkpoint-genes.zip).

We download this file below.

In [None]:
# DO NOT MODIFY CODE BELOW THIS LINE.
!wget https://gitlab.com/oasci/courses/pitt/biosc1540-2024s/-/raw/main/large-files/genomics-checkpoint-genes.zip > /dev/null 2>&1

We extract these files into a directory called `genes`.

In [None]:
# DO NOT MODIFY CODE BELOW THIS LINE.
!unzip genomics-checkpoint-genes.zip > /dev/null 2>&1
!mv genes-array genes

Below is an array of gene names you can use.

In [None]:
gene_list = np.array(
    [
        "fadL",
        "artJ",
        "araD",
        "lysS",
        "cspE",
        "rfaP",
        "bfd",
        "fumB",
        "bfr",
        "folK",
        "fsa",
        "psuG",
        "feoC",
        "baeS",
        "eptA",
        "loiP",
        "hemW",
        "hisA",
        "ilvA",
        "rsmJ",
        "thiB",
        "thiS",
        "flhe",
        "astB",
        "ppiC",
        "thiQ",
        "ada",
        "fadH",
        "murQ",
        "nlpE",
        "fpr",
        "fepG",
        "queE",
        "ubiF",
        "purD",
        "ecnA",
        "hisB",
        "nikR",
        "kbl",
        "fadI",
        "thiM",
        "mtr",
        "priA",
        "gutQ",
        "cysE",
        "lolB",
        "glpB",
        "cysP",
        "queD",
        "entF",
        "ratA",
        "moeB",
        "hemA",
        "panC",
        "cyaY",
        "nikD",
        "kdpE",
        "tauD",
        "aroC",
        "uspG",
        "degS",
        "ilvD",
        "yehX",
        "tusA",
        "ypdE",
        "yhbQ",
        "lldD",
        "entS",
        "torD",
        "panM",
        "pspG",
        "zntA",
        "rsmC",
        "bioF",
        "menC",
        "ytfQ",
        "ugpQ",
        "rfaH",
        "nikE",
        "yajO",
        "ung",
        "hisF",
        "gss",
        "recJ",
        "basR",
        "glpD",
        "ubiH",
        "hycG",
        "rseC",
        "cysT",
        "serC",
        "rhtC",
        "ilvC",
        "entH",
        "proA",
        "cmoB",
        "uhpB",
        "thiP",
        "leuD",
        "kdpC",
        "pmbA",
        "uvrD",
        "lldP",
        "tusD",
        "pspF",
        "rsmB",
        "pldA",
        "polB",
        "rlmG",
        "fepD",
        "livM",
        "ruvX",
        "yccX",
        "poxB",
        "iclR",
        "cdh",
        "pgeF",
        "hyaE",
        "preT",
        "hisH",
        "guaB",
        "tnaB",
        "nikC",
        "metF",
        "zur",
        "araA",
        "cedA",
        "thiE",
        "ribD",
        "zupT",
        "mltB",
        "ugpB",
        "argE",
        "mnmE",
        "prlC",
        "nrdB",
        "purK",
        "ssuC",
        "metH",
        "ybgI",
        "ttdA",
        "livJ",
        "epmC",
        "copA",
        "glyS",
        "hisC",
        "pepP",
        "glpE",
        "livH",
        "norR",
        "glnG",
        "fhuB",
        "actP",
        "entE",
        "yihX",
        "pepB",
        "hprS",
        "ubiX",
        "tsaD",
        "tusE",
        "tsaB",
        "ispH",
        "ruvC",
        "metE",
        "gltB",
        "nuoF",
        "cybC",
        "adeP",
        "tnaC",
        "nfuA",
        "potI",
        "malM",
        "pxpC",
        "cysJ",
        "uxuR",
        "pbpC",
        "ubiI",
        "metR",
        "livG",
        "envC",
        "pxpA",
        "murB",
        "purH",
        "yhhW",
        "entA",
        "alr",
        "kdpD",
        "lpxH",
        "bioC",
        "yegD",
        "gshB",
        "hypD",
        "mscM",
        "fes",
        "ugpE",
        "eptB",
        "glmM",
        "serB",
        "deoB",
        "modB",
        "menH",
        "bioB",
        "lldR",
        "metL",
        "yidZ",
        "srlR",
        "fadE",
        "mobA",
        "ltaE",
        "aroE",
        "fepC",
        "ugpA",
        "yiiM",
        "bglA",
        "cyaA",
        "fhuD",
        "agp",
        "amiA",
        "gltL",
        "yjfP",
        "uhpA",
        "glpA",
        "hypE",
        "yfaE",
        "tus",
        "csiR",
        "ygfZ",
        "pcnB",
        "hycF",
        "deoD",
        "lolD",
        "pepD",
        "nadB",
        "garK",
        "cstA",
        "glpK",
        "cysN",
        "pstS",
        "yqjG",
        "kefB",
        "ispD",
        "parE",
        "wecH",
        "gudD",
        "yeaY",
        "nrfF",
        "glnE",
        "gorA",
        "thiC",
        "rlmB",
        "araB",
        "fliT",
        "pyrI",
        "murD",
        "arfB",
        "hybE",
        "ulaB",
        "rlmJ",
        "hemY",
        "yfhb",
        "panF",
        "yehY",
        "ccmF",
        "pepA",
        "dppB",
        "artM",
        "menF",
        "ileS",
        "rluF",
        "rlmKL",
        "thiF",
        "ftsX",
        "lolE",
        "norW",
        "acnB",
        "pabB",
        "hycE",
        "exuR",
        "recQ",
        "livK",
        "bioH",
        "yehW",
        "deoA",
        "tusC",
        "ttdB",
        "dmsA",
        "cysS",
        "rihA",
        "ygjG",
        "bglX",
        "fklB",
        "cysI",
        "ynjE",
        "malZ",
        "sstT",
        "menD",
        "selB",
        "uvrB",
        "tdh",
        "metC",
        "hisP",
        "truB",
        "caiC",
        "hisG",
        "hycC",
        "manA",
        "moeA",
        "gltS",
        "argT",
        "corA",
        "mdtL",
        "gcvP",
        "recD",
        "yeiG",
        "bcr",
        "speB",
        "flgD",
        "murF",
        "trmH",
        "qseC",
        "hldE",
        "nei",
        "hycB",
        "nfeF",
        "cusR",
        "baeR",
        "eutA",
        "thrA",
        "ligA",
        "ettA",
        "carA",
        "moaD",
        "menB",
        "selD",
        "cecR",
        "asmA",
        "ybfF",
        "cysH",
        "aroG",
        "leuA",
        "iscX",
        "nhaB",
        "miaB",
        "dppD",
        "mepA",
        "amiD",
        "pabA",
        "cpdB",
        "ebgR",
        "glnD",
        "pstA",
        "modC",
        "zapC",
        "rffC",
        "moaA",
        "aer",
        "cca",
        "gntR",
        "trmA",
        "speA",
        "rffA",
        "gntU",
        "rffT",
        "motB",
        "lysO",
        "allC",
        "ycdX",
        "hofM",
        "lpxB",
        "hyaB",
        "acs",
        "bioA",
        "glpX",
        "fruA",
        "asnB",
        "phoU",
        "mutL",
        "gltJ",
        "queA",
        "yojI",
        "pyrE",
        "flgI",
        "pncC",
        "gph",
        "araJ",
        "hemH",
        "uxaA",
        "nirD",
        "rfaF",
        "moaE",
        "exuT",
        "hycD",
        "aat",
        "ilvE",
        "zapE",
        "ubiD",
        "ldtE",
        "ycbX",
        "glnB",
        "cobA",
        "tsx",
        "sdhA",
        "thiG",
        "pckA",
        "nuoE",
        "ogt",
        "glmS",
        "purF",
        "phoA",
        "gltK",
        "sltY",
        "gcvH",
        "mddA",
        "gltI",
        "speE",
        "pspA",
        "frdD",
        "fumC",
        "tyrB",
        "fldA",
        "dxs",
        "arnF",
        "ruvB",
        "lpxK",
        "dnaN",
        "sufA",
        "treA",
        "yebT",
        "rnr",
        "msrQ",
        "creB",
        "fetA",
        "hybC",
        "pepE",
        "rph",
        "cnoX",
        "yajL",
        "feoB",
        "slmA",
        "hydN",
        "rlmI",
        "flgF",
        "bacA",
        "ybbO",
        "dsbC",
        "tsaA",
        "ansP",
        "ccmH",
        "ypfH",
        "aceA",
        "kefF",
        "srlD",
        "cheB",
        "era",
        "slyB",
        "malY",
        "nuoG",
        "comR",
        "sseA",
        "ptrA",
        "murE",
        "mpaA",
        "nrdG",
        "osmY",
        "pat",
        "thrB",
        "mukB",
        "pstC",
        "dapE",
        "znuB",
        "amyA",
        "nudG",
        "gpmB",
        "cyoE",
        "pdxJ",
        "recN",
        "ccmB",
        "rluD",
        "mutH",
        "pgl",
        "fadB",
        "yciH",
        "cusB",
        "ribB",
        "hisQ",
        "tag",
        "pyrF",
        "fadM",
        "queG",
        "hdfR",
        "fhuC",
        "trmL",
        "hyaA",
        "napA",
        "der",
        "hcr",
        "gloC",
        "nagC",
        "dapF",
        "prmC",
        "mglA",
        "gutM",
        "nagE",
        "pheA",
        "nirC",
        "glnA",
        "srmB",
        "nagZ",
        "pgm",
        "rapA",
        "entC",
        "hxpB",
        "dtpB",
        "tusB",
        "hxpA",
        "degQ",
        "hemN",
        "tyrP",
        "allD",
        "yedE",
        "aqpZ",
        "pnp",
        "nudB",
        "ffh",
        "crl",
        "ghrA",
        "galK",
        "ybiV",
        "yedA",
        "dapD",
        "hemC",
        "ydiB",
        "uxuA",
        "malE",
        "ucpA",
        "aceB",
        "malF",
        "cysC",
        "yjeH",
        "pepQ",
        "flgN",
        "argB",
        "orn",
        "yfeX",
        "xerC",
        "nuoM",
        "nudK",
        "secA",
        "yraP",
        "ldcA",
        "gltA",
        "glsA",
        "potD",
        "nadK",
        "pxpB",
        "mraY",
        "napH",
        "ybiI",
        "pfkB",
        "hybB",
        "selA",
        "dgkA",
        "argG",
        "menI",
        "alx",
        "wzyE",
        "lolC",
        "clpX",
        "dbpA",
        "queF",
        "msrP",
        "shoB",
        "lapB",
        "amiC",
        "hscA",
        "cydC",
        "modA",
        "flgJ",
        "bioP",
        "iadA",
        "treR",
        "ybjI",
        "maeB",
        "pspD",
        "opgB",
        "rraB",
        "viaA",
        "katG",
        "mak",
        "trxB",
        "lpcA",
        "truC",
        "ldcC",
        "hemL",
        "proS",
        "argP",
        "nrfD",
        "mutM",
        "yobA",
        "glk",
        "argC",
        "gcvT",
        "rutR",
        "mltC",
        "qseE",
        "yejF",
        "nrdD",
        "appB",
        "moaB",
        "oppB",
        "metA",
        "gpsA",
        "malP",
        "allR",
        "yebS",
        "proB",
        "dadA",
        "exbD",
        "fdoH",
        "rnhB",
        "dauA",
        "rof",
        "gpmM",
        "yeiR",
        "nuoC",
        "msrB",
        "aaeR",
        "folD",
        "glpG",
        "frsA",
        "nuoN",
        "galT",
        "yidA",
        "rhtB",
        "yciA",
        "thiD",
        "epmB",
        "panD",
        "btuE",
        "gppA",
        "mlaA",
        "trpR",
        "hybO",
        "gudP",
        "yaaA",
        "gntK",
        "sapC",
        "acpT",
        "queC",
        "fre",
        "metK",
        "trmG_rlmN",
        "gsk",
        "yfcG",
        "artQ",
        "rsxD",
        "ribF",
        "eco",
        "sbcD",
        "ppc",
        "ybaK",
        "clcA",
        "psd",
        "cysD",
        "tyrA",
        "cysQ",
        "aegA",
        "mdtJ",
        "gpmA",
        "fadD",
        "yhjD",
        "dnaT",
        "ubiA",
        "robA",
        "rarA",
        "ispF",
        "dnaQ",
        "pyrD",
        "wzxE",
        "purU",
        "rstB",
        "nirB",
        "pncB",
        "nfi",
        "pqiC",
        "mioC",
        "nemR",
        "citF",
        "hflD",
        "metB",
        "fabB",
        "astE",
        "cheR",
        "mdaB",
        "lepA",
        "pdxA",
        "murJ",
        "trpS",
        "hypB",
        "pgi",
        "nuoL",
        "dmsD",
        "yajQ",
        "soxS",
        "celF",
        "dacB",
        "fxsA",
        "gltP",
        "damX",
        "glgP",
        "rffM",
        "igaA",
        "kdpF",
        "djlA",
        "hprR",
        "dcd",
        "lysA",
        "rnd",
        "ndh",
        "rsgA",
        "phoP",
        "glnS",
        "mzrA",
        "hemB",
        "ivy",
        "narQ",
        "uvrC",
        "kduD",
        "wzzE",
        "nadA",
        "helD",
        "arnT",
        "tcyP",
        "btuD",
        "recF",
        "purR",
        "yigB",
        "ybgE",
        "fbaA",
        "glnL",
        "glyA",
        "dnaJ",
        "holB",
        "rsuA",
        "polA",
        "ahpF",
        "pheS",
        "folA",
        "ndk",
        "tolC",
        "waaA",
        "rsxG",
        "pnuC",
        "psiF",
        "dsbA",
        "entB",
        "folM",
        "srlB",
        "rpoH",
        "lamB",
        "ortT",
        "ssb1",
        "rcdA",
        "tamA",
        "galM",
        "aas",
        "ftsP",
        "gsiA",
        "oppD",
        "accA",
        "rnc",
        "nrfC",
        "tldD",
        "yieF",
        "menA",
        "oxc",
        "crcB",
        "citT",
        "ybiB",
        "hiuH",
        "adk",
        "gyrB",
        "ppa",
        "fdhE",
        "rutA",
        "nudJ",
        "kefG",
        "fdx",
        "truD",
        "flk",
        "holA",
        "prmA",
        "dtd",
        "garL",
        "hisJ",
        "smpB",
        "endA",
        "pyrB",
        "sapB",
        "tapT",
        "yehT",
        "rstA",
        "hofP",
        "tyrR",
        "inaA",
        "grxC",
        "pyk",
        "mrcA",
        "rluC",
        "plsB",
        "lplT",
        "qseG",
        "ybdG",
        "hslO",
        "murI",
        "tsaC",
        "mppA",
        "sapF",
        "ampG",
        "ybhA",
        "sfsB",
        "ghrB",
        "mdtG",
        "lysR",
        "sugE",
        "aaeA",
        "anmK",
        "plaP",
        "mgsA",
        "yedF",
        "clpA",
        "ispA",
        "mnmA",
        "ygfB",
        "hycH",
        "prfA",
        "asnC",
        "hybG",
        "copD",
        "manZ",
        "alaA",
        "gntT",
        "csrD",
        "aspS",
        "fadR",
        "rimI",
        "cysZ",
        "cbpM",
        "ygiW",
        "sthA",
        "xanP",
        "deoR",
        "hypC",
        "hemP",
        "fixX",
        "pgpB",
        "rlpA",
        "rpoN",
        "ydiK",
        "nupG",
        "ampD",
        "btuC",
        "ccmA",
        "clpB",
        "msrC",
        "narP",
        "yqhH",
        "kdgK",
        "metN",
        "nuoJ",
        "bglJ",
        "tatD",
        "malG",
        "ubiJ",
        "surE",
        "coaBC",
        "aaeB",
        "ldtD",
        "zntB",
        "ldtB",
        "yeaG",
        "znuC",
        "tar",
        "tesB",
        "tpiA",
        "rlmD",
        "hybA",
        "ppx",
        "kbp",
        "ispG",
        "narL",
        "uspC",
        "mnmG",
        "nadD",
        "ddlA",
        "recR",
        "trxC",
        "thiL",
        "yccA",
        "mdtH",
        "glnQ",
        "cpxA",
        "cheZ",
        "msbA",
        "holD",
        "oxyR",
        "sppA",
        "yejK",
        "pstB",
        "sucA",
        "aspC",
        "sodC",
        "srkA",
        "guaC",
        "potA",
        "bamD",
        "cls",
        "ampE",
        "pspC",
        "xerD",
        "psiE",
        "topA",
        "secM",
        "nudE",
        "iscS",
        "hisS",
        "greB",
        "ynaI",
        "rssA",
        "ycaR",
        "pdeH",
        "sapD",
        "lepB",
        "hemD",
        "rsmE",
        "nadC",
        "gsiC",
        "rnt",
        "mukE",
        "rutC",
        "pbgA",
        "fkpB",
        "wecA",
        "caiT",
        "glgB",
        "pdxH",
        "accC",
        "brnQ",
        "mog",
        "folE",
        "btsT",
        "yccS",
        "yfbR",
        "hflX",
        "coaD",
        "trmB",
        "rsmG",
        "npr",
        "marR",
        "acrD",
        "gloA",
        "purE",
        "ccmC",
        "purM",
        "ulaR",
        "leuS",
        "qseB",
        "yhgN",
        "stpA",
        "argA",
        "chbA",
        "uraA",
        "iraP",
        "parC",
        "mlaB",
        "cfa",
        "thpR",
        "truA",
        "pgsA",
        "satP",
        "fabD",
        "cmoA",
        "kdsA",
        "murC",
        "glgX",
        "ftsN",
        "malK",
        "tolQ",
        "mug",
        "pspB",
        "cra",
        "lapA",
        "pspE",
        "exbB",
        "speD",
        "dsbE",
        "folC",
        "rseB",
        "rsmH",
        "kduI",
        "pssA",
        "modE",
        "pldB",
        "frdC",
        "cysK",
        "cmk",
        "mrdB",
        "appC",
        "plsY",
        "ftsH",
        "slyD",
        "kup",
        "hexR",
        "yrfG",
        "ispE",
        "exoX",
        "prfB",
        "secD",
        "lgt",
        "htpG",
        "narX",
        "sspB",
        "xylA",
        "cbpA",
        "hslU",
        "ydfG",
        "elbB",
        "moaC",
        "dpiA",
        "lptE",
        "purA",
        "gloB",
        "murG",
        "rssB",
        "pntA",
        "matP",
        "zapD",
        "ftsE",
        "glnK",
        "ychE",
        "rsxA",
        "cpdA",
        "nudC",
        "tatB",
        "hslJ",
        "rbsD",
        "ppiD",
        "recX",
        "ruvA",
        "cydB",
        "nuoH",
        "ccmD",
        "elaB",
        "pqiA",
        "adhE",
        "thyA",
        "alaS",
        "mrdA",
        "yqaB",
        "mntR",
        "aroD",
        "pntB",
        "pabC",
        "dacA",
        "dsbB",
        "tatC",
        "rsxE",
        "map",
        "dacC",
        "oppA",
        "syd",
        "oppC",
        "aroF",
        "galR",
        "nrfA",
        "zwf",
        "mdh",
        "ppiB",
        "fabH",
        "dcrB",
        "ybjG",
        "purC",
        "marA",
        "recO",
        "mdtD",
        "arcB",
        "mtfA",
        "mlaD",
        "trkH",
        "fruB",
        "yrbL",
        "emtA",
        "htpX",
        "citX",
        "nagB",
        "minC",
        "cytR",
        "asnS",
        "tgt",
        "secG",
        "ftsZ",
        "arfA",
        "yceG",
        "pth",
        "rsmA",
        "yggS",
        "yajC",
        "grcA",
        "btsS",
        "thiK",
        "lon",
        "cyoB",
        "mreD",
        "znuA",
        "sdhE",
        "hda",
        "tdk",
        "narK",
        "fkpA",
        "nuoK",
        "pyrG",
        "blr",
        "tolB",
        "ispC",
        "sodB",
        "yhdE",
        "pqiB",
        "cheY",
        "efp",
        "uspD",
        "sfsA",
        "glnP",
        "napC",
        "hscB",
        "udk",
        "accD",
        "lexA",
        "ubiC",
        "fldB",
        "ybfE",
        "rpiA",
        "fabR",
        "sanA",
        "cydA",
        "kdsC",
        "fnr",
        "pdhR",
        "wrbA",
        "eno",
        "uspA",
        "galU",
        "tsaE",
        "ispB",
        "recA",
        "lspA",
        "epmA",
        "clpP",
        "rsmD",
        "pitA",
        "yqiA",
        "dnaC",
        "hycA",
        "yfgM",
        "mlaC",
        "suhB",
        "bhsA",
        "cdd",
        "epd",
        "ftsA",
        "folX",
        "ftsB",
        "can",
        "nuoB",
        "lpxD",
        "tyrS",
        "napB",
        "aroB",
        "pcm",
        "bamB",
        "frdA",
        "upp",
        "rcsF",
        "lipB",
        "hflC",
        "murA",
        "yhcN",
        "yeiP",
        "yidC",
        "mglB",
        "napG",
        "ptsN",
        "cheW",
        "fur",
        "hflK",
        "mltD",
        "cdsA",
        "gpt",
        "cbdX",
        "surA",
        "mlaF",
        "dnaA",
        "kdsD",
        "iscA",
        "diaA",
        "ppnP",
        "hinT",
        "mepM",
        "ivbL",
        "sulA",
        "yceF",
        "rho",
        "sdhD",
        "ubiB",
        "cpoB",
        "sdhB",
        "fabI",
        "yfcD",
        "rpoB",
        "purN",
        "dnaG",
        "rseA",
        "kdgA",
        "argR",
        "mdtI",
        "lipA",
        "metJ",
        "mdoG",
        "manY",
        "rplA",
        "rpoC",
        "manX",
        "atpA",
        "lptA",
        "proQ",
        "folB",
        "mgrB",
        "holE",
        "sxy",
        "luxS",
        "rapZ",
        "ilvL",
        "asd",
        "accB",
        "ilvM",
        "glyQ",
        "mtnN",
        "ecnB",
        "pal",
        "holC",
        "trmD",
        "grxD",
        "nusB",
        "minE",
        "atpH",
        "acrA",
        "dinI",
        "corC",
        "hspQ",
        "xseB",
        "ybeY",
        "fdoI",
        "lpxC",
        "rraA",
        "uspF",
        "artP",
        "nsrR",
        "pgk",
        "erpA",
        "rplY",
        "bamC",
        "pfkA",
        "ydgT",
        "btuF",
        "tesA",
        "pmrR",
        "rpoD",
        "minD",
        "alaE",
        "coaE",
        "nusG",
        "fieF",
        "pflB",
        "iscU",
        "priB",
        "nudF",
        "ftnA",
        "mepS",
        "yacG",
        "skp",
        "lolA",
        "cvpA",
        "dnaK",
        "fruK",
        "lpxA",
        "mlaE",
        "yfcE",
        "hupB",
        "nrdR",
        "ribA",
        "yjfN",
        "ubiE",
        "miaA",
        "rpsE",
        "yebG",
        "pheL",
        "rimM",
        "dam",
        "ibpA",
        "uof",
        "atpB",
        "dcuA",
        "mreB",
        "secB",
        "iscR",
        "grxB",
        "hslV",
        "greA",
        "gmk",
        "ptsG",
        "glgA",
        "frdB",
        "hpt",
        "yaiA",
        "lpxL",
        "atpD",
        "mgtS",
        "ftsQ",
        "tpx",
        "maoP",
        "sspA",
        "crp",
        "lptB",
        "rbfA",
        "osmE",
        "hfq",
        "fabZ",
        "dps",
        "ynhF",
        "rhoL",
        "seqA",
        "dksA",
        "aspA",
        "cutA",
        "ybgC",
        "uspB",
        "rplX",
        "rsfS",
        "feoA",
        "nuoI",
        "pyrH",
        "tsf",
        "plsX",
        "gapA",
        "dut",
        "mntS",
        "napD",
        "nuoA",
        "rplT",
        "atpF",
        "rpsL",
        "sdhC",
        "ompX",
        "cpxP",
        "dsrB",
        "mgtL",
        "secY",
        "cydX",
        "rplK",
        "lptC",
        "ptsI",
        "cspD",
        "ackA",
        "rpsF",
        "secE",
        "pheM",
        "bssR",
        "fusA",
        "rplO",
        "rimP",
        "rplI",
        "zapA",
        "ispU",
        "acrZ",
        "prs",
        "rpsK",
        "ypdK",
        "fis",
        "trxA",
        "frr",
        "yceD",
        "tolR",
        "yncL",
        "rnk",
        "hns",
        "rpoZ",
        "cspA",
        "rpsT",
        "rpmA",
        "ahpC",
        "rpsA",
        "aroK",
        "rplC",
        "tatA",
        "bssS",
        "yqgB",
        "rpsB",
        "apaG",
        "ydfZ",
        "rlmE",
        "zapB",
        "atpI",
        "rppH",
        "yodD",
        "rplF",
        "rpoE",
        "hupA",
        "rpsJ",
        "ibaG",
        "rpmB",
        "rplS",
        "infC",
        "rnpA",
        "rplL",
        "yoeI",
        "rpsS",
        "bcp",
        "rplN",
        "ihfB",
        "rplQ",
        "atpG",
        "tatE",
        "rplP",
        "rplD",
        "rpmG",
        "nlpI",
        "hpf",
        "rpsI",
        "ygdR",
        "rmf",
        "rpmI",
        "atpC",
        "rplJ",
        "rplU",
        "rplR",
        "rseD",
        "rpsD",
        "rpsP",
        "rpsN",
        "bolA",
        "rpmC",
        "pyrL",
        "rpoA",
        "rplB",
        "lpp",
        "rpsH",
        "rplE",
        "ptsH",
        "rpsG",
        "rplW",
        "rpmE",
        "rpsO",
        "rplV",
        "chbB",
        "csrA",
        "atpE",
        "yhbY",
        "rpsC",
        "rplM",
        "rpmD",
        "rpsU",
        "infA",
        "acpP",
        "ihfA",
        "ypfM",
        "rpmH",
        "rpmF",
        "rpsR",
        "rpsQ",
        "rpmJ",
        "yrbN",
    ]
)

You can learn more information about each gene by going to [Gene](https://www.ncbi.nlm.nih.gov/gene/) and searching the gene name.
For example, we see that [fadL](https://www.ncbi.nlm.nih.gov/gene/946820) is an outer membrane protein involved with the uptake of long-chain fatty acids.

The following cells provide code to load in a gene to a NumPy array.

In [None]:
# DO NOT MODIFY CODE BELOW THIS LINE.
def load_gene_data(gene_name):
    """Loads all variants of gene into a NumPy array."""
    return np.load(f"genes/{gene_name}.npy")

In [None]:
gene_fadL = load_gene_data("fadL")
print(gene_fadL)
print(gene_fadL.shape)

## Selecting genes

The ability of pathogens to evolve resistance to antibiotics can be traced back to genetic variations in specific genes.
These variations might include mutations that alter the target site of an antibiotic, the activation of efflux pumps that expel the antibiotic from the cell, or the acquisition of genes that degrade or modify the antibiotic, rendering it ineffective.
Understanding the genetic basis of antibiotic resistance is essential for developing new therapeutic strategies and predicting the emergence of resistance in bacterial populations.

### Visualize

Plotting the results of the MSA provides a visual representation of the genetic landscape of antibiotic resistance genes, highlighting areas of high variability that may be hotspots for the development of resistance.
This visualization can be particularly illuminating, as it allows students to see the correlation between genetic variations and resistance phenotypes, reinforcing the concept that genetic changes drive antibiotic resistance.
Furthermore, this approach can be extended to include computational models that predict resistance based on genetic data, offering a comprehensive toolset for tackling antibiotic resistance from a genetic perspective.

The next cell setups up code that will plot the gene alignment for you.

In [None]:
# DO NOT MODIFY CODE BELOW THIS LINE.

import math
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap


def numberizer(msa_array):
    nucleotide_to_number = {"A": 0.25, "C": 0.5, "G": 0.75, "T": 1.0}
    msa_array = np.vectorize(nucleotide_to_number.get)(msa_array)
    return msa_array


def plot_alignment(msa_array, start_seq, n_seq, nuc_per_ax=200):
    msa_array = numberizer(msa_array)
    seq_stop = start_seq + n_seq

    msa_array = msa_array[start_seq:seq_stop]

    n_nuc = msa_array.shape[1]
    n_axes = math.ceil(n_nuc / nuc_per_ax)

    custom_cmap = ListedColormap(["#264653", "#f94144", "#f9c74f", "#43aa8b"])

    fig, axs = plt.subplots(n_axes, 1, figsize=(10, 3 * n_axes), sharex=True)
    axs = np.atleast_1d(axs)  # Ensure axs is always an array for consistency

    nuc_start = 0
    for ax in axs:
        nuc_stop = min(nuc_start + nuc_per_ax, n_nuc)
        seq_sliced = msa_array[:, nuc_start:nuc_stop]  # Correct slicing

        im = ax.imshow(
            seq_sliced,  # Transpose for correct orientation
            cmap=custom_cmap,
            aspect="auto",
            interpolation="none",
        )

        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_ylabel(f"Seq {start_seq} to {seq_stop}")
        ax.set_title(f"Positions {nuc_start + 1} to {nuc_stop}")
        nuc_start += nuc_per_ax

    plt.tight_layout()
    plt.show()

Now, let's choose a random gene to plot.

In [None]:
gene_argE = load_gene_data("argE")
plot_alignment(gene_argE, 0, 50)

### Analyze variability

In this part, you are tasked with identifying genes that may play a critical role in determining the susceptibility of bacterial strains to antibiotics.
This step is crucial for developing an effective classifier that can predict antibiotic resistance or susceptibility based on genetic information.
Utilizing the NumPy array of DNA sequences provided, your goal is to uncover specific genes or genetic markers that correlate strongly with the antibiotic susceptibility outcomes you are studying.

For example, we show you how to prepare this for the `argE` gene.

In [None]:
gene_argE = load_gene_data("argE")
gene_argE_feat = numberizer(gene_argE)

print(gene_argE)
print(gene_argE_feat)

The teaching team encourages creative or scientific analysis of the gene for judicious selection.
To show you an example of how you would do this, we will analyze the lengths of our genes and choose sequences of length less than 100.
(Note: This is just a demonstration and not a good choice.)

In [None]:
desc_all = []
for gene_name in gene_list:
    gene_data = load_gene_data(gene_name)
    # PUT YOUR ANALYSIS HERE INSTEAD OF THE LINE BELOW.
    gene_desc = gene_data.shape[1]

    desc_all.append(gene_desc)

desc_all = np.array(desc_all)
print(desc_all)

gene_idxs = np.argwhere(desc_all < 100).ravel()
print(repr(gene_list[gene_idxs]))

### Selection

In the `gene_selection` list, choose which genes you would like to train your model on.
We arbitrarily selected `fadL` and `artJ` as placeholders.

In [None]:
gene_selection = [
    "tnaC",
    "shoB",
    "kdpF",
    "cbdX",
    "ivbL",
    "ilvL",
    "pmrR",
    "pheL",
    "uof",
    "mgtS",
    "ynhF",
    "mgtL",
    "pheM",
    "ypdK",
    "yncL",
    "yoeI",
    "ypfM",
    "yrbN",
]


n_genes = len(gene_selection)
print(f"There are {n_genes} genes selected: {gene_selection}")

### Featurization

Select your value of k.

In [None]:
k = 6

In [None]:
import itertools
from multiprocessing import Pool

In [None]:
def generate_all_kmers(k):
    """
    Generate all possible k-mers of length k from the nucleotides A, C, G, and T.

    Args:
        k (int): Length of the k-mer.

    Returns:
        list: A list of strings, each representing a possible k-mer.
    """
    nucleotides = ["A", "C", "G", "T"]
    return ["".join(p) for p in itertools.product(nucleotides, repeat=k)]

<details><summary>generate_all_kmers</summary>

The algorithm is a Python list comprehension that generates all possible combinations of nucleotides of length `k`, where `k` is a given integer.
This specific line of code uses the `itertools.product` function to produce the Cartesian product of a sequence of nucleotides repeated `k` times.
Let's break down how it works step by step:

**Components of the Algorithm**

1.  A list `nucleotides` containing the four DNA nucleotides: `['A', 'C', 'G', 'T']`.
2.  A function from the `itertools` module in Python's standard library. It is used to compute the Cartesian product of input iterables. The `repeat=k` argument specifies how many times the input iterable (in this case, the list of nucleotides) should be repeated in the product. The result is an iterator of tuples, where each tuple is a combination of nucleotides of length `k`.
3.  This is a concise way to create lists in Python. The expression `[''.join(p) for p in itertools.product(nucleotides, repeat=k)]` iterates over each tuple `p` produced by `itertools.product(nucleotides, repeat=k)`, joins the elements of the tuple into a string using `''.join(p)`, and collects these strings into a new list.

**How It Works**

- The `itertools.product` function is called with the list of nucleotides and the `repeat=k` argument. This generates all possible combinations of the nucleotides repeated `k` times. For example, if `k=2`, the product would include combinations like `('A', 'A')`, `('A', 'C')`, ..., `('T', 'T')`.
- Each tuple `p` in the iterator returned by `itertools.product` represents a unique combination of nucleotides of length `k`. The tuples might look like `('A', 'C')`, `('G', 'T')`, etc., depending on the value of `k`.
- The list comprehension iterates over these tuples. For each tuple `p`, the `''.join(p)` method is called, which concatenates the elements of the tuple into a single string. This operation transforms a tuple of nucleotides like `('A', 'C')` into a string `"AC"`.
- Finally, the list comprehension collects all these strings into a new list, which represents all possible k-mers of length `k` made up from the nucleotides 'A', 'C', 'G', 'T'.

**Example**

For `k=2`, the output would be a list of all possible 2-mer combinations of the nucleotides 'A', 'C', 'G', 'T', such as `["AA", "AC", "AG", "AT", "CA", "CC", ..., "TT"]`.

</details>

In [None]:
def create_kmer_mapping(all_kmers):
    """
    Create a mapping from each k-mer to its index, facilitating quick lookups.

    Args:
        all_kmers (list): List of all possible k-mers generated by `generate_all_kmers`.

    Returns:
        dict: A dictionary with k-mers as keys and their respective indices as values.
    """
    return {kmer: i for i, kmer in enumerate(all_kmers)}

<details><summary>create_kmer_mapping</summary>

The algorithm is a Python dictionary comprehension that creates a dictionary mapping each k-mer to its index within a list of all possible k-mers.
Let's break down how this algorithm works, focusing on its components and functionality.

**Components of the Algorithm**

1. `all_kmers`: A list that contains all possible k-mer strings. A k-mer is a substring of length `k` derived from a longer sequence, commonly used in bioinformatics for analyzing genetic sequences.
2. `enumerate(all_kmers)`: The `enumerate` function takes an iterable (in this case, `all_kmers`) and returns an iterator that produces pairs of an index (`i`) and the value at that index (`kmer`) from the iterable. The index starts from 0 by default.
3. Dictionary Comprehension: `{kmer: i for i, kmer in enumerate(all_kmers)}` is a dictionary comprehension, a concise way to create dictionaries in Python. This particular comprehension iterates over each index-value pair produced by `enumerate(all_kmers)`.

**How It Works**

- For each iteration, `enumerate(all_kmers)` provides an index (`i`) and the value at that index (`kmer`), which is a specific k-mer string from the `all_kmers` list.
- The dictionary comprehension takes each `i, kmer` pair and constructs a key-value pair in the resulting dictionary, where `kmer` is the key and `i` is the value. This means each k-mer string from the `all_kmers` list is mapped to its corresponding index in that list.
- The process repeats for every item in `all_kmers`, ensuring every k-mer is included in the dictionary with its index as the value.

**Purpose and Usage**

The purpose of this algorithm is to create a quick lookup table where each k-mer can be identified by its position in the list of all k-mers. This is particularly useful in bioinformatics and computational biology for tasks such as k-mer counting, sequence alignment, and genome assembly, where knowing the index of a k-mer in a comprehensive list can be crucial for analysis and comparison.

**Example**

Suppose `all_kmers` is `['AA', 'AC', 'AG', 'AT']`. The resulting dictionary would be:

```python
{
    'AA': 0,
    'AC': 1,
    'AG': 2,
    'AT': 3
}
```

This dictionary maps each 2-mer to its index in the list `all_kmers`.

**Conclusion**

This algorithm efficiently maps each k-mer to its index in a given list of k-mers, facilitating rapid index lookups and offering a compact way to reference k-mers by their position in a predefined list. This mapping is highly beneficial for numerous applications in genomics and bioinformatics where such operations are frequently required.

<details>

In [None]:
def count_kmers_seq(sequence, k, kmer_mapping):
    """
    Count the occurrences of each k-mer in a given sequence.

    Args:
        sequence (str): The DNA sequence to count k-mers in.
        k (int): The k-mer size.
        kmer_mapping (dict): A pre-generated mapping of k-mers to indices.

    Returns:
        np.ndarray: An array of counts for each k-mer.
    """
    if not isinstance(sequence, str):
        sequence = "".join(sequence)
    kmer_counts = np.zeros(len(kmer_mapping), dtype=np.int32)

    for i in range(len(sequence) - k + 1):
        kmer = sequence[i : i + k]
        index = kmer_mapping.get(kmer)
        if index is not None:
            kmer_counts[index] += 1

    return kmer_counts

<details><summary>count_kmers_seq</summary>

This algorithm is designed to count the occurrences of each k-mer within a given DNA sequence.
It leverages a mapping where each k-mer is associated with a specific index, and it uses this mapping to efficiently tally the counts of each k-mer in an array.
Let's break down how it operates step-by-step.

**Components of the Algorithm**

1. `kmer_counts`: An array initialized with zeros, created using NumPy. The length of this array is equal to the number of unique k-mers in the `kmer_mapping`. Each position in the array is intended to hold the count of occurrences of the corresponding k-mer in the sequence. The `dtype=np.int32` specifies that the integers are 32-bit, optimizing memory usage.
2. `sequence`: A string representing the DNA sequence from which k-mers will be counted. DNA sequences consist of nucleotides denoted by the letters A, C, G, and T.
3. `k`: The length of the k-mers to be counted.
4. `kmer_mapping`: A dictionary mapping each k-mer to a unique index. This mapping is used to identify the position in the `kmer_counts` array where the count for each k-mer should be incremented.

**How It Works**

1.  The algorithm begins by creating an array of zeros, `kmer_counts`, with a size equal to the number of entries in `kmer_mapping`.
    This array is prepared to store the count of each k-mer found in the sequence.
2.  It then iterates through the sequence, starting from the first nucleotide and stopping at a position where the last k-mer of length `k` can be obtained.
    This is done by looping from 0 to `len(sequence) - k + 1`. For each position `i`, a substring of length `k` (`kmer`) is extracted from the sequence.
3.  For each extracted k-mer, the algorithm looks up the corresponding index in `kmer_mapping` using `kmer_mapping.get(kmer)`.
    This method returns `None` if the k-mer is not found, ensuring that only k-mers present in the mapping are considered.
4.  If the k-mer is found in the mapping (`index` is not `None`), the count in the `kmer_counts` array at the position specified by the index is incremented by 1. This effectively tallies the occurrence of each k-mer.
5.  After iterating through the entire sequence and updating the counts for each k-mer found, the algorithm returns the `kmer_counts` array. This array now contains the total counts of each k-mer as per their mapping indices.

**Purpose and Usage**

This algorithm is particularly useful in bioinformatics for analyzing the composition and frequency of k-mers within genomic sequences. Understanding the abundance of various k-mers is crucial for tasks such as genome assembly, sequence alignment, and motif discovery.
By using a mapping and a count array, the algorithm achieves efficient and fast k-mer counting, which is essential when dealing with large genomic datasets.

**Example**

Consider a short sequence `ACGTAC` with `k=2` and a `kmer_mapping` of `{'AC': 0, 'CG': 1, 'GT': 2, 'TA': 3}`.
The algorithm would produce a `kmer_counts` array `[1, 1, 1, 1]`, indicating that each of the k-mers `AC`, `CG`, `GT`, and `TA` appears exactly once in the sequence.

This algorithm efficiently maps the complex problem of k-mer counting to a simple array operation, leveraging the power of Python dictionaries for fast lookup and NumPy arrays for efficient numerical operations.

</details>

In [None]:
def parallel_count_kmers(sequences, k, kmer_mapping):
    """
    Parallelize the counting of k-mers across multiple DNA sequences.

    This function utilizes multiprocessing to efficiently process multiple sequences
    in parallel, significantly improving performance on multi-core systems.

    Args:
        sequences (iterable of str): An iterable containing DNA sequences.
        k (int): The k-mer size.
        kmer_mapping (dict): Mapping from k-mers to their respective indices.

    Returns:
        np.ndarray: A 2D array where each row represents the k-mer counts for a sequence.
    """
    args_list = [(sequence, k, kmer_mapping) for sequence in sequences]

    with Pool() as pool:
        result = pool.starmap(count_kmers_seq, args_list)

    return np.array(result)

<details><summary>parallel_count_kmers</summary>

This algorithm is designed to count k-mers across multiple DNA sequences in parallel, using Python's multiprocessing capabilities to distribute the work across multiple processes. It is particularly useful for processing large datasets in bioinformatics, where analyzing sequences for k-mer counts can be computationally intensive. Here's a step-by-step breakdown of how this algorithm works:

**Components of the Algorithm**

1. `sequences`: An iterable (e.g., a list or array) of DNA sequences. Each element in this iterable is a string representing a DNA sequence from which k-mers will be counted.
2. `k`: The length of the k-mers to be counted.
3. `kmer_mapping`: A dictionary that maps each k-mer to a unique index. This is used to keep track of the counts of each k-mer in an ordered manner.
4. `args_list`: A list of tuples, where each tuple contains a single sequence from `sequences`, the value of `k`, and the `kmer_mapping`. This list is prepared as input for parallel processing, with each tuple representing a set of arguments to be passed to a function that counts k-mers in a single sequence.
5. `Pool` from `multiprocessing`: A context manager that provides a convenient way to manage a pool of worker processes. It allows for parallel execution of functions across multiple input values, distributing the workload among available CPU cores.
6. `pool.starmap`: A method of the `Pool` class that applies a function to all items in a given iterable (`args_list` in this case), unpacking each item in the iterable to use as separate arguments to the function. It is suitable for when the function to be executed in parallel takes multiple arguments.

**How It Works**

1.  The algorithm starts by creating `args_list`, a list of tuples. Each tuple contains the arguments to be passed to the `count_kmers_seq` function for one of the sequences. This effectively prepares a batch of work where each piece consists of counting k-mers in one sequence.
2.  By using `with Pool() as pool:`, a pool of worker processes is created. The number of workers is determined by the available CPU cores on the machine, allowing for parallel computation.
3.  The `pool.starmap` function is called with two arguments: `count_kmers_seq`, which is the function to be executed in parallel, and `args_list`, the iterable of arguments prepared earlier. `starmap` iterates over `args_list`, and for each tuple in the list, it unpacks the tuple and calls `count_kmers_seq` with those unpacked arguments. This process happens in parallel across different worker processes, allowing for simultaneous counting of k-mers in multiple sequences.
4.  The result of `pool.starmap` is a list where each element corresponds to the result of `count_kmers_seq` for one of the input sequences. This list is then converted to a NumPy array using `np.array(result)` for efficient numerical operations and storage. 
5.  Finally, the algorithm returns the NumPy array containing the k-mer counts for each sequence. Each row in this array corresponds to the counts of k-mers in one sequence from the input `sequences`.

**Purpose and Usage**

This algorithm significantly speeds up the process of counting k-mers across multiple sequences by leveraging multiprocessing, making it highly effective for large-scale genomic analyses. Parallel processing allows for the workload to be distributed across multiple CPU cores, reducing the overall computation time when dealing with large datasets typical in bioinformatics.

</details>

In [None]:
all_kmers = generate_all_kmers(k)
kmer_mapping = create_kmer_mapping(all_kmers)
n_kmers = len(kmer_mapping)
print(f"There are {n_kmers} unique k-mers in the k-mer mapping")

In [None]:
features = np.empty((n_genes, n_isolates, n_kmers))
for i, gene_name in enumerate(gene_selection):
    gene_data = load_gene_data(gene_name)
    features[i] = parallel_count_kmers(gene_data, k, kmer_mapping)
print(features.shape)

### Train and test split

Avoid data leakage

In [None]:
# DO NOT CHANGE CODE BELOW THIS LINE.
from sklearn.model_selection import train_test_split

RANDOM_STATE = 472929478

features_train = []
features_test = []

for i in range(len(features)):
    f_train, f_test = train_test_split(
        features[i], test_size=0.4, random_state=RANDOM_STATE
    )
    features_train.append(f_train)
    features_test.append(f_test)

features_train = np.array(features_train)
features_test = np.array(features_test)

print(features_train.shape)
print(features_test.shape)

In [None]:
# DO NOT CHANGE CODE BELOW THIS LINE.
labels_train, labels_test = train_test_split(
    labels, test_size=0.4, random_state=RANDOM_STATE
)

print(labels_train.shape)
print(labels_test.shape)

### K-mer analysis

Perform any analysis of k-mers in the cell(s) below to help improve your model.
This will help you refine your `select_kmers` later on.
For example, we can calculate the variance of each k-mer.

In [None]:
desc_kmers = []
print(features_train.shape)
for feat in features_train:
    desc = np.std(feat, axis=1)
    desc_kmers.append(desc)

desc_kmers = np.array(desc_kmers)
print(desc_kmers)

Now we can select the k-mer with desired descriptor for each gene.

In [None]:
kmer_idxs = np.argwhere(desc_kmers > 0.15)
n_kmers_selections = len(kmer_idxs)
print(f"You would have {n_kmers_selections} k-mers selected")

In [None]:
kmer_selections = [[] for _ in range(n_genes)]
for sel in kmer_idxs:
    gene_idx, kmer_idx = sel
    kmer_selections[gene_idx].append(kmer_idx)

In [None]:
# DO NOT MODIFY CODE BELOW THIS LINE.


def process_features(features, select_scaler=None, select_kmers=None):
    scaled_features = np.empty(features.shape)

    if select_scaler is None:
        select_scaler = select_scaler
    for i in range(features.shape[0]):
        scaler = select_scaler().fit(features[i])
        new_features = scaler.transform(features[i])
        scaled_features[i, :] = new_features
    if select_kmers is not None:
        if not isinstance(select_kmers, list):
            raise TypeError("select_kmers must be a list")
        scaled_features = scaled_features[:, select_kmers]
    scaled_features = np.concatenate(scaled_features, axis=-1)
    return scaled_features


features_model = process_features(
    features, select_scaler=select_scaler, select_kmers=kmer_selections
)
print(features_model.shape)

## Model

### Preprocessing



#### Scaler

Select your scaler by changing the value after `preprocessing.`.
For example, if you wanted to select `StandardScaler`, your code would be

```python
select_scaler = preprocessing.StandardScaler
```

Your options are:

-   [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html)
-   [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
-   [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

#### Specific kmers

```python
select_kmers = None
# or
select_kmers = [0, 1, 2]
```

In [None]:
# TODO: Make your selections here
select_scaler = preprocessing.StandardScaler
select_kmers = None

# DO NOT MODIFY CODE BELOW THIS LINE.


def process_features(features, select_scaler=None, select_kmers=None):
    scaled_features = np.empty(features.shape)

    if select_scaler is None:
        select_scaler = select_scaler
    for i in range(features.shape[0]):
        scaler = select_scaler().fit(features[i])
        new_features = scaler.transform(features[i])
        scaled_features[i, :] = new_features
    if select_kmers is not None:
        if not isinstance(select_kmers, list):
            raise TypeError("select_kmers must be a list")
        scaled_features = scaled_features[:, select_kmers]
    scaled_features = np.concatenate(scaled_features, axis=-1)
    return scaled_features


features_model = process_features(
    features, select_scaler=select_scaler, select_kmers=kmer_selections
)
print(features_model.shape)

### Model selection

You may use any of these models.

-   [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html)
-   [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
-   [NuSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html)
-   [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
-   [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)
-   [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
-   [GaussianProcessClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html)
-   [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
-   [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
-   [HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html)
-   [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
-   [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)


In [None]:
# TODO: Put your import statement for your model here.

In [None]:
model_selection = SVC
model_kwargs = {"C": 1.0, "kernel": "rbf"}

### Training


In [None]:
# DO NOT CHANGE CODE BELOW THIS LINE.
model = SVC(**model_kwargs)
model.fit(X=features_train, y=labels_train)

### Testing


In [None]:
from sklearn.metrics import balanced_accuracy_score

In [None]:
labels_pred = model.predict(features_test)
score = balanced_accuracy_score(labels_test, labels_pred)
print("True: ", labels_test.flatten())
print("Pred: ", labels_pred)
print(score)