# Workshop 4 - Featurization

In this workshop we're going to take the data we reviewed last week during EDA and transform it into usable features to begin modeling. We have a few different options and we'll run through each of the three in turn:
1. Gene presence absence
2. Kmer counts
3. [Bonus] Gene Sequences



Lets run through the steps together (there are some questions and some blanks to fill in as we run through).

## Imports

In [None]:
import os
from collections import defaultdict

import pandas as pd
import numpy as np
from tensorflow import keras

## 1. Load Data

Last week we took a look through all our raw data

To allow for testing further down the line I've held out some data in advance that we'll be using later in the course. As such I've made a folder in the course project folder system called `train_test_data`

For this workshop please download:
- `train_test_data` folder and put within `data/`

Key for data:
- train_genes = gene_match_data for train samples
- test_genes = gene_match_data for test samples
- y_train = array of S/R target values
- y_train_ids = array of genome_ids in order of y_train
- y_test_ids = array of genome_ids in order of y_test

In [None]:
seed = 130

def load_data():
    """
    Load the data needed for Workshop 4
    """
    train_genes = pd.read_csv('../data/train_test_data/train_genes.csv')
    train_genes['genome_id'] = train_genes.genome_id.astype(str)
    test_genes = pd.read_csv('../data/train_test_data/test_genes.csv')
    test_genes['genome_id'] = test_genes.genome_id.astype(str)
    y_train = np.load('../data/train_test_data/y_train.npy', allow_pickle=True)
    y_train_ids = np.load('../data/train_test_data/train_ids.npy', allow_pickle=True).astype(str)
    y_test_ids = np.load('../data/train_test_data/test_ids.npy', allow_pickle=True).astype(str)

    return train_genes, test_genes, y_train, y_train_ids, y_test_ids

train_genes, test_genes, y_train, y_train_ids, y_test_ids = load_data()

In [None]:
train_genes.head(5)

In [None]:
y_train[0:5], y_train_ids[0:5]

## 1. Presence / Absence Features

Our first, most simple feature set will be the presence/absence of each gene we've seen from our CARD alignment data.

In order to build these features we'll need to:
1. Find all unique res_genes, count which samples they're present/absent in
2. Look for correlations between genes
3. Remove highly correlated features (as seen in the assignment)

### 1a. Create a presence absence matrix for each sample and each gene

- We're going to leverage some pandas magic to make this really simple
- The logic can be quite complex manually
  - For each gene, search each samples
  - Store a list of 0/1 for each samples for each gene
  - Ensuring correct ordering
- Pandas can do this for us using `pivot_table`

In [None]:
def build_gene_presence_absence(dataset, ids, gene_names=None):
    """
    Build a matrix of samples to genes with 1 for present and 0 for absent

    Args:
      - dataset (pd.DataFrame): dataset of gene alignments
      - ids (list): ordering for IDs
      - gene_names (list): ordered list of unique genes (optional)
    """
    # If not providing genes take all unique genes from the data
    if gene_names is None:
        gene_names = dataset.res_gene.unique()

    # Count for all genes found within the data
    genes_counts = (
        ---
    )

    # Add genes missing from the data
    missing_genes = ---
    if len(missing_genes) > 0:
        for gene in missing_genes:
            genes_counts[gene] = 0

    # Make sure to return in the same order as gene_names and the same sample order
    return ---

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. Why do we need to pass in a list of gene names?

</div>

In [None]:
train_presence_absence_df = build_gene_presence_absence(train_genes, y_train_ids)
print('\nShape of PA df:', train_presence_absence_df.shape, '\n')
train_presence_absence_df.head()

In [None]:
genes_of_interest = [
    'gb|AAAGNS010000063.1|-|144-1005|ARO:3000966|TEM-103',
    'gb|AB023477.1|+|0-861|ARO:3001082|SHV-24',
    'Random Missing Gene',
    'gb|AB089595.1|+|0-1206|ARO:3000166|tet(B)',
]
build_gene_presence_absence(train_genes, y_train_ids, genes_of_interest).head()

### 2b. Review Correlations and Remove Identical Features

As seen in last week's assignment, these presence absence features have a lot of redundancies (genes that are identical across samples).

If we leave these, they can cause issues during modeling (unidentifiability/incorrect feature importances and inferences).

Two options:
1. Calculate correlations and cluster
2. Look at identical presence/absence and select one

We'll go for the simpler option 2 but you can also try clustering in your project if you believe it will help model performance.

Again this seems somewhat complex - we need to look across all samples and check for identical 1/0 arrays. BUT:

- Pandas to the rescue once again

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. Can anyone think of a common pandas operation which might help use remove identical values?

</div>

In [None]:
# Transpose the data so that genes are our rows
train_presence_absence_transposed = ---

# Use "drop_duplicates()" to remove identical rows (will just keep the first)
train_unique_presence_absence = ---

# Transpose back so genes are columns
train_presence_absence_df = ---

In [None]:
print('\nShape of PA df:', train_presence_absence_df.shape, '\n')
train_presence_absence_df.head()

### 2c. Also then need to subset our test data

Always need to make sure our test data is in the same shape and format

In [None]:
# Get just the unique 159 genes from our training data
unique_train_genes = ---
test_presence_absence_df = build_gene_presence_absence(---)

In [None]:
test_presence_absence_df.head()

## 3. Kmer Features

Kmers are a representation of the raw sequencing data.

K being a parameter setting the length of the sequence (e.g. 2-mer or 5-mer)

There are two main options for kmerizing data:
1. Utilize Python to count the sequence data
2. Utilize a command line tool (see [Jellyfish](https://github.com/zippav/Jellyfish-2))

In this tutorial we'll learn how to generate kmers manually on a small subset of data.

Kmer counting is an expensive operation, for using Kmer features in your project - I've already generated an output file across all the data using Jellyfish (we'll loop back to this next week)

### 3a. How can we count Kmers in a single sequence?

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. How would we generate the first 7-mer from a sequence?

</div>

We're going to need to break down our sequences into discrete chunks.

As we saw in the presentation, we need to slide along our sequence doing this at each position:
1. A simple for loop will work perfectly for this
2. We can select each chunk of 7 nucleotides at each position
3. We'll need to figure out how to keep track of this efficiently

In [None]:
def count_kmers(sequence, k=7):
    """
    For a single sequence, slide over all nucleotides and count each k chunk

    Args:
      - sequence (str): raw nucleotide data
      - k (int): size of kmers
    """
    kmer_counts = ---
    for i_start in range(---):
        kmer = ---
        kmer_counts[kmer] += 1
    
    return kmer_counts

In [None]:
# Test with a sequence
test_sequence = 'ACGTGTGTAAGACGTGTGGCGA'
count_kmers('ACGTGTGTAAGACGTGTGGCGA')

### 3b. Apply this approach for each sample

We have a function which can turn a sequence into counts of kmers

This still isn't a usable "feature" for modeling though, we now need to apply across our training data, lets take a look at the data again:

In [None]:
train_genes.head(3)

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. What steps are we going to need to take from the data above?

</div>

Our data isn't in a single sequence per sample, it's per sample per gene.

We can either:
1. Count kmers per gene sequence and then aggregate per sample
2. Aggregate sequences to a single sequence per sample then count

We're going to take option 2 as it makes tracking the kmers much easier

To make this computationally feasible we're going to try on just the first 5 samples

In [None]:
def count_kmers_per_sample(dataset, sample_id_col='genome_id', gene_col='ref_gene_str'):

    # Agg to full gene string per sample
    seq_per_sample = ---
    seq_per_sample['ref_gene_str'] = seq_per_sample['ref_gene_str'].str.upper()

    kmer_counts = {}
    for ref_name, sequence in seq_per_sample[[sample_id_col, gene_col]].to_records(index=False):
        ---

    # Convert to DataFrame, fill empty with zero and transpose
    kmer_counts_df = ---
    
    return kmer_counts_df

In [None]:
first_5_samples = train_genes.genome_id.unique()[0:5]
train_genes_first_five = train_genes[train_genes.genome_id.isin(first_5_samples)]
test_kmer_counts = count_kmers_per_sample(train_genes_first_five)

In [None]:
test_kmer_counts.head()

<div class="question" style="color: #534646; background-color: #ffdfa3; padding: 1px; border-radius: 5px;">

#### Q. Any potential issues with the above?

</div>

In [None]:
# Check unique possible 7-mers
4**7

### Final Note

We've counted every kmer that was present in each sample

BUT: we haven't account for the kmers which weren't seen anywhere in the samples
- 16,245 kmers present in the first 5 genomes
- We need the feature matrix to be consistent for future unseen
- Therefore need to account for ALL possible kmers

Our feature matrix would need to be `N x 16384`

This could get very large if we use a high K! (Remember 10-mers had over 1 million unique features)

For next week I've already created a 11-mer matrix for both train and test using all unique genes seen across the data we have in this workshop.

## 3. [BONUS] Sequence Features

I've provided a simple method for featurizing the gene sequences themselves into a usable format. As mentioned it is an optional extra if you want to try using these features to build a Neural network as part of your project. It will be tricky to get it to train correctly and you may need to invest a lot more time to get it working (see the paper I linked in the slides). This featurization approach is likely insufficient alone but you can use it as a jumping off point if you wish to explore further.

At a high level the process is:
1. Extract only variant genes from CARD (ones which confer resistance through mutations)
2. Subset genes to variant genes and randomly concatenate into a single long string
3. Encode nucleotides to integers

In [None]:
# Load in just the varient genes and subset the data
for file_name in os.listdir('../data_full/card_data/'):
    if file_name.startswith('nucleotide_fasta_protein_variant_model'):
        print(file_name)
        with open(f'../data/card_data/{file_name}') as f:
            fasta = f.readlines()
variant_genes = [x.strip().split(' ')[0][1:] for x in fasta[0::2]]

variant_gene_alignment_df = train_genes[train_genes.res_gene.isin(variant_genes)]

In [None]:
# Precomputed maximum length of sequences
max_length = 53859

In [None]:
def encode_seq(seq):
    label_enc = {'A':1, 'C':2, 'G':3, 'T':4}
    return [label_enc.get(x.upper(), 5) for x in seq]

def featurize_variant_sequences(variant_genes, amr_max_length, pad_char=0):
    gene_features = variant_genes.groupby('genome_id', sort=False)['ref_gene_str'].sum()
    gene_features = [encode_seq(x) for x in gene_features]
    gene_features = keras.utils.pad_sequences(gene_features, maxlen=max_length, padding='post', value=pad_char)
       
    return gene_features

In [None]:
sequence_features = featurize_variant_sequences(variant_gene_alignment_df, max_length)

In [None]:
sequence_features[0:5]

### Review:
- This is a very simple featurization scheme to get started with
- It randomly joins genes so the ordering is jumbled
- We're only taking Variant genes, which may not be predictive for all samples
  - Some may be mediated by the presence absence genes

If planning to try using sequence features for the project:
- Review the paper linked in the slides
- Think about how to represent the data
- Consider trying to build sequences using all genes if you have the computational resources to do so

## 5.Save data out for Assignment

Save the presence/absence train/test data for use in the assignment (this is also available in the `train_test_data` course data folder if needed).

In [None]:
train_presence_absence_df.to_csv('../data/train_test_data/train_pa_genes.csv')
test_presence_absence_df.to_csv('../data/train_test_data/test_pa_genes.csv')