# Final Model: Aggregated Probability
After trying several approaches in the sandbox notebooks, I settled on 
the following approach given a time constraint: The aggregated probabilities
of a nearest neighbors classifier on sequence permutations (TFIDF scores) and 
and a random forest classifier on the one-hot encoded portion of the training 
data. The following is a distillation of the best scoring process in the sandbox

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from itertools import permutations
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

### Preparing The Data
Before being able to fit any models, some feature engineering needs to be performed on the training observations, and the training labels need to be collapsed down to a usable form.

In [6]:
values = pd.read_csv("train_values.csv")
labels = pd.read_csv("train_labels.csv")

In [8]:
# create the main label series

lab_ids = labels.columns[1:]

# get numpy matrix of lab_id one-hot values
lab_matrix = labels.drop(columns=['sequence_id']).values

# get array of indices to map back to lab_ids
lab_col_indices = np.asarray(lab_matrix == 1).nonzero()[1]

labels['lab_id'] = lab_ids[lab_col_indices]
y = labels[['sequence_id', 'lab_id']]
y.head()

Unnamed: 0,sequence_id,lab_id
0,9ZIMC,RYUA3GVO
1,5SAQC,RYUA3GVO
2,E7QRO,RYUA3GVO
3,CT5FP,RYUA3GVO
4,7PTD8,RYUA3GVO


In [9]:
# verify that all labels are correct
def correct_label(row_n):
    return labels.iloc[row_n][labels['lab_id'].iloc[row_n]] == 1

assert all(list(map(correct_label, range(labels.shape[0]))))

In order to effectively use RandomizedSearchCV, some of the observations where classes had very few members needed to be 
oversampled

In [11]:
# these are the classes with less than 5 members
# need all to have at least 5 for randomized search cv
print(y['lab_id'].value_counts().tail(19))

# create resampling map
resample_map = y['lab_id'].value_counts().tail(19)
resample_map = list(zip(resample_map.index, resample_map.values))

# temporarily add 'lab_id' back to
over_samp = pd.concat([values, y['lab_id']], axis=1)

UMOD7PGG    4
8N5EPD5C    4
YCD71LRY    4
68OY1RK5    4
PXT3AJ7C    4
03GRNN7N    4
VDSDXJ71    4
INDCDVP0    4
W2DYAZID    4
1KZHNVYR    4
LGTP4O86    4
RZPGGEG4    4
XCWSW5T9    4
WM3Q8LBC    4
58BSUZQB    3
G2P73NZ0    3
WB78G3XF    2
0L3Y6ZB2    1
ON9AXMKF    1
Name: lab_id, dtype: int64


In [12]:
def oversample(df, resamp_map, min_samps=5):
    """
    Randomly oversample rows that belong to classes with less than N
    members
    
    df: pandas DataFrame object
    resamp_map: reference values for classes and resample amount
    """
    new_df = df.copy()
    for class_, members in resamp_map:
        for i in range(min_samps - members):
            new_row = new_df.loc[new_df['lab_id'] == class_].sample(n=1)
            new_df = new_df.append(new_row)
    
    return new_df

Now each class has a minimum of 5 members

In [13]:
over_samp = oversample(over_samp, resample_map)
over_samp['lab_id'].value_counts().tail(20)

VDSDXJ71    5
8F0XPAZX    5
DN01XVIU    5
HCW1Y9QM    5
I3UODLOR    5
5Z4CMIY5    5
QNKGHIRB    5
W2DYAZID    5
78XDAJNS    5
G6MP6EIN    5
LGTP4O86    5
E59C5N01    5
78QGAL01    5
RZPGGEG4    5
XCWSW5T9    5
5CBNCRST    5
03GRNN7N    5
VMU0L6UM    5
XYB5NWR4    5
NDZT8PV3    5
Name: lab_id, dtype: int64

Now that the data is oversampled for use with RandomizedSearchCV, a function can be created that will take train and test data, and output the appropriate data set for each model.

In [14]:
# first, ngrams for sequences must be made
N_GRAMS = 4

# create the 'vocabulary' for the different nucleotides
n_tides = set(''.join(over_samp['sequence'].values))

# create a list of subsequences for features
subseqs = list(''.join(p) for p in permutations(n_tides, r=N_GRAMS))
subseqs[:5]

['TGNC', 'TGNA', 'TGCN', 'TGCA', 'TGAN']

In [None]:
def separate(data, sequences):
    """
    Splits train or testing data into one-hot
    encoded data and tokenized sequences
    """
    pass