# Train Data Contains Wildtype Groups
* Split off from @cdeotte's [excellent notebook](https://www.kaggle.com/code/cdeotte/train-data-contains-mutations-like-test-data)
* Also inspired by [Train Data - 13,000 Single Point Edit Mutations!](https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/discussion/355737) discussion and notebook
* Finally, this notebook uses the corrected training data as generated by @dschettler's [notebook](https://www.kaggle.com/code/dschettler8845/novo-esp-how-to-use-updated-training-file)

I hope this is an incremental improvement and even more robust way of extracting and grouping the train data. :)

For my version, grouping is by any group of size MIN_GROUP_SIZE=5 or more that shares a single wildtype, such that all rows in the group are no more than a single mutation away from the wildtype.

It is easy to change, but note that currently the output is saved as separate files, the wildtype groups, and the 'no wildtype' everything else.

Next steps:
* It would be great to get some help from the community to generate AlphaFold PDBs (and DeepDDGs) for the largest wildtypes, and publish public dataset(s). Maybe a sign-up sheet in the comments? Because it would take a long time for just 1 person. How to [generate PDB is here](https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/discussion/354982https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/discussion/354982).
* Since test data is single pH, single source, it is very reasonable to further subdivide into groups that are same pH, same source. This can be done for the wildtype groups AND for the 'everything else' group.

Even more next steps: I believe that using a 'learning to rank' algorithm on each group separately will be a better way to train, rather than some method of directly predicting the target variable.


# Load Train

In [1]:
import pandas as pd, numpy as np
from statistics import mode, StatisticsError
from collections import defaultdict
from operator import itemgetter

In [2]:
train = pd.read_csv('../input/novo-esp-how-to-use-updated-training-file/updated_train.csv')
print('Train shape:', train.shape )
## Visually show the data is fixed. Interesting that there are still the same number of unique data sources after the "bad data" was removed.
print(train['pH'].min(), train['pH'].max(), train['data_source'].nunique())
train.head()

# train = cudf.read_csv('../input/novozymes-enzyme-stability-prediction/train.csv')
# Old train shape: (31390, 5)
# 1.9900000000000002 64.9 324

Train shape: (28981, 5)
1.99 11.0 324


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5


# Find wildtype candidates for each length of protein string

In [3]:
train['x'] = train.protein_sequence.str.len()
vc = train.x.value_counts()
vc.head()

164    748
231    318
455    245
155    243
148    241
Name: x, dtype: int64

In [4]:
# INSERTION DELETION THRESHOLD
D_THRESHOLD = 1
# MIN GROUP SIZE
MIN_GROUP_SIZE = 5

def max_item_count(seq):
    d = defaultdict(int)
    for item in seq:
        d[item] += 1
    return max(d.items(), key=itemgetter(1))

def get_wildtype(proteins, is_retry=False):
    if not is_retry:
        ## try to get the mode, the simpler algorithm
        wildtype = []
        try:
            for i in range(len(proteins.iloc[0])):
                wildtype.append(mode([p[i] for p in proteins]))
            return ''.join(wildtype)
        except StatisticsError:
            pass
    ## Either failed mode above, or this is a retry because the resulting wildtype didn't actually fit enough proteins
    ##
    ## Two sequences with single mutation from the same wildtype are no more than 2 points different.
    ## Therefore, at least 1/3rd length consecutive string must match. Find max counts of starts, middles, and ends
    ## This technically isn't a guaranteed or precise algorithm, but it is fast and effective,
    ##   based on comparison with more precise grouping methods.
    k = len(proteins.iloc[0])//3
    starts = [p[:k] for p in proteins]
    middles = [p[k:2*k] for p in proteins]
    ends = [p[-k:] for p in proteins]
    ## get the most common substring, and the count of that substring
    start = max_item_count(starts)
    middle = max_item_count(middles)
    end = max_item_count(ends)
    ## reduce the proteins to the ones that match the most common substring
    if (start[1] >= middle[1]) and (start[1] >= end[1]) and (start[1] >= MIN_GROUP_SIZE):
        proteins = [p for p in proteins if p[:k] == start[0]]
        assert(start[1] == len(proteins))
    elif (middle[1] >= end[1]) and (middle[1] >= MIN_GROUP_SIZE):
        proteins = [p for p in proteins if p[k:2*k] == middle[0]]
        assert(middle[1] == len(proteins))
    elif end[1] >= MIN_GROUP_SIZE:
        proteins = [p for p in proteins if p[-k:] == end[0]]
        assert(end[1] == len(proteins))
    else:
        return ''
    ## use the reduced list to find the entire wildtype
    wildtype = []
    try:
        for i in range(len(proteins[0])):
            wildtype.append(mode([p[i] for p in proteins]))
        return ''.join(wildtype)
    except StatisticsError:
        return ""


In [5]:
train['group'] = -1
train['wildtype'] = ''
grp = 0

for k in range(len(vc)):
    if vc.iloc[k] < MIN_GROUP_SIZE:
        break
    c = vc.index[k]
    print(f'rows={vc.iloc[k]}, k:{k}, protein length:{c}')
    is_retry = False
    # SUBSET OF TRAIN DATA WITH SAME PROTEIN LENGTH (not enough deletions to matter for step 1, finding the wildtype)
    tmp = train.loc[(train.x==c)&(train.group==-1)]

    ## It is possible that the same length protein string might have multiple wildtypes in the raw data, keep searching until we've found all of them
    while len(tmp) >= MIN_GROUP_SIZE:
        if len(tmp)<=1: break
        # Ignore Levenstein distance, which is overkill
        # Directly attempt to find wildtype
        # Drop duplicates for wildtype guesstimation
        proteins = tmp.protein_sequence.drop_duplicates()

        # Create most likely wildtype
        wildtype = get_wildtype(proteins, is_retry=is_retry)
        if wildtype == '':
            break

        # SUBSET OF TRAIN DATA WITH SAME PROTEIN LENGTH PLUS MINUS D_THRESHOLD
        tmp = train.loc[(train.x>=c-D_THRESHOLD)&(train.x<=c+D_THRESHOLD)&(train.group==-1)]
        for idx in tmp.index:
            p = train.loc[idx, 'protein_sequence']
            half = c//2
            ## Use fast method to guess that it is only a single point mutation away. Later we double check and actually count number of mutations.
            if (wildtype[:half] == p[:half]) or (wildtype[-half:] == p[-half:]):
                train.loc[idx,'group'] = grp
                train.loc[idx,'wildtype'] = wildtype
        if len(train.loc[train.group==grp]) >= MIN_GROUP_SIZE:
            print(f"{train.loc[(train.group==grp)].shape[0]}: Group {grp} results")
            grp += 1
            is_retry = False
        else:
            train.loc[idx,'group'] = -1
            train.loc[idx,'wildtype'] = ''
            ## to avoid an infinite loop, break out if we've already failed last time
            if is_retry:
                break
            is_retry = True

        # Get ready for next loop
        tmp = train.loc[(train.x==c)&(train.group==-1)]


rows=748, k:0, protein length:164
708: Group 0 results
rows=318, k:1, protein length:231
273: Group 1 results
rows=245, k:2, protein length:455
211: Group 2 results
rows=243, k:3, protein length:155
180: Group 3 results
34: Group 4 results
rows=241, k:4, protein length:148
194: Group 5 results
rows=233, k:5, protein length:448
130: Group 6 results
61: Group 7 results
rows=193, k:6, protein length:246
151: Group 8 results
rows=165, k:7, protein length:170
144: Group 9 results
rows=164, k:8, protein length:150
124: Group 10 results
rows=145, k:9, protein length:96
127: Group 11 results
rows=125, k:10, protein length:142
78: Group 12 results
rows=123, k:11, protein length:101
78: Group 13 results
32: Group 14 results
rows=116, k:12, protein length:109
78: Group 15 results
rows=115, k:13, protein length:485
84: Group 16 results
rows=99, k:14, protein length:268
53: Group 17 results
rows=99, k:15, protein length:537
72: Group 18 results
rows=99, k:16, protein length:144
68: Group 19 results

In [6]:
print(grp)

79


# Display Groups

In [7]:
def argsort(seq, reverse=False):
    # http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
    return sorted(range(len(seq)), key=seq.__getitem__, reverse=reverse)

groups = [0] * grp
for k in range(grp):
    groups[k] = len(train.loc[train.group==k])

groupCount = 0
rowCount = 0
for k in argsort(groups, reverse=True):
    if train.loc[train.group==k].shape[0] == 0:
        continue
    proteins = train.loc[train.group==k, "protein_sequence"]
    wildtype = train.loc[train.group==k, "wildtype"].values[0]

    ## no insertions in the dataset, that I've found.
    ## Handle deletions by adding a '-' symbol in the correct place
    for i in range(len(proteins)):
        if len(proteins.iloc[i]) < len(proteins.iloc[0]):
            if proteins.iloc[i] == wildtype[:-1]:
                proteins.iloc[i] = proteins.iloc[i] + "-"
            else:
                for j in range(len(proteins.iloc[i])):
                    if proteins.iloc[i][j] != wildtype[j]:
                        proteins.iloc[i] = proteins.iloc[i][:j-1] + "-" + proteins.iloc[i][j:]
                        break
        assert(len(proteins.iloc[i]) == len(proteins.iloc[0]))

    ## In very rare cases, the simplified logic to group proteins will group a protein that is NOT a single mutation away from the wildtype.
    ## Ungroup those proteins.
    ungroup = []
    for p in proteins:
        mut = 0
        for j in range(len(wildtype)):
            if p[j] != wildtype[j]:
                mut += 1
        if mut > 1:
            if p not in ungroup:
                ungroup.append(p)
    for p in ungroup:
        train.loc[train.protein_sequence==p, 'group'] = -1
        train.loc[train.protein_sequence==p, 'wildtype'] = ''
    ## Remove entire group if it is now smaller than the min group size
    if train.loc[train.group==k].shape[0] < MIN_GROUP_SIZE:
        train.loc[train.group==k, 'wildtype'] = ''
        train.loc[train.group==k, 'group'] = -1
        continue

    ## Print a line for every group, and a bunch of stats for the first few groups
    print(f'{k}: {train.loc[train.group==k].shape[0]}')
    if groupCount < 5:
        display( train.loc[train.group==k] )
        for c in train.columns:
            print(c, train.loc[train.group==k, c].nunique() + train.loc[train.group==k, c].isnull().values.any())
        print(wildtype)
        print("")
    groupCount += 1
    rowCount += train.loc[train.group==k].shape[0]

print(groupCount, rowCount)

0: 708


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group,wildtype
16160,18020,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1021/bi00535a054,38.1,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
16161,18021,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,4.2,,53.3,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
16162,18022,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1038/334406a0,38.1,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
16163,18023,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,6.5,10.1038/334406a0,62.9,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
16200,18060,MNCFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,,41.9,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
...,...,...,...,...,...,...,...,...
17957,19949,MNWFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,6.5,10.1038/334406a0,56.7,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
17958,19950,MNWFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1038/334406a0,25.5,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
17971,19964,MNYFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1038/334406a0,32.4,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
17972,19965,MNYFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,6.5,10.1038/334406a0,58.8,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...


seq_id 708
protein_sequence 292
pH 54
data_source 59
tm 295
x 1
group 1
wildtype 1
MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSIWYNQTPNRAKRVITTFRTGTWDAYKNL

1: 273


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group,wildtype
15389,16542,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.00,10.1021/bi00006a025,50.9,231,1,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...
15390,16544,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,5.40,,40.0,231,1,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...
15391,16554,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.00,10.1016/j.bpc.2006.10.014,50.1,231,1,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...
15392,16555,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.00,10.1016/j.bpc.2006.10.014,48.9,231,1,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...
15393,16559,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.00,10.1016/j.bpc.2006.10.014,51.1,231,1,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...
...,...,...,...,...,...,...,...,...
15657,17299,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.00,10.1016/j.bpc.2006.10.014,48.5,231,1,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...
15658,17304,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,5.42,10.1021/bi00471a003,45.7,231,1,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...
15659,17306,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,5.40,,40.0,231,1,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...
15660,17314,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...,7.00,10.1016/j.bpc.2006.10.014,52.4,231,1,MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSC...


seq_id 273
protein_sequence 143
pH 17
data_source 17
tm 153
x 1
group 1
wildtype 1
MLVMTEYLLSAGICMAIVSILLIGMAISNVSKGQYAKRFFFFATSCLVLTLVVVSSLSSSANASQTDNGVNRSGSEDPTVYSATSTKKLHKEPATLIKAIDGDTVKLMYKGQPMTFRLLLVDTPETKHPKKGVEKYGPEASAFTKKMVENAKKIEVEFDKGQRTDKYGRGLAYIYADGKMVNEALVRQGLAKVAYVYKPNNTHEQHLRKSEAQAKKEKLNIWSEDNADSGQ

2: 211


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group,wildtype
17494,19480,MNQSVKSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...,8.0,10.1021/acscatal.9b05223,62.5,455,2,MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...
17495,19481,MNQSVSSLAEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...,8.0,10.1021/acscatal.9b05223,64.0,455,2,MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...
17496,19482,MNQSVSSLKEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...,8.0,10.1021/acscatal.9b05223,63.5,455,2,MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...
17497,19483,MNQSVSSLPEKDIQYQLHPYTDARLHQELGPLIIERGEGIYVYDDQ...,8.0,10.1021/acscatal.9b05223,62.0,455,2,MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...
17498,19484,MNQSVSSLPEKDIQYQLHPYTEARLHQELGPLIIERGEGIYVYDDQ...,8.0,10.1021/acscatal.9b05223,61.5,455,2,MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...
...,...,...,...,...,...,...,...,...
17700,19686,MNQSVSSLPEKDIQYQLHPYTNARLHQKLGPLIIERGEGIYVYDDQ...,8.0,10.1021/acscatal.9b05223,62.5,455,2,MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...
17701,19687,MNQSVSSLPEKDIQYQLHPYTNARLMQELGPLIIERGEGIYVYDDQ...,8.0,10.1021/acscatal.9b05223,62.0,455,2,MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...
17702,19688,MNQSVSSLPEKDIQYQLHPYTNLRLHQELGPLIIERGEGIYVYDDQ...,8.0,10.1021/acscatal.9b05223,62.0,455,2,MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...
17703,19689,MNQSVSSLPEKDKQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...,8.0,10.1021/acscatal.9b05223,62.0,455,2,MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQ...


seq_id 211
protein_sequence 211
pH 1
data_source 1
tm 31
x 1
group 1
wildtype 1
MNQSVSSLPEKDIQYQLHPYTNARLHQELGPLIIERGEGIYVYDDQGKGYIEAMAGLWSAALGFSNQRLIKAAEQQFNTLPFYHLFSHKSHRPSIELAEKLIEMAPVPMSKVFFTNSGSEANDTVVKMVWYLNNALGKPAKKKFISRVNGYHGITVASASLTGLPGNQRGFDLPLPGFLHVGCPHHYRFALAGESEEHFADRLAVELEQKILAEGPETIAAFIGEPLMGAGGVIVPPRTYWEKIQKVCRKYDILVIADEVICGFGRTGQMFGSQTFGIQPDIMVLSKQLSSSYQPIAAILINAPVFEGIADQSQALGALGHGFTGSGHPVATAVALENLKIIEEESLVEHAAQMGQLLRSGLQHFIDHPLVGEIRGCGLIAAVELVGDRVSKAPYQALGTLGRYMAGRAQEHGMITRAMGDAVAFCPPLIVNEQEVGMIVERFARALDDTTQWVG

5: 194


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group,wildtype
12401,13269,MKALIVLGLVLLSVTVQGAVFERCELARTLKRLGMDGYRGISLANW...,2.7,10.1046/j.1432-1327.1999.00918.x,62.7,148,5,MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANW...
12402,13270,MKALIVLGLVLLSVTVQGKAFERCELARTLKRLGMDGYRGISLANW...,2.7,10.1021/bi9621829,60.3,148,5,MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANW...
12403,13271,MKALIVLGLVLLSVTVQGKAFERCELARTLKRLGMDGYRGISLANW...,2.7,10.1006/jmbi.1998.1906,41.3,148,5,MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANW...
12404,13273,MKALIVLGLVLLSVTVQGKDFERCELARTLKRLGMDGYRGISLANW...,2.7,10.1074/jbc.M110728200,60.4,148,5,MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANW...
12405,13275,MKALIVLGLVLLSVTVQGKFFERCELARTLKRLGMDGYRGISLANW...,2.7,10.1021/bi0015717,62.2,148,5,MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANW...
...,...,...,...,...,...,...,...,...
12590,13569,MKALIVLGLVLLSVTVQGKVFERCQLARTLKRLGMDGYRGISLANW...,2.2,10.1021/bi000849s,57.8,148,5,MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANW...
12591,13570,MKALIVLGLVLLSVTVQGKVFERCQLARTLKRLGMDGYRGISLANW...,4.0,10.1021/bi000849s,76.2,148,5,MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANW...
12592,13571,MKALIVLGLVLLSVTVQGKVFERCQLARTLKRLGMDGYRGISLANW...,2.3,10.1021/bi000849s,54.6,148,5,MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANW...
12593,13573,MKALIVLGLVLLSVTVQGKYFERCELARTLKRLGMDGYRGISLANW...,2.7,10.1074/jbc.M110728200,63.7,148,5,MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANW...


seq_id 194
protein_sequence 122
pH 12
data_source 20
tm 110
x 1
group 1
wildtype 1
MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANWMCLAKWESGYNTRATNYNAGDRSTDYGIFQINSRYWCNDGKTPGAVNACHLSCSALLQDNIADAVACAKRVVRDPQGIRAWVAWRNRCQNRDVRQYVQGCGV

3: 180


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group,wildtype
14600,15685,MLKQVEIFTAGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...,3.0,10.1074/jbc.271.51.32729,58.1,155,3,MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...
14601,15686,MLKQVEIFTAGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...,9.0,10.1074/jbc.271.51.32729,60.7,155,3,MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...
14602,15687,MLKQVEIFTAGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...,5.5,10.1038/12277,62.6,155,3,MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...
14603,15688,MLKQVEIFTAGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...,5.8,,40.7,155,3,MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...
14604,15690,MLKQVEIFTDGSCLGNPGPGGYAAILRYRGREKTFSAGYTRTTNNR...,4.2,,51.5,155,3,MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...
...,...,...,...,...,...,...,...,...
14775,15875,MLKQVEIFTNGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...,9.0,10.1074/jbc.271.51.32729,53.8,155,3,MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...
14776,15876,MLKQVEIFTNGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...,3.0,10.1074/jbc.271.51.32729,47.4,155,3,MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...
14777,15878,MLKQVEIFTSGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...,9.0,10.1074/jbc.271.51.32729,56.2,155,3,MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...
14778,15879,MLKQVEIFTSGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...,3.0,10.1074/jbc.271.51.32729,52.4,155,3,MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNR...


seq_id 180
protein_sequence 74
pH 11
data_source 11
tm 93
x 2
group 1
wildtype 1
MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNRMELMAAIVALEALKEHCEVILSTDSQYVRQGITQWIHNWKKRGWKTADKKPVKNVDLWQRLDAALGQHQIKWEWVKGHAGHPENERCDELARAAAMNPTLEDTGYQVEV

8: 151
9: 144
6: 130
11: 127
10: 124
16: 84
12: 78
13: 78
15: 78
18: 72
19: 68
30: 62
7: 61
20: 60
29: 59
36: 55
24: 54
25: 54
17: 53
26: 53
48: 52
21: 51
22: 49
53: 45
28: 41
32: 36
4: 34
14: 32
38: 31
54: 30
23: 28
35: 28
60: 27
31: 26
33: 25
47: 24
73: 21
34: 19
44: 19
46: 19
27: 16
51: 17
75: 17
37: 16
45: 16
41: 13
71: 13
72: 13
40: 12
43: 12
58: 12
62: 12
69: 12
68: 11
39: 10
49: 10
52: 10
55: 10
57: 10
59: 10
77: 10
64: 9
61: 8
67: 8
70: 8
76: 8
78: 8
56: 7
42: 6
63: 6
65: 6
74: 6
50: 5
78 4195


In [8]:
## Re-number groups from largest to smallest
groups = [0] * grp
for k in range(grp):
    groups[k] = len(train.loc[train.group==k])

n = 10000
for k in argsort(groups, reverse=True):
    train.loc[train.group==k, "group"] = n
    n += 1

train.loc[train.group>=10000, "group"] = train.loc[train.group>=10000, "group"] - 10000
train.loc[train.group==-1, "group"] = 1000
train = train.sort_values(axis=0, by=['group'], kind='mergesort').reset_index(drop=True)
train.loc[train.group==1000, "group"] = -1

# train = train.drop('x',axis=1)
train_wildtype_groups = train.loc[train.wildtype != '']
train_no_wildtype = train.loc[train.wildtype == '']
print(train_wildtype_groups.shape)
print(train_no_wildtype.shape)




(4195, 8)
(24786, 8)


# Save Groups

In [9]:
train_wildtype_groups.to_csv('train_wildtype_groups.csv',index=False)
train_wildtype_groups.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group,wildtype
0,18020,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1021/bi00535a054,38.1,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
1,18021,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,4.2,,53.3,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
2,18022,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,10.1038/334406a0,38.1,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
3,18023,MNAFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,6.5,10.1038/334406a0,62.9,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...
4,18060,MNCFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...,2.0,,41.9,164,0,MNIFEMLRIDERLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSEL...


In [10]:
train_no_wildtype.to_csv('train_no_wildtype.csv',index=False)
train_no_wildtype.head()

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,x,group,wildtype
4195,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7,341,-1,
4196,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5,286,-1,
4197,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5,497,-1,
4198,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2,265,-1,
4199,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5,1451,-1,
