# Calculate Features
This notebooks reads a dataset with protein sequence and fold type classification and calculates a feature vector for each protein sequence using the Word2vec method.

In [1]:
# parameters
n_gram = 3 # size of n-gram
feature_col = "features" # feature vector
value_col = "foldClass" # fold class to be predicted

In [2]:
from gensim.models import Word2Vec                                                                                                                                                      
import pandas as pd
import numpy as np   
import word2vecutils

In [3]:
df = pd.read_json("./foldClassification.json")

# Create n-grams of the Protein Sequence
The Word2vec method requires a sentence with words as input. Here we split a protein sequence string into a sequence of n-grams. These n-grams represent the "words" in a protein sequence and n is the number of characters in a word.

In this example we split sequences into 2-grams, e.g.: SRMPSPP... -> SR RM MP SP PP...

In [4]:
# add column ngram to dataframe
df['ngram'] = df.sequence.apply(word2vecutils.ngrammer, n=n_gram)
df.head(3)

Unnamed: 0,Exptl.,FreeRvalue,R-factor,alpha,beta,coil,foldClass,length,pdbChainId,resolution,secondary_structure,sequence,ngram
1,XRAY,0.26,0.19,0.469945,0.046448,0.483607,alpha,366,16VP.A,2.1,CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...,SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...,"[SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ..."
1000,XRAY,0.23,0.18,0.50463,0.00463,0.490741,alpha,216,1PBW.B,2.0,CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT...,MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL...,"[MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ..."
10002,XRAY,0.26,0.22,0.716172,0.006601,0.277228,alpha,303,4TQ3.A,2.408,CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC...,MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS...,"[MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ..."


# Create a Word2vec Model

https://code.google.com/p/word2vec/


In [5]:
# get ngram column as a list
ngrams = list(df.ngram)

# create word2vec model
model = Word2Vec(ngrams, size=50, window=13, min_count=3, iter=5)

# train word2vec model
model.train(ngrams, total_examples=model.corpus_count, epochs=model.epochs)

(5370185, 5375965)

# Create a Fixed-sized Feature Vector
Machine learning methods require fixed-size feature vectors. However, since the length of protein chains and their n-grams vary, we need a way to create a fixed size feature vector.

Based on the paper (...) we average the word vectors for each sequence.

In [6]:
df[feature_col] = df.ngram.apply(lambda ng: word2vecutils.average_word_vec_scaled(ng, model.wv))

df.head(3)

Unnamed: 0,Exptl.,FreeRvalue,R-factor,alpha,beta,coil,foldClass,length,pdbChainId,resolution,secondary_structure,sequence,ngram,features
1,XRAY,0.26,0.19,0.469945,0.046448,0.483607,alpha,366,16VP.A,2.1,CCSCCCCCCCCHHHHHHHHHHHHTCTTHHHHHHHHHHCCCCCSTTS...,SRMPSPPMPVPPAALFNRLLDDLGFSAGPALCTMLDTWNEDLFSAL...,"[SRM, RMP, MPS, PSP, SPP, PPM, PMP, MPV, PVP, ...","[1.1960747496336708, -1.0387311626949265, 1.37..."
1000,XRAY,0.23,0.18,0.50463,0.00463,0.490741,alpha,216,1PBW.B,2.0,CCCCCCCCCCCCCCHHHHCCTTSCSCHHHHHHHHHHHHHHTTCTTT...,MEADVEQQALTLPDLAEQFAPPDIAPPLLIKLVEAIEKKGLECSTL...,"[MEA, EAD, ADV, DVE, VEQ, EQQ, QQA, QAL, ALT, ...","[0.9981168457395807, -1.0488222891799788, -0.0..."
10002,XRAY,0.26,0.22,0.716172,0.006601,0.277228,alpha,303,4TQ3.A,2.408,CCCCCCCCCCCCCCCHHHHHHCGGGGHHHHHHHHHHHHHHCCTTSC...,MDSSLANINQIDVPSKYLRLLRPVAWLCFLLPYAVGFGFGITPNAS...,"[MDS, DSS, SSL, SLA, LAN, ANI, NIN, INQ, NQI, ...","[1.2901504611184786, -1.3279968309953385, 1.52..."


### Save DataFrame

In [7]:
df.to_json("./features.json")

### Save Word2vec Model

In [8]:
model.save("./word2vecmodel")

## Next step
After you saved the dataset here, go back to the [0-Workflow.ipynb](./0-Workflow.ipynb)  to run the next step of the analysis.