[The VHSE-Based Prediction of Proteasomal Cleavage Sites](https://dx.doi.org/10.1371%2Fjournal.pone.0074506)

* What about a NN with n * 8 (VHSE) input nodes and n output?


In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [40]:
# VHSE (principal component score vector of hydrophobic, steric, and electronic properties), 
# a set of amino acid descriptors comes from 
# http://onlinelibrary.wiley.com/doi/10.1002/bip.20296/abstract
# The eight score vectors are so-called VHSE descriptors, in which 
# VHSE1 and VHSE2 are related to hydrophobic properties, 
# VHSE3 and VHSE4 to steric properties, and 
# VHSE5∼VHSE8 to electronic properties
vhse = {
"A": ("Ala", 0.15, -1.11, -1.35, -0.92, 0.02, -0.91, 0.36, -0.48),
"R": ("Arg", -1.47, 1.45, 1.24, 1.27, 1.55, 1.47, 1.30, 0.83),
"N": ("Asn", -0.99, 0.00, -0.37, 0.69, -0.55, 0.85, 0.73, -0.80),
"D": ("Asp", -1.15, 0.67, -0.41, -0.01, -2.68, 1.31, 0.03, 0.56),
"C": ("Cys", 0.18, -1.67, -0.46, -0.21, 0.00, 1.20, -1.61, -0.19),
"Q": ("Gln", -0.96, 0.12, 0.18, 0.16, 0.09, 0.42, -0.20, -0.41),
"E": ("Glu", -1.18, 0.40, 0.10, 0.36, -2.16, -0.17, 0.91, 0.02),
"G": ("Gly", -0.20, -1.53, -2.63, 2.28, -0.53, -1.18, 2.01, -1.34),
"H": ("His", -0.43, -0.25, 0.37, 0.19, 0.51, 1.28, 0.93, 0.65),
"I": ("Ile", 1.27, -0.14, 0.30, -1.80, 0.30, -1.61, -0.16, -0.13),
"L": ("Leu", 1.36, 0.07, 0.26, -0.80, 0.22, -1.37, 0.08, -0.62),
"K": ("Lys", -1.17, 0.70, 0.70, 0.80, 1.64, 0.67, 1.63, 0.13),
"M": ("Met", 1.01, -0.53, 0.43, 0.00, 0.23, 0.10, -0.86, -0.68),
"F": ("Phe", 1.52, 0.61, 0.96, -0.16, 0.25, 0.28, -1.33, -0.20),
"P": ("Pro", 0.22, -0.17, -0.50, 0.05, -0.01, -1.34, -0.19, 3.56),
"S": ("Ser", -0.67, -0.86, -1.07, -0.41, -0.32, 0.27, -0.64, 0.11),
"T": ("Thr", -0.34, -0.51, -0.55, -1.06, 0.01, -0.01, -0.79, 0.39),
"W": ("Trp", 1.50, 2.06, 1.79, 0.75, 0.75, -0.13, -1.06, -0.85),
"Y": ("Tyr", 0.61, 1.60, 1.17, 0.73, 0.53, 0.25, -0.96, -0.52),
"V": ("Val", 0.76, -0.92, 0.17, -1.91, 0.22, -1.40, -0.24, -0.03)}


In [57]:
data_prefix = "/Users/matt/SpiderOak Hive/data/proteasomal_cleavage/pone.0074506.s00"

In [78]:
training_set = pd.read_excel(data_prefix + "5.xlsx")
training_set['len'] = training_set.Peptides.str.len()
training_set['len'].describe()


count    231.000000
mean      28.718615
std        2.367084
min       18.000000
25%       29.000000
50%       29.000000
75%       30.000000
max       32.000000
Name: len, dtype: float64

Each *Sequence* listed above is 28 AAs long. The remaining columns are a merge with the VHSE table at each position of the *Sequence*, for example, the *VHSE11* column is the *VHSE1* for position 1, *VHSE288* is the *VHSE8* for position 28, etc.