# Predicting Brightness of Green Fluorescent Protein (GFP) Mutants

In this project, my task is to predict the brightness level of green fluorescent protein (GFP) mutants. The brightness levels are binarized, so the models I train distinguished between high brightness (1) and low bightness (0) for a set of GFP mutants. 

GFP is a protein that is natrually expressed in jellyfish and exhibits green fluorescence when exposed to light<sup>1</sup>. It has become an instrumental tool for biologists and other life sciences researchers for visualization of physiological processes, protein localization, gene expression, etc<sup>2</sup>. Since GFP is such an integral part of the biologist's toolbox, the design and use of GFP mutants with increased brightness and photostability is increasingly important as well. This is the motivation for this project. Based on data that consists of the amino acid sequences of various GFP mutants and measurements of their brightness, I develop a model that can classify a GFP mutant as "high brightness" (1) or "low brightness" (0).

In [2]:
import warnings
warnings.filterwarnings("ignore")
%config Completer.use_jedi = False

## Loading and Visualizing the Data

In [3]:
import pandas as pd

X_train = pd.read_csv("X_train_kaggle.csv")
X_train = X_train.rename(columns={'CunstructedAASeq_cln':'ConstructedAASeq_cln'})
X_test = pd.read_csv("X_test_kaggle.csv")
X_test = X_test.rename(columns={'CunstructedAASeq_cln':'ConstructedAASeq_cln'})
y_train = pd.read_csv("y_train_kaggle.csv")

dpps = pd.read_csv("DPPS.csv")
dpps = dpps.drop(["#from "], axis = 1)
dpps = dpps.drop([0, 1])
dpps = dpps.rename(columns={'﻿10.2174/092986608786071120': "AA"})
dpps = dpps.set_index('AA')

X_train.head(10)

Unnamed: 0,ConstructedAASeq_cln,Id
0,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,11328
1,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5781
2,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,13681
3,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,30804
4,SKGEELFTGVVPILVELDGDVNGHTFSVSGEGEGDATYGELTLKFI...,30813
5,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5983
6,SKGEELFTGVVPILVELDGDVNGHKFSESGEGEGDATYGKLTLKFI...,20374
7,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22332
8,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,17800
9,SKGEELLTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22064


In [4]:
X_test.head(10)

Unnamed: 0,ConstructedAASeq_cln,Id
0,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,50579
1,SKGEELFTGVVPILVELDGDVSGHKFSVSGEGEGDATYGKLTLKFI...,37987
2,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,53977
3,SKGEELFTGVVPILVELDGDVNGHKLSVSGEGEGDATYGKLTLKFI...,10677
4,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,35653
5,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATNGKLTLKFI...,53275
6,SKGGELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,7765
7,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGGATYGKLTLKFV...,3759
8,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,39236
9,SKGEELFTGVVPILAELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,48246


In [5]:
dpps

Unnamed: 0_level_0,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A,-1.02,-2.88,-0.56,0.36,-6.15,-1.68,0.04,-2.51,-1.94,-0.01
R,1.99,4.13,-4.41,-1.02,4.78,3.04,-9.06,6.71,4.41,0.07
N,-2.19,1.86,0.38,-0.13,-2.3,1.41,-5.71,-1.11,1.73,-0.19
D,-6.6,3.32,1.61,0.36,-3.25,1.95,-7.36,0.14,1.24,-0.15
C,0.21,1.12,3.42,-0.68,-2.27,-1.22,3.11,-2.98,-1.7,1.57
Q,-0.47,1.16,-0.57,0.69,0.39,1.93,-5.46,-0.84,1.93,0.85
E,-5.39,0.65,-0.98,1.39,-0.23,2.51,-6.84,-0.68,1.41,1.28
G,-2.86,-5.0,-2.97,0.53,-11.45,1.89,-2.11,-3.99,-2.16,-0.76
H,0.73,2.68,-0.66,-1.89,1.6,1.13,-1.94,-0.11,0.44,0.15
I,1.91,-3.13,0.01,1.14,2.7,-4.55,8.93,0.18,-1.1,-0.76


In [6]:
y_train.head(10)

Unnamed: 0,Brightness_Class,Id
0,0,11328
1,0,5781
2,0,13681
3,0,30804
4,0,30813
5,1,5983
6,0,20374
7,1,22332
8,1,17800
9,0,22064


Visualized above are the data I am working with. X_train and X_test are dataframes in which each row is a particular GFP mutation and the first and second columns are the amino acid sequence and id, respectively for each mutant. 

The dpps dataframe consists of physicochemical descriptors of each amino acid. These descriptors provide relevant descriptions of each amino acid that can be used to predict GFP brightness. The authors of [3] (see references) gathered 119 physicochemical properties of amino acids and separated them into four groups whith each group being a type of property. The four groups are electronic, hydrophobic, steric, and hydrogen bond properties. Running PCA, the authors found that For the matrices of electronic, steric, hydrophobic and hydrogen bond properties, the first 4, 2, 2 and 2 principal components accounted for 74.44%, 72.72%, 73.78% and 77.15% variance of original data matrices, respectively<sup>3</sup>. Thus all these properties of amino acids are expressed by the 10 principal components. These 10 principal components make up the divided physicochemical property scores (dpps) dataframe shown above. For each property (column), there is a unique value associated to an amino acid. 

Lastly, y_train contains the binarized brightness class for each GFP mutant. 

The X_train and X_test data are not yet ready for modeling or analysis since they only contain the amino acid sequences of letters but I need to use the values in the dpps dictionary. So next, I do some feature engineering. I need to replace every letter in X_train and X_test with the ten values associated with that letter in the dpps data. The example below shows the dpps values for Alanine (A). So, the goal is to replace every "A" in X_train and X_test with these ten values, and do the same for all the other amino acids.

In [7]:
dpps.head(1)

Unnamed: 0_level_0,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
AA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A,-1.02,-2.88,-0.56,0.36,-6.15,-1.68,0.04,-2.51,-1.94,-0.01


## Feature Engineering: Transforming X_train and X_test

In [8]:
#### Transforming X_train ####

n = X_train.shape[0]
length = len(X_train["ConstructedAASeq_cln"][0])
sep_str = []
for i in range(n):
    row = []
    if (len(X_train['ConstructedAASeq_cln'][i]) == 237):
        for k in range(0, length, 1):
            p = X_train["ConstructedAASeq_cln"][i][k:k + 1]
            row.append(p)
        sep_str.append(row)
    else:
        X_train = X_train.drop(i)
        y_train = y_train.drop(i)
        
        
df_seq = pd.DataFrame(sep_str)
X_train = pd.concat([X_train, df_seq], axis = 1)
X_train = X_train.drop(['ConstructedAASeq_cln'], axis = 1).set_index("Id")
X_train = X_train.dropna()

dppsT = dpps.T
dict_dpps = dppsT.to_dict('list')

l = []
for i in range(length):
    l.append(pd.DataFrame(X_train[i].map(dict_dpps).tolist()))
X_train_dpps = pd.concat(l, axis = 1)

new_X_train = X_train_dpps.loc[:, (X_train_dpps != X_train_dpps.iloc[0]).any()]
new_X_train = new_X_train.apply(pd.to_numeric)
y_train = y_train.set_index("Id")

new_X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,0.1,1.1,2.1,3.1,4.1,5.1,6.1,7.1,8.1,9.1
0,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.10,1.26,1.15,0.91,5.90,0.74,3.71,3.32,0.25,1.33
1,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.10,1.26,1.15,0.91,5.90,0.74,3.71,3.32,0.25,1.33
2,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.10,1.26,1.15,0.91,5.90,0.74,3.71,3.32,0.25,1.33
3,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.10,1.26,1.15,0.91,5.90,0.74,3.71,3.32,0.25,1.33
4,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.10,1.26,1.15,0.91,5.90,0.74,3.71,3.32,0.25,1.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31024,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.10,1.26,1.15,0.91,5.90,0.74,3.71,3.32,0.25,1.33
31025,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,0.21,1.12,3.42,-0.68,-2.27,-1.22,3.11,-2.98,-1.70,1.57
31026,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.10,1.26,1.15,0.91,5.90,0.74,3.71,3.32,0.25,1.33
31027,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.10,1.26,1.15,0.91,5.90,0.74,3.71,3.32,0.25,1.33


In [9]:
#### Transforming X_test ####

m = X_test.shape[0]
length_2 = len(X_test["ConstructedAASeq_cln"][0])
sep_str2 = []
for i in range(m):
    row = []
    if (len(X_test['ConstructedAASeq_cln'][i]) == 237):
        for k in range(0, length_2, 1):
            p = X_test["ConstructedAASeq_cln"][i][k:k + 1]
            row.append(p)
        sep_str2.append(row)
    else:
        X_test = X_test.drop(i)
        
df_seq2 = pd.DataFrame(sep_str2)
X_test = pd.concat([X_test, df_seq2], axis = 1)
X_test = X_test.drop(['ConstructedAASeq_cln'], axis = 1).set_index("Id")
X_test = X_test.dropna()

l2 = []
for i in range(length_2):
    l2.append(pd.DataFrame(X_test[i].map(dict_dpps).tolist()))
X_test_dpps = pd.concat(l2, axis = 1)

new_X_test = X_test_dpps.loc[:, (X_test_dpps != X_test_dpps.iloc[0]).any()]
new_X_test = new_X_test.apply(pd.to_numeric)
new_X_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,0.1,1.1,2.1,3.1,4.1,5.1,6.1,7.1,8.1,9.1
0,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.1,1.26,1.15,0.91,5.9,0.74,3.71,3.32,0.25,1.33
1,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.1,1.26,1.15,0.91,5.9,0.74,3.71,3.32,0.25,1.33
2,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.1,1.26,1.15,0.91,5.9,0.74,3.71,3.32,0.25,1.33
3,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.1,1.26,1.15,0.91,5.9,0.74,3.71,3.32,0.25,1.33
4,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.1,1.26,1.15,0.91,5.9,0.74,3.71,3.32,0.25,1.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20681,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.1,1.26,1.15,0.91,5.9,0.74,3.71,3.32,0.25,1.33
20682,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.1,1.26,1.15,0.91,5.9,0.74,3.71,3.32,0.25,1.33
20683,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.1,1.26,1.15,0.91,5.9,0.74,3.71,3.32,0.25,1.33
20684,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65,...,2.1,1.26,1.15,0.91,5.9,0.74,3.71,3.32,0.25,1.33


Now I am ready to start modeling. Features have been properly engineered for analysis

## Modeling

In [1]:
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import RidgeClassifier
from sklearn.dummy import DummyClassifier
import numpy as np

### Logistic Regression

As this is a binary classification problem, logistic regression is a suitable first choice for this problem. I start with a simple logistic regression model, without any hyperparameter tuning and measure performance using the f1-score metric. I then compare logistic regression with ridge classification support vector classifiers and random forest classifiers. 

#### Default Logistic Regression

In [4]:
cv = KFold(n_splits=5, random_state=15, shuffle=True)
log = LogisticRegression()
log_cv_f1 = cross_val_score(log, new_X_train, y_train, cv = cv, scoring = "f1", n_jobs = -1).mean()
print("Default Logistic f1:", log_cv_f1)

Default Logistic f1: 0.8175576534642228


#### Optimal Logistic Regression

In [23]:
#### Finding the Optimal Scaler ####

scaler = StandardScaler()
log_model = LogisticRegression()
pipe1 = Pipeline(steps=[('scaler', scaler),
                       ('logistic', log_model)])

scale_parameters =  {'scaler': [StandardScaler(), MaxAbsScaler(),
                MinMaxScaler(), Normalizer()]
}

scaled_gs = GridSearchCV(pipe1, scale_parameters, verbose=True, cv=5).fit(new_X_train, y_train)
scaled_log_model = scaled_gs.best_estimator_
scaled_gs.best_params_ # StandardScaler() is the best scaler

Fitting 5 folds for each of 4 candidates, totalling 20 fits


{'scaler': StandardScaler()}

In [24]:
%%time
#### Finding Optimal Parameters ####


C_val = np.logspace(0.1, 1, 10)
Penalty = ['l1', 'l2', 'elastinet']
Solver = ['lbfgs', 'newton-cg', 'liblinear']

param_list = dict(logistic__penalty = Penalty,
         logistic__C = C_val,
         logistic__solver = Solver)


rs = RandomizedSearchCV(scaled_log_model, param_list, verbose=True, scoring='f1')
rs.fit(new_X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
CPU times: user 1h 21min 43s, sys: 1min 22s, total: 1h 23min 5s
Wall time: 57min 16s


RandomizedSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                             ('logistic',
                                              LogisticRegression())]),
                   param_distributions={'logistic__C': array([ 1.25892541,  1.58489319,  1.99526231,  2.51188643,  3.16227766,
        3.98107171,  5.01187234,  6.30957344,  7.94328235, 10.        ]),
                                        'logistic__penalty': ['l1', 'l2',
                                                              'elastinet'],
                                        'logistic__solver': ['lbfgs',
                                                             'newton-cg',
                                                             'liblinear']},
                   scoring='f1', verbose=True)

In [25]:
print(rs.best_params_)
print()
print(rs.best_estimator_)

{'logistic__solver': 'newton-cg', 'logistic__penalty': 'l2', 'logistic__C': 1.5848931924611136}

Pipeline(steps=[('scaler', StandardScaler()),
                ('logistic',
                 LogisticRegression(C=1.5848931924611136, solver='newton-cg'))])


In [26]:
best_log = rs.best_estimator_

cv = KFold(n_splits=5, random_state=12, shuffle=True)
best_log_cv_f1 = cross_val_score(best_log, new_X_train, y_train, cv = cv, scoring = "f1", n_jobs = -1).mean()
print("Best Logistic f1:", best_log_cv_f1)

Best Logistic f1: 0.869666802939425


### Ridge Regression Classifier

Ridge classifier was chosen because it performs similarly to logistic regression but with a penalty that introduces an amount of bias to the model, such that the coefficients of less important features are lowered, which would theoretically result in lower variance. the ridge classifier underperformed compared to logistic regression.

#### Default

In [9]:
rclf = RidgeClassifier()
cv = KFold(n_splits=5, random_state=12, shuffle=True)
rclf_cv_f1 = cross_val_score(rclf, new_X_train, y_train, cv = cv, scoring = "f1", n_jobs = -1).mean()
print("Default Ridge f1:", rclf_cv_f1)

Default Ridge f1: 0.8305215563255259


#### Optimal Ridge Classifier

In [29]:
%%time
scaler = StandardScaler()
rc = RidgeClassifier()
pipe_ridgeClass = Pipeline(steps=[('scale', scaler),
                       ('estimator', rc)])

alpha = np.arange(0, 30, 1)

alpha_param = dict(estimator__alpha = alpha)

cv = KFold(n_splits=5, random_state=12, shuffle=True)
ridge_gs = GridSearchCV(pipe_ridgeClass, alpha_param, cv= cv).fit(new_X_train, y_train.values.ravel())
ridge_gs.best_params_

CPU times: user 21min 3s, sys: 2min 40s, total: 23min 44s
Wall time: 11min 29s


{'estimator__alpha': 29}

In [30]:
best_ridgeClass = ridge_gs.best_estimator_
rkf_ridge = KFold(n_splits=5, random_state=1, shuffle=True)
best_ridge_cv = cross_val_score(best_ridgeClass, new_X_train, y_train, cv = rkf_ridge).mean()
print("Best Ridge Classifier f1:", best_ridge_cv)

Best Ridge Classifier f1: 0.8549099815805608


### Support Vector Classifier (Linear Kernel)

Support vector classifiers were also chosen here as they are effective in high dimensional datasets. They work by projecting the data into a higher dimension and sinding a hyperplane that separates each class of data points. After tuning for the optimal 'C', 'penalty' and 'loss' parameters, this ended up being the best performing model, with a classification f1-score of 87.15%. Now, it is worth noting that, as you can see below, optimizing this support vector classifier took almost three hours. And that's with optimizing only SOME of the parameters. Although this was the highest scoring model, it is only barely so. Logistic regression took only an hour to optimize and gave a classification f1-score of 86.97%. The ridge classifier had the lowest f1-score at 85.50%, but only took eleven minutes to optimize. If time is not issue, the support vector machine classifier is the best performing model.

#### Default

In [5]:
scaler = StandardScaler()
lsvc = LinearSVC(random_state=0, tol=1e-5)
svc = Pipeline(steps=[('scaler', scaler), ('SVM', lsvc)])

In [31]:
cv = KFold(n_splits=5, random_state=12, shuffle=True)
svc_cv_f1 = cross_val_score(svc, new_X_train, y_train.values.ravel(), cv = cv, scoring = "f1", n_jobs = -1).mean()
print("Default SVC f1:", svc_cv_f1)

Default SVC f1: 0.8284001702253215


#### Optimal SVC

In [6]:
%%time
C_val = np.logspace(0.1, 1, 10)
Penalty = ['l1', 'l2']
Loss = ['hinge', 'squared_hinge']

param_list = dict(SVM__penalty = Penalty,
         SVM__C = C_val,
         SVM__loss = Loss)


rs_svc = GridSearchCV(svc, param_list, verbose=True, scoring='f1')
rs_svc.fit(new_X_train, y_train.values.ravel())

Fitting 5 folds for each of 40 candidates, totalling 200 fits
CPU times: user 2h 42min 29s, sys: 2min 53s, total: 2h 45min 22s
Wall time: 2h 54min 49s


GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('SVM',
                                        LinearSVC(random_state=0, tol=1e-05))]),
             param_grid={'SVM__C': array([ 1.25892541,  1.58489319,  1.99526231,  2.51188643,  3.16227766,
        3.98107171,  5.01187234,  6.30957344,  7.94328235, 10.        ]),
                         'SVM__loss': ['hinge', 'squared_hinge'],
                         'SVM__penalty': ['l1', 'l2']},
             scoring='f1', verbose=True)

In [7]:
best_svc = rs_svc.best_estimator_
svc_ridge = KFold(n_splits=5, random_state=1, shuffle=True)
best_svc_cv = cross_val_score(best_svc, new_X_train, y_train, cv = svc_ridge).mean()
print("Best SVC Classifier f1:", best_svc_cv)

Best SVC Classifier f1: 0.8715073271350047


### Preparing best models for Submission

In [16]:
best_model = best_svc.fit(new_X_train, y_train)
best_pred = best_model.predict(new_X_test)
best_pred_df = pd.DataFrame(best_pred)
best_pred_df = best_pred_df.rename(columns={0:"Brightness_Class"})
best_pred_df["Id"] = X_test["Id"]
best_pred_df = best_pred_df[["Id", "Brightness_Class"]]
best_pred_df = best_pred_df.set_index("Id")

In [32]:
best_pred_df.to_csv("jtaylor_best_pred_df.csv") # SVM

GFP, as well as other fluorescent proteins, has become an almost indispensable tool for life sciences research, so it is critical that these proteins are sufficiently bright. There has been much effort to engineer brighter and more stable fluorescent proteins. This project is meant to show how machine learning can be used to help achieve this goal more quickly and with fewer laboratory resources.

# References

1. https://embryo.asu.edu/pages/green-fluorescent-protein
2. https://www.thermofisher.com/us/en/home/life-science/cell-analysis/fluorophores/green-fluorescent-protein.html
3. Toward Prediction of Binding Affinities Between the MHC Protein and Its Peptide Ligands Using Quantitative Structure-Affinity Relationship Approach, Tian, et al.
   https://pubmed.ncbi.nlm.nih.gov/19075812/