# Predict Fold Type of a Protein from Protein Sequence

Protein chains fold in regular patterns. The secondary structure describes the geometry of segments of a protein chain. The most common secondary structure elements are
* Alpha helices
* Beta sheets

We can classify proteins into three major fold types based on their predominat secondary structure content
* alpha: contains predominatly alpha helices
* beta: contains predominantly beta sheets
* alpha+beta: contains both alpha helices and beta sheets

## Goal
This notebook serves as an example of using machine learning techniques applied to protein sequences. The goal is to create a simple machine learning model to predict the fold type of a protein given its protein sequence. We train the model on a representative set of 3D structure from the Protein Data Bank.

Run the following notebooks to work through this example.

## 1. Create a Data Set

First, we need to create a dataset with protein secondary structure information from 3D protein chains.

Run the following notebook to extract secondary structure information from a representative set of protein chains downloaded from the RCSB Protein Data Bank and assign a fold type to each protein chain.

[1-CreateDataset.ipynb](./1-CreateDataset.ipynb)

The notebook saves the dataset in the file `secondaryStructure.json`.

## 2. Calculate Features

Protein sequences cannot be directly used for machine learning. Here use the Word to Vector method to calculate a fixed size feature vector for each protein sequence.

Run the following notebook to calculate feature vectors for each protein sequence. 

[2-CalculateFeatures.ipynb](./2-CalculateFeatures.ipynb)

The notebook saves the dateset in the file `features.json`.

## 3. Fit a Model

Next, we fit a 3-state classification model using the feature vectors as inputs and the known fold types from the Protein Data Bank dataset.

Run the following notebook to fit a machine learning model on a training set and evaluate its performance on a test set.

[3-FitModel.ipynb](./3-FitModel.ipynb)