# Predict Fold Type of a Protein from Protein Sequence

**The notebooks in this directory were developed to demonstrate the "Ten Rules for Reproducible Research with Jupyter Notebooks". Throughout the notebooks we mention the rules we applied.**

**For example, this notebook demonstrates:**

---

**Rule 1: Tell a Story for a Specific Audience.** This notebook was developed for biologists to learn how to apply a simple machine learning model to protein sequences.

**Rule 3: Document the Entire Workflow.** This top-level notebook links to 3 notebooks that represent the steps of a workflow. This modularity makes it easy to replace one of the 3 steps, for example, use a different method to calculate features or apply a differnt machine learning model.

---

## Introduction

Protein chains fold in regular patterns. Secondary structure describes the geometry of segments of a protein chain. The most common secondary structure elements are
* Alpha helices
* Beta sheets

We can classify proteins into three major fold types based on their predominant secondary structure content
* alpha: contains predominantly alpha helices
* beta: contains predominantly beta sheets
* alpha+beta: contains both alpha helices and beta sheets

## Goal
This notebook serves as an example of using machine learning techniques applied to protein sequences. The goal is to create a simple machine learning model to predict the fold type of a protein given its protein sequence. We train the model on a representative set of 3D structure from the Protein Data Bank.

Run the following notebooks to work through this example.

## 1. Create Dataset

First, we need to create a dataset with protein secondary structure information obtained from 3D protein chains.

Run the following notebook to extract secondary structure information from a representative set of protein chains downloaded from the RCSB Protein Data Bank and assign a fold type to each protein chain.

[1-CreateDataset.ipynb](./1-CreateDataset.ipynb)

The notebook saves the dataset in the file `secondaryStructure.json`.

## 2. Calculate Features

Protein sequences cannot be directly used for machine learning. Here use the Word2vec method to calculate a fixed size feature vector for each protein sequence.

Run the following notebook to calculate feature vectors. 

[2-CalculateFeatures.ipynb](./2-CalculateFeatures.ipynb)

The notebook saves the dateset in the file `features.json`.

## 3. Fit a Model

Next, we fit a 3-state classification model using the feature vectors as inputs and the known fold types from the Protein Data Bank dataset.

Run the following notebook to fit a machine learning model on a training set and evaluate its performance on a test set.

[3-FitModel.ipynb](./3-FitModel.ipynb)

## Version and Hardware Information

---

**Rule 6: Make Your Dependencies Explicit in the Notebook Itself.** Here we use the watermark extension to print software, operating system, and hardware version information.

---

In [3]:
%load_ext watermark
%watermark -v -m -p gensim,matplotlib,numpy,pandas,sklearn -r -g

CPython 3.6.3
IPython 6.3.1

gensim 3.6.0
matplotlib 2.2.2
numpy 1.14.5
pandas 0.22.0
sklearn 0.19.1

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 17.5.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
Git hash   : ed645b92f9b003a3aaf36d7f1f4f15cf04804fe7
Git repo   : https://github.com/pwrose/ten-rules-jupyter.git
