# Protein Folding in 2D 
Proteins are long strands of amino acids that control many important processes in the human body. It is known that proteins are stored 'folded' inside the cells of the body, and that the specific folding significantly influences their functioning. ‘Misfolded’ proteins can play a role in cancer, Alzheimer's disease and cystic fibrosis.

Due to the complexity of the protein folding problem, simplified models such as Dill's hydrophobic-polar (HP) model have become one of the major tools for studying protein structure. The HP model is based on the observation that the hydrophobic force is the main force deter- mining the unique native conformation (and hence the functional state) of small globular proteins.

In a protein, hydrophobic amino acids (H) like to lie 'adjecntly', polar amino acids (P) do not have that preference. When two hydrophobic amino acids lie next to each other, an 'H- bond' is formed due to the attractive forces between the two. And the more bonds, the more stable the protein. The HP-model, containing of just two types of amino acids, arranges these in a grid or a lattice.

![Example Screenshot](8_piece_protein.png)


In [1]:
import numpy as np 
import pandas as pd
import time as time

np.set_printoptions(threshold=np.inf)

## Basic Algorithm for Protein Folding
This thread covers the development of an algorithm to simulate and analyze protein folding using a simplified model known as the HP model. The HP model involves two types of amino acids: hydrophobic (H) and polar (P). The goal is to arrange these amino acids on a 2D grid or lattice to create stable protein configurations.

The key steps of the algorithm include:

1. **Initializing Acids**: A function called `random_division` is used to divide a specified number of amino acids into hydrophobic and polar types while ensuring at least one of each type. 
- **Initializing Grid**: The `initialize grid` function begins by initializing a 2D grid or lattice with a specified size. Hydrophobic and polar amino acids are randomly placed on the grid.
- **Calculate Energy**: The function `calculate_hbonds` calculates the initial energy of a given grid by counting the number of adjacent hydrophobic amino acids (H-bonds).

- **Fold Proteins**: The function `reshape_grid` attempts to reshape a given grid to improve stability. It iterates through various configurations to optimize the placement of amino acids and maximizes the energy function.

- **Creating Samples**: The `generate_random_samples` function generates a given number of random samples for various amino acid lengths. It uses the random division to determine the amino acid types and grid size and then initializes grids accordingly.

- **Generating and storing Results**: The code generates random samples for different amino acid lengths (5, 10, 15, 20, 25, 30, 35, and 40) and calculates various metrics for each sample. It then prints the sample information, grid configuration, and metrics.


### Step 1 - Initializing Acids

This function `random_division` takes the total number of amino acids as input and randomly divides them into hydrophobic and polar amino acids, before creating a random 1D amino acid string. It ensures that there's at least one hydrophobic and one polar amino acid. You can adjust the total_amino_acids variable to the desired number of amino acids you want to divide.

In [2]:
total_amino_acids = 5  # Change this to a desired number, 10 is used for illustration purposes

def random_division(total_amino_acids):
    num_hydrophobic = np.random.randint(1, total_amino_acids)  # Ensure at least one H and P
    num_polar = total_amino_acids - num_hydrophobic
    # randomly shuffle the amino acids
    amino_acids = ['H'] * num_hydrophobic + ['P'] * num_polar
    np.random.shuffle(amino_acids)
    return num_hydrophobic, num_polar, amino_acids

num_hydrophobic, num_polar, amino_acids = random_division(total_amino_acids)
print("Number of Hydro acids:", num_hydrophobic)
print("Number of Polar acids:", num_polar)
print("Random Amino Acid String:", "".join(amino_acids))

Number of Hydro acids: 3
Number of Polar acids: 2
Random Amino Acid String: HHPPH


### Step 2 - Initializing Grid
The `initialize_grid` function generates a grid to place amino acids, with specified dimensions, and populates it with H and P amino acids, while ensuring they do not overlap and tracks their placement order. This allows for simulating the random arrangement of amino acids on a grid for various applications.

In [3]:
amino_acids_copy = amino_acids.copy()

def initialize_grid(amino_acids, num_hydrophobic, num_polar):
    amino_acids_copy = amino_acids.copy()
    total_amino_acids = len(amino_acids)
    grid_size = total_amino_acids * 2
    grid = np.full((grid_size, grid_size), fill_value='', dtype=object)  
    amino_acid_order = []  # List to track the order of amino acids placed
    
    # Place the first amino acid randomly on the grid
    first_amino_row = grid_size // 2
    first_amino_col = grid_size // 2
    amino_acid_type = amino_acids_copy.pop(0)
    grid[first_amino_row, first_amino_col] = amino_acid_type
    
    # Keep track of order
    amino_acid_order.append((amino_acid_type, (first_amino_row, first_amino_col)))
    total_amino_acids -= 1
    
    # Place the remaining amino acids next to the last placed amino acid
    while total_amino_acids > 0:
        last_amino_row, last_amino_col = amino_acid_order[-1][1]
        neighbors = [
            (last_amino_row - 1, last_amino_col),
            (last_amino_row + 1, last_amino_col),
            (last_amino_row, last_amino_col - 1),
            (last_amino_row, last_amino_col + 1)
        ]
        
        valid_neighbors = [(row, col) for row, col in neighbors if 0 <= row < grid_size and 0 <= col < grid_size and grid[row, col] == '']
        
        if valid_neighbors and amino_acids_copy:  # Check if there are valid neighbors and remaining amino acids
            # Shuffle the list of valid neighbors and choose one randomly
            np.random.shuffle(valid_neighbors)
            chosen_row, chosen_col = valid_neighbors[0]
            amino_acid_type = amino_acids_copy.pop(0)
            grid[chosen_row, chosen_col] = amino_acid_type
            amino_acid_order.append((amino_acid_type, (chosen_row, chosen_col)))
            total_amino_acids -= 1
        else:
            break  # No valid neighbors or no remaining amino acids, exit the loop
    
    return grid, amino_acid_order

# Example usage
initial_grid, amino_acid_order = initialize_grid( amino_acids_copy, num_hydrophobic, num_polar)

In [4]:
print("Amino Acid Order:", amino_acid_order)

Amino Acid Order: [('H', (5, 5)), ('H', (5, 4)), ('P', (4, 4)), ('P', (3, 4)), ('H', (2, 4))]


In [5]:
def trim_empty_rows_and_columns(grid):
    # Find the indices of non-empty rows and columns
    non_empty_rows = np.any(grid != '', axis=1)
    non_empty_columns = np.any(grid != '', axis=0)

    # Use boolean indexing to extract non-empty rows and columns
    trimmed_grid = grid[non_empty_rows][:, non_empty_columns]

    return trimmed_grid

# Call the function to trim empty rows and columns
trimmed_grid = trim_empty_rows_and_columns(initial_grid)

# Print the trimmed grid
print("Trimmed Grid:", '\n', trimmed_grid)

Trimmed Grid: 
 [['H' '']
 ['P' '']
 ['P' '']
 ['H' 'H']]


### Step 3a - Calculate H-bonds
The `find_H_combinations_grid` function is used to~ identify and collect sets of coordinates representing adjacent 'H' amino acids in a grid. It iterates through the entire grid, checking each position for the presence of 'H' amino acids. If an 'H' amino acid is found, it examines neighboring positions (up, down, left, right) to identify adjacent 'H' amino acids. For each pair of adjacent 'H' amino acids, it creates a frozenset containing their coordinates (ensuring that the order of coordinates doesn't matter) and adds this frozenset to a set. This set stores all unique pairs of adjacent 'H' amino acids found in the grid. The function returns this set of adjacent 'H' amino acid pairs.

In [6]:
def find_H_pairs_grid(grid):
    adjacent_hydrophobic_amino_acids = set()  # Use a set to automatically remove duplicates

    # Iterate through the grid to find adjacent 'H' amino acids
    for row in range(grid.shape[0]):
        for col in range(grid.shape[1]):
            current_acid = grid[row, col]

            # Check if the current amino acid is 'H'
            if current_acid == 'H':
                # Check the neighboring positions (up, down, left, right) relative to the current position
                neighbors = [
                    (row - 1, col),
                    (row + 1, col),
                    (row, col - 1),
                    (row, col + 1)
                ]

                for neighbor_row, neighbor_col in neighbors:
                    # Check if the neighbor is within the grid bounds
                    if 0 <= neighbor_row < grid.shape[0] and 0 <= neighbor_col < grid.shape[1]:
                        neighbor_acid = grid[neighbor_row, neighbor_col]

                        # Check if the neighbor is also 'H'
                        if neighbor_acid == 'H':
                            # Use frozenset to ensure that the order of coordinates doesn't matter
                            amino_acid_pair = frozenset({(row, col), (neighbor_row, neighbor_col)})
                            adjacent_hydrophobic_amino_acids.add(amino_acid_pair)
                            
    return adjacent_hydrophobic_amino_acids

find_H_pairs_grid(initial_grid)

{frozenset({(5, 4), (5, 5)})}

### Step 3b - Calculate H-bonds
The `find_H_combinations_order` function examines the order of amino acids and identifies adjacent 'H' amino acids. It does this by iterating through the amino acid order, checking pairs of consecutive amino acids for 'H' type, and recording these pairs as frozensets in a set to remove duplicates. This function helps identify adjacent 'H' amino acids in the sequence order, which is useful for analyzing the arrangement of amino acids.

In [7]:
def find_H_pairs_order(amino_acid_order):
    adjacent_hydrophobic_amino_acids = set()  # Use a set to automatically remove duplicates

    # Iterate through the amino acid order to find adjacent 'H' amino acids
    for i in range(len(amino_acid_order) - 1):
        current_acid, current_position = amino_acid_order[i]
        next_acid, next_position = amino_acid_order[i + 1]

        # Check if both current and next amino acids are 'H'
        if current_acid == 'H' and next_acid == 'H':
            # Use frozenset to ensure that the order of positions doesn't matter
            amino_acid_pair = frozenset({current_position, next_position})
            adjacent_hydrophobic_amino_acids.add(amino_acid_pair)

    return adjacent_hydrophobic_amino_acids

find_H_pairs_order(amino_acid_order)

{frozenset({(5, 4), (5, 5)})}

### Step 3c - Calculate H-bonds
The provided code calculates the H-combinations in two different ways and then compares them. It calculates the H-pairs in the grid using the `find_H_combinations_grid` function and the H-pairs in the amino acid order using the `find_H_combination_order` function. Then, it subtracts the H-combinations found in the amino acid order from those found in the grid. This comparison helps identify the **H-bonds** that are formed between adjacent 'H' amino acids in the grid but not in the given amino acid sequence order.

In [8]:
def find_H_bonds(grid, amino_acid_order):
    grid_h_pairs = find_H_pairs_grid(grid)
    order_h_pairs = find_H_pairs_order(amino_acid_order)
    return grid_h_pairs - order_h_pairs

H_bonds = find_H_bonds(initial_grid, amino_acid_order)

In [9]:
print('Number of H-bonds:', len(H_bonds))

for bond in H_bonds:
    coordinates = [coord for coord in bond]
    print(coordinates)

Number of H-bonds: 0


### Step 4 - Creating Samples
The code generates N random protein grid configurations with a specified number of hydrophobic and polar amino acids using the `generate_random_samples` function. 

In [10]:
def generate_random_samples(N, total_amino_acids):
    random_samples = []

    for _ in range(N):
        num_hydrophobic, num_polar, amino_acids = random_division(total_amino_acids)
        amino_acids_copy = amino_acids.copy()
        initial_grid, amino_acid_order = initialize_grid(amino_acids_copy, num_hydrophobic, num_polar)
        trimmed_grid = trim_empty_rows_and_columns(initial_grid)
        protein_dimensions = trimmed_grid.shape
        amino_acids_on_grid = np.count_nonzero(initial_grid != '')
        hbonds = len(find_H_bonds(initial_grid, amino_acid_order))
        hratio = hbonds / amino_acids_on_grid
        
        random_samples.append((num_hydrophobic,
                               num_polar,
                               amino_acids,
                               initial_grid,
                               amino_acids_on_grid,
                               amino_acid_order,
                               trimmed_grid, 
                               protein_dimensions,
                               hbonds,
                               hratio,
                              ))

    return random_samples

# Example usage
generate_random_samples(3, total_amino_acids)[0]

(4,
 1,
 ['H', 'H', 'P', 'H', 'H'],
 array([['', '', '', '', '', '', '', '', '', ''],
        ['', '', '', '', '', '', '', '', '', ''],
        ['', '', '', '', '', '', '', '', '', ''],
        ['', '', '', '', '', '', '', '', '', ''],
        ['', '', '', '', '', '', '', '', '', ''],
        ['', '', '', '', '', 'H', 'H', 'P', 'H', ''],
        ['', '', '', '', '', '', '', '', 'H', ''],
        ['', '', '', '', '', '', '', '', '', ''],
        ['', '', '', '', '', '', '', '', '', ''],
        ['', '', '', '', '', '', '', '', '', '']], dtype=object),
 5,
 [('H', (5, 5)), ('H', (5, 6)), ('P', (5, 7)), ('H', (5, 8)), ('H', (6, 8))],
 array([['H', 'H', 'P', 'H'],
        ['', '', '', 'H']], dtype=object),
 (2, 4),
 0,
 0.0)

In [11]:
def store_samples_in_dataframe(num_samples, amino_acid_lengths):
    
    data = {
        "Amino Acid Length": [],
        "Num Hydrophobic": [],
        "Num Polar": [],
        "1D protein": [],
        "2D protein": [],
        "Amino Acids on Grid": [],  
        "Trimmed 2D protein": [],
        "Shape 2D protein": [],
        "Amino Acid Order": [],
        "H-Bonds": [],
        "H-Ratio": [],
    }

    for length in amino_acid_lengths:
        random_samples = generate_random_samples(num_samples, length)
        for sample in random_samples:
            num_hydrophobic, num_polar, amino_acids, initial_grid, amino_acids_on_grid, amino_acid_order, trimmed_grid, protein_dimensions, hbonds, hratio= sample
            data["Amino Acid Length"].append(length)
            data["Num Hydrophobic"].append(num_hydrophobic)
            data["Num Polar"].append(num_polar)
            data["1D protein"].append(amino_acids)
            data["2D protein"].append(initial_grid)
            data["Amino Acids on Grid"].append(amino_acids_on_grid)
            data["Trimmed 2D protein"].append(trimmed_grid)
            data["Shape 2D protein"].append(protein_dimensions)
            data["Amino Acid Order"].append(amino_acid_order)
            data["H-Bonds"].append(hbonds)
            data['H-Ratio'].append(hratio)

    df = pd.DataFrame(data)
    return df

In [12]:
%%time
# Generate an experiment with 1000 samples for amino length [5, 10, 15 ..., 25]

num_samples = 1000
amino_acid_lengths = [length for length in range(5, 26, 5)]
HP25 = store_samples_in_dataframe(num_samples, amino_acid_lengths)

CPU times: user 1.36 s, sys: 22.8 ms, total: 1.38 s
Wall time: 1.38 s


In [13]:
HP25.head(2)

Unnamed: 0,Amino Acid Length,Num Hydrophobic,Num Polar,1D protein,2D protein,Amino Acids on Grid,Trimmed 2D protein,Shape 2D protein,Amino Acid Order,H-Bonds,H-Ratio
0,5,2,3,"[P, P, H, H, P]","[[, , , , , , , , , ], [, , , , , , , , , ], [...",5,"[[P, ], [P, ], [H, ], [H, P]]","(4, 2)","[(P, (5, 5)), (P, (6, 5)), (H, (7, 5)), (H, (8...",0,0.0
1,5,3,2,"[P, H, P, H, H]","[[, , , , , , , , , ], [, , , , , , , , , ], [...",5,"[[H, H, ], [, P, H], [, , P]]","(3, 3)","[(P, (5, 5)), (H, (4, 5)), (P, (4, 4)), (H, (3...",0,0.0


In [14]:
%%time
# Generate an experiment with 1000 samples for amino length [5, 10, 15 ..., 50]

num_samples = 1000
amino_acid_lengths = [length for length in range(5, 51, 5)]
HP50 = store_samples_in_dataframe(num_samples, amino_acid_lengths)

CPU times: user 6.97 s, sys: 111 ms, total: 7.08 s
Wall time: 7.09 s


In [15]:
HP50

Unnamed: 0,Amino Acid Length,Num Hydrophobic,Num Polar,1D protein,2D protein,Amino Acids on Grid,Trimmed 2D protein,Shape 2D protein,Amino Acid Order,H-Bonds,H-Ratio
0,5,3,2,"[H, H, H, P, P]","[[, , , , , , , , , ], [, , , , , , , , , ], [...",5,"[[P, P], [H, ], [H, H]]","(3, 2)","[(H, (5, 5)), (H, (5, 4)), (H, (4, 4)), (P, (3...",0,0.000000
1,5,2,3,"[P, H, H, P, P]","[[, , , , , , , , , ], [, , , , , , , , , ], [...",5,"[[P, , , ], [P, H, H, P]]","(2, 4)","[(P, (5, 5)), (H, (5, 4)), (H, (5, 3)), (P, (5...",0,0.000000
2,5,1,4,"[P, P, P, P, H]","[[, , , , , , , , , ], [, , , , , , , , , ], [...",5,"[[P, P, H], [P, , ], [P, , ]]","(3, 3)","[(P, (5, 5)), (P, (4, 5)), (P, (3, 5)), (P, (3...",0,0.000000
3,5,2,3,"[H, P, P, P, H]","[[, , , , , , , , , ], [, , , , , , , , , ], [...",5,"[[H, P, P], [, , P], [, , H]]","(3, 3)","[(H, (5, 5)), (P, (5, 6)), (P, (5, 7)), (P, (6...",0,0.000000
4,5,1,4,"[H, P, P, P, P]","[[, , , , , , , , , ], [, , , , , , , , , ], [...",5,"[[P, , ], [P, P, P], [, , H]]","(3, 3)","[(H, (5, 5)), (P, (4, 5)), (P, (4, 4)), (P, (4...",0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
9995,50,20,30,"[P, P, H, P, H, H, P, P, H, H, H, H, P, H, H, ...","[[, , , , , , , , , , , , , , , , , , , , , , ...",45,"[[, H, H, P, , , , , , ], [H, H, P, P, , , , ,...","(13, 10)","[(P, (50, 50)), (P, (50, 49)), (H, (51, 49)), ...",2,0.044444
9996,50,25,25,"[P, H, H, P, H, P, P, P, P, P, H, H, P, H, H, ...","[[, , , , , , , , , , , , , , , , , , , , , , ...",22,"[[P, H, , , , , , ], [, H, P, P, P, , , ], [, ...","(7, 8)","[(P, (50, 50)), (H, (50, 51)), (H, (51, 51)), ...",0,0.000000
9997,50,35,15,"[H, H, H, H, H, H, H, P, H, P, P, H, H, H, H, ...","[[, , , , , , , , , , , , , , , , , , , , , , ...",11,"[[H, H, H, H], [H, P, P, H], [, H, P, H]]","(3, 4)","[(H, (50, 50)), (H, (49, 50)), (H, (49, 51)), ...",0,0.000000
9998,50,46,4,"[H, H, H, H, H, H, H, H, H, H, H, H, H, H, H, ...","[[, , , , , , , , , , , , , , , , , , , , , , ...",50,"[[H, H, , , , , , , , , , , , ], [H, H, H, , ,...","(10, 14)","[(H, (50, 50)), (H, (50, 51)), (H, (49, 51)), ...",13,0.260000


In [16]:
%%time
# Generate an experiment with 1000 samples for amino length [5, 10, 15 ..., 100]

num_samples = 1000
amino_acid_lengths = [length for length in range(5, 101, 5)]
HP100 = store_samples_in_dataframe(num_samples, amino_acid_lengths)

CPU times: user 43.8 s, sys: 678 ms, total: 44.5 s
Wall time: 44.5 s


In [17]:
HP25.to_csv('HP25.csv', index=False)

In [18]:
HP50.to_csv('HP.csv', index=False)

In [19]:
HP100.to_csv('HP25.csv', index=False)