## E5 - Solution Space Sampling in 2D 
In experiment 5, we generated the complete solution space to enumerate all valid conformations of amino acid sequences with lengths {5, 10, 15}. This comprehensive approach served to benchmark the empirical uniform randomness of the other four sampling methods utilized in the preceding experiments. Using a 2D lattice scaled to accommodate the longest protein sequence, we systematically generated each possible conformation for the given lengths. A combinatorial approach was employed to conduct a stacked DFS algorithm across the conformational landscape, assuring that each unique conformation was accounted for once and only once.

![Example Screenshot](5-methods.png)

The protein chain unfolding commenced from the grid’s central point, with each amino acid conforming to a predefined direction sequence {L, M, R}. Therefore, mathematically, the conformation space 𝐶 can be defined as shown in (1), where 𝑛 represents the number of amino acids, and each 𝑑𝑖 denotes the direction of the 𝑖-th amino acid, except for the first amino acid which is always ’S’.

The dataset generated from this exhaustive search provided the basis for benchmarking the sampling efficacy of the previous experiments. By comparing the conformational distribution obtained from the exhaustive search with those from the stochastic and deterministic sampling methods, we aimed to evaluate their performance in producing uniform random samples. This analysis is crucial for understanding the randomness of our sampling methods relative to the entire landscape of possible conformations.

In [2]:
import numpy as np 
import pandas as pd
import time as time
np.set_printoptions(threshold=np.inf)

### Step 1 - Initializing Acids

In [3]:
total_amino_acids = 10

# Generate a random sequence of amino acids
def random_division(total_amino_acids):
    num_hydrophobic = np.random.randint(1, total_amino_acids)  # Ensure at least one H and P
    num_polar = total_amino_acids - num_hydrophobic
    # Randomly shuffle the amino acids
    amino_acids = ['H'] * num_hydrophobic + ['P'] * num_polar
    np.random.shuffle(amino_acids)
    return amino_acids

In [4]:
# Helper functions to get the next position based on the direction
def get_next_position(position, direction, last_move):
    vectors = {
        'left': (-last_move[1], last_move[0]),
        'right': (last_move[1], -last_move[0]),
        'middle': last_move
    }
    return (position[0] + vectors[direction][0], position[1] + vectors[direction][1])

In [5]:
# Check if the next position is valid (within bounds and not already occupied)
def is_valid_position(grid, position):
    return 0 <= position[0] < len(grid) and 0 <= position[1] < len(grid[0]) and grid[position[0]][position[1]] == ''

In [6]:
def trimming_grid(grid):
    # Find the indices of non-empty rows and columns
    non_empty_rows = np.any(grid != '', axis=1)
    non_empty_columns = np.any(grid != '', axis=0)

    # Use boolean indexing to extract non-empty rows and columns
    trimmed_grid = grid[non_empty_rows][:, non_empty_columns]

    return trimmed_grid

In [7]:
def get_protein_dimensions(trimmed_grid):
    # Get the dimensions of the protein on the grid
    # Placeholder implementation
    return trimmed_grid.shape

In [8]:
# Recursive DFS function to generate all conformations
def dfs(grid, current_path, position, last_move, all_conformations, amino_acids, current_index):
    # Define the vectors inside the function
    vectors = {
        'left': (-last_move[1], last_move[0]),
        'right': (last_move[1], -last_move[0]),
        'middle': last_move
    }

    if current_index == len(amino_acids):  # All amino acids have been placed
        all_conformations.append((['Start', 'middle'] + current_path, np.copy(grid)))  # Include the start and first straight move
        return

    for direction in ['left', 'middle', 'right']:
        next_position = get_next_position(position, direction, last_move)
        if is_valid_position(grid, next_position):
            grid[next_position] = amino_acids[current_index]  # Place the amino acid on the grid
            dfs(grid, current_path + [direction], next_position, vectors[direction], all_conformations, amino_acids, current_index + 1)
            grid[next_position] = ''  # Remove the amino acid when backtracking

In [9]:
def generate_conformations(total_amino_acids, random_division, get_next_position, is_valid_position):
    """
    Generate all possible conformations for a given number of amino acids.

    Parameters:
    - total_amino_acids (int): Total number of amino acids.
    - random_division (function): Function to randomly divide amino acids.
    - get_next_position (function): Function to get the next grid position.
    - is_valid_position (function): Function to check if a grid position is valid.
    """

    # Initialize the grid
    grid_size = total_amino_acids * 2  # Make the grid large enough
    grid = np.full((grid_size, grid_size), fill_value='', dtype=object)
    amino_acids = random_division(total_amino_acids)

    # Start the DFS from the second position
    start_position = (grid_size // 2, grid_size // 2)
    initial_move = (0, 1)  # Represents the 'straight' move
    second_position = get_next_position(start_position, 'middle', initial_move)
    grid[start_position] = amino_acids[0]  # Place the first amino acid
    grid[second_position] = amino_acids[1]  # Place the second amino acid

    # Perform DFS to find all paths from the third amino acid
    all_conformations = []
    dfs(grid, [], second_position, initial_move, all_conformations, amino_acids, 2)

    # Store the results in a list
    results = []
    for path, grid_state in all_conformations:
        trimmed_grid = trimming_grid(grid_state)
        protein_dimensions = get_protein_dimensions(trimmed_grid) 
        results.append((amino_acids, grid_state, path, trimmed_grid, protein_dimensions))
    
    return results

In [10]:
def store_samples_in_dataframe(amino_acid_lengths):
    data = {
        "Amino Acid Length": [],
        "1D Protein": [],
        "2D Protein": [], 
        "Trimmed 2D Protein": [],
        "Shape 2D Protein": [],
        "Amino Acid Path": [],
    }
    
    path_count = 0
    milestone = 500000

    for length in amino_acid_lengths:
        conformation_samples = generate_conformations(length, random_division, get_next_position, is_valid_position)
        for sample in conformation_samples:
            amino_acids, grid_state, path, trimmed_grid, protein_dimensions = sample
            data["Amino Acid Length"].append(length)
            data["1D Protein"].append(amino_acids)
            data["2D Protein"].append(grid_state)
            data["Trimmed 2D Protein"].append(trimmed_grid)
            data["Shape 2D Protein"].append(protein_dimensions)
            data["Amino Acid Path"].append(path)
            
            path_count +=1
            
            # Check if the path count has reached the next milestone
            if path_count >= milestone:
                print(f"Generated {milestone} paths for length {length}")
                milestone += 500000  # Update the next milestone

    df = pd.DataFrame(data)
    return df, path_count

# Generating conformations for lengths 5 and 10
amino_acid_lengths = [5, 10]
HP25 = store_samples_in_dataframe(amino_acid_lengths)

In [11]:
def timing_samples(amino_acid_lengths):
    total_times = {}

    for length in amino_acid_lengths:
        start_time = time.time()
        
        _, path_count = store_samples_in_dataframe([length])  # Get the path count
        
        end_time = time.time()
        
        total_times[length] = end_time - start_time
        print(f"Time taken to create and process samples for amino acid length {length}: {total_times[length]:.2f} seconds. Final path count: {path_count}")

    return total_times

### Step 4 - Creating Samples
The code generates N random protein grid configurations with a specified number of hydrophobic and polar amino acids using the `generate_random_samples` function. 

In [12]:
%%time
# Generate an experiment with all samples for amino length [5, 10, 15 ..., 25]

amino_acid_lengths = [length for length in range(5, 16, 5)]
time_data = timing_samples(amino_acid_lengths)
HP25 = store_samples_in_dataframe(amino_acid_lengths)

# amino_acid_lengths = [length for length in range(5, 51, 5)]
# time_data = timing_samples(amino_acid_lengths)
# HP50 = store_samples_in_dataframe(num_samples, amino_acid_lengths)

Time taken to create and process samples for amino acid length 5: 0.00 seconds. Final path count: 25
Time taken to create and process samples for amino acid length 10: 0.16 seconds. Final path count: 4067
Generated 500000 paths for length 15
Time taken to create and process samples for amino acid length 15: 40.75 seconds. Final path count: 593611
Generated 500000 paths for length 15
CPU times: user 1min 8s, sys: 7.94 s, total: 1min 16s
Wall time: 1min 18s


In [23]:
HP25[0].head()

Unnamed: 0,Amino Acid Length,1D Protein,2D Protein,Trimmed 2D Protein,Shape 2D Protein,Amino Acid Path
0,5,"[P, H, P, H, H]","[[, , , , , , , , , ], [, , , , , , , , , ], [...","[[H, H, P], [, P, H]]","(2, 3)","[Start, middle, left, left, middle]"
1,5,"[P, H, P, H, H]","[[, , , , , , , , , ], [, , , , , , , , , ], [...","[[H, ], [H, P], [P, H]]","(3, 2)","[Start, middle, left, left, right]"
2,5,"[P, H, P, H, H]","[[, , , , , , , , , ], [, , , , , , , , , ], [...","[[H, H], [, P], [P, H]]","(3, 2)","[Start, middle, left, middle, left]"
3,5,"[P, H, P, H, H]","[[, , , , , , , , , ], [, , , , , , , , , ], [...","[[, H], [, H], [, P], [P, H]]","(4, 2)","[Start, middle, left, middle, middle]"
4,5,"[P, H, P, H, H]","[[, , , , , , , , , ], [, , , , , , , , , ], [...","[[, H, H], [, P, ], [P, H, ]]","(3, 3)","[Start, middle, left, middle, right]"


In [32]:
solution_space_HP15  = HP25[0]['Amino Acid Path']
solution_space_HP15[0:25]

0         [Start, middle, left, left, middle]
1          [Start, middle, left, left, right]
2         [Start, middle, left, middle, left]
3       [Start, middle, left, middle, middle]
4        [Start, middle, left, middle, right]
5          [Start, middle, left, right, left]
6        [Start, middle, left, right, middle]
7         [Start, middle, left, right, right]
8         [Start, middle, middle, left, left]
9       [Start, middle, middle, left, middle]
10       [Start, middle, middle, left, right]
11      [Start, middle, middle, middle, left]
12    [Start, middle, middle, middle, middle]
13     [Start, middle, middle, middle, right]
14       [Start, middle, middle, right, left]
15     [Start, middle, middle, right, middle]
16      [Start, middle, middle, right, right]
17         [Start, middle, right, left, left]
18       [Start, middle, right, left, middle]
19        [Start, middle, right, left, right]
20       [Start, middle, right, middle, left]
21     [Start, middle, right, midd

In [29]:
# solution_space_HP15.to_csv('../Data/Solution Space.csv', index=False)