## About
This notebook generates the sequence files which are then used to generate ensembles for this analysis. Specifically, this notebook generates three files:

* `starling_comparison_data.csv` - CSV file with STARLING SAXS comparison sequences, associated experimental data, and predictions for a bunch of other methods.
*  `all_comparison_seqs.fasta` - sequences for SAXS comparison in STARLING (i.e. sequences in `starling_comparison_data.csv`)
*  `all_comparison_seqs_GS_versions.fasta` - sequences length-matched to `all_comparison_seqs.fasta` but as GS repeats.
 
Ensembles were then generated using:

    starling all_comparison_seqs.fasta -c 600 -r

i.e. we generated with 600 conformers instead of 400 (default).

    

In [2]:
import protfasta
import pandas as pd

def build_gs(inseq):
    """
    Function which takes an input sequence and returns a GS sequence that is 
    length matched.

    Parameters
    -------------
    inseq : str
        Input sequence for reference

    Returns
    ------------
    str
        GS repeat of length = inseq
    """

    gs = ''
    next_res = 'G'
    while len(inseq) != len(gs):
        gs = gs + next_res
        if next_res == 'G':
            next_res='S'
        else:
            next_res='G'
    return gs
    

In [9]:
# Read our data in and print number of sequences
df = pd.read_csv('all_comparison_data.csv', delimiter=', ', engine='python')  
print(f"Initially, we have {len(df)} sequences...")

Initially, we have 137 sequences...


In [12]:
# filter down to select sequences less than 383 residues
filtered_df = df[df['sequence'].str.len() <= 383]

print(f"After filtering, we have {len(filtered_df)} sequences...")

# save that filtered dataset out...
filtered_df.to_csv('starling_comparison_data.csv', index=False)

# Access each column and convert its values into a list, and build a dictionary
# of those sequences
sequence_list = filtered_df['sequence'].tolist()
name_list = filtered_df['name'].tolist()

seqs = {}
for i in range(len(name_list)):
    seqs[name_list[i]] = sequence_list[i]

# save our STARLING comparison sequences out to a FASTA file
protfasta.write_fasta(seqs,'all_comparison_seqs.fasta')    

After filtering, we have 133 sequences...


### Build GS equivalents 
We also build GS-length sequences for reference

In [11]:
gs_seqs = {}
for index, row in filtered_df.iterrows():
    name = row['name']
    s    = row['sequence']
    gs = build_gs(s)
    gs_seqs['GS_' + name] = gs
protfasta.write_fasta(gs_seqs, 'all_comparison_seqs_GS_versions.fasta')
