# Graph-Part tutorial

In addition to the command line interface for fasta-formatted data, Graph-Part can also be used from within Python directly.

In [1]:
import pandas as pd
from graph_part import train_test_validation_split, stratified_k_fold

In [2]:
# make a tab-delimited file from a fasta file for convenience
!awk -F'[|\n]' 'BEGIN{RS=">"}BEGIN{print"ID\tkingdom\tclass\tright-pos\tleft-pos\texperimental\tsequence"}NF{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$8}' benchmarking/data/protein/NetGPI/netgpi_dataset.fasta | sed 's/\t[aA-zZ\-]*=/\t/g' > benchmarking/data/protein/NetGPI/netgpi.tsv
df = pd.read_table('benchmarking/data/protein/NetGPI/netgpi.tsv', index_col=0)
df

Unnamed: 0_level_0,kingdom,class,right-pos,left-pos,experimental,sequence
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
P15693,Animal,1,30,71,0,QQAAVPLSSETHGGEDVAIFARGPQAHLVHGVQEQNYIAHVMAFAG...
P83558,Animal,0,0,101,0,EGEVKNEFEERLKDEFKDPSRSEVAEVILLRELEVLEETLFGKEMT...
D4ASH1,Fungi,0,0,101,0,PPGSPIRPTASDYELSHRASRSWSTFGSTKEPSLPGHNTFKGFRKS...
Q6B4T5,Animal,0,0,101,0,RFLVGAVLVVVLVACATAFESDAETFKSLVVEERKCHGDGSKGCAT...
Q6B4T4,Animal,0,0,101,0,EGLLVLVLIAFVVAEFESDAEKWEALITQERACKGEGVKGCYYEAD...
...,...,...,...,...,...,...
Q95NK7,Animal,0,0,55,0,MKIFFAILLILAVCSMAIWTVNGTPFAIKCATDADCSRKCPGNPPC...
P16548,Animal,0,0,53,0,MASVKLFFIAILVVALSLNTSAAVLNPSSTAKPRFETKDRKLSAGA...
P15516,Animal,0,0,52,0,MKFFVFALILALMLSMTGADSHAKRHHGYKRKFHEKHHSHRGYRSN...
A0A1W6EVN2,Animal,0,0,51,0,MKAIMVLFYVMTLTIIGSFSMVSGSPGQNDYVNPKLQFACDLLQKA...


`stratified_k_fold` and `train_test_validation_split` accept the same parameters that are available on the command line. `labels` and `priority` are optional.

In [4]:
fold_ids = stratified_k_fold(df['sequence'], labels=df['class'], priority=df['experimental'], alignment_mode='needle', threads = 8, threshold=0.3, partitions=5)

Computing pairwise sequence identities.


  0%|          | 0/13089924 [00:00<?, ?it/s]

Full graph nr. of edges: 21218
723 [203 520]
Initialization mode slow-nn
2
Currently have this many samples: 3618
!  B1P0S1 Q59Y31 {'metric': 0.6910000000000001}  !
Need to remove! Currently have this many samples: 3618
After removal we have this many samples: 3607


The partitioning functions return a list of indices for each partition. In this case, we get 5 lists. As we used pandas series as inputs, we get the `ID` index values back. If we provide lists or arrays, indices are returned. Pandas series, numpy arrays, lists and ID:value dictionaries are accepted as inputs.

In [5]:
len(fold_ids)

5

In [6]:
fold_ids[0][:10]

['P15693',
 'P83558',
 'Q4QRF7',
 'H2A0L1',
 'Q9XZ63',
 'A2QTU5',
 'A1XIH3',
 'L8FSM5',
 'A1XIH6',
 'A1XIH1']

In [3]:
# let's use arrays this time.
sequences = df['sequence'].to_numpy()
labels = df['class'].to_numpy()
priority = df['experimental'].to_numpy()
train_idx, test_idx, valid_idx = train_test_validation_split(sequences, 
                                                             labels=labels, 
                                                             priority=priority, 
                                                             alignment_mode='needle', 
                                                             threads = 8,
                                                             threshold = 0.3,
                                                             test_size = 0.15,
                                                             valid_size = 0.05
                                                            )

Computing pairwise sequence identities.


  0%|          | 0/13089924 [00:00<?, ?it/s]

Full graph nr. of edges: 21218
180 [ 50 130]
Initialization mode slow-nn
2
Currently have this many samples: 3618
!  seq_1496 seq_2911 {'metric': 0.6859999999999999}  !
Need to remove! Currently have this many samples: 3618
After removal we have this many samples: 3612


Use the generated indices to process your data as needed.

In [6]:
train_sequences = sequences[train_idx]
train_labels = labels[train_idx]

test_sequences = sequences[test_idx]
test_labels = labels[test_idx]

valid_sequences = sequences[valid_idx]
valid_labels = labels[valid_idx]

In [7]:
print(len(train_sequences))
print(len(test_sequences))
print(len(valid_sequences))

2953
496
163
