# Fix T. brucei fraction column order

The external *T. brucei* experiments "sax_1" and "sec_1" contain columns are sorted alphabetically rather than numerically. These are quick and dirty scripts to read the original *T.brucei* "sax_1" and "sec_1" experiments (moved to `/stor/work/Marcotte/project/rmcox/leca/ppi_ml/results/elutions/elut_excavate/tryb2_archive/`), resort the columns, and write a new corrected file to `/stor/work/Marcotte/project/rmcox/leca/ppi_ml/results/elutions/elut_excavate/`. **Note:** *T. brucei* "sec_2" is fine & contains correctly ordered columns.

In [1]:
!pwd

/stor/work/Marcotte/project/rmcox/leca/notebooks


In [2]:
!ls ../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2*_1*elut

../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sax_1.prot_count_mFDRpsm001.unique.elut
../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sax_1.prot_count_mFDRpsm001.unique.filtdollo.elut
../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sax_1.prot_count_mFDRpsm001.unique.filtdollo.norm.elut
../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sax_1.prot_count_mFDRpsm001.unique.gold.norm.elut
../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sax_1.prot_count_mFDRpsm001.unique.gold.raw.elut
../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sax_1.prot_count_mFDRpsm001.unique.norm.elut
../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sec_1.prot_count_mFDRpsm001.unique.elut
../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sec_1.prot_count_mFDRpsm001.unique.filtdollo.elut
../ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sec_1.prot_count_mFDRpsm001.unique.filtdollo.norm.elut
../ppi_ml/results

In [3]:
import re
import os
import pandas as pd

In [4]:
input_dir = '../ppi_ml/results/elutions/elut_excavate/tryb2_archive/'
output_dir = '../ppi_ml/results/elutions/elut_excavate/'
sax1_pattern = 'tryb2.sax_1.*elut'
sec1_pattern = 'tryb2.sec_1.*elut'

In [5]:
def fix_tryb2(file_path, outdir, delim):
    # get file name
    fn = os.path.basename(file_path)
    # read in file
    df = pd.read_csv(file_path, sep=delim, index_col=0)
    # delete value after underscore in col names to standardize input
    df.rename(columns=lambda name: re.sub('-', '.', name), inplace=True)  # replace trailing underscore IDs
    df.rename(columns=lambda name: re.sub('_.*', '', name), inplace=True)  # standardize fraction number sep
    # split column names on period and take integer of last item
    cols = [(col, int(col.split('.')[-1])) for col in df]
    # sort column name, integer tuple pairs on integer value
    cols.sort(key=lambda x: x[1])
    # rearrange data frame based on new col order
    df = df[[col[0] for col in cols]]
    # write out corrected file
    df.reset_index(inplace=True)
    df.to_csv(outdir+fn, index=False)

## Strong anion exchange #1 ("sax_1")

The "sax_1" elut file is patterned like so:
| PT5528.10_151013230239 | PT5528.11_151014012454 | PT5528.1_151013014221 | ... |
| --- | ----------- | ----------- | ----------- |
| 0 | 0 | 1 | ... |
| 0 | 1 | 2 | ... |
| 1 | 2 | 3 | ... |

The order is identified by the value following the string 'PT5528', but BEFORE the value that comes after the underscore, so we will extract that value & use it to reorder the cols.

In [6]:
# get elut files
files = [fn for fn in os.listdir(input_dir) if re.search(sax1_pattern, fn)]
# identify tsv exception
tsv_exception = [fn for fn in files if re.search('.*unique.elut', fn)]
# get only csv files
csv_files = [fn for fn in files if fn not in tsv_exception]

In [7]:
csv_files

['tryb2.sax_1.prot_count_mFDRpsm001.unique.gold.norm.elut',
 'tryb2.sax_1.prot_count_mFDRpsm001.unique.gold.raw.elut',
 'tryb2.sax_1.prot_count_mFDRpsm001.unique.filtdollo.elut',
 'tryb2.sax_1.prot_count_mFDRpsm001.unique.filtdollo.norm.elut',
 'tryb2.sax_1.prot_count_mFDRpsm001.unique.norm.elut']

In [8]:
tsv_exception[0]

'tryb2.sax_1.prot_count_mFDRpsm001.unique.elut'

In [9]:
# fix csv files
for f in csv_files:
    fp = input_dir+f
    fix_tryb2(fp, output_dir, ',')

# fix tsv file
fix_tryb2(input_dir+tsv_exception[0], output_dir, '\t')

## Size exclusion #1 ("sec_1")

The "sec_1" elut file is patterned like so:
| PT4651-108 | PT4651-109 | PT4651-10 | ... |
| --- | ----------- | ----------- | ----------- |
| 0 | 0 | 3 | ... |
| 0 | 1 | 2 | ... |
| 1 | 2 | 1 | ... |

The order is identified by the value following the string 'PT4651', so we will extract that value & use it to reorder the cols.

In [10]:
# get elut files
files = [fn for fn in os.listdir(input_dir) if re.search(sec1_pattern, fn)]
# identify tsv exception
tsv_exception = [fn for fn in files if re.search('.*unique.elut', fn)]
# get only csv files
csv_files = [fn for fn in files if fn not in tsv_exception]

In [11]:
csv_files

['tryb2.sec_1.prot_count_mFDRpsm001.unique.filtdollo.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.filtdollo.norm.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.gold.raw.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.norm.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.gold.norm.elut']

In [12]:
tsv_exception[0]

'tryb2.sec_1.prot_count_mFDRpsm001.unique.elut'

In [13]:
df = pd.read_csv(input_dir+csv_files[0])
df.head()

Unnamed: 0,X1,PT4651-100,PT4651-101,PT4651-102,PT4651-103,PT4651-104,PT4651-105,PT4651-106,PT4651-107,PT4651-108,...,PT4651-90,PT4651-92,PT4651-93,PT4651-94,PT4651-95,PT4651-96,PT4651-97,PT4651-98,PT4651-99,PT4651-9
0,ENOG502QPHT,1,1,2,2,1,4,1,0,3,...,1,3,0,3,3,1,0,1,2,2
1,ENOG502QPJC,32,33,36,30,28,38,38,34,36,...,3,0,2,3,5,8,19,30,29,44
2,ENOG502QPKQ,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,2
3,ENOG502QPM4,3,1,0,0,1,0,2,2,2,...,3,0,2,2,1,2,4,0,0,5
4,ENOG502QPN1,5,6,6,7,4,3,3,5,2,...,0,1,0,0,1,0,2,3,4,4


In [14]:
# fix csv files
for f in csv_files:
    fp = input_dir+f
    fix_tryb2(fp, output_dir, ',')

# fix tsv file
fix_tryb2(input_dir+tsv_exception[0], output_dir, '\t')

## Generate tidy files

Corrected .elut files will need to be melted to make corrected replacemets for the .tidy files.

In [16]:
!ls /stor/work/Marcotte/project/rmcox/leca/ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2*tidy

/stor/work/Marcotte/project/rmcox/leca/ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sax_1.prot_count_mFDRpsm001.unique.filtdollo.norm.tidy
/stor/work/Marcotte/project/rmcox/leca/ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sax_1.prot_count_mFDRpsm001.unique.filtdollo.tidy
/stor/work/Marcotte/project/rmcox/leca/ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sec_1.prot_count_mFDRpsm001.unique.filtdollo.norm.tidy
/stor/work/Marcotte/project/rmcox/leca/ppi_ml/results/elutions/elut_excavate/tryb2_archive/tryb2.sec_1.prot_count_mFDRpsm001.unique.filtdollo.tidy


Since some of these are missing for different kinds of elution profiles (e.g., we only see filtdollo here...), then this script needs to be generalized & run for all species/clades.

In [61]:
input_dir = '../ppi_ml/results/elutions/elut_excavate/'
pattern = 'tryb2.*elut'
output_dir = '../ppi_ml/results/elutions/elut_excavate/'
files = [fn for fn in os.listdir(input_dir) if re.search(pattern, fn)]

In [39]:
files.sort()
files

['tryb2.sax_1.prot_count_mFDRpsm001.unique.elut',
 'tryb2.sax_1.prot_count_mFDRpsm001.unique.filtdollo.elut',
 'tryb2.sax_1.prot_count_mFDRpsm001.unique.filtdollo.norm.elut',
 'tryb2.sax_1.prot_count_mFDRpsm001.unique.gold.norm.elut',
 'tryb2.sax_1.prot_count_mFDRpsm001.unique.gold.raw.elut',
 'tryb2.sax_1.prot_count_mFDRpsm001.unique.norm.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.filtdollo.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.filtdollo.norm.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.gold.norm.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.gold.raw.elut',
 'tryb2.sec_1.prot_count_mFDRpsm001.unique.norm.elut',
 'tryb2.sec_2.prot_count_mFDRpsm001.unique.elut',
 'tryb2.sec_2.prot_count_mFDRpsm001.unique.filtdollo.elut',
 'tryb2.sec_2.prot_count_mFDRpsm001.unique.filtdollo.norm.elut',
 'tryb2.sec_2.prot_count_mFDRpsm001.unique.gold.norm.elut',
 'tryb2.sec_2.prot_count_mFDRpsm001.unique.gold.raw.elut',
 'tryb2.

In [48]:
cols = [c for c in df]
cols

['orthogroup',
 'PT5528.1',
 'PT5528.2',
 'PT5528.3',
 'PT5528.4',
 'PT5528.5',
 'PT5528.6',
 'PT5528.7',
 'PT5528.8',
 'PT5528.9',
 'PT5528.10',
 'PT5528.11',
 'PT5528.12',
 'PT5528.13',
 'PT5528.14',
 'PT5528.15',
 'PT5528.16',
 'PT5528.17',
 'PT5528.18',
 'PT5528.19',
 'PT5528.20',
 'PT5528.21',
 'PT5528.22',
 'PT5528.23',
 'PT5528.24',
 'PT5528.25',
 'PT5528.26',
 'PT5528.27',
 'PT5528.28',
 'PT5528.29',
 'PT5528.30',
 'PT5528.31',
 'PT5528.32',
 'PT5528.33',
 'PT5528.34',
 'PT5528.35',
 'PT5528.36',
 'PT5528.37',
 'PT5528.38',
 'PT5528.39',
 'PT5528.40',
 'PT5528.41',
 'PT5528.42',
 'PT5528.43',
 'PT5528.44',
 'PT5528.45',
 'PT5528.46',
 'PT5528.47',
 'PT5528.48',
 'PT5528.49',
 'PT5528.50',
 'PT5528.51',
 'PT5528.52',
 'PT5528.53',
 'PT5528.54',
 'PT5528.55',
 'PT5528.56',
 'PT5528.57',
 'PT5528.58',
 'PT5528.59',
 'PT5528.60',
 'PT5528.62',
 'PT5528.63',
 'PT5528.64',
 'PT5528.65',
 'PT5528.66',
 'PT5528.67',
 'PT5528.68',
 'PT5528.69',
 'PT5528.70',
 'PT5528.71',
 'PT5528.72',


In [59]:
def make_tidy(file_path, outdir):
    df = pd.read_csv(file_path, index_col=0, sep=',')
    fn = os.path.basename(file_path)
    fb = os.path.splitext(fn)[0]
    print(f'Parsing {fn} ...')
    cols = [c for c in df]
    if len(cols) < 2:
        try:
            df = pd.read_csv(file_path, index_col=0, sep='\t')
        except:
            print(f'{fn} in an invalid file type.')
    df.index.names = ['orthogroup']
    df.reset_index(inplace=True)
    print(df.head())
    df_tidy = pd.melt(df, id_vars=['orthogroup'], \
            var_name = 'fraction_id', value_name = 'PSMs')
    print(df_tidy.head())
    file_out = fb+'.tidy'
    print(f'Writing {file_out} ...')
    df_tidy.to_csv(output_dir+file_out, index=False)

In [62]:
for f in files:
    fp = input_dir+f
    make_tidy(fp, output_dir)

Parsing tryb2.sax_1.prot_count_mFDRpsm001.unique.norm.elut ...
    orthogroup  PT5528.1  PT5528.2  PT5528.3  PT5528.4  PT5528.5  PT5528.6  \
0  ENOG502QPHS  0.000000       0.5  0.000000       0.0  0.500000       0.0   
1  ENOG502QPHT  0.500000       0.0  0.000000       0.0  0.500000       0.5   
2  ENOG502QPI5  0.500000       0.5  1.000000       1.0  0.000000       0.5   
3  ENOG502QPJ8  0.000000       0.0  0.000000       0.0  0.000000       0.0   
4  ENOG502QPJC  0.403509       0.0  0.017544       0.0  0.017544       0.0   

   PT5528.7  PT5528.8  PT5528.9  ...  PT5528.87  PT5528.88  PT5528.89  \
0  0.000000       0.0       0.0  ...   0.000000   0.000000   1.000000   
1  0.000000       0.0       0.0  ...   0.000000   0.000000   0.000000   
2  0.000000       0.5       0.0  ...   0.000000   0.000000   0.000000   
3  0.000000       0.0       0.0  ...   0.000000   0.200000   0.000000   
4  0.017544       0.0       0.0  ...   0.333333   0.403509   0.368421   

   PT5528.90  PT5528.91  PT55

Following this, I will need to regenerate every full concatenated version of LECA elution profiles that contain these files; these new files will have the suffix '.ordered' and the old files will be sent to `/stor/work/Marcotte/project/rmcox/leca/ppi_ml/results/elutions/archive/`.