# Create PAO1 and PA14 compendia

This notebook is using the observation from the [exploratory notebook](../explore_data/cluster_by_accessory_gene.ipynb) to bin samples into PAO1 or PA14 compendia.

A sample is considered PAO1 if the median gene expression of PA14 accessory genes is 0 and PAO1 accessory genes in > 0.
Similarlty, a sample is considered PA14 if the median gene expression of PA14 accessory genes is > 0 and PAO1 accessory genes in 0.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import pandas as pd
import seaborn as sns
from textwrap import fill
import matplotlib.pyplot as plt
from core_acc_modules import paths, utils

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))


In [2]:
# User param
# same_threshold: if median accessory expression of PAO1 samples > same_threshold then this sample is binned as PAO1
# 25 threshold based on comparing expression of PAO1 SRA-labeled samples vs non-PAO1 samples
same_threshold = 25

# opp_threshold: if median accessory expression of PA14 samples < opp_threshold then this sample is binned as PAO1
# 25 threshold based on previous plot (eye-balling trying to avoid samples
# on the diagonal of explore_data/cluster_by_accessory_gene.ipynb plot)
opp_threshold = 25

## Load data

In [3]:
# Expression data files
pao1_expression_filename = paths.PAO1_GE
pa14_expression_filename = paths.PA14_GE

# File containing table to map sample id to strain name
sample_to_strain_filename = paths.SAMPLE_TO_STRAIN

In [4]:
# Load expression data
# Matrices will be sample x gene after taking the transpose
pao1_expression = pd.read_csv(pao1_expression_filename, index_col=0, header=0).T
pa14_expression = pd.read_csv(pa14_expression_filename, index_col=0, header=0).T

In [5]:
# Load metadata
# Set index to experiment id, which is what we will use to map to expression data
sample_to_strain_table_full = pd.read_csv(sample_to_strain_filename, index_col=2)

## Get core and accessory annotations

In [6]:
pao1_annot_filename = paths.GENE_PAO1_ANNOT
pa14_annot_filename = paths.GENE_PA14_ANNOT

core_acc_dict = utils.get_my_core_acc_genes(
    pao1_annot_filename, pa14_annot_filename, pao1_expression, pa14_expression
)

Number of PAO1 core genes: 5366
Number of PA14 core genes: 5363
Number of PAO1 core genes in my dataset: 5361
Number of PA14 core genes in my dataset: 5357
Number of PAO1-specific genes: 202
Number of PA14-specific genes: 534


In [7]:
pao1_acc = core_acc_dict["acc_pao1"]
pa14_acc = core_acc_dict["acc_pa14"]

## Format expression data

Format index to only include experiment id. This will be used to map to expression data and SRA labels later

In [8]:
# Format expression data indices so that values can be mapped to `sample_to_strain_table`
pao1_index_processed = pao1_expression.index.str.split(".").str[0]
pa14_index_processed = pa14_expression.index.str.split(".").str[0]

print(
    f"No. of samples processed using PAO1 reference after filtering: {pao1_expression.shape}"
)
print(
    f"No. of samples processed using PA14 reference after filtering: {pa14_expression.shape}"
)
pao1_expression.index = pao1_index_processed
pa14_expression.index = pa14_index_processed

No. of samples processed using PAO1 reference after filtering: (2767, 5563)
No. of samples processed using PA14 reference after filtering: (2767, 5891)


In [9]:
pao1_expression.head()

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA1905,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1
ERX541572,5793.218939,766.512255,1608.330977,1663.46607,176.163343,384.600886,295.846835,453.183561,611.865046,43.032267,...,1.344758,75.306467,447.804528,10.758067,65.893159,44.377025,56.47985,2033.274614,184.231893,1.344758
ERX541573,4416.506898,797.782811,1770.117221,1562.763979,313.958581,324.501966,333.873864,415.87797,550.599003,38.659079,...,1.171487,103.090877,698.206395,18.743796,138.235494,39.830566,96.061954,1315.580171,91.376005,1.171487
ERX541574,3825.086116,644.433113,1852.251003,1589.338107,260.936107,270.820051,363.729119,363.729119,423.03278,67.210815,...,1.976789,128.491265,468.49892,19.767887,73.141182,33.605408,55.350083,1759.341934,67.210815,3.953577
ERX541575,3834.097653,789.216207,1926.825153,1610.427665,289.734779,261.294555,250.629471,520.811596,666.567742,53.325419,...,1.777514,124.425979,611.464809,15.997626,177.751398,21.330168,108.428353,1486.001686,56.880447,1.777514
ERX541576,3515.165133,853.775186,2185.27713,1683.341246,183.98936,245.319146,253.388855,380.890253,551.968079,66.171612,...,1.613942,90.380738,745.641089,11.297592,130.729282,50.032194,95.222563,1273.400041,72.627379,1.613942


In [10]:
pa14_expression.head()

Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_19205,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845
ERX541572,204.761199,49.806778,8.30113,16.602259,22.136346,11.068173,13.835216,74.710167,77.47721,5.534086,...,58.107908,166.022594,2692.333064,204.761199,27.670432,16.602259,1090.215033,520.204128,110.681729,2.767043
ERX541573,163.421371,44.569465,18.908258,24.310617,9.454129,18.908258,6.752949,63.477723,56.724773,9.454129,...,55.374184,202.588477,1755.766798,67.529492,5.402359,13.505898,1493.752368,598.311301,62.127133,1.35059
ERX541574,201.758337,14.497605,7.248803,15.705739,6.040669,18.122006,9.66507,48.32535,59.198554,8.456936,...,67.65549,224.712879,2213.301042,77.32056,7.248803,13.289471,1680.514056,885.562044,164.306191,1.208134
ERX541575,186.502345,46.124236,10.027008,30.081023,10.027008,14.037811,6.016205,44.118834,42.113433,8.021606,...,40.108031,240.648187,1500.040368,64.17285,6.016205,12.032409,1251.370574,677.825728,140.378109,2.005402
ERX541576,223.958038,23.864381,12.850051,31.207268,14.685773,11.01433,11.01433,40.385876,55.071649,9.178608,...,69.757422,212.943708,1672.342397,62.414535,1.835722,14.685773,1325.39101,627.816794,104.636132,1.835722


In [11]:
# Save pre-binned expression data
pao1_expression.to_csv(paths.PAO1_PREBIN_COMPENDIUM, sep="\t")
pa14_expression.to_csv(paths.PA14_PREBIN_COMPENDIUM, sep="\t")

## Bin samples as PAO1 or PA14

In [12]:
# Create accessory df
# accessory gene ids | median accessory expression | strain label

# PAO1
pao1_acc_expression = pao1_expression[pao1_acc]
pao1_acc_expression["median_acc_expression"] = pao1_acc_expression.median(axis=1)

# PA14
pa14_acc_expression = pa14_expression[pa14_acc]
pa14_acc_expression["median_acc_expression"] = pa14_acc_expression.median(axis=1)

pao1_acc_expression.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,PA3498,PA4192,PA0100,PA4638,PA0499,PA2225,PA0258,PA2736,PA3500,PA1224,...,PA3067,PA3157,PA3147,PA2727,PA2186,PA1385,PA0187,PA2334,PA4623,median_acc_expression
ERX541572,84.719776,51.100817,115.649217,337.534344,40.34275,21.516133,45.721784,289.123044,20.171375,14.792342,...,51.100817,2630.347318,1892.074988,150.612934,2.689517,309.294419,6.723792,170.784309,16.1371,82.030259
ERX541573,82.004107,52.716926,151.121854,182.75201,26.944207,7.028923,83.175594,120.663186,8.200411,22.258258,...,36.316104,3045.866826,1435.07187,180.409035,1.171487,160.493752,22.258258,90.204518,14.057847,82.58985
ERX541574,86.978702,35.582196,156.166307,247.098586,43.489351,7.907155,120.58411,217.446756,19.767887,15.81431,...,21.744676,3917.995184,2115.163899,241.16822,1.976789,160.119884,13.837521,90.93228,19.767887,87.967097
ERX541575,67.545531,33.772766,147.53366,199.081566,74.655587,5.332542,92.430727,199.081566,19.552654,24.885196,...,23.107682,3284.845833,1621.092749,245.296929,5.332542,122.648465,8.88757,39.105308,21.330168,76.433101
ERX541576,85.538913,38.734602,167.849942,179.147534,69.399495,12.911534,54.87402,127.501398,17.753359,20.981243,...,25.823068,2853.449015,2154.612237,221.11002,1.613942,85.538913,24.209126,66.171612,12.911534,79.083146


In [13]:
# Merge PAO1 and PA14 accessory dataframes
pao1_pa14_acc_expression = pao1_acc_expression.merge(
    pa14_acc_expression,
    left_index=True,
    right_index=True,
    suffixes=["_pao1", "_pa14"],
)

pao1_pa14_acc_expression.head()

Unnamed: 0,PA3498,PA4192,PA0100,PA4638,PA0499,PA2225,PA0258,PA2736,PA3500,PA1224,...,PA14_58910,PA14_51540,PA14_41300,PA14_43080,PA14_54910,PA14_55080,PA14_15470,PA14_59950,PA14_49510,median_acc_expression_pa14
ERX541572,84.719776,51.100817,115.649217,337.534344,40.34275,21.516133,45.721784,289.123044,20.171375,14.792342,...,2.767043,141.119205,207.528242,55.340865,2.767043,2.767043,8.30113,2.767043,2.767043,5.534086
ERX541573,82.004107,52.716926,151.121854,182.75201,26.944207,7.028923,83.175594,120.663186,8.200411,22.258258,...,5.402359,144.513113,103.995418,74.282441,1.35059,1.35059,1.35059,1.35059,2.70118,2.70118
ERX541574,86.978702,35.582196,156.166307,247.098586,43.489351,7.907155,120.58411,217.446756,19.767887,15.81431,...,1.208134,118.397108,102.691369,1.208134,3.624401,2.416268,1.208134,1.208134,1.208134,2.416268
ERX541575,67.545531,33.772766,147.53366,199.081566,74.655587,5.332542,92.430727,199.081566,19.552654,24.885196,...,4.010803,106.286283,72.194456,2.005402,4.010803,2.005402,2.005402,2.005402,4.010803,4.010803
ERX541576,85.538913,38.734602,167.849942,179.147534,69.399495,12.911534,54.87402,127.501398,17.753359,20.981243,...,5.507165,121.157627,80.771751,69.757422,1.835722,7.342886,1.835722,3.671443,1.835722,3.671443


In [14]:
# Find PAO1 samples
pao1_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1>@same_threshold & median_acc_expression_pa14<@opp_threshold"
    ).index
)

In [15]:
# Find PA14 samples
pa14_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1<@opp_threshold & median_acc_expression_pa14>@same_threshold"
    ).index
)

In [16]:
# Check that there are no samples that are binned as both PAO1 and PA14
shared_pao1_pa14_binned_ids = list(set(pao1_binned_ids).intersection(pa14_binned_ids))

assert len(shared_pao1_pa14_binned_ids) == 0

## Format SRA annotations

In [17]:
# Since experiments have multiple runs there are duplicated experiment ids in the index
# We will need to remove these so that the count calculations are accurate
sample_to_strain_table_full_processed = sample_to_strain_table_full[
    ~sample_to_strain_table_full.index.duplicated(keep="first")
]

assert (
    len(sample_to_strain_table_full.index.unique())
    == sample_to_strain_table_full_processed.shape[0]
)

In [18]:
# Aggregate boolean labels into a single strain label
aggregated_label = []
for exp_id in list(sample_to_strain_table_full_processed.index):
    if sample_to_strain_table_full_processed.loc[exp_id, "PAO1"].all() == True:
        aggregated_label.append("PAO1")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PA14"].all() == True:
        aggregated_label.append("PA14")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PAK"].all() == True:
        aggregated_label.append("PAK")
    elif (
        sample_to_strain_table_full_processed.loc[exp_id, "ClinicalIsolate"].all()
        == True
    ):
        aggregated_label.append("Clinical Isolate")
    else:
        aggregated_label.append("NA")

sample_to_strain_table_full_processed["Strain type"] = aggregated_label

sample_to_strain_table = sample_to_strain_table_full_processed["Strain type"].to_frame()

sample_to_strain_table.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,Strain type
Experiment,Unnamed: 1_level_1
SRX5057740,
SRX5057739,
SRX5057910,
SRX5057909,
SRX3573046,PAO1


## Create compendia

Create PAO1 and PA14 compendia

In [19]:
# Get expression data
# Note: reindexing needed here instead of .loc since samples from expression data
# were filtered out for low counts, but these samples still exist in log files
pao1_expression_binned = pao1_expression.loc[pao1_binned_ids]
pa14_expression_binned = pa14_expression.loc[pa14_binned_ids]

In [20]:
assert len(pao1_binned_ids) == pao1_expression_binned.shape[0]
assert len(pa14_binned_ids) == pa14_expression_binned.shape[0]

In [21]:
# Label samples with SRA annotations
# pao1_expression_label = pao1_expression_binned.join(
#    sample_to_strain_table, how='left')
pao1_expression_label = pao1_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
pa14_expression_label = pa14_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
print(pao1_expression_label.shape)
pao1_expression_label.head()

(1081, 5564)


Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
ERX541572,5793.218939,766.512255,1608.330977,1663.46607,176.163343,384.600886,295.846835,453.183561,611.865046,43.032267,...,75.306467,447.804528,10.758067,65.893159,44.377025,56.47985,2033.274614,184.231893,1.344758,
ERX541573,4416.506898,797.782811,1770.117221,1562.763979,313.958581,324.501966,333.873864,415.87797,550.599003,38.659079,...,103.090877,698.206395,18.743796,138.235494,39.830566,96.061954,1315.580171,91.376005,1.171487,
ERX541574,3825.086116,644.433113,1852.251003,1589.338107,260.936107,270.820051,363.729119,363.729119,423.03278,67.210815,...,128.491265,468.49892,19.767887,73.141182,33.605408,55.350083,1759.341934,67.210815,3.953577,
ERX541575,3834.097653,789.216207,1926.825153,1610.427665,289.734779,261.294555,250.629471,520.811596,666.567742,53.325419,...,124.425979,611.464809,15.997626,177.751398,21.330168,108.428353,1486.001686,56.880447,1.777514,
ERX541576,3515.165133,853.775186,2185.27713,1683.341246,183.98936,245.319146,253.388855,380.890253,551.968079,66.171612,...,90.380738,745.641089,11.297592,130.729282,50.032194,95.222563,1273.400041,72.627379,1.613942,


In [22]:
print(pa14_expression_label.shape)
pa14_expression_label.head()

(576, 5892)


Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845,Strain type
ERX1477379,222.428844,13.264105,3.060947,236.713266,62.239264,21.426632,18.365684,49.995474,22.446948,3.060947,...,156.108317,996.848537,236.713266,3.060947,11.223474,400.984109,262.221161,154.067686,162.230212,PA14
ERX1477380,201.56765,20.87239,3.578124,294.598874,93.627577,29.221346,17.294266,85.874975,36.973948,4.770832,...,127.619755,731.129998,149.684853,1.192708,4.770832,199.778588,353.041565,63.809878,261.799404,PA14
ERX1477381,176.601967,34.43183,14.994507,358.202102,127.17563,37.20859,19.992675,111.625771,38.319295,2.77676,...,124.954222,750.280682,129.952391,1.666056,2.77676,183.821544,380.416186,49.426337,92.188448,PA14
ERX2174773,52.298674,25.694566,8.185879,7.276337,21.374241,11.369277,5.229867,10.004964,52.753445,6.594181,...,43.430638,1032.785125,490.925382,0.227386,40.474626,125.289433,519.57596,68.215662,129.382373,PA14
ERX2174774,61.883169,20.211842,8.483983,5.988694,17.966081,11.228801,3.243876,15.221263,55.395418,4.241991,...,32.438758,939.476338,486.082313,0.249529,45.66379,130.753148,674.975696,64.128929,146.722998,PA14


In [23]:
assert pao1_expression_binned.shape[0] == pao1_expression_label.shape[0]
assert pa14_expression_binned.shape[0] == pa14_expression_label.shape[0]

In [24]:
sample_to_strain_table["Strain type"].value_counts()

PAO1                861
NA                  795
Clinical Isolate    601
PA14                545
PAK                  65
Name: Strain type, dtype: int64

Looks like our binned compendium sizes is fairly close in number to what SRA annotates

## Quick comparison

Quick check comparing our binned labels compared with SRA annotations

In [25]:
pao1_expression_label["Strain type"].value_counts()

PAO1                783
NA                  226
Clinical Isolate     63
PA14                  8
PAK                   1
Name: Strain type, dtype: int64

**Manually check that these PA14 are mislabeled**
* Clinical ones can be removed by increasing threshold

In [26]:
pa14_expression_label["Strain type"].value_counts()

PA14                519
NA                   37
Clinical Isolate     18
PAO1                  2
Name: Strain type, dtype: int64

## Check

Manually look up the samples we binned as PAO1 but SRA labeled as PA14. Are these cases of samples being mislabeled?

In [27]:
pao1_expression_label[pao1_expression_label["Strain type"] == "PA14"]

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
SRX4326016,1410.997874,2680.200888,650.588182,6642.1161,190.449959,130.673695,90.35947,1127.408153,670.050222,93.139761,...,467.088951,4059.225411,232.15433,52.825536,271.078409,735.387069,656.148765,51.43539,1.390146,PA14
SRX5099522,2302.048276,1677.552304,3342.614689,5860.114076,297.416207,328.641005,211.54801,786.084305,417.631681,63.230217,...,117.873615,3722.776612,24.979839,79.623236,514.428557,491.009958,1513.622112,21.857359,0.78062,PA14
SRX5099523,2505.741703,1846.89679,3413.207338,7247.29404,369.379358,342.741424,255.724171,1141.879458,479.482821,74.586217,...,72.810354,4492.931615,17.758623,85.24139,1028.224271,333.862112,1601.827793,8.879311,1.775862,PA14
SRX5099524,2351.953511,1471.202631,3501.780607,5700.625961,290.299128,357.757672,256.190876,738.25418,402.477381,81.859806,...,143.254661,3566.965267,21.222913,102.324758,523.751169,428.248061,1513.648457,22.738835,0.757961,PA14
SRX5290921,1680.640803,1314.233728,762.60989,2537.469655,226.286567,118.37767,703.8237,994.533489,264.13521,27.379869,...,148.173411,3008.564466,39.459223,63.617932,224.675987,1098.415935,810.122016,21.742837,15.300515,PA14
SRX5290922,2020.81509,1114.993589,962.546135,2144.013905,218.921635,151.561132,752.487724,697.535735,266.783045,42.543476,...,179.037126,2902.705886,38.111864,69.133148,125.857782,820.73455,912.91208,20.385415,7.090579,PA14
SRX7423386,1218.366045,2143.587132,768.114005,4487.048685,111.635336,96.91834,467.925014,876.73047,461.132554,59.55981,...,240.000249,2967.990503,38.113247,34.402551,250.314725,232.704644,520.377899,21.761029,0.062893,PA14
SRX7423388,1370.514724,2555.663245,798.517917,4779.381171,109.37057,97.982501,264.857185,841.02586,503.104624,73.176804,...,233.624559,1582.490675,36.419272,27.850032,276.921774,312.326268,520.355858,27.06076,0.112753,PA14


Note: These are the 7 PA14 labeled samples using threshold of 0

Most samples appear to be mislabeled:
* SRX5099522: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099522
* SRX5099523: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099523
* SRX5099524: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099524
* SRX5290921: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5290921
* SRX5290922: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5290922

Two samples appear to be PA14 samples treated with antimicrobial manuka honey.
* SRX7423386: https://www.ncbi.nlm.nih.gov/sra/?term=SRX7423386
* SRX7423388: https://www.ncbi.nlm.nih.gov/sra/?term=SRX7423388

In [28]:
pa14_label_pao1_binned_ids = list(
    pao1_expression_label[pao1_expression_label["Strain type"] == "PA14"].index
)
pao1_pa14_acc_expression.loc[
    pa14_label_pao1_binned_ids,
    ["median_acc_expression_pao1", "median_acc_expression_pa14"],
]

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14
SRX4326016,56.3009,7.214669
SRX5099522,105.774005,0.862558
SRX5099523,108.3276,1.8401
SRX5099524,100.429855,1.640387
SRX5290921,61.202061,0.88374
SRX5290922,77.996372,0.977961
SRX7423386,61.352265,0.070385
SRX7423388,70.752611,0.125314


In [29]:
# Save compendia with SRA label
pao1_expression_label.to_csv(paths.PAO1_COMPENDIUM_LABEL, sep="\t")
pa14_expression_label.to_csv(paths.PA14_COMPENDIUM_LABEL, sep="\t")

# Save compendia without SRA label
pao1_expression_binned.to_csv(paths.PAO1_COMPENDIUM, sep="\t")
pa14_expression_binned.to_csv(paths.PA14_COMPENDIUM, sep="\t")

# Save processed metadata table
sample_to_strain_table.to_csv(paths.SAMPLE_TO_STRAIN_PROCESSED, sep="\t")

'# Save compendia with SRA label\npao1_expression_label.to_csv(paths.PAO1_COMPENDIUM_LABEL, sep="\t")\npa14_expression_label.to_csv(paths.PA14_COMPENDIUM_LABEL, sep="\t")\n\n# Save compendia without SRA label\npao1_expression_binned.to_csv(paths.PAO1_COMPENDIUM, sep="\t")\npa14_expression_binned.to_csv(paths.PA14_COMPENDIUM, sep="\t")\n\n# Save processed metadata table\nsample_to_strain_table.to_csv(paths.SAMPLE_TO_STRAIN_PROCESSED, sep="\t")'