# Create PAO1 and PA14 compendia

This notebook is using the observation from the [exploratory notebook](../0_explore_data/cluster_by_accessory_gene.ipynb) to bin samples into PAO1 or PA14 compendia.

A sample is considered PAO1 if the median gene expression of PA14 accessory genes is 0 and PAO1 accessory genes in > 0.
Similarlty, a sample is considered PA14 if the median gene expression of PA14 accessory genes is > 0 and PAO1 accessory genes in 0.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import pandas as pd
import seaborn as sns
from textwrap import fill
import matplotlib.pyplot as plt
from scripts import paths, utils

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))


In [2]:
# User param
# same_threshold: if median accessory expression of PAO1 samples > same_threshold then this sample is binned as PAO1
# 25 threshold based on comparing expression of PAO1 SRA-labeled samples vs non-PAO1 samples
same_threshold = 25

# opp_threshold: if median accessory expression of PA14 samples < opp_threshold then this sample is binned as PAO1
# 25 threshold based on previous plot (eye-balling trying to avoid samples
# on the diagonal of explore_data/cluster_by_accessory_gene.ipynb plot)
opp_threshold = 25

## Load data

The expression data being used is described in the [paper](link TBD) with source code [here](https://github.com/hoganlab-dartmouth/pa-seq-compendia)

In [3]:
# Expression data files
pao1_expression_filename = paths.PAO1_GE
pa14_expression_filename = paths.PA14_GE

# File containing table to map sample id to strain name
sample_to_strain_filename = paths.SAMPLE_TO_STRAIN

In [4]:
# Load expression data
pao1_expression = pd.read_csv(pao1_expression_filename, sep="\t", index_col=0, header=0)
pa14_expression = pd.read_csv(pa14_expression_filename, sep="\t", index_col=0, header=0)

In [5]:
# Load metadata
# Set index to experiment id, which is what we will use to map to expression data
sample_to_strain_table_full = pd.read_csv(sample_to_strain_filename, index_col=2)

## Get core and accessory annotations

In [6]:
# Annotations are from BACTOME
# Gene ids from PAO1 are annotated with the homologous PA14 gene id and vice versa
pao1_annot_filename = paths.GENE_PAO1_ANNOT
pa14_annot_filename = paths.GENE_PA14_ANNOT

core_acc_dict = utils.get_my_core_acc_genes(
    pao1_annot_filename, pa14_annot_filename, pao1_expression, pa14_expression
)

Number of PAO1 core genes: 5366
Number of PA14 core genes: 5363
Number of PAO1 core genes in my dataset: 5361
Number of PA14 core genes in my dataset: 5361
Number of PAO1-specific genes: 202
Number of PA14-specific genes: 530


In [7]:
pao1_acc = core_acc_dict["acc_pao1"]
pa14_acc = core_acc_dict["acc_pa14"]

## Format expression data

Format index to only include experiment id. This will be used to map to expression data and SRA labels later

In [8]:
# Format expression data indices so that values can be mapped to `sample_to_strain_table`
pao1_index_processed = pao1_expression.index.str.split(".").str[0]
pa14_index_processed = pa14_expression.index.str.split(".").str[0]

print(
    f"No. of samples processed using PAO1 reference after filtering: {pao1_expression.shape}"
)
print(
    f"No. of samples processed using PA14 reference after filtering: {pa14_expression.shape}"
)
pao1_expression.index = pao1_index_processed
pa14_expression.index = pa14_index_processed

No. of samples processed using PAO1 reference after filtering: (2588, 5563)
No. of samples processed using PA14 reference after filtering: (2588, 5891)


In [9]:
pao1_expression.head()

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA1905,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1
ERX541571,5589.915138,897.177641,1373.180223,1801.831763,139.560966,505.908503,480.986902,662.914591,677.867551,77.256964,...,0.0,97.194244,468.526102,12.460801,87.225604,74.764803,77.256964,2275.342185,249.216012,0.0
ERX541572,6297.494504,831.96526,1747.27326,1807.221548,190.079936,416.713706,320.211585,491.283528,663.817624,45.326754,...,0.0,80.418435,485.434914,10.235073,70.183361,46.788907,59.948288,2209.313721,198.852856,0.0
ERX541573,4948.395849,892.785667,1982.509348,1750.12249,350.549666,362.365947,372.869308,464.773715,615.759526,42.013443,...,0.0,114.224049,781.187458,19.693801,153.611651,43.326363,106.346528,1473.09635,101.094848,0.0
ERX541574,4633.161907,778.582016,2242.316207,1923.69649,313.828444,325.806628,438.401566,438.401566,510.270675,79.05602,...,0.0,153.320766,565.370326,21.560733,86.242931,38.330192,64.682198,2129.721269,79.05602,2.395637
ERX541575,4228.807727,868.906226,2124.210932,1775.07931,317.749004,286.366386,274.597904,572.732772,733.568687,56.880994,...,0.0,135.337539,672.764866,15.691309,194.179947,21.57555,117.684816,1637.780358,60.803822,0.0


In [10]:
pa14_expression.head()

Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_19205,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845
ERX541571,211.451625,56.20866,0.0,2.676603,10.706411,10.706411,18.73622,72.268277,66.915071,5.353206,...,42.825646,141.859951,3522.409344,182.008993,21.412823,2.676603,1025.13889,390.784015,152.566362,0.0
ERX541572,221.780416,51.647494,6.076176,15.190439,21.266615,9.114264,12.152352,78.990285,82.028373,3.038088,...,60.761758,179.247186,2953.021435,221.780416,27.342791,15.190439,1193.968543,568.122437,118.485428,0.0
ERX541573,168.134943,44.835985,18.214619,23.819117,8.406747,18.214619,5.604498,64.451728,57.446106,8.406747,...,56.044981,208.767554,1820.060757,68.655102,4.203374,12.610121,1548.2426,619.29704,63.050604,0.0
ERX541574,203.805778,13.505202,6.138728,14.732948,4.910983,17.188439,8.59422,47.88208,58.931791,7.366474,...,67.526011,227.132945,2248.002282,77.347976,6.138728,12.277456,1706.566451,898.709814,165.745663,0.0
ERX541575,193.98079,46.386711,8.433947,29.518816,8.433947,12.650921,4.216974,44.278224,42.169737,6.325461,...,40.06125,250.909936,1575.039679,65.363092,4.216974,10.542434,1313.58731,710.56007,145.485593,0.0


In [11]:
# Save pre-binned expression data
pao1_expression.to_csv(paths.PAO1_PREBIN_COMPENDIUM, sep="\t")
pa14_expression.to_csv(paths.PA14_PREBIN_COMPENDIUM, sep="\t")

## Bin samples as PAO1 or PA14

In [12]:
# Create accessory df
# accessory gene ids | median accessory expression | strain label

# PAO1
pao1_acc_expression = pao1_expression[pao1_acc]
pao1_acc_expression["median_acc_expression"] = pao1_acc_expression.median(axis=1)

# PA14
pa14_acc_expression = pa14_expression[pa14_acc]
pa14_acc_expression["median_acc_expression"] = pa14_acc_expression.median(axis=1)

pao1_acc_expression.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,PA3066,PA4103,PA3067,PA1370,PA3144,PA0712,PA0203,PA2732,PA4551,PA4623,...,PA2459,PA1393,PA3158,PA0445,PA3498,PA3510,PA3514,PA2182,PA3504,median_acc_expression
ERX541571,239.247371,4.98432,57.319683,314.012175,281.614093,94.702084,24.921601,1565.076552,37.382402,22.429441,...,1577.537353,114.639365,2743.868287,1385.641024,107.162885,0.0,52.335362,0.0,42.366722,89.717764
ERX541572,197.390703,0.0,54.099674,397.705713,273.422678,51.175367,13.15938,1745.811107,54.099674,16.083687,...,1315.93802,106.737195,2435.947491,1250.141119,90.653508,0.0,29.243067,0.0,54.099674,87.729201
ERX541573,48.578044,0.0,39.387603,329.542945,374.182228,21.006722,17.067961,2151.87604,70.897685,14.442121,...,598.691565,86.652726,2109.862597,1052.961918,90.591487,5.25168,18.380881,5.25168,70.897685,91.247947
ERX541574,50.308376,0.0,23.95637,313.828444,318.619717,105.408027,14.373822,2541.770829,91.034205,21.560733,...,1001.376255,129.364397,2448.340987,1336.765431,103.01239,4.791274,35.934555,0.0,93.429842,104.210208
ERX541575,50.996754,1.961414,23.536963,349.131621,433.472406,31.382618,27.45979,2702.827944,76.495131,21.57555,...,696.301829,90.225026,1984.950566,1190.578057,72.572303,0.0,13.729895,3.922827,58.842408,82.379371


In [13]:
# Merge PAO1 and PA14 accessory dataframes
pao1_pa14_acc_expression = pao1_acc_expression.merge(
    pa14_acc_expression,
    left_index=True,
    right_index=True,
    suffixes=["_pao1", "_pa14"],
)

pao1_pa14_acc_expression.head()

Unnamed: 0,PA3066,PA4103,PA3067,PA1370,PA3144,PA0712,PA0203,PA2732,PA4551,PA4623,...,PA14_59590,PA14_39880,PA14_00410,PA14_63440,PA14_31250,PA14_46520,PA14_31260,PA14_41290,PA14_03265,median_acc_expression_pa14
ERX541571,239.247371,4.98432,57.319683,314.012175,281.614093,94.702084,24.921601,1565.076552,37.382402,22.429441,...,0.0,18.73622,171.302582,0.0,0.0,5.353206,0.0,32.119234,2.676603,0.0
ERX541572,197.390703,0.0,54.099674,397.705713,273.422678,51.175367,13.15938,1745.811107,54.099674,16.083687,...,0.0,18.228527,173.17101,0.0,0.0,3.038088,0.0,21.266615,0.0,3.038088
ERX541573,48.578044,0.0,39.387603,329.542945,374.182228,21.006722,17.067961,2151.87604,70.897685,14.442121,...,0.0,0.0,319.456392,0.0,0.0,16.813494,0.0,4.203374,0.0,1.401125
ERX541574,50.308376,0.0,23.95637,313.828444,318.619717,105.408027,14.373822,2541.770829,91.034205,21.560733,...,0.0,4.910983,244.321384,0.0,0.0,11.049711,0.0,7.366474,0.0,1.227746
ERX541575,50.996754,1.961414,23.536963,349.131621,433.472406,31.382618,27.45979,2702.827944,76.495131,21.57555,...,0.0,4.216974,324.706975,0.0,0.0,10.542434,2.108487,12.650921,0.0,2.108487


In [14]:
# Find PAO1 samples
pao1_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1>@same_threshold & median_acc_expression_pa14<@opp_threshold"
    ).index
)

In [15]:
# Find PA14 samples
pa14_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1<@opp_threshold & median_acc_expression_pa14>@same_threshold"
    ).index
)

In [16]:
# Check that there are no samples that are binned as both PAO1 and PA14
shared_pao1_pa14_binned_ids = list(set(pao1_binned_ids).intersection(pa14_binned_ids))

assert len(shared_pao1_pa14_binned_ids) == 0

## Format SRA annotations

In [17]:
# Since experiments have multiple runs there are duplicated experiment ids in the index
# We will need to remove these so that the count calculations are accurate
sample_to_strain_table_full_processed = sample_to_strain_table_full[
    ~sample_to_strain_table_full.index.duplicated(keep="first")
]

assert (
    len(sample_to_strain_table_full.index.unique())
    == sample_to_strain_table_full_processed.shape[0]
)

In [18]:
# Aggregate boolean labels into a single strain label
aggregated_label = []
for exp_id in list(sample_to_strain_table_full_processed.index):
    if sample_to_strain_table_full_processed.loc[exp_id, "PAO1"].all() == True:
        aggregated_label.append("PAO1")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PA14"].all() == True:
        aggregated_label.append("PA14")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PAK"].all() == True:
        aggregated_label.append("PAK")
    elif (
        sample_to_strain_table_full_processed.loc[exp_id, "ClinicalIsolate"].all()
        == True
    ):
        aggregated_label.append("Clinical Isolate")
    else:
        aggregated_label.append("NA")

sample_to_strain_table_full_processed["Strain type"] = aggregated_label

sample_to_strain_table = sample_to_strain_table_full_processed["Strain type"].to_frame()

sample_to_strain_table.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0_level_0,Strain type
Experiment,Unnamed: 1_level_1
SRX5057740,
SRX5057739,
SRX5057910,
SRX5057909,
SRX3573046,PAO1


## Save pre-binned data with median accessory expression
This dataset will be used for Georgia's manuscript, which describes how we generated these compendia

In [19]:
# Select columns with median accessory expression
pao1_pa14_acc_expression_select = pao1_pa14_acc_expression[
    ["median_acc_expression_pao1", "median_acc_expression_pa14"]
]

pao1_pa14_acc_expression_select.head()

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14
ERX541571,89.717764,0.0
ERX541572,87.729201,3.038088
ERX541573,91.247947,1.401125
ERX541574,104.210208,1.227746
ERX541575,82.379371,2.108487


In [20]:
# Add SRA strain type
pao1_pa14_acc_expression_label = pao1_pa14_acc_expression_select.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
# Rename column
pao1_pa14_acc_expression_label = pao1_pa14_acc_expression_label.rename(
    {"Strain type": "SRA label"}, axis=1
)

pao1_pa14_acc_expression_label.head()

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14,SRA label
ERX541571,89.717764,0.0,
ERX541572,87.729201,3.038088,
ERX541573,91.247947,1.401125,
ERX541574,104.210208,1.227746,
ERX541575,82.379371,2.108487,


In [21]:
# Add our binned label
pao1_pa14_acc_expression_label["Our label"] = "NA"
pao1_pa14_acc_expression_label.loc[pao1_binned_ids, "Our label"] = "PAO1-like"
pao1_pa14_acc_expression_label.loc[pa14_binned_ids, "Our label"] = "PA14-like"

pao1_pa14_acc_expression_label.head()

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14,SRA label,Our label
ERX541571,89.717764,0.0,,PAO1-like
ERX541572,87.729201,3.038088,,PAO1-like
ERX541573,91.247947,1.401125,,PAO1-like
ERX541574,104.210208,1.227746,,PAO1-like
ERX541575,82.379371,2.108487,,PAO1-like


In [22]:
# Confirm dimensions
pao1_expression_prebin_filename = paths.PAO1_PREBIN_COMPENDIUM
pa14_expression_prebin_filename = paths.PA14_PREBIN_COMPENDIUM

pao1_expression_prebin = pd.read_csv(
    pao1_expression_prebin_filename, sep="\t", index_col=0, header=0
)
pa14_expression_prebin = pd.read_csv(
    pa14_expression_prebin_filename, sep="\t", index_col=0, header=0
)

In [23]:
# he two expression prebins are because the same samples were mapped to 2 different references (PAO1 and a PA14 reference.
# This assertion is to make sure that the number of samples is the same in both, which it is.
# This assertion is also testing that when we added information about our accessory gene expression
# and labels we retained the same number of samples, which we did.
assert (
    pao1_expression_prebin.shape[0]
    == pa14_expression_prebin.shape[0]
    == pao1_pa14_acc_expression_label.shape[0]
)

In [24]:
# Save
pao1_pa14_acc_expression_label.to_csv(
    "prebinned_compendia_acc_expression.tsv", sep="\t"
)

## Create compendia

Create PAO1 and PA14 compendia

In [25]:
# Get expression data
# Note: reindexing needed here instead of .loc since samples from expression data
# were filtered out for low counts, but these samples still exist in log files
pao1_expression_binned = pao1_expression.loc[pao1_binned_ids]
pa14_expression_binned = pa14_expression.loc[pa14_binned_ids]

In [26]:
assert len(pao1_binned_ids) == pao1_expression_binned.shape[0]
assert len(pa14_binned_ids) == pa14_expression_binned.shape[0]

In [27]:
# Label samples with SRA annotations
# pao1_expression_label = pao1_expression_binned.join(
#    sample_to_strain_table, how='left')
pao1_expression_label = pao1_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
pa14_expression_label = pa14_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
print(pao1_expression_label.shape)
pao1_expression_label.head()

(1007, 5564)


Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
ERX541571,5589.915138,897.177641,1373.180223,1801.831763,139.560966,505.908503,480.986902,662.914591,677.867551,77.256964,...,97.194244,468.526102,12.460801,87.225604,74.764803,77.256964,2275.342185,249.216012,0.0,
ERX541572,6297.494504,831.96526,1747.27326,1807.221548,190.079936,416.713706,320.211585,491.283528,663.817624,45.326754,...,80.418435,485.434914,10.235073,70.183361,46.788907,59.948288,2209.313721,198.852856,0.0,
ERX541573,4948.395849,892.785667,1982.509348,1750.12249,350.549666,362.365947,372.869308,464.773715,615.759526,42.013443,...,114.224049,781.187458,19.693801,153.611651,43.326363,106.346528,1473.09635,101.094848,0.0,
ERX541574,4633.161907,778.582016,2242.316207,1923.69649,313.828444,325.806628,438.401566,438.401566,510.270675,79.05602,...,153.320766,565.370326,21.560733,86.242931,38.330192,64.682198,2129.721269,79.05602,2.395637,
ERX541575,4228.807727,868.906226,2124.210932,1775.07931,317.749004,286.366386,274.597904,572.732772,733.568687,56.880994,...,135.337539,672.764866,15.691309,194.179947,21.57555,117.684816,1637.780358,60.803822,0.0,


In [28]:
print(pa14_expression_label.shape)
pa14_expression_label.head()

(568, 5892)


Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845,Strain type
ERX1477379,249.440338,13.793936,2.298989,265.533263,68.969679,22.989893,19.541409,55.175743,24.139388,2.298989,...,174.723186,1121.906772,265.533263,2.298989,11.494946,450.6019,294.270629,172.424196,181.620154,PA14
ERX1477380,238.012983,24.013179,3.53135,348.1911,110.178117,33.900959,19.775559,100.996607,43.082469,4.94389,...,150.435506,865.180726,176.567495,0.70627,4.94389,235.894173,417.405558,74.864618,309.346251,PA14
ERX1477381,211.186691,40.638448,17.321306,429.035423,151.894528,43.969469,23.317143,133.240814,45.301877,2.664816,...,149.229712,899.375498,155.225549,1.332408,2.664816,219.847344,455.683585,58.625958,109.923672,PA14
ERX2174773,56.466716,27.616909,8.630284,7.643966,22.931898,12.082398,5.42475,10.602921,56.959875,6.904227,...,46.850114,1119.717723,532.118662,0.0,43.64458,135.618751,563.187685,73.727285,140.057183,PA14
ERX2174774,67.905545,21.993699,9.072401,6.323188,19.519408,12.096534,3.299055,16.495274,60.757593,4.39874,...,35.464839,1034.803529,535.271645,0.0,50.035665,143.783806,743.387019,70.379836,161.378765,PA14


In [29]:
assert pao1_expression_binned.shape[0] == pao1_expression_label.shape[0]
assert pa14_expression_binned.shape[0] == pa14_expression_label.shape[0]

In [30]:
sample_to_strain_table["Strain type"].value_counts()

PAO1                861
NA                  795
Clinical Isolate    601
PA14                545
PAK                  65
Name: Strain type, dtype: int64

Looks like our binned compendium sizes is fairly close in number to what SRA annotates

## Quick comparison

Quick check comparing our binned labels compared with SRA annotations

In [31]:
pao1_expression_label["Strain type"].value_counts()

PAO1                715
NA                  230
Clinical Isolate     54
PA14                  8
Name: Strain type, dtype: int64

**Manually check that these PA14 are mislabeled**
* Clinical ones can be removed by increasing threshold

In [32]:
pa14_expression_label["Strain type"].value_counts()

PA14                491
NA                   49
Clinical Isolate     26
PAO1                  2
Name: Strain type, dtype: int64

## Check

Manually look up the samples we binned as PAO1 but SRA labeled as PA14. Are these cases of samples being mislabeled?

In [33]:
pao1_expression_label[pao1_expression_label["Strain type"] == "PA14"]

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
SRX4326016,1541.519446,2929.495043,709.950278,7262.168043,206.752115,141.381961,97.295113,1231.391274,731.233584,100.335585,...,509.279107,4437.569294,252.359199,56.248737,294.925811,802.684682,716.031222,54.728501,0.0,PA14
SRX5099522,2432.731077,1772.559821,3532.741431,6194.056805,313.581346,346.589909,222.807799,830.165354,440.664313,66.017126,...,123.78211,3934.620683,25.581636,83.346621,542.990858,518.234436,1599.264867,22.28078,0.0,PA14
SRX5099523,2507.869257,1847.997275,3416.749533,7256.813168,368.17655,341.49709,254.344187,1141.880896,478.451652,72.923858,...,71.145227,4498.156986,16.007676,83.595642,1028.048532,332.603937,1602.546241,7.114523,0.0,PA14
SRX5099524,2520.930268,1576.597266,3753.764316,6111.346105,310.443379,382.771811,273.872824,790.73667,430.719872,86.956653,...,152.783653,3823.65471,21.942333,108.898986,560.748512,458.350958,1622.107291,23.567691,0.0,PA14
SRX5290921,1708.629184,1335.94161,774.862516,2580.144741,229.346199,119.587661,715.068685,1010.761464,267.843597,27.030088,...,149.894123,3059.314479,39.316491,63.889298,227.708012,1116.424534,823.189036,21.296433,14.743684,PA14
SRX5290922,2101.996519,1159.372367,1000.73112,2230.200782,226.893876,156.796581,782.13824,724.953604,276.699849,43.349643,...,185.388899,3019.717684,38.737979,71.019628,130.048929,853.157867,949.080482,20.291322,6.45633,PA14
SRX7423386,1670.977631,2939.9752,1053.429292,6154.176712,153.028461,132.843196,641.701646,1202.403448,632.38537,81.603678,...,329.088827,4070.695079,52.188398,47.098951,343.235764,319.082456,713.644001,29.760326,0.0,PA14
SRX7423388,1898.859526,3541.02774,1106.287996,6622.260879,151.390068,135.610504,366.835788,1165.18795,696.95675,101.239178,...,323.559164,2192.578129,50.307123,38.433392,383.552751,432.610007,720.860445,37.339759,0.0,PA14


Note: These are the 7 PA14 labeled samples using threshold of 0

Most samples appear to be mislabeled:
* SRX5099522: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099522
* SRX5099523: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099523
* SRX5099524: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099524
* SRX5290921: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5290921
* SRX5290922: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5290922

Two samples appear to be PA14 samples treated with antimicrobial manuka honey.
* SRX7423386: https://www.ncbi.nlm.nih.gov/sra/?term=SRX7423386
* SRX7423388: https://www.ncbi.nlm.nih.gov/sra/?term=SRX7423388

In [34]:
pa14_label_pao1_binned_ids = list(
    pao1_expression_label[pao1_expression_label["Strain type"] == "PA14"].index
)
pao1_pa14_acc_expression.loc[
    pa14_label_pao1_binned_ids,
    ["median_acc_expression_pao1", "median_acc_expression_pa14"],
]

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14
SRX4326016,60.049328,6.127075
SRX5099522,110.991292,0.0
SRX5099523,106.717841,0.0
SRX5099524,106.867289,0.0
SRX5290921,61.432018,0.0
SRX5290922,80.242956,0.0
SRX7423386,84.062139,0.0
SRX7423388,97.880162,0.0


In [35]:
# Save compendia with SRA label
pao1_expression_label.to_csv(paths.PAO1_COMPENDIUM_LABEL, sep="\t")
pa14_expression_label.to_csv(paths.PA14_COMPENDIUM_LABEL, sep="\t")

# Save compendia without SRA label
pao1_expression_binned.to_csv(paths.PAO1_COMPENDIUM, sep="\t")
pa14_expression_binned.to_csv(paths.PA14_COMPENDIUM, sep="\t")

# Save processed metadata table
sample_to_strain_table.to_csv(paths.SAMPLE_TO_STRAIN_PROCESSED, sep="\t")