# Create PAO1 and PA14 compendia

This notebook is using the observation from the [exploratory notebook](../0_explore_data/cluster_by_accessory_gene.ipynb) to bin samples into PAO1 or PA14 compendia.

A sample is considered PAO1 if the median gene expression of PA14 accessory genes is 0 and PAO1 accessory genes in > 0.
Similarlty, a sample is considered PA14 if the median gene expression of PA14 accessory genes is > 0 and PAO1 accessory genes in 0.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import pandas as pd
import seaborn as sns
from textwrap import fill
import matplotlib.pyplot as plt
from scripts import paths, utils

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))


In [2]:
# User param
# same_threshold: if median accessory expression of PAO1 samples > same_threshold then this sample is binned as PAO1
# 25 threshold based on comparing expression of PAO1 SRA-labeled samples vs non-PAO1 samples
same_threshold = 25

# opp_threshold: if median accessory expression of PA14 samples < opp_threshold then this sample is binned as PAO1
# 25 threshold based on previous plot (eye-balling trying to avoid samples
# on the diagonal of explore_data/cluster_by_accessory_gene.ipynb plot)
opp_threshold = 25

## Load data

The expression data being used is described in the [paper](link TBD) with source code [here](https://github.com/hoganlab-dartmouth/pa-seq-compendia)

In [3]:
# Expression data files
pao1_expression_filename = paths.PAO1_GE
pa14_expression_filename = paths.PA14_GE

# File containing table to map sample id to strain name
sample_to_strain_filename = paths.SAMPLE_TO_STRAIN

In [4]:
# Load expression data
pao1_expression = pd.read_csv(pao1_expression_filename, sep="\t", index_col=0, header=0)
pa14_expression = pd.read_csv(pa14_expression_filename, sep="\t", index_col=0, header=0)

In [5]:
# Load metadata
# Set index to experiment id, which is what we will use to map to expression data
sample_to_strain_table_full = pd.read_csv(sample_to_strain_filename, index_col=2)

## Get core and accessory annotations

In [6]:
# Annotations are from BACTOME
# Gene ids from PAO1 are annotated with the homologous PA14 gene id and vice versa
pao1_annot_filename = paths.GENE_PAO1_ANNOT
pa14_annot_filename = paths.GENE_PA14_ANNOT

core_acc_dict = utils.get_my_core_acc_genes(
    pao1_annot_filename, pa14_annot_filename, pao1_expression, pa14_expression
)

Number of PAO1 core genes: 5366
Number of PA14 core genes: 5363
Number of PAO1 core genes in my dataset: 5361
Number of PA14 core genes in my dataset: 5361
Number of PAO1-specific genes: 202
Number of PA14-specific genes: 530


In [7]:
pao1_acc = core_acc_dict["acc_pao1"]
pa14_acc = core_acc_dict["acc_pa14"]

In [8]:
pao1_acc_df = pd.DataFrame(pao1_acc)
pa14_acc_df = pd.DataFrame(pa14_acc)

In [9]:
# Save to files (supplementary data tables)
pao1_acc_df.to_csv("pao1_acc_gene_ids.tsv", sep="\t")
pa14_acc_df.to_csv("pa14_acc_gene_ids.tsv", sep="\t")

## Format expression data

Format index to only include experiment id. This will be used to map to expression data and SRA labels later

In [10]:
# Format expression data indices so that values can be mapped to `sample_to_strain_table`
pao1_index_processed = pao1_expression.index.str.split(".").str[0]
pa14_index_processed = pa14_expression.index.str.split(".").str[0]

print(
    f"No. of samples processed using PAO1 reference after filtering: {pao1_expression.shape}"
)
print(
    f"No. of samples processed using PA14 reference after filtering: {pa14_expression.shape}"
)
pao1_expression.index = pao1_index_processed
pa14_expression.index = pa14_index_processed

No. of samples processed using PAO1 reference after filtering: (2333, 5563)
No. of samples processed using PA14 reference after filtering: (2333, 5891)


In [11]:
pao1_expression.head()

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA1905,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1
ERX541571,5621.037929,902.172829,1380.825635,1811.863764,140.337996,508.725234,483.664878,666.605479,681.641693,77.687105,...,0.0,97.73539,471.134699,12.530178,87.711247,75.181069,77.687105,2288.010535,250.603564,0.0
ERX541572,6323.898054,835.453446,1754.599065,1814.798699,190.876886,418.460865,321.554138,493.343336,666.600816,45.516796,...,0.0,80.755606,487.470201,10.277986,70.477619,46.98508,60.199633,2218.576726,199.686588,0.0
ERX541573,4954.119979,893.81841,1984.802645,1752.14697,350.95517,362.785119,373.30063,465.311348,616.471815,42.062043,...,0.0,114.356179,782.091108,19.716583,153.789344,43.376482,106.469546,1474.800376,101.21179,0.0
ERX541574,4603.356163,773.573295,2227.89109,1911.321096,311.809544,323.710671,435.581271,435.581271,506.988037,78.547442,...,0.0,152.334434,561.733224,21.42203,85.688119,38.083608,64.266089,2116.020491,78.547442,2.380226
ERX541575,4260.451254,875.408119,2140.10608,1788.361959,320.126671,288.509222,276.652679,577.018444,739.057871,57.306626,...,0.0,136.350249,677.799063,15.808725,195.632966,21.736996,118.565434,1650.03562,61.258807,0.0


In [12]:
pa14_expression.head()

Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_19205,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845
ERX541571,209.353167,55.650842,0.0,2.65004,10.60016,10.60016,18.550281,71.551082,66.251002,5.30008,...,42.400641,140.452125,3487.45276,180.202726,21.200321,2.65004,1014.965355,386.905853,151.052285,0.0
ERX541572,223.627085,52.07754,6.126769,15.316924,21.443693,9.190154,12.253539,79.648003,82.711388,3.063385,...,61.267694,180.739699,2977.609951,223.627085,27.570463,15.316924,1203.910196,572.852943,119.472004,0.0
ERX541573,167.343061,44.624816,18.128832,23.706934,8.367153,18.128832,5.578102,64.148173,57.175546,8.367153,...,55.78102,207.784301,1811.488635,68.33175,4.183577,12.55073,1540.950687,616.380275,62.753648,0.0
ERX541574,203.038931,13.454387,6.11563,14.677513,4.892504,17.123765,8.561883,47.701918,58.710052,7.338757,...,67.271935,226.278327,2239.543876,77.056944,6.11563,12.231261,1700.145269,895.3283,165.122023,0.0
ERX541575,193.971582,46.384509,8.433547,29.517415,8.433547,12.650321,4.216774,44.276122,42.167735,6.32516,...,40.059348,250.898025,1574.96491,65.35999,4.216774,10.541934,1313.524952,710.526338,145.478686,0.0


In [13]:
# Save pre-binned expression data
pao1_expression.to_csv(paths.PAO1_PREBIN_COMPENDIUM, sep="\t")
pa14_expression.to_csv(paths.PA14_PREBIN_COMPENDIUM, sep="\t")

## Bin samples as PAO1 or PA14

In [14]:
# Create accessory df
# accessory gene ids | median accessory expression | strain label

# PAO1
pao1_acc_expression = pao1_expression[pao1_acc]
pao1_acc_expression["median_acc_expression"] = pao1_acc_expression.median(axis=1)

# PA14
pa14_acc_expression = pa14_expression[pa14_acc]
pa14_acc_expression["median_acc_expression"] = pa14_acc_expression.median(axis=1)

pao1_acc_expression.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,PA3511,PA3149,PA0645,PA0457.1,PA3065,PA1152,PA1938,PA1555.1,PA1427,PA3150,...,PA3843,PA2335,PA0257,PA2224,PA5149,PA1471,PA2229,PA3157,PA2218,median_acc_expression
ERX541571,10.024143,2972.158263,72.675033,87.711247,85.205212,167.904388,122.795746,77.687105,17.542249,5382.964544,...,65.156927,17.542249,170.410423,177.92853,110.265568,15.036214,87.711247,3004.736726,30.072428,90.217283
ERX541572,4.404851,3128.912643,70.477619,70.477619,51.389931,123.335834,67.541052,60.199633,7.341419,6259.29357,...,46.98508,19.087689,246.671668,118.930983,132.145536,7.341419,74.882471,2870.494705,32.302242,88.097024
ERX541573,3.943317,2844.445644,198.480264,153.789344,57.835309,85.438524,160.361538,106.469546,24.974338,1815.240035,...,27.603216,24.974338,350.95517,76.237453,132.758323,5.257755,40.747604,3416.226539,46.005359,91.353499
ERX541574,9.520902,3156.179044,278.486386,85.688119,73.786991,80.927668,121.391502,64.266089,16.661579,2896.734462,...,40.463834,14.281353,316.569995,142.813531,88.068344,11.901128,30.942932,4715.226762,42.844059,103.53981
ERX541575,7.904362,3440.37367,193.656875,195.632966,31.617449,69.16317,106.70889,118.565434,13.832634,2841.618229,...,35.56963,17.784815,278.628769,86.947985,122.517615,13.832634,25.689177,3649.83927,41.497902,82.995804


In [15]:
# Merge PAO1 and PA14 accessory dataframes
pao1_pa14_acc_expression = pao1_acc_expression.merge(
    pa14_acc_expression,
    left_index=True,
    right_index=True,
    suffixes=["_pao1", "_pa14"],
)

pao1_pa14_acc_expression.head()

Unnamed: 0,PA3511,PA3149,PA0645,PA0457.1,PA3065,PA1152,PA1938,PA1555.1,PA1427,PA3150,...,PA14_54070,PA14_27630,PA14_69520,PA14_22250,PA14_28770,PA14_28850,PA14_59940,PA14_36790,PA14_49030,median_acc_expression_pa14
ERX541571,10.024143,2972.158263,72.675033,87.711247,85.205212,167.904388,122.795746,77.687105,17.542249,5382.964544,...,21.200321,0.0,7.95012,5.30008,2.65004,63.600962,7.95012,0.0,0.0,0.0
ERX541572,4.404851,3128.912643,70.477619,70.477619,51.389931,123.335834,67.541052,60.199633,7.341419,6259.29357,...,3.063385,0.0,15.316924,0.0,3.063385,49.014156,6.126769,0.0,0.0,3.063385
ERX541573,3.943317,2844.445644,198.480264,153.789344,57.835309,85.438524,160.361538,106.469546,24.974338,1815.240035,...,29.285036,0.0,6.972628,11.156204,0.0,111.562041,5.578102,2.789051,0.0,1.394526
ERX541574,9.520902,3156.179044,278.486386,85.688119,73.786991,80.927668,121.391502,64.266089,16.661579,2896.734462,...,28.1319,0.0,3.669378,2.446252,0.0,66.048809,6.11563,0.0,0.0,1.223126
ERX541575,7.904362,3440.37367,193.656875,195.632966,31.617449,69.16317,106.70889,118.565434,13.832634,2841.618229,...,12.650321,0.0,0.0,2.108387,2.108387,86.443857,6.32516,0.0,0.0,2.108387


In [16]:
# Find PAO1 samples
pao1_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1>@same_threshold & median_acc_expression_pa14<@opp_threshold"
    ).index
)

In [17]:
# Find PA14 samples
pa14_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1<@opp_threshold & median_acc_expression_pa14>@same_threshold"
    ).index
)

In [18]:
# Check that there are no samples that are binned as both PAO1 and PA14
shared_pao1_pa14_binned_ids = list(set(pao1_binned_ids).intersection(pa14_binned_ids))

if same_threshold == 25:
    assert len(shared_pao1_pa14_binned_ids) == 0

## Format SRA annotations

In [19]:
# Since experiments have multiple runs there are duplicated experiment ids in the index
# We will need to remove these so that the count calculations are accurate
sample_to_strain_table_full_processed = sample_to_strain_table_full[
    ~sample_to_strain_table_full.index.duplicated(keep="first")
]

assert (
    len(sample_to_strain_table_full.index.unique())
    == sample_to_strain_table_full_processed.shape[0]
)

In [20]:
# Aggregate boolean labels into a single strain label
aggregated_label = []
for exp_id in list(sample_to_strain_table_full_processed.index):
    if sample_to_strain_table_full_processed.loc[exp_id, "PAO1"].all() == True:
        aggregated_label.append("PAO1")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PA14"].all() == True:
        aggregated_label.append("PA14")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PAK"].all() == True:
        aggregated_label.append("PAK")
    elif (
        sample_to_strain_table_full_processed.loc[exp_id, "ClinicalIsolate"].all()
        == True
    ):
        aggregated_label.append("Clinical Isolate")
    else:
        aggregated_label.append("NA")

sample_to_strain_table_full_processed["Strain type"] = aggregated_label

sample_to_strain_table = sample_to_strain_table_full_processed["Strain type"].to_frame()

sample_to_strain_table.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0_level_0,Strain type
Experiment,Unnamed: 1_level_1
SRX5057740,
SRX5057739,
SRX5057910,
SRX5057909,
SRX3573046,PAO1


In [26]:
test = pao1_expression.merge(sample_to_strain_table, left_index=True, right_index=True)

print(test.shape)
test.head()

(2333, 5564)


Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
ERX541571,5621.037929,902.172829,1380.825635,1811.863764,140.337996,508.725234,483.664878,666.605479,681.641693,77.687105,...,97.73539,471.134699,12.530178,87.711247,75.181069,77.687105,2288.010535,250.603564,0.0,
ERX541572,6323.898054,835.453446,1754.599065,1814.798699,190.876886,418.460865,321.554138,493.343336,666.600816,45.516796,...,80.755606,487.470201,10.277986,70.477619,46.98508,60.199633,2218.576726,199.686588,0.0,
ERX541573,4954.119979,893.81841,1984.802645,1752.14697,350.95517,362.785119,373.30063,465.311348,616.471815,42.062043,...,114.356179,782.091108,19.716583,153.789344,43.376482,106.469546,1474.800376,101.21179,0.0,
ERX541574,4603.356163,773.573295,2227.89109,1911.321096,311.809544,323.710671,435.581271,435.581271,506.988037,78.547442,...,152.334434,561.733224,21.42203,85.688119,38.083608,64.266089,2116.020491,78.547442,2.380226,
ERX541575,4260.451254,875.408119,2140.10608,1788.361959,320.126671,288.509222,276.652679,577.018444,739.057871,57.306626,...,136.350249,677.799063,15.808725,195.632966,21.736996,118.565434,1650.03562,61.258807,0.0,


In [27]:
# SRA labels for 2,333 compendium
test["Strain type"].value_counts()

NA                  669
PAO1                646
Clinical Isolate    523
PA14                441
PAK                  54
Name: Strain type, dtype: int64

## Save pre-binned data with median accessory expression
This dataset will be used for Georgia's manuscript, which describes how we generated these compendia

In [19]:
# Select columns with median accessory expression
pao1_pa14_acc_expression_select = pao1_pa14_acc_expression[
    ["median_acc_expression_pao1", "median_acc_expression_pa14"]
]

pao1_pa14_acc_expression_select.head()

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14
ERX541571,90.217283,0.0
ERX541572,88.097024,3.063385
ERX541573,91.353499,1.394526
ERX541574,103.53981,1.223126
ERX541575,82.995804,2.108387


In [20]:
# Add SRA strain type
pao1_pa14_acc_expression_label = pao1_pa14_acc_expression_select.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
# Rename column
pao1_pa14_acc_expression_label = pao1_pa14_acc_expression_label.rename(
    {"Strain type": "SRA label"}, axis=1
)

pao1_pa14_acc_expression_label.head()

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14,SRA label
ERX541571,90.217283,0.0,
ERX541572,88.097024,3.063385,
ERX541573,91.353499,1.394526,
ERX541574,103.53981,1.223126,
ERX541575,82.995804,2.108387,


In [21]:
# Add our binned label
pao1_pa14_acc_expression_label["Our label"] = "NA"
pao1_pa14_acc_expression_label.loc[pao1_binned_ids, "Our label"] = "PAO1-like"
pao1_pa14_acc_expression_label.loc[pa14_binned_ids, "Our label"] = "PA14-like"

pao1_pa14_acc_expression_label.head()

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14,SRA label,Our label
ERX541571,90.217283,0.0,,PAO1-like
ERX541572,88.097024,3.063385,,PAO1-like
ERX541573,91.353499,1.394526,,PAO1-like
ERX541574,103.53981,1.223126,,PAO1-like
ERX541575,82.995804,2.108387,,PAO1-like


In [22]:
# Confirm dimensions
pao1_expression_prebin_filename = paths.PAO1_PREBIN_COMPENDIUM
pa14_expression_prebin_filename = paths.PA14_PREBIN_COMPENDIUM

pao1_expression_prebin = pd.read_csv(
    pao1_expression_prebin_filename, sep="\t", index_col=0, header=0
)
pa14_expression_prebin = pd.read_csv(
    pa14_expression_prebin_filename, sep="\t", index_col=0, header=0
)

In [23]:
# he two expression prebins are because the same samples were mapped to 2 different references (PAO1 and a PA14 reference.
# This assertion is to make sure that the number of samples is the same in both, which it is.
# This assertion is also testing that when we added information about our accessory gene expression
# and labels we retained the same number of samples, which we did.
assert (
    pao1_expression_prebin.shape[0]
    == pa14_expression_prebin.shape[0]
    == pao1_pa14_acc_expression_label.shape[0]
)

In [24]:
# Save
pao1_pa14_acc_expression_label.to_csv(
    "prebinned_compendia_acc_expression.tsv", sep="\t"
)

## Create compendia

Create PAO1 and PA14 compendia

In [25]:
# Get expression data
# Note: reindexing needed here instead of .loc since samples from expression data
# were filtered out for low counts, but these samples still exist in log files
pao1_expression_binned = pao1_expression.loc[pao1_binned_ids]
pa14_expression_binned = pa14_expression.loc[pa14_binned_ids]

In [26]:
assert len(pao1_binned_ids) == pao1_expression_binned.shape[0]
assert len(pa14_binned_ids) == pa14_expression_binned.shape[0]

In [27]:
# Label samples with SRA annotations
# pao1_expression_label = pao1_expression_binned.join(
#    sample_to_strain_table, how='left')
pao1_expression_label = pao1_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
pa14_expression_label = pa14_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
print(pao1_expression_label.shape)
pao1_expression_label.head()

(890, 5564)


Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
ERX541571,5621.037929,902.172829,1380.825635,1811.863764,140.337996,508.725234,483.664878,666.605479,681.641693,77.687105,...,97.73539,471.134699,12.530178,87.711247,75.181069,77.687105,2288.010535,250.603564,0.0,
ERX541572,6323.898054,835.453446,1754.599065,1814.798699,190.876886,418.460865,321.554138,493.343336,666.600816,45.516796,...,80.755606,487.470201,10.277986,70.477619,46.98508,60.199633,2218.576726,199.686588,0.0,
ERX541573,4954.119979,893.81841,1984.802645,1752.14697,350.95517,362.785119,373.30063,465.311348,616.471815,42.062043,...,114.356179,782.091108,19.716583,153.789344,43.376482,106.469546,1474.800376,101.21179,0.0,
ERX541574,4603.356163,773.573295,2227.89109,1911.321096,311.809544,323.710671,435.581271,435.581271,506.988037,78.547442,...,152.334434,561.733224,21.42203,85.688119,38.083608,64.266089,2116.020491,78.547442,2.380226,
ERX541575,4260.451254,875.408119,2140.10608,1788.361959,320.126671,288.509222,276.652679,577.018444,739.057871,57.306626,...,136.350249,677.799063,15.808725,195.632966,21.736996,118.565434,1650.03562,61.258807,0.0,


In [28]:
print(pa14_expression_label.shape)
pa14_expression_label.head()

(505, 5892)


Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845,Strain type
ERX1477379,248.707957,13.753435,2.292239,264.753632,68.767177,22.922392,19.484033,55.013742,24.068512,2.292239,...,174.210182,1118.612747,264.753632,2.292239,11.461196,449.27889,293.406622,171.917943,181.0869,PA14
ERX2174773,56.232678,27.502445,8.594514,7.612284,22.836852,12.03232,5.402266,10.558975,56.723793,6.875611,...,46.655934,1115.076822,529.913187,0.0,43.463686,135.056651,560.853438,73.421707,139.476687,PA14
ERX2174774,67.495419,21.860864,9.017607,6.284999,19.401517,12.023475,3.27913,16.395648,60.390638,4.372173,...,35.250644,1028.553671,532.038788,0.0,49.733467,142.915401,738.897218,69.954766,160.404093,PA14
ERX2174775,58.088081,24.116927,7.001688,3.889827,18.411847,9.335584,3.889827,16.596595,46.159279,2.85254,...,54.716898,1157.612474,485.450392,0.518644,35.267764,168.04052,817.382285,56.53215,67.164344,PA14
ERX2174776,50.207404,23.036339,9.746143,4.430065,21.264312,10.632156,4.134727,12.994858,54.342132,2.362701,...,49.321391,1087.433313,650.924232,0.295338,50.79808,132.01594,611.348984,64.67895,124.041823,PA14


In [29]:
assert pao1_expression_binned.shape[0] == pao1_expression_label.shape[0]
assert pa14_expression_binned.shape[0] == pa14_expression_label.shape[0]

In [30]:
sample_to_strain_table["Strain type"].value_counts()

PAO1                861
NA                  795
Clinical Isolate    601
PA14                545
PAK                  65
Name: Strain type, dtype: int64

Looks like our binned compendium sizes is fairly close in number to what SRA annotates

## Quick comparison

Quick check comparing our binned labels compared with SRA annotations

In [31]:
pao1_expression_label["Strain type"].value_counts()

PAO1                639
NA                  202
Clinical Isolate     42
PA14                  7
Name: Strain type, dtype: int64

**Manually check that these PA14 are mislabeled**
* Clinical ones can be removed by increasing threshold

In [32]:
pa14_expression_label["Strain type"].value_counts()

PA14                434
NA                   46
Clinical Isolate     23
PAO1                  2
Name: Strain type, dtype: int64

In [None]:
# Percent of PAO1 SRA labeled samples that are binned into PAO1 compendium based on expression
# Similarly for PA14
print(639 / 646)
print(434 / 441)

## Check

Manually look up the samples we binned as PAO1 but SRA labeled as PA14. Are these cases of samples being mislabeled?

In [33]:
pao1_expression_label[pao1_expression_label["Strain type"] == "PA14"]

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
SRX4326016,1542.681036,2931.70252,710.48525,7267.640342,206.90791,141.488497,97.368428,1232.31917,731.784594,100.411192,...,509.662867,4440.913158,252.549361,56.291123,295.148048,803.289533,716.570777,54.769741,0.0,PA14
SRX5099522,2448.870829,1784.319722,3556.179111,6235.150761,315.661776,348.889331,224.285999,835.673017,443.587864,66.455111,...,124.603333,3960.724597,25.751355,83.899577,546.593285,521.672619,1609.875057,22.4286,0.0,PA14
SRX5099523,2517.289506,1854.938863,3429.583789,7284.071763,369.559523,342.779848,255.299574,1146.170116,480.248849,73.19778,...,71.412468,4515.053306,16.067805,83.90965,1031.910166,333.853289,1608.565848,7.141247,0.0,PA14
SRX5099524,2541.397963,1589.397824,3784.241519,6160.96476,312.963901,385.879575,276.096426,797.156744,434.216931,87.662663,...,154.024119,3854.699361,22.120485,109.783149,565.301288,462.072357,1635.277348,23.75904,0.0,PA14
SRX5290921,1682.10007,1315.199048,762.831575,2540.083998,225.785244,117.730877,703.966137,995.067827,263.68491,26.610404,...,147.566785,3011.813883,38.706042,62.897318,224.172493,1099.090314,810.407752,20.965773,14.514766,PA14
SRX7423386,1672.517997,2942.685374,1054.400381,6159.849851,153.169528,132.965656,642.29319,1203.511866,632.968326,81.678903,...,329.392192,4074.447591,52.236508,47.142369,343.552171,319.376598,714.301863,29.787761,0.0,PA14
SRX7423388,1898.810933,3540.937123,1106.259685,6622.091411,151.386193,135.607034,366.8264,1165.158132,696.938915,101.236588,...,323.550884,2192.52202,50.305835,38.432408,383.542936,432.598936,720.841998,37.338803,0.0,PA14


Note: These are the 7 PA14 labeled samples using threshold of 0

Most samples appear to be mislabeled:
* SRX5099522: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099522
* SRX5099523: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099523
* SRX5099524: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099524
* SRX5290921: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5290921
* SRX5290922: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5290922

Two samples appear to be PA14 samples treated with antimicrobial manuka honey.
* SRX7423386: https://www.ncbi.nlm.nih.gov/sra/?term=SRX7423386
* SRX7423388: https://www.ncbi.nlm.nih.gov/sra/?term=SRX7423388

In [34]:
pa14_label_pao1_binned_ids = list(
    pao1_expression_label[pao1_expression_label["Strain type"] == "PA14"].index
)
pao1_pa14_acc_expression.loc[
    pa14_label_pao1_binned_ids,
    ["median_acc_expression_pao1", "median_acc_expression_pa14"],
]

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14
SRX4326016,60.094577,6.158171
SRX5099522,111.727655,0.0
SRX5099523,107.118702,0.0
SRX5099524,107.734956,0.0
SRX5290921,60.47819,0.0
SRX7423386,84.139631,0.0
SRX7423388,97.877658,0.0


In [35]:
"""# Save compendia with SRA label
pao1_expression_label.to_csv(paths.PAO1_COMPENDIUM_LABEL, sep="\t")
pa14_expression_label.to_csv(paths.PA14_COMPENDIUM_LABEL, sep="\t")

# Save compendia without SRA label
pao1_expression_binned.to_csv(paths.PAO1_COMPENDIUM, sep="\t")
pa14_expression_binned.to_csv(paths.PA14_COMPENDIUM, sep="\t")

# Save processed metadata table
sample_to_strain_table.to_csv(paths.SAMPLE_TO_STRAIN_PROCESSED, sep="\t")"""