# Create PAO1 and PA14 compendia

This notebook is using the observation from the [exploratory notebook](../explore_data/cluster_by_accessory_gene.ipynb) to bin samples into PAO1 or PA14 compendia.

A sample is considered PAO1 if the median gene expression of PA14 accessory genes is 0 and PAO1 accessory genes in > 0.
Similarlty, a sample is considered PA14 if the median gene expression of PA14 accessory genes is > 0 and PAO1 accessory genes in 0.

In [1]:
%load_ext autoreload
%autoreload 2
import os
import pandas as pd
import seaborn as sns
from textwrap import fill
import matplotlib.pyplot as plt
from core_acc_modules import paths, utils

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))


## Load data

In [2]:
# Expression data files
pao1_expression_filename = paths.PAO1_GE
pa14_expression_filename = paths.PA14_GE

# File containing table to map sample id to strain name
sample_to_strain_filename = paths.SAMPLE_TO_STRAIN

In [3]:
# Load expression data
# Matrices will be sample x gene after taking the transpose
pao1_expression = pd.read_csv(pao1_expression_filename, index_col=0, header=0).T

pa14_expression = pd.read_csv(pa14_expression_filename, index_col=0, header=0).T

# Drop row with gene ensembl ids
pao1_expression.drop(["X"], inplace=True)
pa14_expression.drop(["X"], inplace=True)

In [4]:
# Load metadata
# Set index to experiment id, which is what we will use to map to expression data
sample_to_strain_table_full = pd.read_csv(sample_to_strain_filename, index_col=2)

## Get core and accessory annotations

In [5]:
pao1_annot_filename = paths.GENE_PAO1_ANNOT
pa14_annot_filename = paths.GENE_PA14_ANNOT

core_acc_dict = utils.get_my_core_acc_genes(
    pao1_annot_filename, pa14_annot_filename, pao1_expression, pa14_expression
)

Number of PAO1 core genes: 5366
Number of PA14 core genes: 5363
Number of PAO1 core genes in my dataset: 5361
Number of PA14 core genes in my dataset: 5361
Number of PAO1-specific genes: 202
Number of PA14-specific genes: 530


In [6]:
pao1_acc = core_acc_dict["acc_pao1"]
pa14_acc = core_acc_dict["acc_pa14"]

## Format expression data

Format index to only include experiment id. This will be used to map to expression data and SRA labels later

In [7]:
# Format expression data indices so that values can be mapped to `sample_to_strain_table`
pao1_index_processed = pao1_expression.index.str.split(".").str[0]
pa14_index_processed = pa14_expression.index.str.split(".").str[0]

print(
    f"No. of samples processed using PAO1 reference after filtering: {pao1_expression.shape}"
)
print(
    f"No. of samples processed using PA14 reference after filtering: {pa14_expression.shape}"
)
pao1_expression.index = pao1_index_processed
pa14_expression.index = pa14_index_processed

No. of samples processed using PAO1 reference after filtering: (2643, 5563)
No. of samples processed using PA14 reference after filtering: (2619, 5891)


In [8]:
pao1_expression.head()

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA1905,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1
ERX541571,186.695,41.0399,52.6869,42.5596,11.8271,49.6818,15.1125,18.9761,39.7944,7.90016,...,0,5.16485,8.34185,2.24955,16.7872,35.8977,35.3801,117.821,133.373,0.0
ERX541572,200.587,36.65,65.5425,40.1072,15.0975,38.9103,9.27882,13.3374,35.814,4.29789,...,0,4.02765,8.1123,2.13563,13.5472,21.0032,26.6635,109.515,99.2929,0.0
ERX541573,111.42,27.4148,56.0137,25.2659,18.6952,22.6177,7.28033,8.25073,22.8004,2.72297,...,0,3.55578,8.70387,2.63218,20.6529,13.5422,33.1718,47.4317,34.1095,0.0
ERX541574,143.32,34.4773,83.4517,39.3797,23.2258,30.0783,12.3878,11.1666,26.4343,7.74561,...,0,7.79597,9.47563,4.12049,16.6373,15.0784,24.8545,97.8814,36.5146,5.00512
ERX541575,118.398,34.014,73.7324,32.2136,21.2469,23.1491,6.74443,13.0066,33.6489,4.74327,...,0,5.56938,9.57055,2.4523,32.6727,8.08117,47.4864,66.6335,26.2999,0.0


In [9]:
pa14_expression.head()

Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_19205,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845
ERX541571,7.649,5.19946,0.0,0.60384,3.25146,2.07243,4.73741,9.29522,14.7829,0.799833,...,12.9614,59.3176,216.062,15.6344,62.0945,12.2478,485.023,7.89171,20.5447,0
ERX541572,9.07994,5.32769,1.81816,2.99295,8.57561,2.01307,2.62795,11.8476,20.8028,1.0461,...,22.1461,83.6427,201.195,21.1028,90.7961,77.918,643.011,13.0426,18.0631,0
ERX541573,8.19844,5.34884,5.74693,6.51464,4.44288,4.59846,1.40991,11.5351,15.8012,3.56081,...,22.5625,113.387,140.01,8.23511,17.6324,75.5676,989.102,15.9628,11.4278,0
ERX541574,10.2302,1.51737,2.20543,3.78442,2.16879,4.43033,2.52313,9.35498,18.8926,2.66614,...,28.664,120.056,179.671,8.44258,20.247,72.9086,1107.97,24.7304,28.1454,0
ERX541575,8.83536,5.10013,2.70427,7.34835,4.38874,3.05368,0.652771,7.40535,11.4972,1.66304,...,14.6011,125.629,111.88,7.18445,16.4272,58.6684,720.376,17.0733,24.7502,0


In [10]:
# Save pre-binned expression data
pao1_expression.to_csv(paths.PAO1_PREBIN_COMPENDIUM, sep="\t")
pa14_expression.to_csv(paths.PA14_PREBIN_COMPENDIUM, sep="\t")

## Bin samples as PAO1 or PA14

In [11]:
# Create accessory df
# accessory gene ids | median accessory expression | strain label

# PAO1
pao1_acc_expression = pao1_expression[pao1_acc]
pao1_acc_expression["median_acc_expression"] = pao1_acc_expression.median(axis=1)

# PA14
pa14_acc_expression = pa14_expression[pa14_acc]
pa14_acc_expression["median_acc_expression"] = pa14_acc_expression.median(axis=1)

pao1_acc_expression.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,PA4102,PA4105,PA1383,PA4638,PA2105,PA2221,PA1388,PA1393,PA3154,PA3144,...,PA0453,PA1938,PA1387,PA3513,PA2734,PA1386,PA1152,PA4101,PA1224,median_acc_expression
ERX541571,1.51752,0.418828,10.9541,53.7184,7.94475,2.75331,1.31863,10.4854,122.501,237.158,...,2.63691,7.72568,3.66371,1.61181,51.6956,37.3208,24.1255,5.34724,1.75525,6.311616
ERX541572,2.51172,0.579668,11.2011,55.9173,5.81177,3.26242,2.80764,9.87491,104.343,236.762,...,2.20806,4.08008,3.44704,0.63844,44.4813,21.3089,17.593,6.74358,0.850947,6.229964
ERX541573,1.43159,0.480136,8.9898,23.7776,4.75052,1.67101,2.69804,5.27413,48.9171,240.184,...,1.90123,7.04667,4.10674,1.11013,19.8075,8.05387,8.93223,4.00277,0.965475,4.207755
ERX541574,2.33737,0.360131,11.9408,49.2154,8.52468,3.81331,2.13531,9.85984,76.3573,226.801,...,2.48247,7.67418,5.68581,2.059,23.6064,12.5105,9.98602,5.32215,0.944287,6.659673
ERX541575,1.74343,0.963773,10.7525,31.9288,4.87133,3.02249,1.23281,6.96431,65.5187,345.686,...,1.20162,5.76774,4.49896,1.27307,22.8577,7.7716,8.78002,5.5444,1.34505,5.026461


In [12]:
# Merge PAO1 and PA14 accessory dataframes
pao1_pa14_acc_expression = pao1_acc_expression.merge(
    pa14_acc_expression,
    left_index=True,
    right_index=True,
    suffixes=["_pao1", "_pa14"],
)

pao1_pa14_acc_expression.head()

Unnamed: 0,PA4102,PA4105,PA1383,PA4638,PA2105,PA2221,PA1388,PA1393,PA3154,PA3144,...,PA14_09460,PA14_33980,PA14_33970,PA14_61380,PA14_59560,PA14_69520,PA14_31240,PA14_59100,PA14_07460,median_acc_expression_pa14
ERX541571,1.51752,0.418828,10.9541,53.7184,7.94475,2.75331,1.31863,10.4854,122.501,237.158,...,3.15874,0.0,1.27689,72.9424,0,0.172729,0.0,0.0,78.9373,0.0
ERX541572,2.51172,0.579668,11.2011,55.9173,5.81177,3.26242,2.80764,9.87491,104.343,236.762,...,8.23266,0.862252,5.59599,132.318,0,0.440314,0.0,0.763529,47.2668,0.152917
ERX541573,1.43159,0.480136,8.9898,23.7776,4.75052,1.67101,2.69804,5.27413,48.9171,240.184,...,12.1562,0.463245,2.85544,132.586,0,0.246533,0.181258,0.0,159.055,0.198142
ERX541574,2.33737,0.360131,11.9408,49.2154,8.52468,3.81331,2.13531,9.85984,76.3573,226.801,...,17.4976,0.0,5.86011,151.555,0,0.104318,0.0,0.0,178.461,0.152538
ERX541575,1.74343,0.963773,10.7525,31.9288,4.87133,3.02249,1.23281,6.96431,65.5187,345.686,...,17.3923,0.63587,4.2793,114.183,0,0.0,0.0,0.0,96.5324,0.262593


In [13]:
# Find PAO1 samples
pao1_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1>0 & median_acc_expression_pa14==0"
    ).index
)

In [14]:
# Find PA14 samples
pa14_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1==0 & median_acc_expression_pa14>0"
    ).index
)

In [15]:
# Check that there are no samples that are binned as both PAO1 and PA14
shared_pao1_pa14_binned_ids = list(set(pao1_binned_ids).intersection(pa14_binned_ids))

assert len(shared_pao1_pa14_binned_ids) == 0

## Format SRA annotations

In [16]:
# Since experiments have multiple runs there are duplicated experiment ids in the index
# We will need to remove these so that the count calculations are accurate
sample_to_strain_table_full_processed = sample_to_strain_table_full[
    ~sample_to_strain_table_full.index.duplicated(keep="first")
]

assert (
    len(sample_to_strain_table_full.index.unique())
    == sample_to_strain_table_full_processed.shape[0]
)

In [17]:
# Aggregate boolean labels into a single strain label
aggregated_label = []
for exp_id in list(sample_to_strain_table_full_processed.index):
    if sample_to_strain_table_full_processed.loc[exp_id, "PAO1"].all() == True:
        aggregated_label.append("PAO1")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PA14"].all() == True:
        aggregated_label.append("PA14")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PAK"].all() == True:
        aggregated_label.append("PAK")
    elif (
        sample_to_strain_table_full_processed.loc[exp_id, "ClinicalIsolate"].all()
        == True
    ):
        aggregated_label.append("Clinical Isolate")
    else:
        aggregated_label.append("NA")

sample_to_strain_table_full_processed["Strain type"] = aggregated_label

sample_to_strain_table = sample_to_strain_table_full_processed["Strain type"].to_frame()

sample_to_strain_table.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,Strain type
Experiment,Unnamed: 1_level_1
SRX5057740,
SRX5057739,
SRX5057910,
SRX5057909,
SRX3573046,PAO1


## Create compendia

Create PAO1 and PA14 compendia

In [18]:
# Get expression data
# Note: reindexing needed here instead of .loc since samples from expression data
# were filtered out for low counts, but these samples still exist in log files
pao1_expression_binned = pao1_expression.loc[pao1_binned_ids]
pa14_expression_binned = pa14_expression.loc[pa14_binned_ids]

"""# Missing samples are dropped
pao1_expression_binned = pao1_expression_binned.dropna()
pa14_expression_binned = pa14_expression_binned.dropna()

# Drop ambiguously mapped samples
pao1_expression_binned = pao1_expression_binned.drop(high_pao1_pa14_mapping_ids)
pa14_expression_binned = pa14_expression_binned.drop(high_pao1_pa14_mapping_ids)"""

'# Missing samples are dropped\npao1_expression_binned = pao1_expression_binned.dropna()\npa14_expression_binned = pa14_expression_binned.dropna()\n\n# Drop ambiguously mapped samples\npao1_expression_binned = pao1_expression_binned.drop(high_pao1_pa14_mapping_ids)\npa14_expression_binned = pa14_expression_binned.drop(high_pao1_pa14_mapping_ids)'

In [19]:
assert len(pao1_binned_ids) == pao1_expression_binned.shape[0]
assert len(pa14_binned_ids) == pa14_expression_binned.shape[0]

In [20]:
# Label samples with SRA annotations
# pao1_expression_label = pao1_expression_binned.join(
#    sample_to_strain_table, how='left')
pao1_expression_label = pao1_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
pa14_expression_label = pa14_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
print(pao1_expression_label.shape)
pao1_expression_label.head()

(1259, 5564)


Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
ERX541571,186.695,41.0399,52.6869,42.5596,11.8271,49.6818,15.1125,18.9761,39.7944,7.90016,...,5.16485,8.34185,2.24955,16.7872,35.8977,35.3801,117.821,133.373,0.0,
ERX541579,150.909,63.9683,56.863,86.0135,32.241,48.3226,21.0994,21.2195,31.5214,12.169,...,16.6068,38.2189,8.61178,28.3964,110.944,137.584,101.058,65.2106,0.0,
ERX541580,108.324,56.079,57.0017,73.4215,18.4262,46.5726,16.6067,18.5953,30.2534,15.936,...,13.9043,37.2461,11.7777,45.733,94.7995,53.4862,82.7537,23.3406,38.1136,
ERX541581,92.256,32.0571,25.3085,33.3771,14.1502,15.5453,7.91773,12.2487,18.1247,4.71911,...,2.80695,5.39659,0.671762,20.4962,21.8929,9.37972,49.2456,26.8092,2.8763,
ERX541589,96.3232,21.9846,58.9142,26.9432,14.225,18.4081,8.37227,8.3501,17.1165,5.60477,...,5.59335,6.60067,2.46414,12.7637,8.72133,18.1404,66.3317,14.5591,0.0,


In [21]:
print(pa14_expression_label.shape)
pa14_expression_label.head()

(703, 5892)


Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845,Strain type
ERX1477379,11.0302,1.3703,0.683438,63.5278,34.587,4.70905,5.27434,8.13739,5.31058,0.766974,...,82.9812,63.6503,28.8776,7.95613,56.8295,269.183,6.07431,29.0794,100.749,PA14
ERX1477380,13.4196,3.03917,1.3345,105.004,70.4931,8.84901,6.80286,18.7842,11.9783,2.08849,...,91.0909,62.4002,24.8194,3.13252,31.3252,181.774,10.8669,16.0901,233.318,PA14
ERX1477381,13.1554,5.68473,7.20822,143.201,107.85,12.7531,8.8681,27.4823,13.9512,1.25178,...,98.9094,71.4295,24.1147,6.5158,18.6166,184.357,13.1124,13.9704,93.1826,PA14
ERX2174773,4.20145,4.19461,3.28368,3.27338,20.1881,3.80429,2.66525,2.65057,20.0627,3.75513,...,35.8212,114.728,104.737,0.0,391.722,160.216,18.9805,20.9854,182.606,PA14
ERX2174774,5.01689,3.30686,3.4467,2.67676,17.1993,3.74861,1.61874,4.08309,21.225,2.37377,...,26.5991,105.848,104.402,0.0,444.485,168.539,24.907,19.7757,206.865,PA14


In [22]:
assert pao1_expression_binned.shape[0] == pao1_expression_label.shape[0]
assert pa14_expression_binned.shape[0] == pa14_expression_label.shape[0]

## Quick comparison

Quick check comparing our binned labels compared with SRA annotations

In [23]:
pao1_expression_label["Strain type"].value_counts()

PAO1                715
NA                  349
Clinical Isolate    130
PAK                  58
PA14                  7
Name: Strain type, dtype: int64

**Manually check that these PA14 are mislabeled**
* Clinical ones can be removed by increasing threshold

In [24]:
pa14_expression_label["Strain type"].value_counts()

PA14                476
Clinical Isolate    146
NA                   81
Name: Strain type, dtype: int64

In [25]:
# Save compendia with SRA label
pao1_expression_label.to_csv(paths.PAO1_COMPENDIUM_LABEL, sep="\t")
pa14_expression_label.to_csv(paths.PA14_COMPENDIUM_LABEL, sep="\t")

# Save compendia without SRA label
pao1_expression_binned.to_csv(paths.PAO1_COMPENDIUM, sep="\t")
pa14_expression_binned.to_csv(paths.PA14_COMPENDIUM, sep="\t")

# Save processed metadata table
sample_to_strain_table.to_csv(paths.SAMPLE_TO_STRAIN_PROCESSED, sep="\t")