# Create PAO1 and PA14 compendia

This notebook is using the observation from the [exploratory notebook](../explore_data/cluster_by_accessory_gene.ipynb) to bin samples into PAO1 or PA14 compendia.

A sample is considered PAO1 if the median gene expression of PA14 accessory genes is 0 and PAO1 accessory genes in > 0.
Similarlty, a sample is considered PA14 if the median gene expression of PA14 accessory genes is > 0 and PAO1 accessory genes in 0.

In [1]:
%load_ext autoreload
%autoreload 2
import os
import pandas as pd
import seaborn as sns
from textwrap import fill
import matplotlib.pyplot as plt
from core_acc_modules import paths, utils

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))


## Load data

In [2]:
# Expression data files
pao1_expression_filename = paths.PAO1_GE
pa14_expression_filename = paths.PA14_GE

# File containing table to map sample id to strain name
sample_to_strain_filename = paths.SAMPLE_TO_STRAIN

In [3]:
# Load expression data
# Matrices will be sample x gene after taking the transpose
pao1_expression = pd.read_csv(pao1_expression_filename, index_col=0, header=0).T

pa14_expression = pd.read_csv(pa14_expression_filename, index_col=0, header=0).T

# Drop row with gene ensembl ids
pao1_expression.drop(["X"], inplace=True)
pa14_expression.drop(["X"], inplace=True)

In [4]:
# Load metadata
# Set index to experiment id, which is what we will use to map to expression data
sample_to_strain_table_full = pd.read_csv(sample_to_strain_filename, index_col=2)

## Get core and accessory annotations

In [5]:
pao1_annot_filename = paths.GENE_PAO1_ANNOT
pa14_annot_filename = paths.GENE_PA14_ANNOT

core_acc_dict = utils.get_my_core_acc_genes(
    pao1_annot_filename, pa14_annot_filename, pao1_expression, pa14_expression
)

Number of PAO1 core genes: 5366
Number of PA14 core genes: 5363
Number of PAO1 core genes in my dataset: 5361
Number of PA14 core genes in my dataset: 5361
Number of PAO1-specific genes: 202
Number of PA14-specific genes: 530


In [6]:
pao1_acc = core_acc_dict["acc_pao1"]
pa14_acc = core_acc_dict["acc_pa14"]

## Format expression data

Format index to only include experiment id. This will be used to map to expression data and SRA labels later

In [7]:
# Format expression data indices so that values can be mapped to `sample_to_strain_table`
pao1_index_processed = pao1_expression.index.str.split(".").str[0]
pa14_index_processed = pa14_expression.index.str.split(".").str[0]

print(
    f"No. of samples processed using PAO1 reference after filtering: {pao1_expression.shape}"
)
print(
    f"No. of samples processed using PA14 reference after filtering: {pa14_expression.shape}"
)
pao1_expression.index = pao1_index_processed
pa14_expression.index = pa14_index_processed

No. of samples processed using PAO1 reference after filtering: (2643, 5563)
No. of samples processed using PA14 reference after filtering: (2619, 5891)


In [8]:
pao1_expression.head()

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA1905,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1
ERX541571,186.695,41.0399,52.6869,42.5596,11.8271,49.6818,15.1125,18.9761,39.7944,7.90016,...,0,5.16485,8.34185,2.24955,16.7872,35.8977,35.3801,117.821,133.373,0.0
ERX541572,200.587,36.65,65.5425,40.1072,15.0975,38.9103,9.27882,13.3374,35.814,4.29789,...,0,4.02765,8.1123,2.13563,13.5472,21.0032,26.6635,109.515,99.2929,0.0
ERX541573,111.42,27.4148,56.0137,25.2659,18.6952,22.6177,7.28033,8.25073,22.8004,2.72297,...,0,3.55578,8.70387,2.63218,20.6529,13.5422,33.1718,47.4317,34.1095,0.0
ERX541574,143.32,34.4773,83.4517,39.3797,23.2258,30.0783,12.3878,11.1666,26.4343,7.74561,...,0,7.79597,9.47563,4.12049,16.6373,15.0784,24.8545,97.8814,36.5146,5.00512
ERX541575,118.398,34.014,73.7324,32.2136,21.2469,23.1491,6.74443,13.0066,33.6489,4.74327,...,0,5.56938,9.57055,2.4523,32.6727,8.08117,47.4864,66.6335,26.2999,0.0


In [9]:
pa14_expression.head()

Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_19205,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845
ERX541571,7.649,5.19946,0.0,0.60384,3.25146,2.07243,4.73741,9.29522,14.7829,0.799833,...,12.9614,59.3176,216.062,15.6344,62.0945,12.2478,485.023,7.89171,20.5447,0
ERX541572,9.07994,5.32769,1.81816,2.99295,8.57561,2.01307,2.62795,11.8476,20.8028,1.0461,...,22.1461,83.6427,201.195,21.1028,90.7961,77.918,643.011,13.0426,18.0631,0
ERX541573,8.19844,5.34884,5.74693,6.51464,4.44288,4.59846,1.40991,11.5351,15.8012,3.56081,...,22.5625,113.387,140.01,8.23511,17.6324,75.5676,989.102,15.9628,11.4278,0
ERX541574,10.2302,1.51737,2.20543,3.78442,2.16879,4.43033,2.52313,9.35498,18.8926,2.66614,...,28.664,120.056,179.671,8.44258,20.247,72.9086,1107.97,24.7304,28.1454,0
ERX541575,8.83536,5.10013,2.70427,7.34835,4.38874,3.05368,0.652771,7.40535,11.4972,1.66304,...,14.6011,125.629,111.88,7.18445,16.4272,58.6684,720.376,17.0733,24.7502,0


In [10]:
# Save pre-binned expression data
pao1_expression.to_csv(paths.PAO1_PREBIN_COMPENDIUM, sep="\t")
pa14_expression.to_csv(paths.PA14_PREBIN_COMPENDIUM, sep="\t")

## Bin samples as PAO1 or PA14

In [11]:
# Create accessory df
# accessory gene ids | median accessory expression | strain label

# PAO1
pao1_acc_expression = pao1_expression[pao1_acc]
pao1_acc_expression["median_acc_expression"] = pao1_acc_expression.median(axis=1)

# PA14
pa14_acc_expression = pa14_expression[pa14_acc]
pa14_acc_expression["median_acc_expression"] = pa14_acc_expression.median(axis=1)

pao1_acc_expression.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,PA2101,PA2976,PA4106,PA2336,PA4193,PA4146,PA1555.1,PA3151,PA1387,PA1390,...,PA1224,PA0643,PA1152,PA1381,PA1369,PA1223,PA2818,PA2192,PA3158,median_acc_expression
ERX541571,2.05909,379.493,1.64952,1.50311,3.93058,0.553733,35.3801,18.5786,3.66371,3.46066,...,1.75525,10.4145,24.1255,6.30027,24.5901,2.35952,8.33503,0.366494,143.347,6.311616
ERX541572,2.31548,248.794,1.60289,0.769797,3.15489,0.302655,26.6635,24.9523,3.44704,1.77095,...,0.850947,8.45386,17.593,4.04601,23.5762,4.29316,6.63197,0.425682,134.148,6.229964
ERX541573,1.45339,79.628,2.28253,0.286995,1.89947,1.36516,33.1718,22.2867,4.10674,1.97604,...,0.965475,7.24225,8.93223,1.88176,20.0251,6.34905,5.10521,0.124786,84.0009,4.207755
ERX541574,5.92319,157.005,2.59091,0.293115,2.39818,1.97878,24.8545,34.0147,5.68581,3.03053,...,0.944287,11.4808,9.98602,3.26026,22.343,12.9834,6.06797,0.64848,124.559,6.659673
ERX541575,3.98432,94.9893,3.35318,0.273816,1.34895,0.562468,47.4864,39.7562,4.49896,1.98053,...,1.34505,7.53727,8.78002,2.42034,30.7739,4.6329,6.0068,0.22894,99.7044,5.026461


In [12]:
# Merge PAO1 and PA14 accessory dataframes
pao1_pa14_acc_expression = pao1_acc_expression.merge(
    pa14_acc_expression,
    left_index=True,
    right_index=True,
    suffixes=["_pao1", "_pa14"],
)

pao1_pa14_acc_expression.head()

Unnamed: 0,PA2101,PA2976,PA4106,PA2336,PA4193,PA4146,PA1555.1,PA3151,PA1387,PA1390,...,PA14_60240,PA14_59100,PA14_63450,PA14_43100,PA14_46460,PA14_15620,PA14_35750,PA14_49520,PA14_22110,median_acc_expression_pa14
ERX541571,2.05909,379.493,1.64952,1.50311,3.93058,0.553733,35.3801,18.5786,3.66371,3.46066,...,34.7077,0.0,0.0,0.722916,0.21124,0.0,0,0.476568,0,0.0
ERX541572,2.31548,248.794,1.60289,0.769797,3.15489,0.302655,26.6635,24.9523,3.44704,1.77095,...,36.5013,0.763529,1.44329,0.408432,0.0,0.468415,0,0.198355,0,0.152917
ERX541573,1.45339,79.628,2.28253,0.286995,1.89947,1.36516,33.1718,22.2867,4.10674,1.97604,...,23.7647,0.0,0.816088,1.04989,0.183557,0.0,0,0.289011,0,0.198142
ERX541574,5.92319,157.005,2.59091,0.293115,2.39818,1.97878,24.8545,34.0147,5.68581,3.03053,...,58.9577,0.0,0.0,1.48468,0.0,0.0,0,0.470385,0,0.152538
ERX541575,3.98432,94.9893,3.35318,0.273816,1.34895,0.562468,47.4864,39.7562,4.49896,1.98053,...,38.3732,0.0,0.0,1.67993,0.0,0.349279,0,0.660307,0,0.262593


In [13]:
# Find PAO1 samples
pao1_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1>10 & median_acc_expression_pa14==0"
    ).index
)

In [14]:
# Find PA14 samples
pa14_binned_ids = list(
    pao1_pa14_acc_expression.query(
        "median_acc_expression_pao1==0 & median_acc_expression_pa14>10"
    ).index
)

In [15]:
# Check that there are no samples that are binned as both PAO1 and PA14
shared_pao1_pa14_binned_ids = list(set(pao1_binned_ids).intersection(pa14_binned_ids))

assert len(shared_pao1_pa14_binned_ids) == 0

## Format SRA annotations

In [16]:
# Since experiments have multiple runs there are duplicated experiment ids in the index
# We will need to remove these so that the count calculations are accurate
sample_to_strain_table_full_processed = sample_to_strain_table_full[
    ~sample_to_strain_table_full.index.duplicated(keep="first")
]

assert (
    len(sample_to_strain_table_full.index.unique())
    == sample_to_strain_table_full_processed.shape[0]
)

In [17]:
# Aggregate boolean labels into a single strain label
aggregated_label = []
for exp_id in list(sample_to_strain_table_full_processed.index):
    if sample_to_strain_table_full_processed.loc[exp_id, "PAO1"].all() == True:
        aggregated_label.append("PAO1")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PA14"].all() == True:
        aggregated_label.append("PA14")
    elif sample_to_strain_table_full_processed.loc[exp_id, "PAK"].all() == True:
        aggregated_label.append("PAK")
    elif (
        sample_to_strain_table_full_processed.loc[exp_id, "ClinicalIsolate"].all()
        == True
    ):
        aggregated_label.append("Clinical Isolate")
    else:
        aggregated_label.append("NA")

sample_to_strain_table_full_processed["Strain type"] = aggregated_label

sample_to_strain_table = sample_to_strain_table_full_processed["Strain type"].to_frame()

sample_to_strain_table.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,Strain type
Experiment,Unnamed: 1_level_1
SRX5057740,
SRX5057739,
SRX5057910,
SRX5057909,
SRX3573046,PAO1


## Create compendia

Create PAO1 and PA14 compendia

In [18]:
# Get expression data
# Note: reindexing needed here instead of .loc since samples from expression data
# were filtered out for low counts, but these samples still exist in log files
pao1_expression_binned = pao1_expression.loc[pao1_binned_ids]
pa14_expression_binned = pa14_expression.loc[pa14_binned_ids]

In [19]:
assert len(pao1_binned_ids) == pao1_expression_binned.shape[0]
assert len(pa14_binned_ids) == pa14_expression_binned.shape[0]

In [20]:
# Label samples with SRA annotations
# pao1_expression_label = pao1_expression_binned.join(
#    sample_to_strain_table, how='left')
pao1_expression_label = pao1_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
pa14_expression_label = pa14_expression_binned.merge(
    sample_to_strain_table, left_index=True, right_index=True
)
print(pao1_expression_label.shape)
pao1_expression_label.head()

(847, 5564)


Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
ERX541579,150.909,63.9683,56.863,86.0135,32.241,48.3226,21.0994,21.2195,31.5214,12.169,...,16.6068,38.2189,8.61178,28.3964,110.944,137.584,101.058,65.2106,0.0,
ERX541580,108.324,56.079,57.0017,73.4215,18.4262,46.5726,16.6067,18.5953,30.2534,15.936,...,13.9043,37.2461,11.7777,45.733,94.7995,53.4862,82.7537,23.3406,38.1136,
ERX541591,556.704,182.31,81.7305,157.349,85.185,116.644,36.9645,31.9735,63.1099,16.1408,...,73.032,80.3696,49.3125,69.0997,61.6223,103.911,186.031,478.965,0.0,
ERX541592,428.47,178.51,74.9532,152.142,92.3931,95.3175,33.668,31.2622,81.1315,20.1266,...,114.618,62.5837,56.4265,58.4739,55.3119,101.62,207.41,345.058,0.0,
ERX676205,555.933,530.702,295.029,336.887,160.856,196.8,24.0623,127.448,251.544,29.9338,...,59.6603,268.159,40.9292,30.2728,284.147,357.807,260.651,110.622,0.0,PAO1


In [21]:
print(pa14_expression_label.shape)
pa14_expression_label.head()

(520, 5892)


Unnamed: 0,PA14_55610,PA14_55600,PA14_55590,PA14_55580,PA14_55570,PA14_55560,PA14_55550,PA14_55540,PA14_55530,PA14_55520,...,PA14_17675,PA14_67975,PA14_36345,PA14_43405,PA14_38825,PA14_24245,PA14_28895,PA14_55117,PA14_59845,Strain type
ERX1477379,11.0302,1.3703,0.683438,63.5278,34.587,4.70905,5.27434,8.13739,5.31058,0.766974,...,82.9812,63.6503,28.8776,7.95613,56.8295,269.183,6.07431,29.0794,100.749,PA14
ERX1477380,13.4196,3.03917,1.3345,105.004,70.4931,8.84901,6.80286,18.7842,11.9783,2.08849,...,91.0909,62.4002,24.8194,3.13252,31.3252,181.774,10.8669,16.0901,233.318,PA14
ERX1477381,13.1554,5.68473,7.20822,143.201,107.85,12.7531,8.8681,27.4823,13.9512,1.25178,...,98.9094,71.4295,24.1147,6.5158,18.6166,184.357,13.1124,13.9704,93.1826,PA14
ERX2174773,4.20145,4.19461,3.28368,3.27338,20.1881,3.80429,2.66525,2.65057,20.0627,3.75513,...,35.8212,114.728,104.737,0.0,391.722,160.216,18.9805,20.9854,182.606,PA14
ERX2174774,5.01689,3.30686,3.4467,2.67676,17.1993,3.74861,1.61874,4.08309,21.225,2.37377,...,26.5991,105.848,104.402,0.0,444.485,168.539,24.907,19.7757,206.865,PA14


In [22]:
assert pao1_expression_binned.shape[0] == pao1_expression_label.shape[0]
assert pa14_expression_binned.shape[0] == pa14_expression_label.shape[0]

In [23]:
sample_to_strain_table["Strain type"].value_counts()

PAO1                861
NA                  795
Clinical Isolate    601
PA14                545
PAK                  65
Name: Strain type, dtype: int64

Looks like our binned compendium sizes is fairly close in number to what SRA annotates

## Quick comparison

Quick check comparing our binned labels compared with SRA annotations

In [24]:
pao1_expression_label["Strain type"].value_counts()

PAO1                658
NA                  169
Clinical Isolate     15
PA14                  5
Name: Strain type, dtype: int64

**Manually check that these PA14 are mislabeled**
* Clinical ones can be removed by increasing threshold

In [25]:
pa14_expression_label["Strain type"].value_counts()

PA14                463
NA                   36
Clinical Isolate     21
Name: Strain type, dtype: int64

## Check

Manually look up the samples we binned as PAO1 but SRA labeled as PA14. Are these cases of samples being mislabeled?

In [26]:
pao1_expression_label[pao1_expression_label["Strain type"] == "PA14"]

Unnamed: 0,PA0001,PA0002,PA0003,PA0004,PA0005,PA0006,PA0007,PA0008,PA0009,PA0010,...,PA0195,PA4812,PA0195.1,PA0457.1,PA1552.1,PA1555.1,PA3701,PA4724.1,PA5471.1,Strain type
SRX5099522,375.431,337.019,635.994,501.447,89.1968,216.427,30.8242,127.463,101.924,31.8056,...,36.2873,299.498,26.7235,85.106,1030.22,1017.71,289.708,46.8439,0,PA14
SRX5099523,354.891,314.924,551.896,536.666,95.4369,191.199,31.3827,156.952,98.7123,31.5692,...,18.2669,308.803,15.4645,74.2603,1793.89,605.43,259.285,13.735,0,PA14
SRX5099524,389.998,300.367,688.801,500.255,91.593,243.114,37.9347,122.364,100.192,42.2384,...,45.222,292.814,22.4411,110.264,1084.12,910.3,299.204,50.1211,0,PA14
SRX7423386,212.036,529.068,190.269,431.382,39.3568,52.7005,68.0894,122.324,120.495,31.3687,...,58.9855,212.045,38.1922,29.2829,524.758,510.617,106.532,47.8645,0,PA14
SRX7423388,216.779,581.906,183.299,455.074,37.6687,47.3949,37.5524,106.549,123.094,34.3959,...,54.0478,116.447,32.8172,19.5091,571.821,653.718,111.577,48.2728,0,PA14


Note: These are the 7 PA14 labeled samples using threshold of 0

Most samples appear to be mislabeled:
* SRX5099522: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099522
* SRX5099523: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099523
* SRX5099524: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5099524
* SRX5290921: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5290921
* SRX5290922: https://www.ncbi.nlm.nih.gov/sra/?term=SRX5290922

Two samples appear to be PA14 samples treated with antimicrobial manuka honey.
* SRX7423386: https://www.ncbi.nlm.nih.gov/sra/?term=SRX7423386
* SRX7423388: https://www.ncbi.nlm.nih.gov/sra/?term=SRX7423388

In [27]:
pa14_label_pao1_binned_ids = list(
    pao1_expression_label[pao1_expression_label["Strain type"] == "PA14"].index
)
pao1_pa14_acc_expression.loc[
    pa14_label_pao1_binned_ids,
    ["median_acc_expression_pao1", "median_acc_expression_pa14"],
]

Unnamed: 0,median_acc_expression_pao1,median_acc_expression_pa14
SRX5099522,42.405685,0.0
SRX5099523,32.682155,0.0
SRX5099524,40.899813,0.0
SRX7423386,25.369495,0.0
SRX7423388,23.708601,0.0


In [28]:
# Save compendia with SRA label
pao1_expression_label.to_csv(paths.PAO1_COMPENDIUM_LABEL, sep="\t")
pa14_expression_label.to_csv(paths.PA14_COMPENDIUM_LABEL, sep="\t")

# Save compendia without SRA label
pao1_expression_binned.to_csv(paths.PAO1_COMPENDIUM, sep="\t")
pa14_expression_binned.to_csv(paths.PA14_COMPENDIUM, sep="\t")

# Save processed metadata table
sample_to_strain_table.to_csv(paths.SAMPLE_TO_STRAIN_PROCESSED, sep="\t")