# Accessory genes related to least stable core genes

Since we found that least stable core genes are more co-expressed with accessory genes. Let's look at who those accessory genes are. In our [previous notebook](../3_core_core_analysis/4_find_related_acc_genes.ipynb) we annotated least and most stable core genes with their top co-expressed accessory gene.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import random
import scipy
import pandas as pd
import numpy as np
from scripts import paths, utils

random.seed(1)

In [2]:
# Load transcriptional similarity df
# These are the subset of genes that we will consider
pao1_similarity_scores_filename = (
    "../3_core_core_analysis/pao1_core_similarity_associations_final_spell.tsv"
)
pa14_similarity_scores_filename = (
    "../3_core_core_analysis/pa14_core_similarity_associations_final_spell.tsv"
)

In [3]:
pao1_similarity_scores = pd.read_csv(
    pao1_similarity_scores_filename, sep="\t", header=0, index_col=0
)
pa14_similarity_scores = pd.read_csv(
    pa14_similarity_scores_filename, sep="\t", header=0, index_col=0
)

In [4]:
# Get least stable core genes
pao1_least_stable_genes = list(
    pao1_similarity_scores[pao1_similarity_scores["label"] == "least stable"].index
)
pa14_least_stable_genes = list(
    pa14_similarity_scores[pa14_similarity_scores["label"] == "least stable"].index
)

The co-expressed accessory genes are listed in the `Related acc genes` column. The values are a list of accessory genes that were in the top 10 co-expressed genes, otherwise "No accessory genes"

In [5]:
pao1_least_df = pao1_similarity_scores.loc[pao1_least_stable_genes]

In [6]:
pa14_least_df = pa14_similarity_scores.loc[pa14_least_stable_genes]

In [7]:
pao1_least_df.head()

Unnamed: 0_level_0,PA14 homolog id,Transcriptional similarity across strains,P-value,Name,label,mean expression,standard deviation expression,min expression,25% expression,50% expression,75% expression,max expression,variance expression,range expression,pathways present,Related acc genes
PAO1 id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
PA0251,PA14_03100,0.103785,2.756572e-14,,least stable,35.742954,69.060531,0.0,6.077436,19.42845,38.660155,1403.232286,4769.35696,1403.232286,[],['PA0499']
PA3954,PA14_12710,0.133231,1.300336e-22,,least stable,89.839814,158.75368,0.0,16.993923,33.312202,70.730446,1446.745723,25202.731002,1446.745723,['KEGG-Pathway-pae00920: Sulfur metabolism'],No accessory genes
PA0190,PA14_02380,0.074565,4.767006e-08,,least stable,54.573616,102.617366,0.0,12.90793,23.857577,47.583543,1105.915532,10530.323797,1105.915532,[],No accessory genes
PA1868,PA14_40320,0.12667,1.410198e-20,xqhA,least stable,117.079248,199.10781,0.0,22.414522,43.354495,107.139091,1446.643669,39643.920061,1446.643669,['KEGG-Pathway-pae03070: Bacterial secretion s...,No accessory genes
PA4905,PA14_64810,0.111597,2.710524e-16,vanB,least stable,67.118721,156.124192,0.0,9.862778,22.542214,59.593673,2777.288296,24374.763439,2777.288296,['KEGG-Pathway-pae00627: Aminobenzoate degrada...,No accessory genes


## Concatenate accessory genes into a list

Note: We have to use `eval` here because we have a mix of strings and lists in our column of interest. In the future we could use an empty list instead of a string.

In [8]:
pao1_least_processed_df = pao1_least_df[
    pao1_least_df["Related acc genes"] != "No accessory genes"
]["Related acc genes"]
pa14_least_processed_df = pa14_least_df[
    pa14_least_df["Related acc genes"] != "No accessory genes"
]["Related acc genes"]

In [9]:
pao1_least_counts_df = (
    pd.Series(pao1_least_processed_df.apply(eval).sum())
    .value_counts()
    .to_frame("counts")
)
pa14_least_counts_df = (
    pd.Series(pa14_least_processed_df.apply(eval).sum())
    .value_counts()
    .to_frame("counts")
)

In [10]:
pao1_least_counts_df

Unnamed: 0,counts
PA1382,3
PA2221,3
PA1368,3
PA2104,2
PA2106,2
...,...
PA2336,1
PA1560,1
PA4146,1
PA0258,1


In [11]:
pa14_least_counts_df

Unnamed: 0,counts
PA14_40470,4
PA14_59000,4
PA14_31090,4
PA14_10090,3
PA14_35930,3
...,...
PA14_59370,1
PA14_03390,1
PA14_35720,1
PA14_14420,1


## Add gene names

In [12]:
pao1_annotation_filename = paths.GENE_PAO1_ANNOT
pa14_annotation_filename = paths.GENE_PA14_ANNOT
gene_mapping_pao1 = utils.get_pao1_pa14_gene_map(pao1_annotation_filename, "pao1")
gene_mapping_pa14 = utils.get_pao1_pa14_gene_map(pa14_annotation_filename, "pa14")

In [13]:
pao1_gene_name_map = gene_mapping_pao1["Name"].to_frame()
pa14_gene_name_map = gene_mapping_pa14["Name"].to_frame()

In [14]:
gene_mapping_pao1.head()

Unnamed: 0_level_0,Name,Product.Name,GeneID.(PAO1),PA14_ID,annotation,num_mapped_genes
PAO1_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PA0001,dnaA,chromosomal replication initiator protein DnaA,878417.0,PA14_00010,core,1.0
PA0002,dnaN,"DNA polymerase III, beta chain",879244.0,PA14_00020,core,1.0
PA0003,recF,RecF protein,879229.0,PA14_00030,core,1.0
PA0004,gyrB,DNA gyrase subunit B,879230.0,PA14_00050,core,1.0
PA0005,lptA,"lysophosphatidic acid acyltransferase, LptA",877576.0,PA14_00060,core,1.0


In [15]:
# Add gene names
pao1_least_counts_df = pao1_least_counts_df.merge(
    gene_mapping_pao1["Name"], left_index=True, right_index=True, how="left"
)
pa14_least_counts_df = pa14_least_counts_df.merge(
    pa14_gene_name_map["Name"], left_index=True, right_index=True, how="left"
)

In [16]:
pao1_least_counts_df.sort_values(by="counts", ascending=False)

Unnamed: 0,counts,Name
PA1382,3,
PA1368,3,
PA2221,3,
PA0977,2,
PA2771,2,
...,...,...
PA2105,1,
PA3157,1,
PA0499,1,
PA2184,1,


In [17]:
pa14_least_counts_df.sort_values(by="counts", ascending=False)

Unnamed: 0,counts,Name
PA14_40470,4,
PA14_31090,4,
PA14_59000,4,
PA14_10090,3,
PA14_35930,3,
...,...,...
PA14_48510,1,
PA14_55070,1,
PA14_59100,1,
PA14_15610,1,


In [18]:
# Save
pao1_least_counts_df.to_csv("pao1_acc_coexpressed_with_least_stable.tsv", sep="\t")
pa14_least_counts_df.to_csv("pa14_acc_coexpressed_with_least_stable.tsv", sep="\t")