# Accessory genes related to least stable core genes

Since we found that least stable core genes are more co-expressed with accessory genes. Let's look at who those accessory genes are. In our [previous notebook](../3_core_core_analysis/4_find_related_acc_genes.ipynb) we annotated least and most stable core genes with their top co-expressed accessory gene.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import random
import scipy
import pandas as pd
import numpy as np
from scripts import paths, utils

random.seed(1)

In [2]:
# Load transcriptional similarity df
# These are the subset of genes that we will consider
pao1_similarity_scores_filename = (
    "../3_core_core_analysis/pao1_core_similarity_associations_final_spell.tsv"
)
pa14_similarity_scores_filename = (
    "../3_core_core_analysis/pa14_core_similarity_associations_final_spell.tsv"
)

In [3]:
pao1_similarity_scores = pd.read_csv(
    pao1_similarity_scores_filename, sep="\t", header=0, index_col=0
)
pa14_similarity_scores = pd.read_csv(
    pa14_similarity_scores_filename, sep="\t", header=0, index_col=0
)

In [4]:
# Get least stable core genes
pao1_least_stable_genes = list(
    pao1_similarity_scores[pao1_similarity_scores["label"] == "least stable"].index
)
pa14_least_stable_genes = list(
    pa14_similarity_scores[pa14_similarity_scores["label"] == "least stable"].index
)

The co-expressed accessory genes are listed in the `Related acc genes` column. The values are a list of accessory genes that were in the top 10 co-expressed genes, otherwise "No accessory genes"

In [5]:
pao1_least_df = pao1_similarity_scores.loc[pao1_least_stable_genes]

In [6]:
pa14_least_df = pa14_similarity_scores.loc[pa14_least_stable_genes]

In [7]:
pao1_least_df.head()

Unnamed: 0_level_0,PA14 homolog id,Transcriptional similarity across strains,P-value,Name,label,mean expression,standard deviation expression,min expression,25% expression,50% expression,75% expression,max expression,variance expression,range expression,pathways present,Related acc genes
PAO1 id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
PA0182,PA14_02300,0.126408,1.691876e-20,,least stable,83.111864,421.695987,0.0,18.610135,31.82349,61.014792,11811.693575,177827.5,11811.693575,"['KEGG-Pathway-pae00780: Biotin metabolism', '...",No accessory genes
PA2222,PA14_59620,0.113817,6.858471e-17,,least stable,393.11117,1226.69201,0.0,51.183444,122.52686,233.600575,16615.976605,1504773.0,16615.976605,[],"['PA2224', 'PA2227', 'PA2221', 'PA2228', 'PA22..."
PA5185,PA14_68490,0.148277,1.1237300000000001e-27,,least stable,96.393706,75.58697,0.0,45.750026,74.597653,125.890384,632.779182,5713.39,632.779182,[],No accessory genes
PA4883,PA14_64550,0.146946,3.318383e-27,,least stable,68.166681,351.089538,0.0,3.439621,7.993611,27.91423,6214.325606,123263.9,6214.325606,[],No accessory genes
PA4598,PA14_60830,0.14567,9.283626000000001e-27,mexD,least stable,602.482358,2087.48042,10.188671,99.920003,166.922821,412.896279,25997.8933,4357575.0,25987.704628,[],No accessory genes


## Concatenate accessory genes into a list

Note: We have to use `eval` here because we have a mix of strings and lists in our column of interest. In the future we could use an empty list instead of a string.

In [8]:
pao1_least_processed_df = pao1_least_df[
    pao1_least_df["Related acc genes"] != "No accessory genes"
]["Related acc genes"]
pa14_least_processed_df = pa14_least_df[
    pa14_least_df["Related acc genes"] != "No accessory genes"
]["Related acc genes"]

In [9]:
pao1_least_counts_df = (
    pd.Series(pao1_least_processed_df.apply(eval).sum())
    .value_counts()
    .to_frame("counts")
)
pa14_least_counts_df = (
    pd.Series(pa14_least_processed_df.apply(eval).sum())
    .value_counts()
    .to_frame("counts")
)

In [10]:
pao1_least_counts_df

Unnamed: 0,counts
PA2771,4
PA2225,4
PA1382,4
PA0135,3
PA0716,3
...,...
PA3499,1
PA2220,1
PA3066,1
PA3159,1


In [11]:
pa14_least_counts_df

Unnamed: 0,counts
PA14_43100,4
PA14_10830,4
PA14_54610,4
PA14_22280,3
PA14_35920,3
...,...
PA14_15470,1
PA14_46460,1
PA14_53680,1
PA14_35970,1


## Add gene names

In [12]:
pao1_annotation_filename = paths.GENE_PAO1_ANNOT
pa14_annotation_filename = paths.GENE_PA14_ANNOT
gene_mapping_pao1 = utils.get_pao1_pa14_gene_map(pao1_annotation_filename, "pao1")
gene_mapping_pa14 = utils.get_pao1_pa14_gene_map(pa14_annotation_filename, "pa14")

In [13]:
pao1_gene_name_map = gene_mapping_pao1["Name"].to_frame()
pa14_gene_name_map = gene_mapping_pa14["Name"].to_frame()

In [14]:
gene_mapping_pao1.head()

Unnamed: 0_level_0,Name,Product.Name,GeneID.(PAO1),PA14_ID,annotation,num_mapped_genes
PAO1_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
PA0001,dnaA,chromosomal replication initiator protein DnaA,878417.0,PA14_00010,core,1.0
PA0002,dnaN,"DNA polymerase III, beta chain",879244.0,PA14_00020,core,1.0
PA0003,recF,RecF protein,879229.0,PA14_00030,core,1.0
PA0004,gyrB,DNA gyrase subunit B,879230.0,PA14_00050,core,1.0
PA0005,lptA,"lysophosphatidic acid acyltransferase, LptA",877576.0,PA14_00060,core,1.0


In [15]:
# Add gene names
pao1_least_counts_df = pao1_least_counts_df.merge(
    gene_mapping_pao1["Name"], left_index=True, right_index=True, how="left"
)
pa14_least_counts_df = pa14_least_counts_df.merge(
    pa14_gene_name_map["Name"], left_index=True, right_index=True, how="left"
)

In [16]:
pao1_least_counts_df

Unnamed: 0,counts,Name
PA2771,4,
PA2225,4,
PA1382,4,
PA0135,3,
PA0716,3,
...,...,...
PA3499,1,
PA2220,1,
PA3066,1,
PA3159,1,wbpA


In [17]:
pa14_least_counts_df

Unnamed: 0,counts,Name
PA14_43100,4,rhsP2
PA14_10830,4,
PA14_54610,4,
PA14_22280,3,
PA14_35920,3,
...,...,...
PA14_15470,1,merP
PA14_46460,1,
PA14_53680,1,
PA14_35970,1,


In [18]:
# Save
pao1_least_counts_df.to_csv("pao1_acc_coexpressed_with_least_stable.tsv", sep="\t")
pa14_least_counts_df.to_csv("pa14_acc_coexpressed_with_least_stable.tsv", sep="\t")