# STRONG CORRELATIONS LOCALIZER
This script aims to find in which groups can we locate the different strongly correlated MGE-ARG pairs that we have obtained in the correlation study script. It depends on having ran the copy per seq and the detection study scripts before hand.

Having gotten here, we have a series of .json files in which different information is condensed

- arg_per_typology: it contains all the different ARG sequences that have amplified, deglosed by sample typology
- mge_per_typology: same idea, but with MGE sequences 
- shared_different_seqs_per_place: a sort of addendum to the two first lists, it contains the sequences (both ARG and MGE, it needs a little bit of trimming) that were shared by both plastics and control but have a significant greater NCN in plastics (there were none that did so in control samples)
- correlations_per_typology: la piece de la resistance, a list with all the strongly correlated MGE-ARG pairs (neatly formatted) and in which sample type can we find them

So, the idea is simple: take the list of strongly correlated MGEs and find them within the maremagnum of amplified thingys, and locate in which sample typology they are. Then, do the same for the ARGs. The most interesting part will be to see how many of the strongly correlated MGEs are exclusive to a typology, and whether the exclusive ARGs of said typology are also strongly correlated.

A neat bow can be put to everything if those MGE/ARGs were also strongly correlated to exclusive ASVs. But, that will wait.


In [1]:
import json
import pandas as pd

In [2]:
with open("../data/resistome_data/metadata/arg_per_typology.json") as json_file:
    arg_per_typology = json.load(json_file)
with open("../data/resistome_data/metadata/mge_per_typology.json") as json_file:
    mge_per_typology = json.load(json_file)
with open("../data/resistome_data/metadata/shared_different_seqs_per_place.json") as json_file:
    diff_seqs = json.load(json_file)
with open("../data/resistome_data/metadata/correlations_per_typology.json") as json_file:
    corrs = json.load(json_file)


In [3]:
# create a new dictionary with the same structure as the ones with every typology separated and this time, filter the MGEs found within to only keep those that are strongly correlated
# Also, keep count of where it's been found
everything_mge = {} 
for category in mge_per_typology.keys():
    for mge in mge_per_typology[category]:
        for corr_typology in corrs.keys():
            for correlated_mge in corrs[corr_typology]:
                if mge == correlated_mge: 
                    if category not in everything_mge.keys():
                        everything_mge[category] = []
                        everything_mge[category].append((mge, corr_typology))
                        #everything[category].append(corr_typology)
                        continue
                    else:
                        everything_mge[category].append((mge, corr_typology))
                        #everything[category].append(mge)
                        #everything[category].append(corr_typology)

everything_mge

{'pe_sw': [('IS91', 'plastic_ex_corrs'),
  ('dfrA1', 'plastic_ex_corrs'),
  ('cro', 'plastic_ex_corrs'),
  ('IS200a', 'plastic_ex_corrs'),
  ('tnpAc', 'plastic_ex_corrs'),
  ('tnpAb', 'plastic_ex_corrs'),
  ('TN5403', 'plastic_ex_corrs'),
  ('IncW_trwAB', 'plastic_ex_corrs'),
  ('IS6100', 'plastic_ex_corrs'),
  ('EAE_05855', 'plastic_ex_corrs'),
  ('dfrA12', 'plastic_ex_corrs'),
  ('ISPps1-pseud', 'plastic_ex_corrs'),
  ('Tn3', 'plastic_ex_corrs')],
 'ps_ew': [('IS26', 'plastic_ex_corrs'),
  ('IS26', 'control_ex_corrs'),
  ('IS26', 'shared_corrs')],
 'es_pw': [('mobA', 'plastic_ex_corrs')],
 'pes_w': [('IS630', 'plastic_ex_corrs'),
  ('ISEcp1', 'plastic_ex_corrs'),
  ('IS613', 'plastic_ex_corrs'),
  ('IS21-ISAs29', 'plastic_ex_corrs'),
  ('fabK', 'plastic_ex_corrs'),
  ('fabK', 'control_ex_corrs'),
  ('fabK', 'shared_corrs'),
  ('IS1247', 'plastic_ex_corrs'),
  ('orf39-IS26', 'plastic_ex_corrs'),
  ('IS256', 'plastic_ex_corrs'),
  ('ARR-3', 'plastic_ex_corrs'),
  ('ARR-3', 'control_ex_

Before continuing, it seems that all but ISSm2-Xanthob are always present in plastics and, if they're not, they are present in plastics, control and shared both at once. ISSm2-Xanthob only has plastic and control. I'm going to take a look to have an idea of what could be going on under each MGE

In [4]:
print("######## ISSm2-Xanthob ########")
print(corrs["plastic_ex_corrs"]["ISSm2-Xanthob"])
print(corrs["control_ex_corrs"]["ISSm2-Xanthob"])
print("")
print("######## ISAba3-Acineto ########")
print(corrs["plastic_ex_corrs"]["ISAba3-Acineto"])
print(corrs["control_ex_corrs"]["ISAba3-Acineto"])
print(corrs["shared_corrs"]["ISAba3-Acineto"])

######## ISSm2-Xanthob ########
['qnrB-bob_resign', 'mphA', 'pica', "aac(6')-Ib", 'blaPSE']
['catA1', 'cefa_qacelta', 'tetS', 'QnrS1_S3_S5', 'qnrS2', 'lsa(C)', 'aadD', 'erm(B)', 'ermX', 'vanRB', 'blaOCH', 'erm(Q)', 'pikR2', 'lnu(F)', 'catA3', 'dfrA10', 'apmA', 'blaCTX-M', 'QnrB4', 'catB9', 'vanC2/vanC3', 'vgaB', 'aph_viii', 'merA-marko', 'terW', 'vanTG', 'bla-SME', 'tetX', 'blaHERA', 'sul2', 'aadA_99', 'blaIMI', 'msr(A)', 'vanXA', 'qacF/Ha', 'vga(A)LC', 'aadA5', 'aadE', 'tetH', 'vat(E)', 'erm(F)b', 'cfxA', 'ampC/blaDHA', 'blaGES', 'aac(6)-is_iu_ix', 'tetPB', 'blaLEN', 'erm(F)a', 'dfra21', 'tcrB', 'vanWB', 'mefA', 'vanA', 'ant6-ib', 'ermA/ermTR', 'erm(A)', 'catP', 'ermY', 'erm(D)', 'bl1acc', 'vanRD', 'aac(6)-iw', 'lnuB', 'mtrD', 'tet39', 'aadA2b', 'sulA/folP']

######## ISAba3-Acineto ########
['strB', 'vanSB', 'mdth', 'cmlA1', 'aac(3)-Xab', 'lsa(C)', 'QnrVC1_VC3_VC6', 'sugE', 'aadA10', 'dfrA10', 'aac(6)-ij', 'vgaB', 'aph_viii', 'terW', 'aac(3)-xaa', 'bla-L1', 'aadA17', 'aac(6)-iv_ih', 

In [6]:
corrs

{'plastic_ex_corrs': {'tnpAa': ['blaCTX-M',
   'catB9',
   'sul2',
   'qepA_1_2',
   'ant6-ia',
   'tetR',
   'tetQ',
   'catB3',
   'tet44',
   'blaOXA10',
   'qnrB-bob_resign',
   "aac(6')-Ib",
   'tetC',
   'erm(S)',
   'blaGES',
   'strA',
   'cat',
   'aadA2a',
   'blaMOX/blaCMY',
   'dfrA15',
   'tetH',
   'cefa_ampc',
   'mefA',
   'strB',
   'aadB',
   'mdth',
   'vanC',
   'aadA_99',
   'vat(E)',
   'bla-SME',
   'mef(B)',
   'mphA',
   'acrF',
   'sulA/folP',
   'ttgB',
   'pica',
   'blaOCH',
   'blaIMI',
   'sugE',
   'vgaB',
   'cfiA',
   'bla1',
   'blaPSE',
   'pbp',
   'erm(35)',
   'aadD',
   'dfrA25',
   'bexA/norM'],
  'IS3': ['cmlA5',
   'tetA',
   'blaVIM',
   "aph(3'')-ia",
   'strB',
   'norA',
   'vanSB',
   'mdth',
   'cmlA1',
   'aac(3)-Xab',
   'lsa(C)',
   'vanC',
   'QnrVC1_VC3_VC6',
   'dfrA10',
   'aac(6)-ij',
   'aph_viii',
   'merA-marko',
   'terW',
   'dfrBmulti',
   'aac(3)-xaa',
   'bla-L1',
   'aadA17',
   'vatB',
   'arsA',
   'blaPAO/PDC',
   'me