## Create interactive plots for the entire E protein
To do this, we will concatenate the relevant datasets that have been generated for each tile. This means the per-tile analysis must be run before this analysis can be completed. 

First, we will concatenate the 'host_adapt' charts for each tile, which includes average mutation effect and differential selection data for each site in both our Huh-7.5-selected and C6-36-selected conditions. Then we will re-plot in Altair, and regenerate a list of most interesting muts for each selection condition  

In [1]:
# import necessary Python modules and packages
import glob
import os
import subprocess
import shutil

import Bio.SeqIO

import dms_tools2
from dms_tools2 import AAS
from dms_tools2.ipython_utils import showPDF
from dms_tools2.plot import COLOR_BLIND_PALETTE_GRAY as CBPALETTE
import dms_tools2.prefs
import dms_tools2.utils
print(f"Using dms_tools2 {dms_tools2.__version__}")

from IPython.display import display, HTML

import pandas as pd

import altair as alt
from plotnine import *

import numpy

import dms_variants.plotnine_themes

Using dms_tools2 2.6.10


Disable max rows in Altair. This was leading to bug in chart generation step. 

In [2]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

Create dictionary of pandas dataframes for each host_adaptation.csv file in each tile's results folder. 

In [3]:
# create a pandas dataframe for each tile 'host_adapt.csv'
results = './results/'

d = {}

tile_list = ['tile_1', 'tile_2', 'tile_3', 'tile_4', 
             'tile_5', 'tile_6', 'tile_7', 'tile_8']

for tile in tile_list:
    tilepath = os.path.join(results + tile + "/host_adaptation/host_adaptation.csv")
    d[tile] = pd.read_csv(tilepath)


alltiles_hostadapt = pd.concat([d['tile_1'], d['tile_2'], d['tile_3'], d['tile_4'], 
                                d['tile_5'], d['tile_6'], d['tile_7'], d['tile_8']])

Now we can save the concatenated file in a new results folder. 

In [4]:
# create 'all_tiles' file within results folder
alltiles_dir = './results/all_tiles'
os.makedirs(alltiles_dir, exist_ok=True)

# save concatenated dataframe as 'alltiles_host_adapt.csv'
alltiles_file = os.path.join(alltiles_dir + '/alltiles_host_adaptation.csv')
alltiles_hostadapt.to_csv(alltiles_file, index = False)
print('Saving concatenated data to "results/all_tiles/" folder. Here are first few lines...')
print(alltiles_hostadapt)

Saving concatenated data to "results/all_tiles/" folder. Here are first few lines...
      site wildtype mutant mutation  muteffect_C636  muteffect_Huh75  \
0        0        R      A      R0A         -5.4935          -5.4756   
1        0        R      C      R0C         -6.5642          -6.5772   
2        0        R      D      R0D         -4.2448          -4.2537   
3        0        R      E      R0E         -4.6242          -4.6174   
4        0        R      F      R0F         -5.3429          -5.3218   
...    ...      ...    ...      ...             ...              ...   
2275   903        L      S    L903S         -4.0946          -8.0519   
2276   903        L      T    L903T         -7.3497          -7.3560   
2277   903        L      V    L903V         -6.9141          -6.7634   
2278   903        L      W    L903W         -4.2839          -3.6811   
2279   903        L      Y    L903Y         -5.8816          -5.8909   

      foldchange_C636  foldchange_Huh75  diffsel_H

Now we can produce the Altair charts like we did for the per-tile analysis for the entire E gene

In [5]:
# select point nearest mouse
nearest = alt.selection(type='single', empty='none', nearest=True, on='mouseover')

# create the basic chart
basechart = (
 alt.Chart(alltiles_hostadapt
           .rename(columns={'muteffect_C636': 'effect C636',
                            'muteffect_Huh75': 'effect Huh75',
                            'diffsel_Huh75_vs_C636': 'Huh75 vs C636',
                            })
           .assign(dummy=0)
           )
 .add_selection(nearest)
 .encode(fill=alt.condition(nearest, alt.value('orange'), alt.value('gray')),
         opacity=alt.condition(nearest, alt.value(1), alt.value(0.4)),
         tooltip=['mutation', 'effect C636', 'effect Huh75', 'Huh75 vs C636'],
         )
 .interactive()
 )

# side-by-side interactive plots to select mutations
chart = (
 basechart.encode(x='effect C636:Q',
              y='effect Huh75:Q'
              )
      .mark_point()
      .properties(width=500,
                  height=500)
 |
 basechart.encode(x=alt.X('dummy:O', title=None),
              y='Huh75 vs C636:Q',           
              )
      .properties(width=50,
                  height=500)
      .mark_tick()
 )

# save the interactive plot
plotfile = os.path.join(alltiles_dir, 'select_muts_chart.html')
print(f"Saving interactive plot to {plotfile}")
chart.save(plotfile)

# show the chart
chart

Saving interactive plot to ./results/all_tiles/select_muts_chart.html


The above interactive plots make it easy to identify mutations.

As mentioned above, Huh-7.5-specific mutations will:
  - have *effect Huh-7.5* $> 0$ in the scatter plot at left (be favorable in Huh-7.5 cells)
  - have *effect C636* $< 0$ in the scatter plot at left (be unfavorable in C636 cells)
  - have *Huh-7.5 vs C636* $> 0$ in the strip chart at right (be favored in Huh-7.5 over C636)
  
The C636-specific mutations will:
  - have *effect Huh-7.5* $< 0$ in the scatter plot at left (be unfavorable in Huh-7.5 cells)
  - have *effect C636* $> 0$ in the scatter plot at left (be favorable in C636 cells)
  - have *Huh-7.5 vs C636* $< 0$ in the strip chart at right (be favored in C636 over Huh-7.5)
  
You can use the mouse to hover over marks and they will turn orange in both the scatter plot and the strip chart, and a box will appear giving detailed information on the mutations.
You can also use the mouse scroll bar to zoom in and out.

*Note: the interactive plot will only render interactively in the Jupyter notebook itself! If you have a HTMl rendering the plot will be static. In that case, you want to open the interactive plot saved to the HTML file above separately.*

The best way to pick mutations will be to look at the charts above, but below we also simply list what appear to be some of the top candidates in tabular form using simple criteria.

In [6]:
print("The top Huh-7.5-specific mutations appear to be...")
display(HTML(
    alltiles_hostadapt
    .query('muteffect_Huh75 > 0')
    .sort_values('diffsel_Huh75_vs_C636', ascending=False)
    .head(n=20)
    .to_html(index=False)
    ))

print("The top C6-36-specific mutations appear to be...")
display(HTML(
    alltiles_hostadapt
    .query('muteffect_C636 > 0')
    .sort_values('diffsel_Huh75_vs_C636', ascending=True)
    .head(n=20)
    .to_html(index=False)
    ))

The top Huh-7.5-specific mutations appear to be...


site,wildtype,mutant,mutation,muteffect_C636,muteffect_Huh75,foldchange_C636,foldchange_Huh75,diffsel_Huh75_vs_C636
697,K,E,K697E,-2.4416,1.1264,0.1841,2.1831,3.1098
697,K,D,K697D,-4.2018,1.2251,0.0543,2.3377,2.9749
22,L,A,L22A,-2.4252,0.4004,0.1862,1.3199,2.8743
22,L,S,L22S,-2.9729,0.334,0.1274,1.2605,2.828
22,L,G,L22G,-3.2191,0.4154,0.1074,1.3337,2.7068
697,K,L,K697L,-1.865,0.4207,0.2745,1.3386,2.5092
257,V,T,V257T,-2.0889,0.5046,0.2351,1.4187,2.4418
697,K,A,K697A,0.1298,2.4851,1.0941,5.5987,2.4296
697,K,M,K697M,-1.364,1.2066,0.3885,2.3079,2.4174
697,K,S,K697S,-0.0674,2.4563,0.9544,5.4881,2.4104


The top C6-36-specific mutations appear to be...


site,wildtype,mutant,mutation,muteffect_C636,muteffect_Huh75,foldchange_C636,foldchange_Huh75,diffsel_Huh75_vs_C636
513,L,T,L513T,0.1821,-3.0962,1.1345,0.1169,-2.6612
572,Y,Q,Y572Q,0.9595,-1.9272,1.9446,0.2629,-2.4337
299,H,S,H299S,0.0767,-3.5799,1.0546,0.0836,-2.3292
171,W,Y,W171Y,0.7521,-1.4,1.6842,0.3789,-2.0508
581,V,T,V581T,0.0505,-2.1184,1.0356,0.2303,-1.9434
4,T,F,T4F,0.5062,-1.5107,1.4203,0.3509,-1.8939
142,T,V,T142V,0.4554,-1.3951,1.3712,0.3802,-1.6768
787,V,D,V787D,0.395,-2.284,1.3149,0.2053,-1.6099
27,Y,F,Y27F,1.1971,-0.726,2.2928,0.6046,-1.5244
171,W,F,W171F,0.0215,-1.8953,1.015,0.2688,-1.5228
