In [2]:
%load_ext autoreload
%autoreload 2
from src.setup import *
from src import common

# Results


In [3]:
df_results = pd.read_csv(PATH_GWAS_FINAL_RESULTS, sep='\t', index_col=0)
df_results

Unnamed: 0_level_0,#CHROM,POS,REF,ALT,A1,TEST,OBS_CT,OR,LOG(OR)_SE,Z_STAT,P,AA
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
rs2296651,14,70245193,G,A,A,ADD,434,48.1416,0.560362,6.91365,4.72326e-12,S_35_R
rs9397998,6,157488340,C,T,T,ADD,431,1711340000.0,2.89168,7.35232,1.94792e-13,Pol_584_T
rs70944751,6,29911857,G,T,T,ADD,430,0.0400096,0.490586,-6.5608,5.352e-11,PC_160_A
rs376806238,6,29912395,T,TGG,TGG,ADD,413,0.00555607,0.793047,-6.54799,5.83178e-11,PC_160_A
rs2735101,6,29913001,T,C,C,ADD,429,0.0125832,0.669145,-6.53878,6.20211e-11,PC_160_A


* rs2296651 <-> S_35_R: having a G at that position makes you more likely (OR > 1) to be infected by a virus that has  the S_35_R variant.
* rs9397998 <-> Pol_584_T: having a C at that position makes you much more likely (OR >> 1) to be infected by a virus that has the Pol_584_T variant


# Protein sizes

First, look at the positions and variants genes in the **raw viral data**.

In [8]:
df_raw_viral = pd.read_csv(PATH_VIRAL_RAW_DATA, sep='\t')

In [31]:
list_gene_S = [ item for item in df_raw_viral.columns.values if item[0:6] == 'gene_S' ]
print("The last position of gene S is", list_gene_S[-1])
list_gene_C = [ item for item in df_raw_viral.columns.values if item[0:7] == 'gene_PC' ]
print("The last position of gene C is", list_gene_C[-3:])
list_gene_Pol = [ item for item in df_raw_viral.columns.values if item[0:8] == 'gene_Pol' ]
print("The last position of gene Pol is", list_gene_Pol[-1])
list_gene_X = [ item for item in df_raw_viral.columns.values if item[0:6] == 'gene_X' ]
print("The last position of gene X is", list_gene_X[-3:])

The last position of gene S is gene_S_pos_0401_STOP
The last position of gene C is ['gene_PC_C_pos_0213_STOP', 'gene_PC_C_pos_0213_W', 'gene_PC_C_pos_0213_Y']
The last position of gene Pol is gene_Pol_pos_0844_STOP
The last position of gene X is ['gene_X_pos_0154_A', 'gene_X_pos_0155_STOP', 'gene_X_pos_0155_W']


From the [SnapGene file of HBV GT-C](res/HBV_GT_C.dna), which contains the reference genome described [in this notebook](tutorial/HBV%20genome.ipynb#Reference-genome):

* Gene S: STOP codon at 401
* Gene Pol: STOP codon at 844
* Gene X: STOP at 155
* Gene C: STOP at 184, but Pre-C is 29aa -> Total size of protein core is 213

All those observations are consistent.

# Compare viral reported GT and data

We merge clinical data (GT) with viral data (whole table) and keep individuals used in the study.

In [28]:
# 1. Load clinical data
with open(PATH_CLINICAL_DATA, 'rb') as file:
    df_clinical = pickle.load(file)[['gilead_id', 'GT']]
# 2. Load viral data
with open(PATH_VIRAL_DATA, 'rb') as file:
    df_viral = pickle.load(file)

In [29]:
# 3. Join the two tables
df_clinical.set_index('gilead_id', inplace=True)
df_clinical.columns = pd.MultiIndex.from_product([['GT'], [''], ['']])
df = df_viral.join(other=df_clinical, on='id')

In [30]:
# 4. Load the individuals used in the study
list_inds = pd.read_csv(PATH_ASIANS_GWAS+'.fam', sep='\t', header=None)[0].values
# 5. Map the ids IGM->GS
list_inds = common.map_ids(list_inds)
# 5. Filter out the individuals there were not in the study
df = df[ df[('id', '', '')].isin(list_inds) ]
df.shape

(435, 746)

###### S_35_R

We know that residue 35 in gene S of HBV genotype C should be a **G** ([see here](tutorial/HBV%20genome.ipynb#Reference-genome)).

In [135]:
# Select a portion of the table
ids = pd.IndexSlice
S_35 = df.loc[:, ids[['S', 'GT'], [35, '']]]
S_35['S'].sum()

pos  variant
35   G          309.0
     K          100.0
     R           50.0
dtype: float64

In [147]:
S_35_GT_C = S_35[S_35.GT == 'C']
print(S_35_GT_C.shape[0], "individuals have genotype C")
S_35_GT_C[('S', 35)].sum()

314 individuals have genotype C


variant
G    298.0
K      1.0
R     38.0
dtype: float64

Among the individuals that were assigned to genotype C, only 298 (over 314) have the consensus amino acid. Moreover, there are some individuals that have at least two variants at that position (S_35), since the sum shown above is greater than 314. This basically tells us that we shouldn't rely too much about the assigned viral genotypes.

# rs2296651 <-> S_35_R

In [15]:
df_results.loc['rs2296651'] 

#CHROM                 14
POS              70245193
REF                     G
ALT                     A
A1                      A
TEST                  ADD
OBS_CT                434
OR                48.1416
LOG(OR)_SE       0.560362
Z_STAT            6.91365
P             4.72326e-12
AA                 S_35_R
Name: rs2296651, dtype: object

dbSNP: https://www.ncbi.nlm.nih.gov/snp/rs2296651, G->A, causes Ser267Phe, missense mutation. 

## Gene 

[SLC10A1](https://www.ncbi.nlm.nih.gov/gene/6554). Expression in the **liver**, sodium/bile acid cotransporter.

> The protein encoded by this gene belongs to the sodium/bile acid cotransporter family, which are integral membrane glycoproteins that participate in the enterohepatic circulation of bile acids.

###### Function

[virus receptor activity](http://amigo.geneontology.org/amigo/term/GO:0001618): Combining with a virus component and mediating entry of the virus into the cell. [Definition](https://www.uniprot.org/keywords/KW-1183):

> Cell surface protein used by a virus as an attachment and entry receptor. In some cases, binding to a cellular receptor is not sufficient for infection: an additional cell surface molecule, or coreceptor, is required for entry. Some viruses are able to use different receptors depending on the target cell type.

## Articles

[SLC10A1 S267F variant influences susceptibility to HBV infection and reduces cholesterol level by impairing bile acid uptake](https://onlinelibrary.wiley.com/doi/full/10.1111/jvh.13157)

> The SLC10A1 Ser267Phe (S267F) variant has been reported to severely inhibit hepatitis B virus (HBV) infection and taurocholate transport activity. 

[The p.Ser267Phe Variant in SLC10A1 Is Associated With Resistance to Chronic Hepatitis B](https://www.researchgate.net/publication/269287835_The_pSer267Phe_Variant_in_SLC10A1_Is_Associated_With_Resistance_to_Chronic_Hepatitis_B)

[Diverse Effects of the NTCP p.Ser267Phe Variant on Disease Progression During Chronic HBV Infection and on HBV preS1 Variability](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6407604/)

> The HBV virus comprises an external envelope composed of surface glycoproteins, an icosahedral nucleocapsid, and a 3.2 kb partially double-stranded DNA genome.
> 
> Recently, it was discovered that HBV entry into human hepatocytes is mediated by the receptor sodium taurocholate co-transporting polypeptide (NTCP) expressed by the host (Yan et al., 2012; Ni et al., 2014). **The preS1 domain of large envelope proteins is responsible for its binding with NTCP and involved in virus–host receptor interaction** (Barrera et al., 2005; Glebe et al., 2005; Yan et al., 2012).
> 
> Most SLC10A1 SNPs have distributions related to ethnicity, and the non-synonymous mutation that encodes the p.Ser267Phe variant (S267F, c.800 G>A, rs2296651) is **specific to Asian patient populations**

## Interpretation



Having the Ser267Phe mutation in gene SLC10A1 makes having the S_35_R viral phenotype much more likely (OR > 1). This is because the Ser267Phe mutation severly inhibits HBV infection (it makes you much less susceptible to infection). Thus the S_35_R phenotype must have some escape function. 

## Re-run association analyses by genotypes

We want to re-run an G2G analysis, for variant S_35_R, among asian individuals infected by the same HBV type. 



In [26]:
common.write_phenotypes(fam=PATH_ASIANS_GWAS, phenotype=('S', 35, 'R'), criteria=('GT', 'C'),
                       output_path=PATH_WORKING_PHENOTYPES)
host.run_gwas(path_phenotypes=PATH_WORKING_PHENOTYPES, )

313 individuals written to 'data/working_pheno'
121 were filtered out based on the criteria ('GT', 'C')
The phenotype '('S', 35, 'R')' were included from viral data.


# rs9397998 <-> Pol_584_T

In [17]:
df_results.loc['rs9397998']

#CHROM                  6
POS             157488340
REF                     C
ALT                     T
A1                      T
TEST                  ADD
OBS_CT                431
OR            1.71134e+09
LOG(OR)_SE        2.89168
Z_STAT            7.35232
P             1.94792e-13
AA              Pol_584_T
Name: rs9397998, dtype: object

dbSNP https://www.ncbi.nlm.nih.gov/snp/rs9397998, C>T, intro variant. 

## Gene

[ARID1B AT-rich interaction domain 1B [ Homo sapiens (human) ] ](https://www.ncbi.nlm.nih.gov/gene/57492), expressed in all tissues (but not much in liver).
> This locus encodes an AT-rich DNA interacting domain-containing protein. The encoded protein is a component of the SWI/SNF chromatin remodeling complex and may play a role in cell-cycle activation. The protein encoded by this locus is similar to AT-rich interactive domain-containing protein 1A. These two proteins function as alternative, mutually exclusive ARID-subunits of the SWI/SNF complex. The associated complexes play opposing roles. Alternative splicing results in multiple transcript variants. 

## Articles

[Genetic basis of hepatitis virus-associated hepatocellular carcinoma: linkage between infection, inflammation, and tumorigenesis](https://link.springer.com/article/10.1007/s00535-016-1273-2):
> HCC tissues contain mutations of genes essential for maintaining the chromatin structure, including ARID1A, **ARID1B**, ARID2, and MLL4 [25]. Mutations of these epigenetic modifiers lead to profound epigenetic changes, including aberrant DNA methylation, histone modifications, and nucleosome positioning [16], resulting in abnormal gene expression and genomic instability, which may predispose to HCC development.

[Whole-genome sequencing of liver cancers identifies etiological influences on mutation patterns and recurrent mutations in chromatin regulators.](https://www.ncbi.nlm.nih.gov/pubmed/22634756) :
> Multiple chromatin regulators, including ARID1A, ARID1B, ARID2, MLL and MLL3, were mutated in ∼50% of the tumors. 

-> involved in epigenetic/chromatin regulation and development of hepatocellular carcinoma. 

## Interpretation

None... Found no reference about this specific SNP, neither about the amino acid variant. We only know that the gene containing the intronic SNP is involved in HCC predisposition and that the amino acid is in the reverse transcriptase domain of the viral polymerase. Nothing can be deduced from it. 

# rs70944751 <-> PC_160_A

In [27]:
df_results.loc['rs70944751']

#CHROM                6
POS            29911857
REF                   G
ALT                   T
A1                    T
TEST                ADD
OBS_CT              430
OR            0.0400096
LOG(OR)_SE     0.490586
Z_STAT          -6.5608
P             5.352e-11
AA             PC_160_A
Name: rs70944751, dtype: object

dbSNP https://www.ncbi.nlm.nih.gov/snp/rs70944751, G>T, HLA-A intro variant


## Gene

[HLA-A](https://www.ncbi.nlm.nih.gov/gene/3105), 
> HLA-A belongs to the HLA class I heavy chain paralogues. This class I molecule is a heterodimer consisting of a heavy chain and a light chain (beta-2 microglobulin). The heavy chain is anchored in the membrane. Class I molecules play a central role in the immune system by presenting peptides derived from the endoplasmic reticulum lumen.

expressed in nearly all tissues, 

In [36]:
# Individuals from the study
df[('PC', 160)].sum()

variant
A    391.0
P     36.0
dtype: float64

In [51]:
# Individuals from the raw data
DF = pd.read_csv(PATH_VIRAL_RAW_DATA, sep='\t')
lst = [ i for i in DF.columns if len(i) > 17 and i[0:18] == 'gene_PC_C_pos_0160' ]
DF[lst].sum()

gene_PC_C_pos_0160_A    735
gene_PC_C_pos_0160_G     11
gene_PC_C_pos_0160_N      2
gene_PC_C_pos_0160_P     44
gene_PC_C_pos_0160_S      1
gene_PC_C_pos_0160_V      1
dtype: int64

# rs376806238 <-> PC_160_A


In [6]:
df_results.loc['rs376806238']

#CHROM                  6
POS              29912395
REF                     T
ALT                   TGG
A1                    TGG
TEST                  ADD
OBS_CT                413
OR             0.00555607
LOG(OR)_SE       0.793047
Z_STAT           -6.54799
P             5.83178e-11
AA               PC_160_A
Name: rs376806238, dtype: object

dnSNP https://www.ncbi.nlm.nih.gov/snp/rs376806238, HLA-A non-coding transcript variant

# rs2735101 <-> PC_160_A

dbSNP https://www.ncbi.nlm.nih.gov/snp/rs2735101, HLA-A intron variant

