# Logic Programming with the MARS Output

Since the output of MARS is weighted confidences, we can make a logic program out of that. Why?

### 1. Validation step. 

Using the probabilities of the rules and probabilities of edges, we can see how likely each proposed MoA is.

In particular, we can also try this on some of the repurposing suggestions. Let's say MARS suggested a drug-BP pair which wasn't in the test set along with an MoA. Let's plug it in to the logic program with its suggested MoAs as rules, and get a probability that it could be true.

### 2. Use Case

Part of our output is weighted rules. It's cool for explainability, but how many ML researchers actually use them? We can show how to use them afterwards.

In particular, let's say we propose a new drug which is not in the original KG, but we have the protein target. Now that we have weighted rules, we can:
- query whether that drug targets a specific BP.
- if we know which BP it targets, query for the most likely MoA, based on our rules.

In [1]:
import scallopy
import json
import os
from collections import defaultdict
import ast
import pandas as pd
import random

In [2]:
ctx = scallopy.ScallopContext(provenance="topkproofs")

Let's start with some example output.

We'll pull from the following directory:

In [3]:
EXAMPLE_DIR = '../data/output/KG_size_experiment/full_KG/PoLo-2'

First, let's get the confidences. Since we did 5 replicates, we have to get the average of the confidence scores. 

In [4]:
# Initialize a dictionary to store the averaged values
averaged_dict = defaultdict(list)

# Iterate through subdirectories
for subdir, _, _ in os.walk(EXAMPLE_DIR):
    confidences_file = os.path.join(subdir, 'confidences.txt')

    # Check if 'confidences.txt' exists in the current subdirectory
    if os.path.isfile(confidences_file):
        # Read the 'confidences.txt' file into a dictionary
        # read in json file
        with open(confidences_file, 'r') as f:
            current_dict = json.load(f)
        for lst in current_dict['CtBP']:
            while 'NO_OP' in lst:
                lst.remove('NO_OP')
            averaged_dict[str(lst[1::])].append(float(lst[0]))

# Calculate the average of each value in the dictionary
for key, value in averaged_dict.items():
    averaged_dict[key] = sum(value) / len(value)

## Adding the Relations

Additionally, for each of these rules, we need to add the relations into the scallopy logic program:

In [5]:
# first a unary predicate to represent each node:

ctx.add_relation("node", int)

In [6]:
relations = set()

for key in averaged_dict.keys():
    as_list = ast.literal_eval(key)
    as_list = [i[1::] + '_' if i.startswith('_') else i for i in as_list]  # Scallop doesn't like leading underscores
    relations.update(set(as_list))

In [7]:
relations = relations - {'NO_OP'}

In [8]:
# Add these relations to the scallopy program:
for relation in relations:
    ctx.add_relation(relation, (str, str))  # Add them as binary relations

## Adding the Probabilistic Rules

Now, we'll add those rules in:

In [9]:
for key, val in averaged_dict.items():
    key_as_list = ast.literal_eval(key)
    key_as_list = [i[1::] + '_' if i.startswith('_') else i for i in key_as_list]  # Scallop doesn't like leading underscores
    # Count how long the rule is, and from there, get the number of variables needed:
    variables = [chr(i+97) for i in range(len(key_as_list))]
    # Add the rule to the scallopy program:
    body = ' and '.join([f"{key_as_list[i]}({variables[i-1]}, {variables[i]})" for i in range(1, len(key_as_list))])
    rule = f"{key_as_list[0]}({variables[0]}, {variables[-1]}) = {body}"
    print(f"{val} :: {rule}")
    ctx.add_rule(rule, tag = val)

0.6495389892847557 :: CtBP(a, c) = CdG(a, b) and GpBP(b, c)
0.648391038677989 :: CtBP(a, c) = CuG(a, b) and GpBP(b, c)
0.672000820244284 :: CtBP(a, d) = CtBP(a, b) and GpBP_(b, c) and GpBP(c, d)
0.5732368224883798 :: CtBP(a, d) = CdG(a, b) and GiG_(b, c) and GpBP(c, d)
0.5678119136132754 :: CtBP(a, d) = CdG(a, b) and GiG(b, c) and GpBP(c, d)
0.569955952094015 :: CtBP(a, d) = CuG(a, b) and GiG_(b, c) and GpBP(c, d)
0.5651251716707305 :: CtBP(a, d) = CuG(a, b) and GiG(b, c) and GpBP(c, d)
0.548242218469801 :: CtBP(a, e) = CtBP(a, b) and GpBP_(b, c) and GiG_(c, d) and GpBP(d, e)
0.5440509825676607 :: CtBP(a, e) = CtBP(a, b) and GpBP_(b, c) and GiG(c, d) and GpBP(d, e)
0.5209817398102472 :: CtBP(a, e) = CdG(a, b) and GpBP(b, c) and GpBP_(c, d) and GpBP(d, e)
0.4417495795043223 :: CtBP(a, e) = CdG(a, b) and GiG_(b, c) and GiG_(c, d) and GpBP(d, e)
0.4413808372039352 :: CtBP(a, e) = CdG(a, b) and GiG_(b, c) and GiG(c, d) and GpBP(d, e)
0.4413692132040815 :: CtBP(a, e) = CdG(a, b) and GiG(b, 

## Instantiate the edges

For now, I am instantiating the edges with probabilities.

Right now, we only have probabilities for most of the PPIs (Gene-gene).

For the other edge types, we will instantiate them with a probability of 1.

For PPIs we have no probability for, we'll do 0.5.

In [10]:
drug_gene = pd.read_csv('../data/kg/splits/kg_drug_gene.tsv', sep='\t', header=None)
gene_gene = pd.read_csv('../data/kg/splits/kg_protein.tsv', sep='\t', header=None)
gene_bp = pd.read_csv('../data/kg/splits/kg_gene_bp.tsv', sep='\t', header=None)

Read in the file with PPI probabilities:

In [12]:
# read in json file
with open('../data/kg/stringdb_ppi_confidences.json', 'r') as f:
    ppi_confs = json.load(f)

In [15]:
def map_conf(row):
    if f"{row[0]}_{row[1]}" in ppi_confs:
        return ppi_confs[f"{row[0]}_{row[1]}"]
    else:
        return 0.5

gene_gene['conf'] = gene_gene.apply(map_conf, axis=1)

In [17]:
for i, row in gene_gene.iterrows():
    ctx.add_facts('GiG', [(row['conf'], (row[0], row[1]))])
    ctx.add_facts('GiG_', [(row['conf'], (row[0], row[1]))])

In [18]:
for i, row in drug_gene.iterrows():
    if row[2] == 'upregulates':
        ctx.add_facts('CuG', [(1, (row[0], row[1]))])
    elif row[2] == 'downregulates':
        ctx.add_facts('CdG', [(1, (row[0], row[1]))])

In [19]:
for i, row in gene_bp.iterrows():
    ctx.add_facts('GpBP', [(1, (row[0], row[1]))])
    ctx.add_facts('GpBP_', [(1, (row[0], row[1]))])

An example:

In [None]:
ctx.add_facts("CdG", [(0.3, ('a', 'c')), (0.22, ('c', 'b'))])

ctx.add_facts("GpBP", [(0.76, ('c', 'd'))])

## Run the Program

Ok, don't do this right now because it crashes.

In [None]:
ctx.run()

With the example:

In [None]:
print(list(ctx.relation("CtBP")))

[(0.12501072904791985, ('a', 'd'))]


Leaving this here for now. But I'll work on doing a demonstration with less rules for each one. 

For example, if we want to get the probability of an MoA, we only need the rule encoding that MoA along with the knowledge base (KB) / KG.

If we want the most likely MoA for a new drug, we can run the program with a different rule set each time to see how likely each MoA-rule is. 