### **Task 1**: Compare lower-bound with the actual cost

Details of the task:
- Set LP using [gurobi](https://www.gurobi.com/features/academic-named-user-license/): Need to set up my license

    Using [Google OR-Tools](https://developers.google.com/optimization/lp/lp_example) instead
- The lower bound is predicted by the LP objective function
- The actual solution from the given data OR with an ILP solver
- Reference [here](https://github.com/algo-cancer/PhISCS-BnB) for the data & optimal solution

**Questions**:
* Why does the input data have question marks? - make it automatically 0
* When can a mutation be "eliminated" (See the [Read Me Section](https://github.com/algo-cancer/PhISCS-BnB/tree/master?tab=readme-ov-file#output))?


### **Task 2**: Does the lower bound increase when adding constraints for non-conflict column pairs to the LP (empiricconflicty tested)

Details of the task:
- First test with the given data [here](https://github.com/algo-cancer/PhISCS-BnB)
- Then test with randomly generated data
- Compare lower bound with all pairs vs. only conflict pairs in the LP
- Is there 1/2 across the board that is lowering the bound of only conflict LPs?
- If the additional constraints don’t help - why?

**Questions**:
* Should I remove constraints (2), (3), (4), (5), (6)

    Effectively remove all variables $B_{p,q,x_1,x_2}$ if $p$ and $q$ don't have a conflict

### **Task 3**: Write up proof claiming that once a conflict is resolved, those 2 columns will be <, >, or ≠ in the final solution

Look at scratch.txt for additional ideas

### Functions

* `make_random_data(rows, cols, bias)` - create random SCS data with a bias
* `read_data(base, file)` - read the data from the file
* `get_conversion_cost(X, Y)` - calculate the number of mutations needed to convert X to Y
* `find_conflict_columns(X)` - matrix describing if a given pair of columns have a conflict
* `solve_LP(SCS, ColSelector = None, verbose = False)` - solve the LP with all pairs or conflict pairs
* `compare_SCS(SCS_array, SCS_names)` - compare SCS data among multiples DataFrames

In [1]:
# Generate random data
# Input: rows, cols, bias
# Output: random_data
def make_random_data(rows, cols, file=None, bias=0.7):
    random_data = [([f'cell{i}'] if file else []) + [int(random.uniform(0, 1) < bias) for j in range(cols)] for i in range(rows)]
    random_data = pd.DataFrame(random_data, columns=(['cellIDxmutID'] if file else []) + [f'mut{i}' for i in range(cols)])
    if file:
        random_data.to_csv(file, index=False, sep='\t')
    return random_data

In [2]:
# Read data function
# Input: base - folder with data, file - name of the file without extension
# Return: In_SCS, CF_SCS, MutsAtEdges
# Note: SCS converts all ? to 0
# Note: MutsAtEdges is a list with a tuple - (parent, curr_node, muts: set)
find_nodes_re = r"\[(?P<parent>[0-9]+)\]->\[(?P<node>[0-9]+)\]:"
def read_data(base, file):
    raw = pd.read_csv(base + "/" + file + ".SC", sep="\t", dtype=str)
    In_SCS = (raw.iloc[:, 1:] == "1").astype(int)
    try:
        CF_SCS = pd.read_csv(base + "/" + file + ".CFMatrix", sep="\t").iloc[:, 1:]
    except:
        CF_SCS = None
    try:
        MutsAtEdges = []
        with open(base + "/" + file + ".mutsAtEdges", "r") as f:
            for line in f:
                l = line.strip().split(' ')
                parent, curr_node = tuple(map(int, re.match(find_nodes_re, l[0]).groups()))
                MutsAtEdges.append((parent, curr_node, set(l[1:])))
    except:
        MutsAtEdges = None
    return In_SCS, CF_SCS, MutsAtEdges

In [3]:
# Conversion cost
# Input: X - from matrix, Y - to matrix
# Return: Cost of converting X into Y
def get_conversion_cost(X, Y):
    cost = 0
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            if X.iloc[i, j] > Y.iloc[i, j]:
                exit(1) # This is a false positive mutation
            if X.iloc[i, j] < Y.iloc[i, j]:
                cost += 1
    return cost

In [4]:
# Conflict column pairs
# Input: X - matrix of SCS data
# Return: nxn columns pairs, True - is conflict

# Check if a column pair has conflicts - Utility function
def is_conflict(df, p, q):
    is10 = False
    is01 = False
    is11 = False
    for k in range(df.shape[0]):
        if df.iloc[k, p] == 1 and df.iloc[k, q] == 0:
            is10 = True
        if df.iloc[k, p] == 0 and df.iloc[k, q] == 1:
            is01 = True
        if df.iloc[k, p] == 1 and df.iloc[k, q] == 1:
            is11 = True
    return is10 and is01 and is11

# Get matrix of is_conflict
def find_conflict_columns(X):
    conflicts = []
    for p in range(n):
        temp = []
        for q in range(n):
            temp.append(is_conflict(X, p, q))
        conflicts.append(temp)
    conflicts = pd.DataFrame(conflicts)
    return conflicts
# TODO: Make this more efficient - how?

In [5]:
# Solve the LP & find the lower bound
# Input: SCS, ColSelector - which col pairs to add constraints for, verbose
# Return: LP_objective, LP_solution (float)
def solve_LP(SCS, ColSelector = None, verbose = False):
    
    solver = pywraplp.Solver.CreateSolver("GLOP")
    m = SCS.shape[0] # rows
    n = SCS.shape[1] # cols

    # Create variables
    vars = {}
    for p in range(n):
        for q in range(p+1, n):
            if ColSelector is None or ColSelector.iloc[p, q]: # Check cols
                vars[f"B_{p}_{q}_1_0"] = solver.NumVar(0, 1, f"B_{p}_{q}_1_0") # (6)
                vars[f"B_{p}_{q}_0_1"] = solver.NumVar(0, 1, f"B_{p}_{q}_0_1") # (6)
                vars[f"B_{p}_{q}_1_1"] = solver.NumVar(0, 1, f"B_{p}_{q}_1_1") # (6)
    for i in range(m):
        for j in range(n):
            vars[f"x_{i}_{j}"] = solver.NumVar(float(SCS.iloc[i, j]), 1, f"x_{i}_{j}") # (7)
    if verbose:
        print(solver.NumVariables(), "variables created")

    # Create constraints
    for p in range(n):
        for q in range(p+1, n):
            if ColSelector is None or ColSelector.iloc[p, q]: # Check cols
                solver.Add(vars[f"B_{p}_{q}_1_0"] + vars[f"B_{p}_{q}_0_1"] + vars[f"B_{p}_{q}_1_1"] <= 2) # (5)
                for i in range(m):
                    solver.Add(vars[f"x_{i}_{p}"] - vars[f"x_{i}_{q}"] <= vars[f"B_{p}_{q}_1_0"]) # (2)
                    solver.Add(- vars[f"x_{i}_{p}"] + vars[f"x_{i}_{q}"] <= vars[f"B_{p}_{q}_0_1"]) # (3)
                    solver.Add(vars[f"x_{i}_{p}"] + vars[f"x_{i}_{q}"] <= 1 + vars[f"B_{p}_{q}_1_1"]) # (4)
    if verbose:
        print(solver.NumConstraints(), "constraints created")

    # Define objective function
    objective = solver.Objective()
    for i in range(m):
        for j in range(n):
            if SCS.iloc[i, j] == 0: # only if they used to be 0
                objective.SetCoefficient(vars[f"x_{i}_{j}"], 1) # (1)
    objective.SetMinimization()

    # Solve & print objective
    status = solver.Solve()
    if status != pywraplp.Solver.OPTIMAL:
        print("The problem does not have an optimal solution.")
        exit(1)
    objective_value = objective.Value()
    if verbose:
        print(f"Solving with {solver.SolverVersion()}\n")
        print(f"Solution:\nLower bound (LP objective) = {objective_value:0.5f}")

    # Create & print the solution DF
    solution = []
    for i in range(m):
        solution.append([vars[f"x_{i}_{j}"].solution_value() for j in range(n)])
    solution = pd.DataFrame(solution)
    if verbose:
        display(solution)

    # Return
    return objective_value, solution

In [6]:
# Compare SCS data
# Input: SCS_array - list of SCS matrices, SCS_names - list of names
# Return: Matrix of each difference - (row, col, mat1_val, mat2_val, ...)
def compare_SCS(SCS_array, SCS_names):
    m = SCS_array[0].shape[0]
    n = SCS_array[0].shape[1]
    diffs = []
    for i in range(m):
        for j in range(n):
            to_add = False
            temp = [i, j]
            for SCS_DF in SCS_array:
                if SCS_DF.iloc[i, j] != SCS_array[0].iloc[i, j]:
                    to_add = True
                temp.append(SCS_DF.iloc[i, j])
            if to_add:
                diffs.append(temp)
    diffs = pd.DataFrame(diffs, columns=["row", "col"] + SCS_names)
    return diffs

### Driver

In [7]:
# Imports
import pandas as pd
from ortools.linear_solver import pywraplp
import re
import random
import time
DISPLAY_TABLES = False

In [13]:
# Create data
m = 200 # rows
n = 50 # cols
FILE = None #"./data/data3.SC"
In_SCS = make_random_data(m, n, FILE)
print("Dimensions of the created data:", In_SCS.shape)
if DISPLAY_TABLES:
    display(In_SCS)

Dimensions of the created data: (200, 50)


In [112]:
# Read data
BASE = "./data"
FILE = "data1"

In_SCS, CF_SCS, MutsAtEdges = read_data(BASE, FILE)
m = In_SCS.shape[0] # rows
n = In_SCS.shape[1] # cols

print(f"Data shape: {In_SCS.shape}")
if DISPLAY_TABLES:
    print("\nInput SCS data:")
    display(In_SCS)
    print("Conflict-free SCS (answer) data:")
    display(CF_SCS)

Data shape: (20, 20)


In [None]:
# Compare In_SCS and CF_SCS
real_cost = get_conversion_cost(In_SCS, CF_SCS)
print(f"True cost of converting Input SCS to Conflict-Free SCS: {real_cost}\n")
print("Mutations (only false negative) between Input SCS and Conflict-Free SCS:")
if DISPLAY_TABLES:
    display(compare_SCS([In_SCS, CF_SCS], ["In_SCS", "CF_SCS"]))

#### Task 1: Compare lower-bound with the actual cost

In [14]:
# Find the LP based lower bound (all columns)
LP_bound_all_columns, LP_solution_all_columns = solve_LP(In_SCS)
print("Lower bound (LP objective) with all columns:", LP_bound_all_columns)
if DISPLAY_TABLES:
    display(LP_solution_all_columns)

Lower bound (LP objective) with all columns: 1520.0


In [None]:
# Deterministic rounding & compare
RoundedLP_solution_all_columns = (LP_solution_all_columns.iloc[:, :] >= 0.5).astype(int)
RoundedLP_cost_all_columns = get_conversion_cost(In_SCS, RoundedLP_solution_all_columns)
print(f"Cost of converting rounded solution for LP with all columns to Input SCS: {RoundedLP_cost_all_columns}")
if DISPLAY_TABLES:
    display(compare_SCS([In_SCS, RoundedLP_solution_all_columns], ["In_SCS", "Rounded All Columns"]))

#### Task 2: Does the lower bound increase when adding constraints for non-conflict column pairs to the LP (empirically tested)

In [15]:
# Get the conflict columns - takes a long time
conflict_columns = find_conflict_columns(In_SCS)
if DISPLAY_TABLES:
    print("Conflict columns:")
    display(pd.DataFrame(conflict_columns))

In [16]:
# Find the LP based lower bound (conflict columns)
LP_bound_conflict_columns, LP_solution_conflict_columns = solve_LP(In_SCS, conflict_columns)
print("Lower bound (LP objective) with conflict columns:", LP_bound_conflict_columns)
if DISPLAY_TABLES:
    display(LP_solution_conflict_columns)

Lower bound (LP objective) with conflict columns: 1520.0


In [None]:
# Deterministic rounding & compare
RoundedLP_solution_conflict_columns = (LP_solution_conflict_columns.iloc[:, :] >= 0.5).astype(int)
RoundedLP_cost_conflict_columns = get_conversion_cost(In_SCS, RoundedLP_solution_conflict_columns)
print(f"Cost of converting rounded solution for LP with conflict columns to Input SCS: {RoundedLP_cost_conflict_columns}")
if DISPLAY_TABLES:
    display(compare_SCS([In_SCS, RoundedLP_solution_conflict_columns], ["In_SCS", "Rounded Conflict Columns"]))

### Analysis

In [104]:
# Compare costs
print(f"True cost of converting Input SCS to Conflict-Free SCS: {real_cost}\n")
print("Lower bound (LP objective) with all columns:", LP_bound_all_columns)
print("Lower bound (LP objective) with conflict columns:", LP_bound_conflict_columns)
print(f"Cost of converting rounded solution for LP with all columns to Input SCS: {RoundedLP_cost_all_columns}")
print(f"Cost of converting rounded solution for LP with conflict columns to Input SCS: {RoundedLP_cost_conflict_columns}")

True cost of converting Input SCS to Conflict-Free SCS: 40

Lower bound (LP objective) with all columns: 1495.0
Lower bound (LP objective) with conflict columns: 7474.5
Cost of converting rounded solution for LP with all columns to Input SCS: 14949
Cost of converting rounded solution for LP with conflict columns to Input SCS: 14949


In [17]:
# Compare LP (all columns) and LP (conflict columns)
print("LP (all columns) vs LP (conflict columns):")
display(compare_SCS([LP_solution_all_columns, LP_solution_conflict_columns], ["LP_all_columns", "LP_conflict_columns"]))

LP (all columns) vs LP (conflict columns):


Unnamed: 0,row,col,LP_all_columns,LP_conflict_columns


In [None]:
# Compare Rounded LP (all columns) and Rounded LP (conflict columns)
print("Rounded LP (all columns) vs Rounded LP (conflict columns):")
display(compare_SCS([RoundedLP_solution_all_columns, RoundedLP_solution_conflict_columns],
    ["RoundedLP_all_columns", "RoundedLP_conflict_columns"]))

## TODO

 - confernec

mailto:farid.rashidimehrabadi@nih.gov - ask for data

Make sure to CC Salem (and Cenk)

Tree visualization

incoporate mutsAtEdges

Add the list of relevent data points to the driver section