## Imports

In [1]:
import os
import sys

import io
from contextlib import redirect_stdout

sys.path.insert(0, os.getcwd() + '/python_scripts')

from transform_data import csv_to_clingo, undersample_csv_to_clingo
from single_proxy import get_single_proxies
from multi_proxy_choice_rules import get_proxy_clusters_choice_rules
from multi_proxy_hardcoded import get_proxy_clusters_hardcoded
from multi_proxy_undersampled import process_potential_implications, check_implication


## General info

* Whenever default values are meantioned, they are as follows:

| Attribute | Default Value |
|---|---|
| Minimum implication probability | 80 |
| Maximum incidence probability | 5 |
| Minimum proxy cluster size | 1 |
| Maximum proxy cluster size | 3 |

* All the above mentioned attribute values should be **integers**
* There may be issues with running clingo programs on jupyter notebook. For example, a `IOPub data rate exceeded` might occur. In this case, it can be useful to run the required instructions on an external python file or in a command line 
* Calls to functions from files in the `python_scripts` directory should be done in the root of this repo

---

## .csv data transformation

 **⚠ Edit cell below** to use preferred dataset

In [2]:
sourcedatafolder = "example_datasets_no_ordinals/"
outdatafolder = "clingo_data/"

dataset = "student-performance-por"
protected_attributes = ["sex"]
outcome_attribute = "G3"

# dataset = "adult"
# protected_attributes = ["gender", "race"]
# outcome_attribute = "income"

# dataset = "bank-marketing"
# protected_attributes = ["marital"]
# outcome_attribute = "deposit"

# dataset = "compas"
# protected_attributes = ["race", "sex"]
# outcome_attribute = "is_violent_recid"

# dataset = "german-credit"
# protected_attributes = ["age_cat"]
# outcome_attribute = "class"

---

Creating output directory if it does not exist already

In [3]:
!mkdir -p $outdatafolder

Creating data file readable by clingo programs

In [4]:
csv_to_clingo(sourcedatafolder, dataset, outdatafolder, protected_attributes, outcome_attribute)

 **⚠ Resulting file name** should be the following `datafile`:

In [5]:
datafile = outdatafolder + "data-" + dataset + ".lp"

---

# Single proxy discovery

## Alternative 1 - running clingo directly

In [6]:
!clingo $datafile clingo_scripts/single_proxy_default.lp

clingo version 5.6.2
Reading from ...o_data/data-student-performance-por.lp ...
Solving...
Answer: 1
count_facts(11682) count_items(649) outcome("G3") protected("sex")
SATISFIABLE

Models       : 1
Calls        : 1
Time         : 0.194s (Solving: 0.00s 1st Model: 0.00s Unsat: 0.00s)
CPU Time     : 0.155s


## Alternative 2 - running clingo through python 

```
get_single_proxies(
    datafile: str, 
    min_implication_probability: optional int, 
    min_incidence_probability: optional int,
)
```

### Alternative 2.1 - using default values

In [7]:
get_single_proxies(datafile)

protected("sex") outcome("G3") count_items(649) count_facts(11682)


### Alternative 2.2 - customizing minimum implication and incidence proabilities

In [8]:
get_single_proxies(datafile, 85, 1)

protected("sex") outcome("G3") count_items(649) implication("absences","1","sex","F",91,1) count_facts(11682)


---

# Mutliple proxy discovery

## Choice Rules method

### Alternative 1 - running clingo directly
⚠ This method is **unadvised** since it potentially takes a longer runtime.
It will use the mentioned default values.

In [9]:
#!clingo -W none $datafile clingo_scripts/multi_proxy_choice_rules_default.lp 0

### Alternative 2 - running clingo though python

```
get_proxy_clusters_choice_rules(
    datafile: str, 
    min_implication_probability: optional int, 
    min_incidence_probability: optional int,
    min_cluster_size: optional int,
    max_cluster_size: optional int
)
```

#### Alternative 2.1 - using default values

In [10]:
get_proxy_clusters_choice_rules(datafile)
# Same as
# get_proxy_clusters_choice_rules(datafile, 80, 5, 1, 3)

UNSAT
proxy("Pstatus","A") proxy("romantic","yes") implication("sex","F",80,5) count_attributes_in_cluster(2)
proxy("famsize","GT3") proxy("Pstatus","A") implication("sex","F",81,5) count_attributes_in_cluster(2)
proxy("Mjob","at_home") proxy("activities","no") implication("sex","F",80,11) count_attributes_in_cluster(2)
SAT
proxy("Mjob","at_home") proxy("higher","yes") proxy("internet","yes") implication("sex","F",80,9) count_attributes_in_cluster(3)
proxy("Mjob","at_home") proxy("guardian","mother") proxy("Fjob","other") implication("sex","F",80,8) count_attributes_in_cluster(3)
proxy("Mjob","at_home") proxy("paid","no") proxy("absences","0") implication("sex","F",82,7) count_attributes_in_cluster(3)
proxy("Mjob","at_home") proxy("schoolsup","no") proxy("absences","0") implication("sex","F",81,7) count_attributes_in_cluster(3)
proxy("Mjob","at_home") proxy("Fjob","other") proxy("famsup","yes") implication("sex","F",87,7) count_attributes_in_cluster(3)
proxy("Mjob","at_home") proxy("Fj

#### Alternative 2.2 - customizing values

In [11]:
get_proxy_clusters_choice_rules(datafile, 80, 1, 1, 1)

proxy("absences","1") implication("sex","F",91,1) count_attributes_in_cluster(1)
proxy("absences","5") implication("sex","F",83,1) count_attributes_in_cluster(1)
SAT


## Hardcoded method

This method uses default values.

⚠ The `get_proxy_clusters_hardcoded` function should only be called in the root of this repo. 

The minimum implication and incidence probability values **can** be changed but they require some hardcoding. The clingo rules for this method are in the three following files:
* `clingo_scripts/multi_proxy_hardcoded_1.lp`
* `clingo_scripts/multi_proxy_hardcoded_2.lp`
* `clingo_scripts/multi_proxy_hardcoded_3.lp`

To change the minimum **implication** probability, the above mentioned files must update the following code line
> `    P >= 80,` >> `    P >= <new-minimum-implication>, `


To change the minimum **incidence** probability, the above mentioned files must update the following code line
> `    I >= 5,` >> `    I >= <new-minimum-incidence>, `

```
get_proxy_clusters_hardcoded(
    datafile: str
)
```

In [12]:
get_proxy_clusters_hardcoded(datafile)

protected("sex") count_items(649) count_facts(11682)


protected("sex") count_items(649) implication("famsize","GT3","Pstatus","A","sex","F",81,5) implication("Pstatus","A","famsize","GT3","sex","F",81,5) implication("Pstatus","A","romantic","yes","sex","F",80,5) implication("Mjob","at_home","activities","no","sex","F",80,11) implication("activities","no","Mjob","at_home","sex","F",80,11) implication("romantic","yes","Pstatus","A","sex","F",80,5) count_facts(11682)


protected("sex") count_items(649) implication("Mjob","at_home","address","U","school","GP","sex","F",80,6) implication("address","U","Mjob","at_home","school","GP","sex","F",80,6) implication("famsup","yes","Mjob","at_home","school","GP","sex","F",83,5) implication("Mjob","at_home","famsup","yes","school","GP","sex","F",83,5) implication("Mjob","at_home","school","GP","address","U","sex","F",80,6) implication("school","GP","Mjob","at_home","address","U","sex","F",80,6) implication("activities","no","Mjob","at_home","addres

## Undersampled Hardcoded method

⚠ This method requires previous data transformation (undersampling) and subsequent verifications.

### .csv data transformation
 **⚠ Edit cell bellow** if needed


In [13]:
sourcedatafolder = "example_datasets_no_ordinals/"
undersampleddatafolder = "undersampled_clingo_data/"
n_records = 500

In [14]:
!mkdir -p $undersampleddatafolder

In [15]:
undersample_csv_to_clingo(sourcedatafolder, dataset, undersampleddatafolder, protected_attributes, outcome_attribute, n_records)

 **⚠ Resulting file name** should be the following `undersampleddatafile`:

In [16]:
undersampleddatafile = undersampleddatafolder + "recs-" + str(n_records) + "-data-" + dataset + ".lp"

### Hardcoded regular usage
But we redirect the clingo output into a string variable 

In [17]:
clingo_output = ""

with io.StringIO() as buf, redirect_stdout(buf):
    get_proxy_clusters_hardcoded(undersampleddatafile)
    clingo_output = buf.getvalue()
    
print(clingo_output)

protected("sex") count_items(500) count_facts(9000)


protected("sex") count_items(500) implication("activities","no","Mjob","at_home","sex","F",83,11) implication("activities","no","schoolsup","yes","sex","F",82,5) implication("guardian","father","Mjob","at_home","sex","F",80,5) implication("absences","0","Mjob","at_home","sex","F",82,7) implication("romantic","yes","reason","reputation","sex","F",81,6) implication("reason","reputation","romantic","yes","sex","F",81,6) implication("Mjob","other","schoolsup","yes","sex","F",80,5) implication("Mjob","at_home","activities","no","sex","F",83,11) implication("Mjob","at_home","guardian","father","sex","F",80,5) implication("Mjob","at_home","absences","0","sex","F",82,7) implication("schoolsup","yes","activities","no","sex","F",82,5) implication("schoolsup","yes","Mjob","other","sex","F",80,5) count_facts(9000)


protected("sex") count_items(500) implication("Mjob","at_home","address","R","school","MS","sex","F",81,6) implication("Mjob","at_

### Verifying proxies against full dataset

Processing potential proxies from previous step

In [18]:
potential_proxy_string = ""

with io.StringIO() as buf, redirect_stdout(buf):
    process_potential_implications(clingo_output)
    potential_proxy_string = buf.getvalue()
    
print(potential_proxy_string)


potential_implication("activities","no","Mjob","at_home","sex","F",83,11) .
potential_implication("activities","no","schoolsup","yes","sex","F",82,5) .
potential_implication("guardian","father","Mjob","at_home","sex","F",80,5) .
potential_implication("absences","0","Mjob","at_home","sex","F",82,7) .
potential_implication("romantic","yes","reason","reputation","sex","F",81,6) .
potential_implication("reason","reputation","romantic","yes","sex","F",81,6) .
potential_implication("Mjob","other","schoolsup","yes","sex","F",80,5) .
potential_implication("Mjob","at_home","activities","no","sex","F",83,11) .
potential_implication("Mjob","at_home","guardian","father","sex","F",80,5) .
potential_implication("Mjob","at_home","absences","0","sex","F",82,7) .
potential_implication("schoolsup","yes","activities","no","sex","F",82,5) .
potential_implication("schoolsup","yes","Mjob","other","sex","F",80,5) .
potential_implication("Mjob","at_home","address","R","school","MS","sex","F",81,6) .
potential

The minimum implication and incidence probability values can be changed as previously explained. The clingo rules for this method are in the three following files:

* `clingo_scripts/multi_proxy_hardcoded_check_1.lp`
* `clingo_scripts/multi_proxy_hardcoded_check_2.lp`
* `clingo_scripts/multi_proxy_hardcoded_check_3.lp`


```
check_implication(
    potential_proxy_string: str,
    datafile: str
)
```


In [19]:
check_implication(potential_proxy_string, datafile)

cluster size = 1
protected("sex") count_items(649)




cluster size = 2
protected("sex") count_items(649) implication("activities","no","Mjob","at_home","sex","F",80,11) implication("Mjob","at_home","activities","no","sex","F",80,11)




cluster size = 3
protected("sex") count_items(649) implication("Mjob","at_home","famsize","GT3","school","MS","sex","F",80,7) implication("romantic","yes","activities","no","school","MS","sex","F",82,7) implication("Mjob","at_home","activities","no","school","MS","sex","F",85,6) implication("Mjob","at_home","absences","0","school","MS","sex","F",80,5) implication("activities","no","romantic","yes","school","MS","sex","F",82,7) implication("Mjob","at_home","Fjob","other","school","MS","sex","F",84,6) implication("famsize","GT3","Mjob","at_home","school","MS","sex","F",80,7) implication("activities","no","Mjob","at_home","school","MS","sex","F",85,6) implication("absences","0","Mjob","at_home","school","MS","sex","F",80,5) implication("Fjob","other","Mjo

---

⚠ If the previous cell yields **Notebook errors**, do the following steps instead:

In [20]:
auxfilename = "potential_proxies_" + dataset + ".lp"
f = open(auxfilename,"w")
f.write(potential_proxy_string)
print("datafile:", datafile)
print("dataset:", dataset)

datafile: clingo_data/data-student-performance-por.lp
dataset: student-performance-por


Run the following in a command line in the root of the repo:

```python3 python_scripts/multi_proxy_undersampled.py potential_proxies_<dataset>.lp <datafile>```

For example:

```python3 python_scripts/multi_proxy_undersampled.py potential_proxies_student-performance-por.lp clingo_data/data-student-performance-por.lp```

---