## Imports

In [None]:
import os
import sys

import io
from contextlib import redirect_stdout

sys.path.insert(0, os.getcwd() + '/python_scripts')

from transform_data import csv_to_clingo, undersample_csv_to_clingo
from single_proxy import get_single_proxies
from multi_proxy_choice_rules import get_proxy_clusters_choice_rules
from multi_proxy_hardcoded import get_proxy_clusters_hardcoded
from multi_proxy_undersampled import process_potential_implications, check_implication


## General info

* Whenever default values are meantioned, they are as follows:

| Attribute | Default Value |
|---|---|
| Minimum implication probability | 80 |
| Maximum incidence probability | 5 |
| Minimum proxy cluster size | 1 |
| Maximum proxy cluster size | 3 |

* All the above mentioned attribute values should be **integers**
* There may be issues with running clingo programs on jupyter notebook. For example, a `IOPub data rate exceeded` might occur. In this case, it can be useful to run the required instructions on an external python file or in a command line 
* Calls to functions from files in the `python_scripts` directory should be done in the root of this repo

---

## .csv data transformation

 **⚠ Edit cell below** to use preferred dataset

In [None]:
sourcedatafolder = "example_datasets_no_ordinals/"
outdatafolder = "clingo_data/"

#dataset = "student-performance-mat"
#protected_attributes = ["sex"]
#outcome_attribute = "G3"

dataset = "student-performance-por"
protected_attributes = ["sex"]
outcome_attribute = "G3"

#dataset = "adult"
#protected_attributes = ["gender", "race"]
#outcome_attribute = "income"

#dataset = "bank-marketing"
#protected_attributes = ["marital"]
#outcome_attribute = "deposit"

#dataset = "compas"
#protected_attributes = ["race", "sex"]
#outcome_attribute = "is_violent_recid"

#dataset = "german-credit"
#protected_attributes = ["age_cat"]
#outcome_attribute = "class"

#dataset = "credit-card-clients"
#protected_attributes = ["SEX", "MARRIAGE"]
#outcome_attribute = "default.payment.next.month"

#dataset = "diabetes"
#protected_attributes = ["gender"]
#outcome_attribute = "readmitted"

#dataset = "kdd-adult-census-income"
#protected_attributes = ["sex", "race"]
#outcome_attribute = "income"

#dataset = "law-school"
#protected_attributes = ["sex", "race", "race1", "race2"]
#outcome_attribute = "gpa"

#dataset = "open-university-learning-analytics"
#protected_attributes = ["gender"]
#outcome_attribute = "final_result"

---

Creating output directory if it does not exist already

In [None]:
!mkdir -p $outdatafolder

Creating data file readable by clingo programs

In [None]:
csv_to_clingo(sourcedatafolder, dataset, outdatafolder, protected_attributes, outcome_attribute)

 **⚠ Resulting file name** should be the following `datafile`:

In [None]:
datafile = outdatafolder + "data-" + dataset + ".lp"

---

# Single proxy discovery

## Alternative 1 - running clingo directly

In [None]:
!clingo $datafile clingo_scripts/single_proxy_default.lp

## Alternative 2 - running clingo through python 

```
get_single_proxies(
    datafile: str, 
    min_implication_probability: optional int, 
    min_incidence_probability: optional int,
)
```

### Alternative 2.1 - using default values

In [None]:
get_single_proxies(datafile)

### Alternative 2.2 - customizing minimum implication and incidence proabilities

In [None]:
get_single_proxies(datafile, 85, 1)

---

# Mutliple proxy discovery

## Choice Rules method

### Alternative 1 - running clingo directly
⚠ This method is **unadvised** since it potentially takes a longer runtime.
It will use the mentioned default values.

In [None]:
!clingo -W none $datafile clingo_scripts/multi_proxy_choice_rules_default.lp 0

### Alternative 2 - running clingo though python

```
get_proxy_clusters_choice_rules(
    datafile: str, 
    min_implication_probability: optional int, 
    min_incidence_probability: optional int,
    min_cluster_size: optional int,
    max_cluster_size: optional int
)
```

#### Alternative 2.1 - using default values

In [None]:
get_proxy_clusters_choice_rules(datafile)
# Same as
# get_proxy_clusters_choice_rules(datafile, 80, 5, 1, 3)

#### Alternative 2.2 - customizing values

In [None]:
get_proxy_clusters_choice_rules(datafile, 80, 1, 1, 1)

## Hardcoded method

This method uses default values.

⚠ The `get_proxy_clusters_hardcoded` function should only be called in the root of this repo. 

The minimum implication and incidence probability values **can** be changed but they require some hardcoding. The clingo rules for this method are in the three following files:
* `clingo_scripts/multi_proxy_hardcoded_1.lp`
* `clingo_scripts/multi_proxy_hardcoded_2.lp`
* `clingo_scripts/multi_proxy_hardcoded_3.lp`

To change the minimum **implication** probability, the above mentioned files must update the following code line
> `    P >= 80,` >> `    P >= <new-minimum-implication>, `


To change the minimum **incidence** probability, the above mentioned files must update the following code line
> `    I >= 5,` >> `    I >= <new-minimum-incidence>, `

```
get_proxy_clusters_hardcoded(
    datafile: str
)
```

In [None]:
get_proxy_clusters_hardcoded(datafile)

## Undersampled Hardcoded method

⚠ This method requires previous data transformation (undersampling) and subsequent verifications.

### .csv data transformation
 **⚠ Edit cell bellow** if needed


In [None]:
sourcedatafolder = "example_datasets_no_ordinals/"
undersampleddatafolder = "undersampled_clingo_data/"
n_records = 500

In [None]:
!mkdir -p $undersampleddatafolder

In [None]:
undersample_csv_to_clingo(sourcedatafolder, dataset, undersampleddatafolder, protected_attributes, outcome_attribute, n_records)

 **⚠ Resulting file name** should be the following `undersampleddatafile`:

In [None]:
undersampleddatafile = undersampleddatafolder + "recs-" + str(n_records) + "-data-" + dataset + ".lp"

### Hardcoded regular usage
But we redirect the clingo output into a string variable 

In [None]:
clingo_output = ""

with io.StringIO() as buf, redirect_stdout(buf):
    get_proxy_clusters_hardcoded(undersampleddatafile)
    clingo_output = buf.getvalue()
    
print(clingo_output)

### Verifying proxies against full dataset

Processing potential proxies from previous step

In [None]:
potential_proxy_string = ""

with io.StringIO() as buf, redirect_stdout(buf):
    process_potential_implications(clingo_output)
    potential_proxy_string = buf.getvalue()
    
print(potential_proxy_string)


The minimum implication and incidence probability values can be changed as previously explained. The clingo rules for this method are in the three following files:

* `clingo_scripts/multi_proxy_hardcoded_check_1.lp`
* `clingo_scripts/multi_proxy_hardcoded_check_2.lp`
* `clingo_scripts/multi_proxy_hardcoded_check_3.lp`


```
check_implication(
    potential_proxy_string: str,
    datafile: str
)
```


In [None]:
check_implication(potential_proxy_string, datafile)

---

⚠ If the previous cell yields **Notebook errors**, do the following steps instead:

In [None]:
auxfilename = "potential_proxies_" + dataset + ".lp"
f = open(auxfilename,"w")
f.write(potential_proxy_string)
print("datafile:", datafile)
print("dataset:", dataset)

Run the following in a command line in the root of the repo:

```python3 python_scripts/multi_proxy_undersampled.py potential_proxies_<dataset>.lp <datafile>```

For example:

```python3 python_scripts/multi_proxy_undersampled.py potential_proxies_student-performance-por.lp clingo_data/data-student-performance-por.lp```

---