# **MDSS API and parameters**

- The `direction` specifies whether we want to scan for a subgroup with higher than expected outcomes or lower than expected outcomes  
- The `penalty` co-efficient allows us to adjust the complexity of the highest scoring subset. It can be thought of a regularization constant.
In each iteration, we optimize over subsets of all the attributes and randomly initialize the values of each attribute.      
- The `num_iters` specifies the number of iterations and thus random initializations  
- The `use_not_direction` specifies whether we want to include complements(nots)in our subsets along with the ands and ors in order to reduce our anomalous subset's description/no. of literals.
- The `cpu` specifices whether or not we'll leverage multicore performance when scanning.

Import the MDSS module and Bernoulli modules

In [1]:
from mdss.ScoringFunctions.Bernoulli import Bernoulli
from mdss.MDSS import MDSS
import pandas as pd
import numpy as np

In [2]:
import ssl
ssl._create_default_https_context= ssl._create_unverified_context

In [3]:
import warnings
warnings.filterwarnings('ignore')

## Data

In [4]:
adult = pd.read_csv('https://gist.githubusercontent.com/Viktour19/b690679802c431646d36f7e2dd117b9e/raw/d8f17bf25664bd2d9fa010750b9e451c4155dd61/adult_autostrat.csv')
adult.head()

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,sex,native_country,age_bin,education_num_bin,hours_per_week_bin,capital_gain_bin,capital_loss_bin,observed,expectation
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States,17-27,1-8,40-44,0,0,0,0.236226
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States,37-47,9,45-99,0,0,0,0.236226
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States,28-36,12-16,40-44,0,0,1,0.236226
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States,37-47,10-11,40-44,7298-7978,0,1,0.236226
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States,17-27,10-11,1-39,0,0,0,0.236226


In [5]:
scoring_function = Bernoulli(direction="positive")
scanner = MDSS(scoring_function)

### 1. Direction

In the positive direction, we look for sub-groups that are most favoured by our outcome i.e most likely to earn above 50k

In [6]:
scoring_function = Bernoulli(direction="positive")
scanner = MDSS(scoring_function)

subset,score = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , 
    penalty=20 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
    
print("Group most likely to earn above 50k: \n{}".format(subset), score)

Group most likely to earn above 50k: 
{'marital_status': [' Married-civ-spouse'], 'age_bin': ['28-36', '37-47', '48-90'], 'education': [' Assoc-acdm', ' Bachelors', ' Doctorate', ' Masters', ' Prof-school', ' Some-college']} 1065.8355262262485


In the negative direction, we look for sub-groups that are least favoured by our outcome i.e most likely to earn below 50k

In [7]:
scoring_function = Bernoulli(direction="negative")
scanner = MDSS(scoring_function)

subset,score = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=20 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Group most likely to earn below 50k: \n{}".format(subset), score)

Group most likely to earn below 50k: 
{'capital_gain_bin': ['0', '114-2354'], 'marital_status': [' Divorced', ' Never-married', ' Separated', ' Widowed'], 'education_num_bin': ['1-8', '10-11', '9']} 1068.4273492510838


### 2. Penalty

The `penalty` co-efficient allows us to adjust the complexity of the highest scoring subset i.e reduce no. of literals 

If we don't penalize i.e `penalty=0`, we get a very large anomalous subset description

In [8]:
subset,score = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=0 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Non-penalized Subset: \n{}".format(subset))

Non-penalized Subset: 
{'capital_gain_bin': ['0', '114-2354', '2407-3273', '3325-4416'], 'marital_status': [' Divorced', ' Married-civ-spouse', ' Married-spouse-absent', ' Never-married', ' Separated', ' Widowed'], 'workclass': [' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private', ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'], 'occupation': [' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair', ' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners', ' Machine-op-inspct', ' Other-service', ' Priv-house-serv', ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support', ' Transport-moving'], 'race': [' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other', ' White'], 'capital_loss_bin': ['0', '1672-1876', '213-1669'], 'education_num_bin': ['1-8', '10-11', '9'], 'hours_per_week_bin': ['1-39', '40-44', '45-99'], 'relationship': [' Not-in-family', ' Other-relative', ' Own-child', ' Unmarried'], 'sex': [' Female', ' Male'], 'age_bin': 

Increasing the penalty to 30 i.e `penalty = 30`, significantly reduces the number of literals in our subset

In [9]:
subset,score = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=30 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Penalized Subset: \n{}".format(subset))

Penalized Subset: 
{'capital_gain_bin': ['0'], 'marital_status': [' Divorced', ' Never-married', ' Separated', ' Widowed']}


### 3. NOT/ Complement

Goal is to reduce number of literals in our anomalous subset

In [10]:
scoring_function = Bernoulli(direction="positive")
scanner = MDSS(scoring_function)

Scan in the prime direction

In [11]:
subset,score = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=15 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Prime Subset: \n{}".format(subset))
print("Score: {}".format(score))
print("No. of literals in PRIME: {}".format(sum([len(val) for key, val in subset.items()])))

Prime Subset: 
{'marital_status': [' Married-civ-spouse'], 'age_bin': ['28-36', '37-47', '48-90'], 'education': [' Assoc-acdm', ' Bachelors', ' Doctorate', ' Masters', ' Prof-school', ' Some-college']}
Score: 1115.8355262262485
No. of literals in PRIME: 10


Our complement subset has 2x less literals while being more anomalous than our prime

In [12]:
subset_,score_ = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=15 , num_iters=1 , use_not_direction=True , num_of_subsets=1 , verbose=False , cpu=0)
print("Not/Complement Subset: \n{}".format(subset_))
print("Score: {}".format(score_))
print("No. of literals in COMPLEMENT: {}".format(sum([len(val) for key, val in subset_.items()])))

Not/Complement Subset: 
{'occupation': {' Other-service'}, 'marital_status': [' Married-civ-spouse'], 'age_bin': {'17-27'}, 'education': {' HS-grad'}, 'education_num_bin': {'1-8'}}
Score: 1231.0251383988543
No. of literals in COMPLEMENT: 5


Translate our subset back to its prime version

In [13]:
def translate_subset(coordinates,subset):
    translated_subset = {}
    for key, value in subset.items():
        if isinstance(value, list):
            translated_subset[key] = value
        elif isinstance(value, set):
            all_categories = coordinates[key].unique()
            value = [i for i in all_categories if i not in value]
            translated_subset[key] = value
        else:
            assert False, "Should be list or set"
    return translated_subset

In [14]:
translated = translate_subset( adult[adult.columns[:-2]],subset_)

In [15]:
to_choose = adult[translated.keys()].isin(translated).all(axis=1)
temp_df = adult.loc[to_choose]
subset_size = len(temp_df)/len(adult) * 100
"Our detected sub-group has a size of {}, which is {}% of our data. We observe {} as the probability of earning >50k in this sub-group, but our population mean is {}"\
.format(len(temp_df), subset_size,np.round(temp_df['observed'].mean(),4), np.round(temp_df['expectation'].mean(),4))

'Our detected sub-group has a size of 3771, which is 23.16196793808734% of our data. We observe 0.628 as the probability of earning >50k in this sub-group, but our population mean is 0.2362'

In [16]:
to_choose = adult[subset.keys()].isin(subset).all(axis=1)
temp_df = adult.loc[to_choose]
subset_size = len(temp_df)/len(adult) * 100
"Our detected sub-group has a size of {}, which is {}% of our data. We observe {} as the probability of earning >50k in this sub-group, but our population mean is {}"\
.format(len(temp_df), subset_size,np.round(temp_df['observed'].mean(),4), np.round(temp_df['expectation'].mean(),4))

'Our detected sub-group has a size of 3577, which is 21.970394938885818% of our data. We observe 0.6324 as the probability of earning >50k in this sub-group, but our population mean is 0.2362'

### 4. CPU

**Same subset and score**. No difference in results, just runtime. Advantage is gained when we have multiple tests running at the same time

In [17]:
subset,score = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=10 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Single-core subset: \n{}".format(subset))
print("Score: {}".format(score))

Single-core subset: 
{'occupation': [' Adm-clerical', ' Exec-managerial', ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support'], 'marital_status': [' Married-civ-spouse'], 'age_bin': ['28-36', '37-47', '48-90'], 'education': [' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate', ' Masters', ' Prof-school', ' Some-college']}
Score: 1179.6330771061198


In [18]:
subset,score = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=10 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0.5)
print("Multi-core subset: \n{}".format(subset))
print("Score: {}".format(score))

Multi-core subset: 
{'occupation': [' Adm-clerical', ' Exec-managerial', ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support'], 'marital_status': [' Married-civ-spouse'], 'age_bin': ['28-36', '37-47', '48-90'], 'education': [' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate', ' Masters', ' Prof-school', ' Some-college']}
Score: 1179.6330771061198


### 5. Num-iters

Run only one iteration i.e `num-iters=1`

In [19]:
subset_,score_ = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=1 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Subset: \n{}".format(subset_))
print("Score: {}".format(score_))
print("No. of literals: {}".format(sum([len(val) for key, val in subset_.items()])))

Subset: 
{'relationship': [' Husband', ' Wife'], 'occupation': [' Adm-clerical', ' Armed-Forces', ' Craft-repair', ' Exec-managerial', ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support', ' Transport-moving'], 'capital_loss_bin': ['0', '1887-1974', '1977-3770'], 'age_bin': ['28-36', '37-47', '48-90'], 'education': [' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate', ' Masters', ' Prof-school', ' Some-college']}
Score: 1345.3017616977104
No. of literals: 24


When we run 10 iterations i.e `num-iters=10`, we get a more anomalous subset compared to the subset scanned in one iteration.

In [20]:
subset_,score_ = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=1 , num_iters=10 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Subset: \n{}".format(subset_))
print("Score: {}".format(score_))
print("No. of literals: {}".format(sum([len(val) for key, val in subset_.items()])))

Subset: 
{'occupation': [' Adm-clerical', ' Armed-Forces', ' Craft-repair', ' Exec-managerial', ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support', ' Transport-moving'], 'capital_loss_bin': ['0', '1887-1974', '1977-3770'], 'education_num_bin': ['10-11', '12-16'], 'marital_status': [' Married-AF-spouse', ' Married-civ-spouse'], 'age_bin': ['28-36', '37-47', '48-90']}
Score: 1355.3185519323251
No. of literals: 19


### Max Literals

We can set MDSS to automatically return a subset that has a maximum number of `max_literals` literals. This can be done by passing in penalty = None and max_literals = k. Default k, if max_literals parameter is not passed in, is 5.

In [21]:
subset_,score_ = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=None,
                    num_iters=10,  use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Subset: \n{}".format(subset_))
print("Score: {}".format(score_))
print("No. of literals: {}".format(sum([len(val) for key, val in subset_.items()])))

Subset: 
{'education_num_bin': ['12-16'], 'marital_status': [' Married-civ-spouse']}
Score: 781.3349029018855
No. of literals: 2


In [22]:
subset_,score_ = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=None, max_literals = 10,
                    num_iters=10,  use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Subset: \n{}".format(subset_))
print("Score: {}".format(score_))
print("No. of literals: {}".format(sum([len(val) for key, val in subset_.items()])))

Subset: 
{'age_bin': ['28-36', '37-47', '48-90'], 'marital_status': [' Married-civ-spouse'], 'education_num_bin': ['10-11', '12-16']}
Score: 1149.5470882071652
No. of literals: 6
