# **Scoring Functions and Modes in MDSS**

MDScan currently supports the following **4** scoring functions:
- `BerkJones`: Non-parametric scoring function(_use only where expectations are constant or none_). To be used for all of the four types of outcomes supported - binary, continuous, nominal, ordinal.
- `Bernoulli`: Parametric scoring function. To used for two of the four types of outcomes supported - binary and nominal.
- `Guassian`: Parametric scoring function. To used for one of the four types of outcomes supported - continuous.
- `Poisson`: Parametric scoring function. To be used for three of the four types of outcomes supported - binary, continuous, and ordinal.


Modes dictate the _**type of outcome**_ we expect. There are 4 types of outcomes supported:
- `Binary`_(default)_: Yes/no outcomes. Outcomes must 0 or 1.
- `Continuous`: Continuous outcomes. Outcomes could be any real number.
- `Nominal`: Multiclass outcomes with no rank or order between them. Outcomes must be a finite set of integers with dimensionality <= 10.
- `Ordinal`: Multiclass outcomes that are ranked in a specific order. Outcomes must be positive integers.

Import the MDSS and scoring function modules

In [1]:
from mdss.ScoringFunctions.Bernoulli import Bernoulli
from mdss.ScoringFunctions.Gaussian import Gaussian
from mdss.ScoringFunctions.BerkJones import BerkJones
from mdss.ScoringFunctions.Poisson import Poisson
from mdss.MDSS import MDSS
import numpy as np 
import pandas as pd

In [2]:
import ssl
ssl._create_default_https_context= ssl._create_unverified_context

In [3]:
import warnings
warnings.filterwarnings('ignore')

## Adult Dataset - Bernoulli

_Outcome = Persons earning >50k_

This is a binary classification example. Therefore `mode = "binary"`   
It can use both the `Berkjones` and/or `Bernoulli` scoring functions. We'll use Bernoulli

In [4]:
adult = pd.read_csv('https://gist.githubusercontent.com/Viktour19/b690679802c431646d36f7e2dd117b9e/raw/d8f17bf25664bd2d9fa010750b9e451c4155dd61/adult_autostrat.csv')
adult.head()

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,sex,native_country,age_bin,education_num_bin,hours_per_week_bin,capital_gain_bin,capital_loss_bin,observed,expectation
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States,17-27,1-8,40-44,0,0,0,0.236226
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States,37-47,9,45-99,0,0,0,0.236226
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States,28-36,12-16,40-44,0,0,1,0.236226
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States,37-47,10-11,40-44,7298-7978,0,1,0.236226
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States,17-27,10-11,1-39,0,0,0,0.236226


Using `Bernoulli` with mode being binary therefore we don't need to specify by default

In [5]:
scoring_function = Bernoulli(direction="positive")
scanner = MDSS(scoring_function)

subset,score = scanner.scan( adult[adult.columns[:-2]] , adult['observed'] , adult['expectation'] , penalty=20 , num_iters=1, use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Subset: \n{}".format(subset))
print("\nScore: {}".format(score))

Subset: 
{'marital_status': [' Married-civ-spouse'], 'age_bin': ['28-36', '37-47', '48-90'], 'education': [' Assoc-acdm', ' Bachelors', ' Doctorate', ' Masters', ' Prof-school', ' Some-college']}

Score: 1065.8355262262485


In [6]:
to_choose = adult[subset.keys()].isin(subset).all(axis=1)
temp_df = adult.loc[to_choose]
subset_size = len(temp_df)/len(adult) * 100
"Our detected sub-group has a size of {}, which is {}% of our data. We observe {} as the probability of earning >50k in this sub-group, but our population mean is {}"\
.format(len(temp_df), subset_size,np.round(temp_df['observed'].mean(),4), np.round(temp_df['expectation'].mean(),4))

'Our detected sub-group has a size of 3577, which is 21.970394938885818% of our data. We observe 0.6324 as the probability of earning >50k in this sub-group, but our population mean is 0.2362'

## Hospitalization Time - Poisson

_Outcome = No. of days spent in hospital_

This is an ordinal, multiclass classification example.    
Therefore `mode = "ordinal"`       
Scoring function is `Poisson`.


In [7]:
hosp = pd.read_csv('https://raw.githubusercontent.com/Adebayo-Oshingbesan/data/main/hospital.csv')

hosp = hosp[hosp['Length of Stay'] != '120 +'].fillna('Unknown')
hosp['Length of Stay'] = pd.to_numeric(hosp['Length of Stay'])
hosp['expectation'] = hosp['Length of Stay'].mean()
hosp.head()

Unnamed: 0,Health Service Area,Hospital County,Age Group,Zip Code - 3 digits,Gender,Race,Ethnicity,Type of Admission,Patient Disposition,APR MDC Code,...,APR Risk of Mortality,APR Medical Surgical Description,Payment Typology 1,Payment Typology 2,Payment Typology 3,Birth Weight,Abortion Edit Indicator,Emergency Department Indicator,Length of Stay,expectation
0,New York City,Kings,70 or Older,112,M,Black/African American,Not Span/Hispanic,Emergency,Expired,4,...,Extreme,Medical,Medicare,Medicare,Self-Pay,0,N,Y,14,5.423149
1,New York City,Queens,0 to 17,113,M,White,Spanish/Hispanic,Newborn,Home or Self Care,15,...,Minor,Medical,Medicaid,Medicaid,Unknown,3800,N,N,2,5.423149
2,New York City,Kings,70 or Older,112,M,Black/African American,Not Span/Hispanic,Emergency,Skilled Nursing Home,4,...,Extreme,Medical,Medicare,Unknown,Unknown,0,N,Y,13,5.423149
3,New York City,Richmond,50 to 69,103,M,White,Not Span/Hispanic,Emergency,Skilled Nursing Home,1,...,Minor,Medical,Medicare,Medicare,Unknown,0,N,Y,3,5.423149
4,Long Island,Nassau,18 to 29,115,F,White,Spanish/Hispanic,Elective,Home or Self Care,14,...,Minor,Medical,Medicaid,Unknown,Unknown,0,N,N,3,5.423149


In [8]:
scoring_function = Poisson(direction="positive")
scanner = MDSS(scoring_function)

subset,score = scanner.scan( hosp[hosp.columns[:-2]] , hosp['Length of Stay'] , hosp['expectation'] , penalty=10 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , mode="ordinal" , verbose=False , cpu=0)
print("Subset: \n{}".format(subset))
print("\nScore: {}".format(score))

Subset: 
{'APR Severity of Illness Description': ['Extreme']}

Score: 11220.538868561147


In [9]:
to_choose = hosp[subset.keys()].isin(subset).all(axis=1)
temp_df = hosp.loc[to_choose]
subset_size = len(temp_df)/len(hosp) * 100
"Our detected sub-group has a size of {}, which is {}% of our data. We observe {} as the average number of days spent in the hospital for this sub-group, but our population mean is {}"\
.format(len(temp_df), subset_size,np.round(temp_df['Length of Stay'].mean(),4), np.round(temp_df['expectation'].mean(),4))

'Our detected sub-group has a size of 1900, which is 6.337558372248166% of our data. We observe 15.2216 as the average number of days spent in the hospital for this sub-group, but our population mean is 5.4231'

## Insurance Costs - Gaussian

_Outcome = Insurance costs to be incurred_

Scoring function = `Gaussian`  
Mode = `continuous`

In [9]:
insurance = pd.read_csv('https://raw.githubusercontent.com/Adebayo-Oshingbesan/data/main/insurance.csv')

for col in ['bmi','age']:
        insurance[col] = pd.qcut(insurance[col], 10, duplicates='drop')
        insurance[col] = insurance[col].apply(lambda x: str(round(x.left, 2)) + ' - ' + str(round(x.right,2)))

insurance['expectation'] = insurance['charges'].mean()
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,expectation
0,18.0 - 19.0,female,27.36 - 28.8,0,yes,southwest,16884.924,13270.422265
1,18.0 - 19.0,male,33.66 - 35.86,1,no,southeast,1725.5523,13270.422265
2,24.0 - 29.0,male,32.03 - 33.66,3,no,southeast,4449.462,13270.422265
3,29.0 - 34.0,male,15.96 - 22.99,0,no,northwest,21984.47061,13270.422265
4,29.0 - 34.0,male,28.8 - 30.4,0,no,northwest,3866.8552,13270.422265


In [10]:
scoring_function = Gaussian(direction="positive", var=insurance["charges"].var())
scanner = MDSS(scoring_function)

subset,score = scanner.scan( insurance[insurance.columns[:-2]] , insurance['charges'] , insurance['expectation'] , penalty=10 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , mode="continuous" , verbose=False , cpu=0)
print("Subset: \n{}".format(subset))
print("\nScore: {}".format(score))

Subset: 
{'smoker': ['yes']}

Score: 90264.35743912018


In [11]:
to_choose = insurance[subset.keys()].isin(subset).all(axis=1)
temp_df = insurance.loc[to_choose]
subset_size = len(temp_df)/len(insurance) * 100
"Our detected sub-group has a size of {}, which is {}% of our data. We observe {} as the average insurance charge for this sub-group, but the population average is {}"\
.format(len(temp_df), subset_size,np.round(temp_df['charges'].mean(),4), np.round(temp_df['expectation'].mean(),4))

'Our detected sub-group has a size of 274, which is 20.47832585949178% of our data. We observe 32050.2318 as the average insurance charge for this sub-group, but the population average is 13270.4223'

## Temperature Dataset - Berkjones

_Outcome = Likelihood of experiencing negative temperatures_

Scoring Function = `Berkjones`

In [12]:
temperature = pd.read_csv('https://raw.githubusercontent.com/Adebayo-Oshingbesan/data/main/weatherHistory.csv')

for col in ['Humidity','WindSpeed','Visibility','Pressure']:
        temperature[col] = pd.qcut(temperature[col], 10, duplicates='drop')
        temperature[col] = temperature[col].apply(lambda x: str(round(x.left, 2)) + ' - ' + str(round(x.right,2)))

temperature['observed'] = (temperature['Temperature'] <= 0).astype(int)
temperature = temperature.drop(columns=['Temperature'])
temperature['expectation'] = temperature['observed'].mean()

temperature.head()

Unnamed: 0,Summary,PrecipType,Humidity,WindSpeed,Visibility,Pressure,DailySummary,observed,expectation
0,Partly Cloudy,rain,0.87 - 0.92,13.14 - 15.47,15.15 - 15.83,1014.8 - 1016.45,Partly cloudy throughout the day.,0,0.111059
1,Partly Cloudy,rain,0.83 - 0.87,13.14 - 15.47,15.15 - 15.83,1014.8 - 1016.45,Partly cloudy throughout the day.,0,0.111059
2,Mostly Cloudy,rain,0.87 - 0.92,3.2 - 4.7,11.45 - 15.15,1014.8 - 1016.45,Partly cloudy throughout the day.,0,0.111059
3,Partly Cloudy,rain,0.78 - 0.83,13.14 - 15.47,15.15 - 15.83,1014.8 - 1016.45,Partly cloudy throughout the day.,0,0.111059
4,Mostly Cloudy,rain,0.78 - 0.83,9.97 - 11.21,15.15 - 15.83,1016.45 - 1018.17,Partly cloudy throughout the day.,0,0.111059


Using `Berkjones` we need to specify the alpha parameter, which is our expectation/mean of observed

In [13]:
alpha = temperature["observed"].mean()
scoring_function = BerkJones(direction="positive", alpha=alpha)
scanner = MDSS(scoring_function)

subset,score = scanner.scan( temperature[temperature.columns[:-2]] , temperature['observed'] , temperature['expectation'] , penalty=100 , num_iters=1 , use_not_direction=False , num_of_subsets=1 , verbose=False , cpu=0)
print("Subset: \n{}".format(subset))
print("\nScore: {}".format(score))

Subset: 
{'PrecipType': ['snow']}

Score: 23441.668505872967


In [14]:
to_choose = temperature[subset.keys()].isin(subset).all(axis=1)
temp_df = temperature.loc[to_choose]
subset_size = len(temp_df)/len(temperature) * 100
"Our detected sub-group has a size of {}, which is {}% of our data. We observe {} as the probability of getting negative temperatures in this sub-group, but our population mean is {}"\
.format(len(temp_df), subset_size,np.round(temp_df['observed'].mean(),4), np.round(temp_df['expectation'].mean(),4))

'Our detected sub-group has a size of 10712, which is 11.105927239173484% of our data. We observe 1.0 as the probability of getting negative temperatures in this sub-group, but our population mean is 0.1111'