## Loading data

The dataset used for this example comes from the `seaborn` package. It is a popular dataset consisting of titanic passengers. The target variable of interest is whether a passenger survived the disaster or not.

In [1]:
from ivpy import discretize
import seaborn as sns
import pandas as pd

d = sns.load_dataset('titanic')

### Basic Example

Using defaults, the discretize function will return a dictionary. The `break` element contains a list of floats that correspond to optimal split points for discretizing the x variable. This list of breaks can be passed to the pandas `cut` function to discretize the array into a set of mutually exclusive intervals:

In [2]:
res = discretize(d['fare'], d['survived'])
print(res['breaks'])

[-inf, 7.2271, 7.731249999999999, 7.88335, 9.2875, 10.48125, 39.5, 50.9875, 74.375, 211.41875, inf]


Summarizing the target variable, `survived`, but the discretized array yields the following summary statistics:

In [3]:
x = pd.cut(d['fare'], res['breaks'])
d['survived'].groupby(x).agg(N='size', Sum='sum', Mean='mean')


Unnamed: 0_level_0,N,Sum,Mean
fare,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(-inf, 7.227]",55,6,0.109091
"(7.227, 7.731]",44,8,0.181818
"(7.731, 7.883]",85,29,0.341176
"(7.883, 9.288]",132,19,0.143939
"(9.288, 10.481]",23,5,0.217391
"(10.481, 39.5]",368,161,0.4375
"(39.5, 50.988]",25,5,0.2
"(50.988, 74.375]",62,35,0.564516
"(74.375, 211.419]",80,63,0.7875
"(211.419, inf]",17,11,0.647059


### Monotonicity

Often in scorecard modeling, monotonicity is a desired characteristic of a discretized variable. `ivpy` supports four levels of monotonicity:

- `-1` : `y` decreases as `x` increases
- `0` : no monotonic relationship
- `1` : `y` increases as `x` increases
- `2` : `y` either increases or decreases as `x` increases

Passing a value of `mono=1` in the function below results in increasing values of `y` as the `x` increase.

In [4]:
res = discretize(d['fare'], d['survived'], mono=1)
x = pd.cut(d['fare'], res['breaks'])
d['survived'].groupby(x).agg(N='size', Sum='sum', Mean='mean')

Unnamed: 0_level_0,N,Sum,Mean
fare,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(-inf, 7.227]",55,6,0.109091
"(7.227, 7.404]",29,5,0.172414
"(7.404, 10.481]",255,56,0.219608
"(10.481, 15.173]",122,47,0.385246
"(15.173, 15.646]",14,6,0.428571
"(15.646, 50.988]",257,113,0.439689
"(50.988, 52.277]",10,5,0.5
"(52.277, 74.375]",52,30,0.576923
"(74.375, 79.825]",21,15,0.714286
"(79.825, inf]",76,59,0.776316


### Controlling Structure

Other arguments can be used to control how many bins are returned (at most) as well as how many observations fall within bins:

In [5]:
res = discretize(d['fare'], d['survived'], mono=1, maxbin=5, minres=25)
x = pd.cut(d['fare'], res['breaks'])
d['survived'].groupby(x).agg(N='size', Sum='sum', Mean='mean')

Unnamed: 0_level_0,N,Sum,Mean
fare,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(-inf, 10.481]",339,67,0.19764
"(10.481, 15.173]",122,47,0.385246
"(15.173, 50.988]",271,119,0.439114
"(50.988, 74.375]",62,35,0.564516
"(74.375, inf]",97,74,0.762887
