# Identifying biased features
This demo will show you how to identify features that help your models in a way that might just be too good to be true. At times it is hard to understand what a model is really doing, behind the scenes. That's where MLDB's [`classifier.explain`][1] comes to the rescue. In particular, it can help discover that a model is cheating, or in other words, that it has learnt to use bits of information that won't be available when applying the model in real life.

To illustrate this, we are going to train a model on some data where we know a feature is biased. You can [find the details here][2]. Basically the task is to predict if the client will subscribe to a term deposit after he receives a call from the bank, given some informations about the client (the employee calling, scocioeconomic conditions at the time, etc.).

[1]: ../../../../doc/#builtin/functions/ClassifierExplain.md.html
[2]: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

In [1]:
import pymldb
mldb = pymldb.Connection()

Let's start by importing the data, which we have copied on our servers.

In [2]:
print mldb.put('/v1/procedures/_', {
    'type': 'import.text',
    'params': {
        'dataFileUrl':
            'archive+http://public.mldb.ai/datasets/bank-additional.zip#bank-additional/bank-additional-full.csv',
        'outputDataset': 'bank_raw',
        'delimiter': ';'
        }
    })

<Response [201]>


Here is a sneek peak of the data.

In [3]:
mldb.query("""
SELECT *
FROM bank_raw
LIMIT 10
""")

Unnamed: 0_level_0,age,campaign,"""cons.conf.idx""","""cons.price.idx""",contact,day_of_week,default,duration,education,"""emp.var.rate""",...,housing,job,loan,marital,month,"""nr.employed""",pdays,poutcome,previous,y
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,56,1,-36.4,93.994,telephone,mon,no,261,basic.4y,1.1,...,no,housemaid,no,married,may,5191,999,nonexistent,0,no
3,57,1,-36.4,93.994,telephone,mon,unknown,149,high.school,1.1,...,no,services,no,married,may,5191,999,nonexistent,0,no
4,37,1,-36.4,93.994,telephone,mon,no,226,high.school,1.1,...,yes,services,no,married,may,5191,999,nonexistent,0,no
5,40,1,-36.4,93.994,telephone,mon,no,151,basic.6y,1.1,...,no,admin.,no,married,may,5191,999,nonexistent,0,no
6,56,1,-36.4,93.994,telephone,mon,no,307,high.school,1.1,...,no,services,yes,married,may,5191,999,nonexistent,0,no
7,45,1,-36.4,93.994,telephone,mon,unknown,198,basic.9y,1.1,...,no,services,no,married,may,5191,999,nonexistent,0,no
8,59,1,-36.4,93.994,telephone,mon,no,139,professional.course,1.1,...,no,admin.,no,married,may,5191,999,nonexistent,0,no
9,41,1,-36.4,93.994,telephone,mon,unknown,217,unknown,1.1,...,no,blue-collar,no,married,may,5191,999,nonexistent,0,no
10,24,1,-36.4,93.994,telephone,mon,no,380,professional.course,1.1,...,yes,technician,no,single,may,5191,999,nonexistent,0,no
11,25,1,-36.4,93.994,telephone,mon,no,50,high.school,1.1,...,yes,services,no,single,may,5191,999,nonexistent,0,no


We can train a model on a random selection of 75% of the data, keeping the other 25% for testing.

In [4]:
print mldb.put('/v1/procedures/_', {
    'type': 'classifier.train',
    'params': {
        'trainingData': """
            SELECT features: {* EXCLUDING (y)}, label: y = 'yes'
            FROM bank_raw
            WHERE rowHash() % 4 != 0
            """,
        'modelFileUrl': 'file://bank_model.cls',
        'algorithm': 'bbdt',
        'functionName': 'score',
        'mode': 'boolean'
        }
    })

<Response [201]>


This creates a "score" function that we can use on examples from our test set. The higher the score, the more likely the client is going to subscribe.

In [5]:
mldb.query("""
SELECT score({features: {* EXCLUDING (y)}}) AS *
FROM bank_raw
WHERE rowHash() % 4 = 0
LIMIT 10
""")

Unnamed: 0_level_0,score
_rowName,Unnamed: 1_level_1
15,-2.730311
22,-7.071012
25,-3.583807
33,-2.070334
39,0.421762
40,-0.734294
47,-7.231209
62,-1.650849
63,-0.009851
65,-2.706054


Now let's test that model on the 25% we didn't train on and get a feel of how good it should perform in real life.

In [6]:
mldb.put('/v1/procedures/_', {
    'type': 'classifier.test',
    'params': {
        'testingData': """
            SELECT score: score({features: {* EXCLUDING (y)}})[score], label: y = 'yes'
            FROM bank_raw
            WHERE rowHash() % 4 = 0
            """,
        'outputDataset': 'bank_test',
        'mode': 'boolean'
        }
    })

As we can see by inspecting the different statistics returned by the classifier.test procedure, that model seems to be doing pretty good! The AUC is 0.95: let's ship this thing in production right now! ... Or let's be cautious!

To understand what's going on, let's use the [`classifier.explain` function][1]. This will give us an idea of how much each feature helps (or hurts) in making the predictions.

[1]: ../../../../doc/#builtin/functions/ClassifierExplain.md.html

In [7]:
print mldb.put('/v1/functions/explain', {
    'type': 'classifier.explain',
    'params': {
        'modelFileUrl': 'file://bank_model.cls'
        }
    })

<Response [201]>


You can "explain" every single example, and know how much each feature influence the final score, like this:

In [8]:
mldb.query("""
SELECT explain({features: {* EXCLUDING (y)}, label: y = 'yes'}) AS *
FROM bank_raw
WHERE rowHash() % 4 = 0
LIMIT 10
""")

Unnamed: 0_level_0,bias,"explanation.""""""cons.conf.idx""""""","explanation.""""""cons.price.idx""""""","explanation.""""""emp.var.rate""""""","explanation.""""""nr.employed""""""",explanation.age,explanation.campaign,explanation.contact,explanation.day_of_week,explanation.default,...,explanation.education,explanation.euribor3m,explanation.job,explanation.loan,explanation.marital,explanation.month,explanation.pdays,explanation.poutcome,explanation.previous,explanation.housing
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
15,-0.187552,0.005895,0.157162,0.709004,0.357669,0.017939,0.028919,0.083293,0.025961,-0.006558,...,0.054602,0.25969,0.046422,-0.005966,0.077895,0.30854,0.208543,0.031093,-0.010907,
22,-0.187552,0.057059,0.284745,0.854344,0.212641,-0.021539,0.008846,0.133029,0.042484,-7e-05,...,-0.053114,0.306672,0.059294,-0.005966,-0.007937,0.603028,0.208543,0.038117,0.007541,-0.061839
25,-0.187552,0.051206,0.147233,0.96603,0.255613,0.136431,0.008406,0.083293,0.030877,-0.006558,...,-0.037976,0.363411,-0.020222,-0.005966,-0.007937,0.415444,0.208543,0.038117,-0.010907,
33,-0.187552,0.061675,0.113903,0.689183,0.348905,-0.154653,0.028919,0.083293,0.030877,-7e-05,...,0.114116,0.231433,-0.023728,-0.005966,0.013465,0.303361,0.208543,0.031093,-0.006298,
39,-0.187552,0.020152,0.093249,0.580062,0.199933,-0.058259,0.028919,0.10245,0.030877,-0.006558,...,0.027709,0.118192,-0.009669,-0.021155,-0.007937,0.180797,0.208543,0.031093,-0.009972,
40,-0.187552,0.023699,0.099499,0.602229,0.27855,0.08883,0.030372,0.083293,0.030877,-0.006558,...,-0.045472,0.134022,0.063472,-0.005966,-0.007937,0.253302,0.208543,0.035534,-0.010907,
47,-0.187552,0.032773,0.261927,0.771801,0.212641,0.128008,0.008846,0.083293,0.030877,0.077405,...,0.013701,0.311449,-0.008973,-0.005966,-0.007937,0.60362,0.208543,0.038117,-0.009972,-0.061839
62,-0.187552,0.023699,0.124415,0.626461,0.293483,0.08883,0.030372,0.083293,0.030877,0.077405,...,-0.045472,0.120093,-0.024413,-0.005966,-0.007937,0.293401,0.208543,0.035534,-0.010907,
63,-0.187552,-0.02196,0.097264,0.580062,0.229247,-0.058259,0.02936,0.126855,0.011171,-0.006558,...,0.029197,0.133764,-0.047023,-0.005966,-0.004715,0.1689,0.208543,0.031093,-0.010907,
65,-0.187552,-0.009339,0.124415,0.888562,0.357669,0.090653,0.02936,0.083293,0.011171,0.077405,...,0.051012,0.176531,0.031853,-0.005966,-0.004715,0.25681,0.208543,0.031093,-0.010907,


Or you can do the average on all the examples. Here we then transpose the result and sort it by absolute value.

In [9]:
mldb.query("""
SELECT *
FROM transpose((
    SELECT avg({explain({features: {* EXCLUDING (y)}, label: y='yes'})[explanation] as *}) AS *
    NAMED 'explanation'
    FROM bank_raw
    WHERE rowHash() % 4 = 0
))
ORDER BY abs(explanation) DESC
""")

Unnamed: 0_level_0,explanation
_rowName,Unnamed: 1_level_1
duration,1.29713
"""""""emp.var.rate""""""",0.526305
"""""""nr.employed""""""",0.303379
pdays,0.170738
euribor3m,0.139221
month,0.135022
"""""""cons.price.idx""""""",0.056744
poutcome,0.043148
age,0.035058
"""""""cons.conf.idx""""""",0.020872


Now what is striking here is that there is one feature that really stands out: duration. This is the actual duration of the call. Clearly, that information would not be available in a real life setting: you can't know the duration of a call before it's over, and when it's over you already now if the client has subscribed or not. If you look at the [detailed description of the data][1], you can in fact see a warning saying that using that piece of information is probably a bad idea for any realistic modeling.

Now that we have identified the cause of those suspiciously good results, let's train and test again but ignoring it.

[1]: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

In [10]:
print mldb.put('/v1/procedures/_', {
    'type': 'classifier.train',
    'params': {
        'trainingData': """
            SELECT features: {* EXCLUDING (y, duration)}, label: y = 'yes'
            FROM bank_raw
            WHERE rowHash() % 4 != 0
            """,
        'modelFileUrl': 'file://bank_model.cls',
        'algorithm': 'bbdt',
        'functionName': 'score',
        'mode': 'boolean'
        }
    })

<Response [201]>


In [11]:
mldb.put('/v1/procedures/_', {
    'type': 'classifier.test',
    'params': {
        'testingData': """
            SELECT score: score({features: {* EXCLUDING (y)}})[score], label: y = 'yes'
            FROM bank_raw
            WHERE rowHash() % 4 = 0
            """,
        'outputDataset': 'bank_test',
        'mode': 'boolean'
        }
    })

Now a AUC of 0.79 sounds more reasonable!

If we run the explanation again, the highest ranking features seem more legitimate.

In [12]:
print mldb.put('/v1/functions/explain', {
    'type': 'classifier.explain',
    'params': {
        'modelFileUrl': 'file://bank_model.cls'
        }
    })

<Response [201]>


In [13]:
mldb.query("""
SELECT *
FROM transpose((
    SELECT avg({explain({features: {* EXCLUDING (y)}, label: y='yes'})[explanation] as *}) AS *
    NAMED 'explanation'
    FROM bank_raw
    WHERE rowHash() % 4 = 0
))
ORDER BY abs(explanation) DESC
""")

Unnamed: 0_level_0,explanation
_rowName,Unnamed: 1_level_1
"""""""nr.employed""""""",0.286326
"""""""emp.var.rate""""""",0.245805
pdays,0.169785
contact,0.058398
poutcome,0.045957
euribor3m,0.03737
campaign,0.019305
month,0.014595
age,0.01061
job,0.009837


## Where to next?

Check out the other [Tutorials and Demos](../../../../doc/#builtin/Demos.md.html).