# How to use the mb_modelbase package on real data

Welcome to a short introductory example of the mb_modelbase package.
Usually modelbase is used as a web-service backend for the graphical user interface Lumen
However, it can also used without the frontend as a standalone software. This is what we do here.

## The Allbus2016 data set
TODO: check if we can use this data. If so: add it to the repo. If not: change to MPG data set.

For the introduction we will use The German General Social Survey [ALLBUS](https://www.gesis.org/en/allbus/allbus-home/). It is from the Leibniz Institute for the Social Sciences (GESIS) and contains all different kinds of variables of a single person like sex, age, income, place of residence, political attitude ... and so on. We would like to learn models on a small subset and execute some operations mb_modelbase to get a brief overview over the functions of mb_modelbase and how to handle them.

Please note that the ALLBUS data is only released for academic research and teaching, see their website for more information.

Let us import the data set and have a first view:

In [1]:
import pandas as pd
from mb.data import mpg

dataset =  mpg.mixed()
dataset.head()

Unnamed: 0,age,sex,educ,income,eastwest,lived_abroad,spectrum
0,47,Female,3,1800,East,No,1
1,52,Male,3,2000,East,No,5
2,61,Male,2,2500,West,No,6
3,54,Female,2,860,West,Yes,1
4,49,Male,3,2500,West,No,6


todo: update text.
We see a data set with 7 variables, 4 continuous and 3 categorical.

## Fitting a model and using some basic functions
First step is to import the mb_modelbase package:

In [2]:
import mb.modelbase as mbase

Now we can create an object of a model and fit the data to it. Of course it is hard to validate if the model class make sense or not from here without visual output.

In [3]:
#Create an object of the model class Conditional Gaussians (with categorical and continuous variables)
mymod = mbase.MixableCondGaussianModel("Allbus_CondGauss")
#Fitting process
mymod.fit(df=dataset)

<mb_modelbase.models_core.mixable_cond_gaussian.MixableCondGaussianModel at 0x7faa9c8852e8>

After the model is fitted to the data, we can execute some functions on the model:

In [4]:
mymod.aggregate("maximum")

['Female',
 'West',
 'No',
 51.37181215918501,
 3.380615927876127,
 1326.0000632802612,
 4.284749457039735]

Why do we get two different maxima? One aggregation is executed on the fitted model and the other one on the data. We have to change the mode-parameter of the model to get the predicted maximum.

In [5]:
mymod.mode = 'model' # or 'data', 'both'
argmax = mymod.aggregate("maximum")
argmax

['Female',
 'West',
 'No',
 51.37181215918501,
 3.380615927876127,
 1326.0000632802612,
 4.284749457039735]

To get a better understanding what is meant by executing the query against the model or data, we calculate the density of the distribution for a specific point. This can be done e.g. with the argmax:

In [6]:
mymod.density(argmax)

0.012709581532377863

What happens if we change the mode of the model?

In [7]:
mymod.mode = 'data'
mymod.density(argmax)

0.012709581532377863

Why do we get 0 as an answer? Well, the density query against the data corresponds to the number of observations we have of the given point, the argmax was not observed once obviously (who specify his age on 54.593....?). So let us ask for the density of a specific point that we know it exists, for example the first point in the table above:

In [8]:
firstrow = ['Female', 'East', 'No', 47, 3, 1800, 1]
mymod.density(firstrow)

0.0014730501919687049

This means, the datapoint appears only once in the whole data set. We also can ask for the density of the model for this point:

In [9]:
mymod.mode = 'model'
mymod.density(firstrow)

0.0014730501919687049

To keep our fitted model, we can save it and load it another time:

In [10]:
mymod.save(model=mymod, filename="models/Allbus_CondGauss.mdl")
loadmod = mbase.Model.load("models/Allbus_CondGauss.mdl")
loadmod.names

['sex', 'eastwest', 'lived_abroad', 'age', 'educ', 'income', 'spectrum']

For this purpose we also can create a modelbase. This is like an abstract object that keeps all the models we have learned. Important is that we have name a directory where we store and load the models from:

In [11]:
#Create the modelbase
mymodelbase = mbase.ModelBase("My Modelbase", load_all=False, model_dir="models")
#Add our fittedmodel mymod to the modelbase mymodelbase
mymodelbase.add(mymod) 
#Save it
mymodelbase.save_all_models

<bound method ModelBase.save_all_models of <mb_modelbase.server.modelbase.ModelBase object at 0x7faa9a8179e8>>

## Marginalization and Conditionalization (basic)

Right now our model is way too complex to obtain some useful information. Therefore we would like to marginalize out some dimensions and conditionalize some other dimension on a specific value in order to concentrate on information we are actually interested in.

### 1. Marginalization

At first we show how we remove some dimensions to reduce the model to less variables:

In [12]:
#'keep' for the dimensions you want to keep
mymod_marg = mymod.copy().marginalize(keep=['income', 'sex', 'eastwest']) 
#'remove' for the dimensions you want to remove
mymod_marg = mymod.copy().marginalize(remove=['lived_abroad', 'spectrum', 'educ', 'age', ]) 
mymod_marg.names

['sex', 'eastwest', 'income']

So we really have the 3 dimensions left we have asked for. Now, we can use our basic functions:

In [13]:
argmax = mymod_marg.aggregate("maximum")
argmax

['Female', 'West', 1352.9024977961801]

In [14]:
mymod_marg.density(["Male", "West", 1800])

0.10433855223064581

So we have reduced our model to 3 dimensions without any new fitting process!

### 2. Conditioning

Now we want to condition some variables on specific values to compare two those, in this example we want to compare the income between men and women:

In [15]:
#How to conditionalize a variable on a specific value:
#1. Conditionalize the variables 
mymod_cond_mann = mymod.copy().condition([mbase.Condition("sex", "==", "Male")])
mymod_cond_frau = mymod.copy().condition([mbase.Condition("sex", "==", "Female")])
#2. Marginalize out the dimension
mymod_cond_mann.marginalize(remove=['sex'])
mymod_cond_frau.marginalize(remove=['sex'])

#Alternative: Steps 1 and 2 chained in one line
mymod_cond_mann = mymod.copy().condition([mbase.Condition("sex", "==", "Male")]).marginalize(remove=['sex'])
mymod_cond_frau = mymod.copy().condition([mbase.Condition("sex", "==", "Female")]).marginalize(remove=['sex'])

After we conditionalized the variable 'sex' in one model on 'Mann' and in the other model on 'Frau', we marginalize out the rest of the values to filter unnecessary information and get the aggregation after it:

In [16]:
mymod_cond_mann.marginalize(keep=['income'])
mymod_cond_frau.marginalize(keep=['income'])
[mymod_cond_mann.aggregate("maximum"), mymod_cond_frau.aggregate("maximum")]

[[1917.6936449926814], [1290.4953763614485]]

We can see that we have a large difference in the income between men and women.