# How to use the mb_modelbase package on real data

Welcome to a short introductory example of the mb_modelbase package. Usually mb_modelbase is a part of the Moo-Software and is used as the backend but it can also used without the frontend as a standalone software. 

## The Allbus2016 data set

For the introduction we will use The German General Social Survey [ALLBUS](https://www.gesis.org/en/allbus/allbus-home/). It is from the Leibniz Institute for the Social Sciences (GESIS) and contains all different kinds of variables of a single person like sex, age, income, place of residence, political attitude ... and so on. We would like to learn models on a small subset and execute some operations mb_modelbase to get a brief overview over the functions of mb_modelbase and how to handle them.

Let us import the data set and have a first view:

In [20]:
import pandas as pd
dataset = pd.read_csv('allbus2016.csv', index_col=0)
dataset.head()

Unnamed: 0,age,sex,educ,income,eastwest,happiness,health,lived_abroad,spectrum
0,47,Female,3,1800,East,9,3,No,Left
1,52,Male,3,2000,East,8,3,No,Center-right
2,61,Male,2,2500,West,9,4,No,Center-right
3,54,Female,2,860,West,3,2,Yes,Left
4,49,Male,3,2500,West,7,5,No,Center-right


We see a data set with 9 variables, 5 continuous and 4 categorical.

## Fitting a model and using some basic functions
First step is to import the mb_modelbase package:

In [21]:
import mb_modelbase as mbase

Now we can create an object of a model and fit the data to it. Of course it is hard to validate if the model class make sense or not from here without visual output.

In [74]:
#Create an object of the model class Conditional Gaussians (with categorical and continuous variables)
mymod = mbase.MixableCondGaussianModel("Allbus_CondGauss")
#Fitting process
mymod.fit(df=dataset)
mymod.aggregate()

** Mean parameters (direct) **
p [[[[ 0.02696729  0.04553492  0.0464191   0.02122016]
   [ 0.00442087  0.00486295  0.00839965  0.00088417]]

  [[ 0.07073386  0.1061008   0.06189213  0.01635721]
   [ 0.01282051  0.02431477  0.02740937  0.00265252]]]


 [[[ 0.0331565   0.04995579  0.0397878   0.03669319]
   [ 0.00530504  0.0066313   0.00839965  0.00397878]]

  [[ 0.06233422  0.11229001  0.04774536  0.03271441]
   [ 0.01768347  0.02873563  0.02431477  0.00928382]]]]
mu [[[[[  4.73267587e-01  -4.08958076e-01  -4.39004154e-01  -9.13167593e-02
      -1.09583756e-01]
    [  2.32078066e-01  -6.83576478e-02  -3.54460872e-01   1.03842158e-01
      -1.39625065e-01]
    [  1.05495343e-01   8.68477206e-02  -4.04647106e-01  -3.02295423e-01
      -1.86985197e-01]
    [ -1.10726875e-01  -3.04921039e-01  -5.36704952e-01  -5.27018977e-01
      -4.12939247e-01]]

   [[  2.30468909e-01   4.01866311e-01  -2.08318517e-01   5.50211429e-01
       2.51777922e-01]
    [ -2.04221267e-01   2.51715343e-01  -5.5203

<module 'mb_modelbase' from '/home/chris/anaconda3/lib/python3.6/site-packages/mb_modelbase/__init__.py'>

After the model is fitted to the data, we can execute some functions on the model:

In [23]:
mymod.aggregate("maximum")

(['Female',
  'West',
  'No',
  'Center-right',
  54.593417414050819,
  3.3353064275037374,
  1358.0209846786247,
  8.1159491778774289,
  3.6650448430493272],
 ['Male',
  'West',
  'No',
  'Center-right',
  52.42017937219731,
  3.4735426008968608,
  1737.0363228699553,
  7.8278026905829599,
  3.6107623318385649])

Why do we get two different maxima? One aggregation is executed on the fitted model and the other one on the data. We have to change the mode-parameter of the model to get the predicted maximum.

In [24]:
mymod.mode = 'model' # or 'data', 'both'
argmax = mymod.aggregate("maximum")
argmax

['Female',
 'West',
 'No',
 'Center-right',
 54.593417414050819,
 3.3353064275037374,
 1358.0209846786247,
 8.1159491778774289,
 3.6650448430493272]

To get a better understanding what is meant by executing the query against the model or data, we calculate the density of the distribution for a specific point. This can be done e.g. with the argmax:

In [25]:
mymod.density(argmax)

0.0019600700797255009

What happens if we change the mode of the model?

In [26]:
mymod.mode = 'data'
mymod.density(argmax)

0

Why do we get 0 as an answer? Well, the density query against the data corresponds to the number of observations we have of the given point, the argmax was not observed once obviously (who specify his age on 54.593....?). So let us ask for the density of a specific point that we know it exists, for example the first point in the table above:

In [27]:
firstrow = ['Female', 'East', 'No', 'Left', 47, 3, 1800, 9, 3]
mymod.density(firstrow)

1

This means, the datapoint appears only once in the whole data set. We also can ask for the density of the model for this point:

In [28]:
mymod.mode = 'model'
mymod.density(firstrow)

0.0001405463987097855

To keep our fitted model, we can save it and load it another time:

In [64]:
mymod.save(model=mymod, filename="example_models/Allbus_CondGauss.mdl")
loadmod = mbase.Model.load("example_models/Allbus_CondGauss.mdl")
loadmod.names

['sex',
 'eastwest',
 'lived_abroad',
 'spectrum',
 'age',
 'educ',
 'income',
 'happiness',
 'health']

For this purpose we also can create a modelbase. This is like an abstract object that keeps all the models we have learned. Important is that we have name a directory where we store and load the models from:

In [66]:
#Create the modelbase
mymodelbase = mbase.ModelBase("My Modelbase", load_all=False, model_dir="example_models")
#Add our fittedmodel mymod to the modelbase mymodelbase
mymodelbase.add(mymod) 
#Save it
mymodelbase.save_all_models

<bound method ModelBase.save_all_models of <mb_modelbase.server.modelbase.ModelBase object at 0x7f3bd883eb38>>

## Marginalization and Conditionalization (basic)

Right now our model is way too complex to obtain some useful information. Therefore we would like to marginalize out some dimensions and conditionalize some other dimension on a specific value in order to concentrate on information we are actually interested in.

### 1. Marginalization

At first we show how we remove some dimensions to reduce the model to less variables:

In [67]:
#'keep' for the dimensions you want to keep
mymod_marg = mymod.copy().marginalize(keep=['income', 'sex', 'eastwest']) 
#'remove' for the dimensions you want to remove
mymod_marg = mymod.copy().marginalize(remove=['lived_abroad', 'spectrum', 'educ', 'happiness', 'health', 'age', ]) 
mymod_marg.names

['sex', 'eastwest', 'income']

So we really have the 3 dimensions left we have asked for. Now, we can use our basic functions:

In [68]:
argmax = mymod_marg.aggregate("maximum")
argmax

['Female', 'West', 1355.5720558306523]

In [69]:
mymod_marg.density(["Male", "West", 1800])

0.10685935992577861

So we have reduced our model to 3 dimensions without any new fitting process!

### 2. Conditionalization

Now we want to marginalize some variables on specific values to compare two those, in this example we want to compare the income between men and women:

In [70]:
#How to conditionalize a variable on a specific value:
#1. Conditionalize the variables 
mymod_cond_mann = mymod.copy().condition([mbase.Condition("sex", "==", "Male")])
mymod_cond_frau = mymod.copy().condition([mbase.Condition("sex", "==", "Female")])
#2. Marginalize out the dimension
mymod_cond_mann.marginalize(remove=['sex'])
mymod_cond_frau.marginalize(remove=['sex'])

#Alternative: Steps 1 and 2 chained in one line
mymod_cond_mann = mymod.copy().condition([mbase.Condition("sex", "==", "Male")]).marginalize(remove=['sex'])
mymod_cond_frau = mymod.copy().condition([mbase.Condition("sex", "==", "Female")]).marginalize(remove=['sex'])

After we conditionalized the variable 'sex' in one model on 'Mann' and in the other model on 'Frau', we marginalize out the rest of the values to filter unnecessary information and get the aggregation after it:

In [71]:
mymod_cond_mann.marginalize(keep=['income'])
mymod_cond_frau.marginalize(keep=['income'])
[mymod_cond_mann.aggregate("maximum"), mymod_cond_frau.aggregate("maximum")]

[[1841.9319762906237], [1293.8338215305228]]

We can see that we have a large difference in the income between men and women.