# how to build and query probabilistic models with `modelbase` - an introduction to the Python-API

Welcome to a short introductory example of `modelbase` package.
There is multiple ways of using `modelbase`.
One is to run it as a webservice, for instance to connect it to [`lumen`](https://github.com/lumen-org/lumen), a graphical user interface for visually exploring models.
Another is to use its Python-API which we introduce here.

## The MPG data set
Here, we will use the popular [cars/mpg data set](https://archive.ics.uci.edu/ml/datasets/auto+mpg) which is conviniently included in the `mb-data` package.

Let us import the data set and have a first view:

In [1]:
import pandas as pd
from mb.data import mpg

dataset =  mpg.mixed()
dataset.head()

Unnamed: 0,car_size,cylinder,displacement,mpg_city,mpg_highway
0,compact car,4,2.296,17.0,17.0
1,compact car,4,2.296,17.0,17.0
2,compact car,6,2.4436,21.0,27.0
3,compact car,6,2.6896,18.0,24.0
4,compact car,6,2.6896,18.0,23.0


Here, we loaded a version of the data set with only 5 attributes, one of them categorical (`car_size`) and the other four having quantitative values (`cylinder`, `displacement`, `mpg_city`)

## Fitting a probabilistic model

Let's creata a probabilistic model, namely a Conditional Gaussian distribution.
Next, we will train it on our data, i.e., let an algorithm fit its internal parameters to best match the data.
It should take only a second to fit the model.

In [2]:
import mb.modelbase as mbase

#Create an object of the model class Conditional Gaussians (with categorical and continuous variables)
mpg_model = mbase.MixableCondGaussianModel("mpg_v1")

#Fitting process
mpg_model.fit(df=dataset)

## Executing queries on models

Once it is trained, we can execute various queries on the model.
Let's show the basic ones:

**descriptive queries:**
Here, descriptive means that the query will return some data table that describes a particular aspect of the model:

 * aggregation/prediction
 * sampling
 * probability/density queries

**modifying queries:**
Here, modifying means that such queries will change the queried model.

 * filtering/conditioning
 * marginalization

## Aggregation/Prediction

We can query for point predictions (here called aggregations).
This gives us the point of maximum density of the 5-dimensional fully probabilistic model.
An intuitive interpretation is as an answer to this question: if we would draw a new sample point from the model, what would be the most likely value be?

In [9]:
arg_max = mpg_model.aggregate("maximum")
arg_max

['pickup',
 6.395435764938331,
 4.213260050346121,
 14.653353886176635,
 18.738270671563757]

## Density

Let's query the density of the model at this point:

In [10]:
mpg_model.density(arg_max)

0.12170594972155192

We can also query for any other points, such as any value in our data set.

In [22]:
random_data_item = dataset.sample().values[0]
p = mpg_model.density(random_data_item)
print(f'p({random_data_item}) = {p}')

p(['compact car' 4 2.296 17.0 24.0]) = 0.034981008135752945


## Probability
Similarily, we can also query for the probability of some interval:
TOOD

# Using model qeries to specialize models

We can flexibly derive 'specialized' models from our full, five-dimensional model by conditioning or marginalizing it.

By default, these operations *modify* the model they are applied to.
To keep the original model, we can copy it explicitely using `.copy()`.

## Conditioning / filtering

Conditioning means to fix the value of one or more random variables.


In [24]:
conditions = [mbase.ConditionTuple('car_size', "==", "compact car"),
              mbase.ConditionTuple('cylinder', '<', 4)]
mpg_model_conditionalized = mpg_model.copy().condition(conditions)

As to be expected smaller cars with less cylinders can go more miles per gallon than the overall average car:

In [29]:
print(mpg_model.names)
print(f"conditionalized model: {mpg_model_conditionalized.aggregate(method='maximum')}")
print(f"original model       : {mpg_model.aggregate(method='maximum')}")

['car_size', 'cylinder', 'displacement', 'mpg_city', 'mpg_highway']
conditionalized model: ['compact car', 4, 2.4870509541527945, 19.75823175695253, 26.74175955164767]
original model       : ['pickup', 6.395435764938331, 4.213260050346121, 14.653353886176635, 18.738270671563757]


## Marginalization

Marginalization means to remove one or more variables.

Originally our model has 5 dimensions as follows:

In [36]:
print(f"{mpg_model.dim} dimensions with names {mpg_model.names}")

5 dimensions with names ['car_size', 'cylinder', 'displacement', 'mpg_city', 'mpg_highway']


Now, let's marginalize the variables `car_size` and `mpg_city`:

In [38]:
mpg_model_marginalized = mpg_model.copy().marginalize(remove=['car_size', 'mpg_city'])
print(f"{mpg_model_marginalized.dim} dimensions with names {mpg_model_marginalized.names}")

3 dimensions with names ['cylinder', 'displacement', 'mpg_highway']


Alternatively, we could also specify those variables to keep, instead of those to remove.
Obviously, this is sometimes handy:

In [39]:
mpg_model_marginalized = mpg_model.copy().marginalize(keep=['cylinder'])
print(f"{mpg_model_marginalized.dim} dimensions with names {mpg_model_marginalized.names}")

1 dimensions with names ['cylinder']


Again, as the marginalized model is just a model, we can run any query on it:


In [59]:
print(f"density(cylinder=4) = {mpg_model_marginalized.density([4])} \n"
      #f"probability(cylinder in [1,4]) = {mpg_model_marginalized.probability([mbase.NumericDomain(1,4)])}"
      f"argmax(model) = {mpg_model_marginalized.aggregate(method='maximum')}"
 )

density(cylinder=4) = 0.24835018618057497 
argmax(model) = [5.605323188960642]


## Other queries

There is a number of other queries.
For instance we can save a model to disk, and of course you can also load a model from disk:

In [62]:
mpg_model.save(dir='.')

<mb.modelbase.models.mixable_cond_gaussian.MixableCondGaussianModel at 0x7faceb9e9940>

By default the name of the file is the name of the model.
Also, we use the `.mdl` file ending.

Hence, to load, we can simply do:

In [78]:
mpg_model2 = mbase.Model.load(mpg_model.name + '.mdl')
mpg_model2.density( mpg_model2.data.sample().iloc[0] )

0.09090717745084334