## Can we help foggy gaussians to lift off with bernoulli lighthouses?

During the data science trainee program of Lea W. we [explored the gaussian mixture model](https://www.kaggle.com/allunia/hidden-treasures-in-our-groceries) using real dirty data of the openfoodfacts app. The treasures we were looking for are given by the hidden product categories that we tried to find by unsupervised clustering using nutrition table information. **This way we found nice clusters that already hold meaningful products like pasta, yoghurts, cookies and much more**. In addition we found an anomalistic cluster that was fully occupied by outliers or seldom products. On the coarse-grained view all results are nice and clustering was succesful.


Let's take a look at some of these clusters! But to do so, we need to:

* Load packages
* Load data

In [None]:
########################### Load packages

# This Python 3 environment comes with many helpful analytics libraries installed
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
%matplotlib inline
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import seaborn as sns
sns.set()

from sklearn.model_selection import train_test_split

########################### Load the data

data =pd.read_csv("../input/hidden-treasures-in-our-groceries/hidden_treasures_groceries_gmm.csv",
                  index_col=0)
data["product"] = data["product"].astype(str)
data.head()

Now, let's take a look at product name words of some of our clusters obtained by Gaussian Mixture:

In [None]:
def make_word_cloud(data, cluster, subplotax, title):
    words = data[data.cluster==cluster]["product"].apply(lambda l: l.lower().split())
    cluster_words=words.apply(pd.Series).stack().reset_index(drop=True)

    text = " ".join(w for w in cluster_words)

    # Create and generate a word cloud image:
    wordcloud = WordCloud(max_font_size=30, max_words=30,
                          background_color="white", colormap="YlGnBu").generate(text)

    # Display the generated image:
    
    subplotax.imshow(wordcloud, interpolation='bilinear')
    subplotax.axis("off")
    subplotax.set_title(title)
    return subplotax

fig, ax = plt.subplots(1,2, figsize=(20,5))
make_word_cloud(data, 8, ax[0], "Yoghurt and Milk")
make_word_cloud(data, 9, ax[1], "Pasta")

These clusters look really nice and they are what we are seeking for.  But diving deeper into cluster statistics and looking at violinplots of nutrient distributions per cluster we revealed **strange behaviours of our model**: Beside the anomalistic cluster we found an mixed type, but not anomalistic, cluster that showed the same nutrient distributions except from 2 features: salt and proteins. In these nutrients these normal cluster showed sharp zeros. Thus is contained products that were typed in by users with zero fat and zero proteins. 

In [None]:
nutrients = ["fat_100g",
             "proteins_100g",
             "carbohydrates_100g",
             "sugars_100g", 
             "other_carbs",
             "salt_100g",
             "g_sum"]
energies = ["energy_100g", "reconstructed_energy"] 

transformed_nutrients = ["transformed_" + nutrient for nutrient in nutrients]
transformed_energies = ["transformed_" + energy for energy in energies]

def make_violin(subax, cluster, nutrients):
    pos = np.arange(1, len(nutrients)+1)
    part = subax.violinplot(
            data[data.cluster==cluster]
                [nutrients].values,
            showmeans=True,
            showextrema=False)
    subax.set_title("Feature distributions of cluster: " + str(cluster), size = 20)
    subax.set_xticks(pos)
    subax.set_xticklabels(nutrients)
    set_color(part, len(nutrients))
    return subax

def set_color(axes, num_colors):
    cm = plt.cm.get_cmap('RdYlBu_r')
    NUM_COLORS=num_colors
    for n in range(len(axes["bodies"])):
        pc = axes["bodies"][n]
        pc.set_facecolor(cm(1.*n/NUM_COLORS))
        pc.set_edgecolor('black')
    return axes

fig, ax = plt.subplots(2,2,gridspec_kw = {'width_ratios':[3, 1]}, figsize=(30,10))
pair00 = make_violin(ax[0,0], 6, nutrients)
ax[0,0].set_ylim([0,100])
pair01 = make_violin(ax[0,1], 6, energies)
ax[0,1].set_ylim([0,4000])
pair10 = make_violin(ax[1,0], 12, nutrients)
ax[1,0].set_ylim([0,100])
pair11 = make_violin(ax[1,1], 12, energies)
ax[1,1].set_ylim([0,4000])

Both clusters spread widely into the feature space and contain outliers in their features. Looking at the anomalies per cluster, we can see that cluster 6 is highly anomalistic whereas cluster 12 is not. And that is really strange!

In [None]:
fig, ax = plt.subplots(1,2,figsize=(20,5))
counts_6 = data[data.cluster==6].anomaly.value_counts() / data[data.cluster==6].anomaly.count() * 100
counts_12 = data[data.cluster==12].anomaly.value_counts() / data[data.cluster==12].anomaly.count() * 100
sns.barplot(x=counts_6.index, y=counts_6.values, ax=ax[0], palette="Set2")
ax[0].set_title("Anomaly detection in cluster 6")
ax[0].set_xlabel("normal: 0 - anomal: 1")
ax[0].set_ylim([0,100])
ax[0].set_ylabel("Percentage of cluster data")
sns.barplot(x=counts_12.index, y=counts_12.values, ax=ax[1], palette="Set2")
ax[1].set_title("Anomaly detection in cluster 12")
ax[1].set_xlabel("normal: 0 - anomal: 1")
ax[1].set_ylim([0,100])
ax[1].set_ylabel("Percentage of cluster data")

### Why did that happen? 

Well... for us it's still not easy to understand. Both clusters have means in their features that are close to each other except from proteins and fat. Recapping the definition of anomaly in the view of a gaussian mixture model we can say that **our model detects an anomaly if the sample is placed in a low-density region of our feature space**. But this mixed-type cluster with zero-fat-zero-proteins is very dense in these two features as all samples are placed at only one point: 0 :-). And even if all other nutrient features spread widely into the space, this **heavy sample density on the zero plane may causes the model to say that this is not anomalistic**. 




### And how can we escape from the mighty zero-plane?

During this kernel I will try to improve the model by gluing Bernoullis and Gaussians that try to explain the somehow discrete and continous nature of nutrition table information. Hopefully the bernoullis will guide the gaussians to better results, especially in anomaly detection. In the beginning the kernel might look very crowded with math and code as I need to derive and implement the new model. Unfortunately I can only find tools for either Gaussian or Bernoulli Mixture Models and not for mixed typed as well. Hence I have to roll up my sleeves and put some work into implementations.

I'm excited and curios if this new model can lift off the gaussians! :-)

## Building lighthouses 

Before we can start to dive into implementation and analysis, we have to derive the model and its learning procedure:

$$ p(\hat{x}) = \sum_{k=1}^{K} \pi_{k} \cdot B_{k}(\chi| \nu_{k}) \cdot N_{k}(x| \mu_{k}, \Sigma_{k})$$

In contrast to the gaussian mixture model you can see that the density we need to detect anomalies is now calculated by multiplying gaussians with bernoullis. Let's try to understand this model:

$$ B(\chi_{n}| \hat{\mu}_{k}) = \prod_{d=1}^{D} \nu_{k,d}^{\chi_{n,d}} \cdot (1-\nu_{k,d})^{(1-\chi_{n,d})} $$

The bernoulli distribution describes the discrete nature of the features. To make it act we need new features $\chi_{n,d}$:
* A discrete nutrient feature that holds 1 if a nutrient of a product is zero (for example zero fat) and 0 in the case where it's greater than zero. This way we **describe the zeroness of a nutrient**. In contrast to pure nutrients like fat, proteins etc. the energy features and the g_sum feature should be treated differently:
* For energies we like to know if there is a **discrepancy between the user given energy and the reconstructed energy**. Now we could set 1 in cases where the reconstructed energy is higher than the user given energy and 0 for the otherway round. Even if almost all samples have discrepancies between these energies you might think that this feature is useless, but I think it acts like a separator that makes it possible to explain some more patterns like products where the nutrients have all zero entries but energy is positive and hence higher than reconstructed. 
* In addition we can try to cover the errors when the sum of all nutrients exceed 100g (g_sum) and the case where energy exceeds 3700. We exclude these kind of errors during fitting but for model performance test on test data these **new features of exceeding nature** could be very helpful. 

Let's explore the discreteness of our data :-)

## The islands where lighthouses live

First of all, let's try to understand couplings, for example one might ask: "Do products with zero proteins often have zero fat as well?" and the other way round: "Do products with zero fat often have zero proteins?" 

### Can we find some coupled zero-zero features?

In [None]:
lighthouses = pd.DataFrame(index=data.index)
lighthouses["zero_proteins"] = np.where(data.proteins_100g == 0, 1, 0)
lighthouses["zero_fat"] = np.where(data.fat_100g == 0, 1, 0)
lighthouses["zero_sugars"] = np.where(data.sugars_100g == 0, 1, 0)
lighthouses["zero_carbs"] = np.where(data.carbohydrates_100g == 0, 1, 0)
lighthouses["zero_other_carbs"] = np.where(data.other_carbs == 0, 1, 0)
lighthouses["zero_salt"] = np.where(data.salt_100g == 0, 1, 0)

lighthouses["energy_separator"] = np.where(data.reconstructed_energy >= data.energy_100g, 1, 0)
lighthouses["exceeds_energy"] = np.where(data.energy_100g > 3700, 1, 0)
lighthouses["exceeds_reconstructed_energy"] = np.where(data.reconstructed_energy > 3700, 1, 0)

In [None]:
cols_of_interest = [col for col in lighthouses.columns if "zero" in col]

In [None]:
def get_percentage(feature):
    result = lighthouses[lighthouses[feature] == 1].sum(axis=0) 
    result /= lighthouses[lighthouses[feature] == 1][feature].count() 
    return result

zero_heatmap = pd.DataFrame(index=lighthouses.columns)

for col in lighthouses.columns:
    zero_heatmap[col] = np.round(get_percentage(col) * 100)

plt.figure(figsize=(10,5))
sns.heatmap(zero_heatmap.loc[cols_of_interest, cols_of_interest].transpose(),
            cmap="coolwarm", annot=True, cbar=False, fmt="g")
plt.ylabel("Products with ")
plt.xlabel("contain this percentage of ")

### Take-Away

Ok, first the questions and then in general:

* Products with zero proteins have in 82 % of cases zero fat as well. And the other way round: Products with zero fat have in 62 % of cases zero proteins. We can see an imbalanced coupling of the zeroness of proteins and fat. Think of it - does it make sense? What could be a product with zero fat? .... Water, drinks in general, vegetables, perhaps fruits... these products are likely to have no proteins as well. I'm curious if we can reveal what is hidden behind such groups. 
* The next obvious pattern is between zero carbs, zero sugars and zero other carbs. It make sense that products with zero carbohydrates have zero sugars and zero other carbs as the latter are themselves carbohydrates. 
* Now let's take a look at the smoother patterns: Products with zero salt often have zero fat and proteins as well. And one antipattern: products with zero salt, proteins and fat often consists of carbohydrates. 

Next topic to discover: 

### Which is the feature with the highest count of zeroness over all products of our data?

In [None]:
zeroness_per_product = lighthouses[cols_of_interest].sum(axis=1).value_counts() / lighthouses.shape[0] * 100
percentage_zeroness = lighthouses.sum(axis=0) / lighthouses.count(axis=0) * 100
percentage_zeroness = percentage_zeroness.loc[cols_of_interest]
percentage_zeroness = percentage_zeroness.sort_values()

fig, ax = plt.subplots(1,2,figsize=(20,5))
sns.barplot(x=percentage_zeroness.index, y=percentage_zeroness.values, order=percentage_zeroness.index, 
           palette="Reds", ax=ax[0])
ax[0].set_ylabel("% in data")
sns.barplot(zeroness_per_product.index, zeroness_per_product.values)
ax[1].set_xlabel("Number of zero nutrients per product")
ax[1].set_ylabel("% in data")

### Take-Away

* We can see that most products have zero fat, zero other carbohydrates or zero proteins. Interestingly carbohydrates themselves and sugars are often higher than zero. Perhaps the data given is full of sweets or perhaps the industry likes sugar in each product? :-)
* The second plot shows that most products have no zero nutrients. This are all data spots that are not somehow sticked to the mighy zero planes. In addition we can see that multiple zero nutrients higher than 3 are seldom. They would probably not have caused the problem that clusters that are sticked to zero planes are of high density even though they consist of many outliers. Hence the cause of our problem is mainly due to the products with 1, 2 or 3 zero nutrients. 

## Foggy Gaussians

Now we have gained an impression what discrete features of nutrients can tell us. We have in mind that zero fat does often come with zero proteins and we know that there are roughly 55 % of products with no zero-nutrients and 45 % with one or more. The latter probably causes our anomaly detection problem as high counts on the zero planes or axes lead to very high densities. We changed our model by introducing bernoulli distributions $B_{k}$ that explain the discreteness of each cluster:

$$ p(\hat{x}) = \sum_{k=1}^{K} \pi_{k} \cdot B_{k}(\chi| \nu_{k}) \cdot N_{k}(x| \mu_{k}, \Sigma_{k})$$


Doing so they change the way we calculate densities and hopefully this will change the cluster formation process such that we can obsvere even nicer patterns and detect anomalies suffienctly. Now, let's take a look at the second part of our model, the gaussians $N_{k}$:

$$ N(x_{n}| \mu_{k}, \Sigma_{k}) = \frac{1}{\sqrt{(2\pi)^{d}\det\Sigma_{k}}} \cdot \exp \left( \frac{1}{2} \cdot (x_{n} - \mu_{k})^{T}  \Sigma_{k}^{-1} (x_{n} - \mu_{k}) \right) $$ 


In contrast to the Bernoullis they describe the continuous nature of our nutrients. Each nutrient should be allowed to cover a range of 0 up to 100g. As this is not described by the zero-features we need the original ones. That's the reason why we have $\chi$ for Bernoulli and $x$ for the original continuous features. 

If you take a look at two nutrients of your choice of the following,

* proteins_100g
* fat_100g
* salt_100g
* sugars_100g
* carbohydrates_100g
* other_carbs

you can recap the different nature of cluster 6 and 12. This way you can see that both clusters contain products that are highly different in the way zeroness occurs in their nutrients. This is the second motivation to extend the old Gaussian Mixture Model: Adding the discreteness as a second aspect for clustering may help to detect more pattern. We expect that products that are similar in their zeroness build own groups! :-) Beside anomaly detection there should be an improvement of clustering as well! 

In [None]:
features = ["sugars_100g", "salt_100g"]

In [None]:
sns.set()
fig, ax = plt.subplots(1,2, figsize=(20,5))
ax[0].scatter(data[(data.cluster==6) & (data.anomaly==1)]["transformed_" + features[0]].values,
              data[(data.cluster==6) & (data.anomaly==1)]["transformed_" + features[1]].values, s=1, alpha=0.5, color="coral")
ax[0].set_title("Anomalistic cluster of old GMM")
ax[0].set_xlabel("Boxcox transformed " + features[0])
ax[0].set_ylabel("Boxcox transformed " + features[1])
ax[1].scatter(data[(data.cluster==12) & (data.anomaly==0)]["transformed_" + features[0]].values,
              data[(data.cluster==12) & (data.anomaly==0)]["transformed_" + features[1]].values, s=1, alpha=0.5, color="mediumaquamarine")
ax[1].set_title("Non-Anomalistic counterpart cluster of old GMM")
ax[1].set_xlabel("Boxcox transformed " + features[0])
ax[1].set_ylabel("Boxcox transformed " + features[1])

## Expectation maximization

Ok, now we will dive deeper and set up the learning process for our model with expectation maximization algorithm. Its a general method to make mixture models learn and has a well-founded theory that is beyond the scope of this kernel. I will heavily build upon explanations of the book "Pattern recognition and machine learning" of [Christopher Bishop](https://www.microsoft.com/en-us/research/people/cmbishop/) that is one of my favorites. 

The next lines will be full of math and some of these lines are just a recap for myself. So if you don't want to go into details for yourself, make a jump to the next chapter. ;-)

### Log-Likelihood of incomplete data $D = [x_{n}]_{n=1}^{N}$

Let's take a look at our model again:

$$ p(\hat{x}) = \sum_{k=1}^{K} \pi_{k} \cdot B_{k}(\chi| \nu_{k}) \cdot N_{k}(x| \mu_{k}, \Sigma_{k})$$

It describes the probability density given by our samples within the feature space over all nutrients we take into account. We want the model to describe our data well. It should fit to our data as much as it can and the only way to shape it as we want is by tuning its parameters: the component probabilities $\pi_{k}$ the bernoulli centers $\nu_{k}$ as well as the gaussian centers $\mu_{k}$  and their covariances $\Sigma_{k}$. We want to fit our model most likely to our data and we are searching for a maximum likelihood solution with respect to our parameters. As we have gaussians with exponentials we expect that its easier to consider the log of our probability density:

$$ \ln p(\hat{x}) = \ln \left( \sum_{k=1}^{K} \pi_{k} \cdot B_{k}(\chi| \nu_{k}) \cdot N_{k}(x| \mu_{k}, \Sigma_{k}) \right)  $$

Oh, here comes the trouble: As we have a sum over $k$ cluster components the $\ln$ can't act directly on our bernoullis and gaussians :-( . Maximizing this function with respect to our parameters is not tractable in this case. For this reason we need a new approch!  

### Log-Likelihood of complete data $D = [x_{n}, z_{n}]_{n=1}^{N}$

Things would become much nicer if we could place the $\ln$ inside the sum. But without clear motivation this is idea is not worth to try. In our current state we have given something like that:

$$\ln p(X|\theta) = \ln \left( \sum_{Z} p(X,Z|\theta) \right) $$

The Z stands for our hidden latent varibales, the product categories we like to obtain by clustering. For each data spot $x_{n}$ there is one $z_{n,k}$ that describes which component $k$ is related to it. In this case $z_{n,k}$ holds one (hot) wheras all other elements of this vector hold zeros. Now imagine you already know $\vec{z}_{n}$. In this case you would have given the complete data, the nutrition information and the product categories. In this case it would be easy to compute $\ln p(X,Z)$. As we don't have it, the least we can do is the following:

* Choose some parameters $\theta^{*}$ randomly and compute expectations of $\ln p(X,Z|\theta)$ under some probability distribution $\tilde{p}(j)$:

$$E\left[ \ln p(X,Z)\right] = \sum_{j} \tilde{p}(j) \ln p(X,Z|\theta) = \sum_{Z} p(Z|X, \theta^{*}) \cdot \ln p(X,Z|\theta) $$

* Given parameters $\theta^{*}$ and our data spots $X$ we are able to compute the responsibilities for each component to generate the data spots. This is described by the probability distribution $p(Z|X, \theta^{*})$. Computing them is part of the **E-Step**.
* If we know these responsibilites we can than maximize our expectation with respect to the parameters $\theta$. Now, this part, called **M-Step** is nice because we can compute $\ln p(X,Z|\theta)$ as $\ln$ acts directly on our bernoullis and gaussians. This way we obtain new parameters $\theta^{*}$ and we can repeat the procedure. 

#### E-Step:

Ok, first part for us is to derive the E-Step for our model...

coming soon

....

$$ \gamma_{nk} = \frac{\pi_{k} \cdot B(x_{n}|\nu_{k}) \cdot N(x_{n}|\mu_{k}, \Sigma_{k})}{\sum_{j=1}^{K}\pi_{j} \cdot B(x_{n}|\nu_{j}) \cdot N(x_{n}|\mu_{j}, \Sigma_{j})}$$


#### M-Step:

$$ N_{k} = \sum_{n=1}^{N} \gamma_{nk}$$