### Exploratory analysis of the 2021 tabular playground target data

Dumping the dataset into several ML algorithms then choosing the best one, then run a hyperparameter-tuning and automatic feature engineering do not teach you anything about the problem. 

I like to go manual and apply a lot of common sense and experimentation. To help this, I always visualize my hypotheses and findings. And I stare a lot at pairplots, sometimes distorted, coloured by different data aspects, zoomed in-and-out. I do this with my estimations and errors too, down to the individual record.

In [1]:
# imports
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture as GMM

# suppress warnings
import warnings
warnings.filterwarnings("ignore")

# reading just the train data, I do not do estimations yet
train = pd.read_csv(
    '/kaggle/input/tabular-playground-series-jan-2021/train.csv', 
    index_col=0
)

### Let's look at the shape of the target distribution


In [2]:
fig, ax = plt.subplots(2,1, figsize=(20, 5), sharex=True)
sns.kdeplot(train.target, ax=ax[0])
sns.boxplot(train.target, ax=ax[1], fliersize=10, **{'flierprops':{'alpha':.2}})
fig.show()

### Trying to find the distributions that make up this curve
This looks like a few normal distributions on top of each other, and some outlier at around 0.
There is a distribution just for that, called [Gaussian Mixture Model](https://scikit-learn.org/stable/modules/mixture.html). Now I try to model the target distribution (blue) with GMM (red).

In [3]:
# fitting the GMM model to the target data
clf = (
    GMM(
        n_components=5, 
        max_iter=200, 
        random_state = 0
    )
    .fit(
        np.array(train.target)
        .reshape(-1, 1)
    )
)

### Plotting the actual data, the estimated normal distributions, and their sum

In [4]:
# configuring plot
plt.figure(figsize=(20,5))
plt.title('The estimated gaussians behind the multimodal target data')

# plotting the original kde (in blue)
sns.kdeplot(train.target)

# plotting the estimation (in red)
xpdf = np.linspace(5,10,100000).reshape((-1,1))
density = np.exp(clf.score_samples(xpdf))
plt.plot(xpdf, density, '-r', alpha=.5)

# plotting the estimated underlying normal distributions
for i in range(clf.n_components):
    pdf = (
        clf.weights_[i]
        * stats.norm(
            clf.means_[i, 0],
            np.sqrt(clf.covariances_[i, 0])
        ).pdf(xpdf)
    )
    plt.fill(xpdf, pdf, facecolor='gray',
             edgecolor='none', alpha=0.3)

# trimming the outlier at 0
plt.xlim(5, 10)

# putting the 5 distributions GMM has found into a DF
gaussians = pd.concat(
    [
        pd.DataFrame(clf.means_, columns=['means']),
        pd.DataFrame(clf.covariances_.squeeze(), columns=['covariances']),
        pd.DataFrame(clf.weights_, columns=['weights'])
    ],
    axis=1)

plt.table(
    cellText=gaussians.values,
    rowLabels=gaussians.index,
    colLabels=gaussians.columns,
    cellLoc = 'right', rowLoc = 'center', loc='bottom',
    bbox=[.015,.45,.35,.5])

plt.show()


Honestly, I'm a bit dissapointed. I was hoping for a near-perfect fit from the GMM, but for the sake of the experiment, I push forward with this. So now I will try to find/build features that are strong predictors to one of the 5 gaussians, and see, where that path takes us.