## Import utils

In [None]:
%run utils.py

In [None]:
%pylab inline

In [None]:
import numpy

In [None]:
# parameters for histograms
hist_kw = dict(bins=100, normed=True, alpha=0.5)

# KS investigation

## KS metric pdf generation

###TODO
Build the KS-metric pdf by generating a pair of distributions from `numpy.random.random` many times and using `ks_2samp` function to compute KS 

In [None]:
from scipy.stats import ks_2samp
n1 = 10000
n2 = 20000

# finish the function
def generate_ks_pdf(n1, n2, points=30000):
    ks = []
    # for each point 
    for step in range(points):
        # generate pdf1 and pdf2 from the same distributions
        ...
        # compute ks for generated pdfs
        ks.append(...)
    return numpy.array(ks)

In [None]:
ks_pdf = generate_ks_pdf(n1, n2)

In [None]:
hist(ks_pdf, **hist_kw)
pass

## Assumption:

KS metric pdf will be the same, no matter which distribution we used to generate samples.
To be more precise, samples may be generated from any continuous distribution, not only uniform.

## Check the assumption!


###TODO:
* Generate two samples from a normal distribution and get KS metric pdf in this case. 
* Are two KS metric distributions similar (the first $PDF_{metric,u}$ is received by generating two samples from uniform pdf and the second $PDF_{metric,n}$ is received by generating two samples from normal pdf)?

In [None]:
def generate_ks_pdf_normal(n1, n2, points=30000):
    ks = []
    # for each point 
    for step in range(points):
        # generate pdf1 and pdf2 from the same distributions
        ...
        # compute ks for generated pdfs
        ks.append(...)
    return numpy.array(ks)

In [None]:
ks_pdf_normal = generate_ks_pdf_normal(n1, n2)

In [None]:
hist(ks_pdf_normal, **hist_kw)
pass

In [None]:
hist(ks_pdf, **hist_kw)
hist(ks_pdf_normal, **hist_kw)
pass

## Do you check these two KS metric distributions by eye? 

To check if $PDF_{metric,u}$ and $PDF_{metric,n}$ are similar compute KS metric between them!


## How can one understand what this metric value means?

two options here

### 1. KS test:

This is Kolmogorov-Smirnov statistic to test a hypothesis that two samples come from the same distibution.

Statistic:

$q= \sqrt{\frac{n*m}{n + m}}KS_{nm}$

$K_{\alpha} \sim \sqrt{-0.5 * \ln{\frac{1 - \alpha}{2}}}$, 

where $\alpha$ - statistical significance, $KS_{nm}$ - Kolmogorov-Smirnov metric.

If $q > K_{\alpha}$ then hypothesis (that both samples from the same distribution) will be rejected.

### 2. P-value calculated using the KS metric pdf:

You can calculate p-value using generated KS pdf to test the hypotheses

### TODO:
Check that $PDF_{metric,u}$ and $PDF_{metric,n}$ come from the same distribution (It will mean that our assumption holds).

Use the p-value returned by `ks_2samp` to understand KS-metric value between $PDF_{metric,u}$ and $PDF_{metric,n}$ 

if p-value is not small we can't reject hypothesis (so consider them coming from the same distribution)

## Сan you now answer if the KS metric pdf depends on the initial continuous distribution, from which two samples are generated? Can you prove this behaviour?

If you remember how the KS metric is calculated using the roc curve you will understand that the KS metric use only the permutation of the sample1 in the sample2, $ks=\max{|fpr - tpr|}$. Thus, the initial disribution will be transformed into a zeros-ones sequence, where zero means an element came from the sample1 and one - the sample2. If the samples were generated from the same distibution then probability of such zeros-ones sequence will be $\frac{(n1 + n2)!}{n1!n2!}$ and it doesn't depend on the initial distribution. Thus, the initial distribution doesn't matter and we can generate the KS pdf using the uniform distribution.

###TODO:

Above you checked the similarity of $PDF_{metric,u}$ and $PDF_{metric,n}$ using KS statistic. 

Now try option 2 to check this: generate KS metric distribution $PDF_{KS}$ for samples $PDF_{metric,u}$ and $PDF_{metric,n}$ and compute p-value of their distance using $PDF_{KS}.$

In [None]:
# compute p-value using the KS pdf 


**Note:** two p-values, obtained by two methods (KS statistic, KS metric distibution) are very similar.

## Example 

Two similar normal distributions are generated. By eye they are very similar, but the KS test says that hypothesis (the same distribution) should be rejected.

In [None]:
pdf1_g = numpy.random.normal(loc=10, scale=5, size=n1)
pdf2_g = numpy.random.normal(loc=10.2, scale=5.3, size=n2)
hist(pdf1_g, **hist_kw)
hist(pdf2_g, **hist_kw)
pass

In [None]:
ks_2samp(pdf1_g, pdf2_g)

In [None]:
numpy.mean(ks_pdf > ks_2samp(pdf1_g, pdf2_g)[0])

p-value < 0.005, thus the KS test says that samples were generated from different distributions. 

# CvM investigation

The mass and several predictions are generated to see the CvM metric:

* mass from 1000 * exp(1), range=(0, 4000);
* prediction_sin = sin(mass / 1000.)
* prediction_rand = numpy.random.random
* prediction_cut = zeros outside [1000, 1500] mass region and ones in this region

In [None]:
mass = 1000 * numpy.random.exponential(1, size=10000)
mass = mass[mass < 4000]
prediction_cut = numpy.zeros(len(mass))
prediction_cut[(mass > 1000) & (mass < 1500)] = 1
prediction_rand = numpy.random.random(len(mass))
prediction_sin = numpy.sin(mass / 1000.)

## Plot:
* mass histogram
* mass histogram for each prediction selection > 0.5

In [None]:
hist(mass, **hist_kw)
pass

In [None]:
hist(mass[prediction_cut > 0.5], **hist_kw)
pass

In [None]:
hist(mass[prediction_rand > 0.5], **hist_kw)
pass

In [None]:
hist(mass[prediction_sin > 0.5], **hist_kw)
pass

### For which prediction does selection change initial pdf? Check your thoughts using CvM metric.

#### What predictions always must be uncorrelated with the mass? 

CvM metric can be calculated using:

* `from utils import compute_cvm`
* `compute_cvm(predictions, mass)`

For CvM metric we haven't statistic like KS statistic to test hypothesis, that is why only pdf generation can help.

In [None]:
from utils import compute_cvm
# compute cvm for all predictions

Generate CvM metric distribution using appropriate distribution for predicitons generation.

**Random predictions always are uncorellated with the mass**

In [None]:
def generate_cvm_pdf(mass, points=5000):
    cvm = []
    # for each point 
    for step in range(points):
        # generate uncorellated predictions for mass
        ...
        # compute cvm for generated prediction and mass
        cvm.append(...)
    return numpy.array(cvm)

In [None]:
cvm_pdf = generate_cvm_pdf(mass)

In [None]:
hist(cvm_pdf, **hist_kw)
pass

Compute p-value to check correlation

# Classifiers training:
## check influence of the correlation restriction on the quality

Download 

* `training.csv`, 
* `check_correlation.csv`, 

to the folder `datasets/` from https://www.kaggle.com/c/flavours-of-physics/data

In [None]:
data = pandas.read_csv('datasets/training.csv')
data_correlation = pandas.read_csv('datasets/check_correlation.csv')

In [None]:
data.head()

In [None]:
data_correlation.head()

In [None]:
train_features = list(set(data_correlation.columns) - {'id'})
train_features_wo_mass = list(set(data_correlation.columns) - {'id', 'mass'})

In [None]:
hist(data[data.signal == 1]['mass'].values, label='signal MC', **hist_kw)
hist(data[data.signal == 0]['mass'].values, label='bck real', **hist_kw)
xlim(1600, 1950)
legend()

In [None]:
hist(data_correlation['mass'].values, label='correlation bck', **hist_kw)
legend()

### TODO:

* Train any linear model and ensemble model on `train_features`.
* Are models are correlated with the mass?
* Compute the AUC for the models
* Are they have essential difference?
* Use feature `(mass - mass.mean())**2` instead of `mass`. How does it infuence on the models?

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [None]:
# Divide train on train, test
train_index, test_index = train_test_split(range(len(data)))
train = data.iloc[train_index, :]
test = data.iloc[test_index, :]

In [None]:
# generate cvm pdf
tau_cvm = generate_cvm_pdf(data_correlation.mass.values)

In [None]:
hist(tau_cvm, **hist_kw)
pass

In [None]:
# define function to test model on cvm and calculate quality
def test_model(model, features, cvm_pdf):
    model_cvm = model.predict_proba(data_correlation[features])[:, 1]
    model_corr = compute_cvm(data_correlation.mass.values, model_cvm)
    print 'Correlation', model_corr, 'p-value', numpy.mean(cvm_pdf > model_corr)
    print 'AUC', roc_auc_score(test.signal.values, model.predict_proba(test[features])[:, 1])

In [None]:
# train ensemble
# test on correlation and quality

In [None]:
# train linear model
# test on correlation and quality

#### Add new feature for linear model

In [None]:
train['mean'] = (train.mass.values - train.mass.mean())**2
test['mean'] = (test.mass.values - train.mass.mean())**2
data_correlation['mean'] = (data_correlation.mass.values - train.mass.mean())**2

In [None]:
hist(train[train.signal == 1]['mean'].values, label='signal MC', **hist_kw)
hist(train[train.signal == 0]['mean'].values, label='bck real', **hist_kw)
# xlim(1600, 1950)
legend()

In [None]:
# train ensemble using new feature
# test on correlation and quality

In [None]:
# train linear using new feature
# test on correlation and quality

Result: ...

### How does shape of the mass in correlation dataset change depending on models' thresholds?

In [None]:
def compare_shape(model, features):
    probs = model.predict_proba(data_correlation[features])[:, 1]
    hist_kw['bins'] = 30
    for thr in [0.001, 0.05, 0.1, 0.5]:
        hist(data_correlation[probs > thr]['mass'].values, label='thr=%1.2f' %thr, **hist_kw)
    legend()

In [None]:
compare_shape(..., train_features)

Result: ...

## How the feature bagging and events bagging affect the CvM and quality?

In [None]:
# train the same ensemble using another max_features
for max_features in [20, 15, 10, 5]:
    ...

In [None]:
# train the same ensemble using another subsample
for subsample in [0.2, 0.5, 0.9]:
    ...

CvM and quality are both lower if max_feature/subsample becomes lower, but it is not sufficient to remove correlation. It will be useful, when you use feature (which is not strongly correlated with the mass) for training to prevent correlation.

## Produce new feature from mass to use it in training and make cvm much lower.

**Note:** In the competition cvm 0.002 threshold is set. Your model should pass it.

### Answer the question: 
* What quality can you get if `mass` is not used in the training? Is it different? Then why? 
* What model is better, linear or ensemble? Compare their cvm.

In [None]:
# train ensemble without mass
# test on correlation and quality

In [None]:
# train linear model without mass
# test on correlation and quality

## Example

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=3)
rf.fit(train[train_features_wo_mass], train['signal'].values)
test_model(rf, train_features_wo_mass, tau_cvm)

## Explain why this model is correlated with the mass, although mass feature is absent in training sample?

## Overfitting possibilities:
* overfitting-difference (difference between prediction's pdf for training and test datasets) - check by comparing distributions
* overfitting-complexity (overfitting because too many estimators used in training)

### Overfitting-difference

In [None]:
probs_train = rf.predict_proba(train[train_features_wo_mass])[:, 1]

In [None]:
hist_kw['bins'] = 60
hist_kw['alpha'] = 0.5
hist_kw['range'] = (0, 1)
hist(probs_train[train.signal.values == 1], color='r', label='train signal', **hist_kw)
hist(probs[test.signal.values == 1], color='g', label='test signal', **hist_kw)
legend(loc='upper center')

In [None]:
hist(probs_train[train.signal.values == 0], color='b', label='train bck', **hist_kw)
hist(probs[test.signal.values == 0], color='y', label='test bck', **hist_kw)
legend(loc='upper center')

Thus tree structure influences the correlation!

## Feature bagging invention

Write your own feature bagging algorithm: features will be chosen with some probability.

Compare using knn-method:

* standard feature bagging and this one

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureSampler(BaseEstimator, TransformerMixin):
    def fit(self, X, y, sample_weight=None):
        # finish method: generate features proportionally their information 
        # compute information using roc auc score 
        # use numpy.choice to generate features indices
        aucs = []
        self.n_features = X.shape[1]
        self.max_features = numpy.random.randint(2, high=8)
        
        for feature in range(self.n_features):
            auc = ...
            aucs.append(auc)
        self.features_information = ...
        self.used_features = ...
        return self
    
    def transform(self, X):
        # finish method returning necessary features         
        return ...

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
simple_knn = KNeighborsClassifier(n_neighbors=50)
pp_sampler = Pipeline([('sampler', FeatureSampler()), ('knn', simple_knn)])

### train simple knn

### train knn with standard bagging

### own feature bagging 

#### Does bagging improve knn algorithm? Does new feature sampler help?

### Can some scaler help to improve knn quality?
As an example, there is `sklearn.preprocessing.StandardScaler`

## Summarize:

* Now you can understand the trade off between the corelation and quality
* You can invent new algorithms which will be trained on the mass column and will be not correlated with the mass