In this notebook I'll investigate on log loss functions and try to understand in a graphical way how the score is related to ratio of positives using constant predictions.

Yeah, I know this has already been done by [David Thaler](https://www.kaggle.com/davidthaler) in [How many 1's are in the Public LB?](https://www.kaggle.com/davidthaler)

With the calculated ratio: **0.17426778573248283** it scores: **0.46258** on LB

#### Summary: positive _rates on train and test

|     data     |        train        |         test        |
|:------------:|:-------------------:|:-------------------:|
| postive rate | 0.36919785302629282 | 0.17426778573248283 |

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
from math import log

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from sklearn.metrics import log_loss

%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
df_test = pd.read_csv('../input/test.csv',
                     usecols=['test_id'])
df_test['is_duplicate'] = 0.5
df_test.to_csv('all-half.csv', index=False)
df_test.head()

It scores **0.69315** in the public LB.

According to the [sklearn docs](http://scikit-learn.org/stable/modules/model_evaluation.html#log-loss):

> For binary classification with a true label  $y \in \{0,1\}$
and a probability estimate $p = \operatorname{Pr}(y = 1)$,
the log loss per sample is the negative log-likelihood
of the classifier given the true label:

$$L_{\log}(y, p) = -\log \operatorname{Pr}(y|p) = -(y \log (p) + (1 - y) \log (1 - p))$$

Here p = 0.5:

$$L_{\log}(y, 0.5) = -\log \operatorname{Pr}(y|0.5) = -(y \log (0.5) + (0.5 - y) \log (1 - 0.5)) = -(y \log (0.5) + (1 - y) \log (0.5)) $$

Finally $L_{\log}(y, 0.5) =  -\log (0.5)$

We don't get the number of positives, but now we are sure that the log-loss function used by Kaggle **uses a logarithm in base e**:  $\ln$.

In [None]:
log(2)

## log loss graph

What is the value of log loss for a predicted value if the questions are duplicate or not.

In [None]:
eps = 1E-6
xs = np.linspace(0 + eps, 1 - eps)
y1s = -np.log(xs)
y0s = -np.log(1 - xs)


plt.figure(figsize=(12, 5))
plt.plot(xs, y1s, label="is duplicate $-\log(p)$")
plt.plot(xs, y0s, label="isn't duplicate: $-\log(1-p)$")
plt.legend(loc='upper center')
axes = plt.gca()
axes.set_ylim([0,2])
plt.title('log loss in cas of a duplicate or non duplicate question')
plt.xlabel('Predicted probability')
plt.ylabel('log loss')

## Try to predict duplicate rate on train test

With a constant value in y_pred, it's easy to find the ratio of duplicates in the data.

$$ratio = \frac{ log\_loss + log(1-y_{pred})}{log(1-y_{pred})- log(y_{pred})}$$

In [None]:
df_train = pd.read_csv('../input/train.csv',
                       usecols=['is_duplicate'])
df_train['is_duplicate'].sum()/ df_train.shape[0]

In [None]:
def create_array(val, df=df_train):
    """Return a constant array with value val of same length as df"""
    return val * np.ones_like(df_train.index)
    
ll = log_loss(df_train["is_duplicate"], create_array(0.2))
ll

In [None]:
ratio = (ll + log(0.8)) / (log(0.8) -log(0.2))
ratio

### Verify that better constant prediction is the duplicates ratio

Let's do it graphically.

In [None]:
xs = np.linspace(0, 1, 100)
lls = [log_loss(df_train["is_duplicate"], create_array(x)) for x in xs]
# find minimum
min_index = np.where(lls == np.min(lls))[0][0]
x_min = xs[min_index]
y_min = lls[min_index]

plt.figure(figsize=(12, 5))
plt.plot(xs, lls, '-gD', markevery=[min_index])

plt.annotate('minimum ({:.3f}, {:.3f})'.format(x_min, y_min), xy=(x_min, y_min), xytext=(x_min, y_min - 0.1))
plt.title('log loss vs predicted constant probability')
plt.xlabel('Predicted probability')
plt.ylabel('log loss')
axes = plt.gca()
axes.set_ylim([0,2])
print()

In [None]:
log_loss(df_train["is_duplicate"], create_array(ratio))

## Easy, let's do the same on the test data

Yeah, I know this has already been done by [David Thaler](https://www.kaggle.com/davidthaler) in [How many 1's are in the Public LB?](https://www.kaggle.com/davidthaler)

In [None]:
df_test = pd.read_csv('../input/test.csv',
                      usecols=['test_id']
                      )
df_test['is_duplicate'] = 0.2
df_test.to_csv('submission-0.2.csv', index=False)

**Score on LB: 0.46473**

We can now calculate the ratio, and create the submission.

In [None]:
ll = 0.46473
ratio = (ll + log(0.8)) / (log(0.8) -log(0.2))
ratio

In [None]:
df_test['is_duplicate'] = ratio
df_test.to_csv('submission-ratio.csv', index=False)

**Score on LB: 0.46258**