# Leaderboard Distribution (Santander Value Prediction)

The goal of this notebook is to create an interactive histogram showing the distribution of scores on the public leaderboard of the [Santander value prediction challenge](https://www.kaggle.com/c/santander-value-prediction-challenge) where the interactive part is the time (as in days until the end of the competition).

Data is used from the [competition page on Kaggle](https://www.kaggle.com/c/santander-value-prediction-challenge/leaderboard).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

import ipywidgets as widgets
from ipywidgets import interact

# read the csv file into a datframe
df = pd.read_csv('santanderlb.csv')
print('Shape: ', df.shape)
df.head()

Shape:  (21483, 4)


Unnamed: 0,TeamId,TeamName,SubmissionDate,Score
0,1791654,Yuya Takagi,2018-06-18 22:45:52,2.08
1,1791631,Yuriy Nazarov,2018-06-18 22:53:41,1.92
2,1791674,JohnM,2018-06-18 23:32:47,2.09
3,1791674,JohnM,2018-06-18 23:40:36,1.82
4,1791703,Rhostam,2018-06-18 23:52:14,1.86


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21483 entries, 0 to 21482
Data columns (total 4 columns):
TeamId            21483 non-null int64
TeamName          21480 non-null object
SubmissionDate    21483 non-null object
Score             21483 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 671.4+ KB


In [3]:
df['Score'].describe()

count    21483.000000
mean         1.518039
std          2.122953
min          0.470000
25%          0.750000
50%          1.410000
75%          1.510000
max        196.970000
Name: Score, dtype: float64

In [4]:
# drop outliers
df = df[df['Score'] < 2.5]
print('Shape: ', df.shape)

Shape:  (20753, 4)


In [5]:
# convert to datetime
df['SubmissionDate'] = pd.to_datetime(df['SubmissionDate'])

# define the deadline date
deadline = pd.to_datetime('2018-08-20')

# create new column by subtracting the SubmissionDate from the deadline
df['DaysUntilDeadline'] = deadline - df['SubmissionDate']

# convert to days
df['DaysUntilDeadline'] = df['DaysUntilDeadline'].dt.days

In [7]:
# check the df
df.tail()

Unnamed: 0,TeamId,TeamName,SubmissionDate,Score,DaysUntilDeadline
21478,1803914,Can you spare some change?,2018-08-17 12:18:44,1.86,2
21479,1932112,No train no gain,2018-08-17 12:22:17,0.5,2
21480,1837174,EpicDad,2018-08-17 12:28:05,0.56,2
21481,1797730,TrailBlazers,2018-08-17 12:32:31,0.57,2
21482,1892950,celestial712,2018-08-17 12:33:25,0.56,2


In [8]:
x = 0

def f(x):
    df['Score'][df['DaysUntilDeadline'] > x].hist(bins='scott', figsize=(15,10))
    plt.show()

interact(f, x=widgets.IntSlider(min=0,max=63,step=1,value=10))

interactive(children=(IntSlider(value=10, description='x', max=63), Output()), _dom_classes=('widget-interact'…

<function __main__.f(x)>