# Kaggle May 2021 monthly challenge
This notebook was built to explore the Kaggle May 2021 monthly challenge.

## Imports

In [1]:
import pandas as pd
import numpy as np

In [2]:
from matplotlib import pyplot as plt
import seaborn as sns

In [3]:
%matplotlib inline

## The rundown
There are 50 features - we don't know exactly what they represent. They are used to predict the row belonging to one of four classes. We need to predict the likelihood of belonging to each.

There are 100,000 rows in the training dataset and 50,000 in the test set.

## Load the data
We'll load training data and examine the class/feature relationships. Then we'll load the test data, and glue the two together to examine them. 

In [4]:
train_all = pd.read_csv('~/Data/Kaggle/May2021/train.csv') 
test_X = pd.read_csv('~/Data/Kaggle/May2021/test.csv')

In [6]:
train_y = train_all['target']

In [8]:
train_X = train_all.drop('target', axis=1)

In [11]:
data_X = pd.concat([train_X, test_X])

## Investigate the data

In [15]:
# Are there any missing values for features?
test_X.describe()

Unnamed: 0,id,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,...,feature_40,feature_41,feature_42,feature_43,feature_44,feature_45,feature_46,feature_47,feature_48,feature_49
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,...,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,124999.5,0.25462,0.44348,0.11672,0.58444,0.61264,0.1615,0.746,1.23804,0.88644,...,0.71916,0.59714,0.53242,0.61422,0.13378,0.36058,0.52708,0.38822,0.9876,0.56562
std,14433.901067,0.910607,2.004536,0.524807,1.814083,2.81958,0.60576,2.352495,2.728055,3.364154,...,1.749077,2.058195,2.347675,2.335616,0.623451,1.500722,2.191986,1.450816,2.63717,1.700115
min,100000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,112499.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,124999.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,137499.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
max,149999.0,10.0,31.0,6.0,25.0,38.0,9.0,25.0,29.0,35.0,...,23.0,31.0,36.0,30.0,9.0,29.0,29.0,26.0,46.0,21.0


In [14]:
train_X.describe()

Unnamed: 0,id,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,...,feature_40,feature_41,feature_42,feature_43,feature_44,feature_45,feature_46,feature_47,feature_48,feature_49
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,...,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,49999.5,0.25783,0.43172,0.11413,0.59055,0.59977,0.16082,0.73149,1.22892,0.90335,...,0.71227,0.58207,0.52923,0.61631,0.1351,0.35866,0.51681,0.39004,0.97085,0.55712
std,28867.657797,0.929033,1.977862,0.519584,1.844558,2.785531,0.601149,2.343465,2.692732,3.415258,...,1.721863,2.003114,2.300826,2.360955,0.627592,1.464187,2.171415,1.48735,2.576615,1.68093
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,24999.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,49999.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,74999.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
max,99999.0,10.0,31.0,6.0,26.0,38.0,10.0,27.0,31.0,39.0,...,21.0,32.0,37.0,33.0,9.0,26.0,29.0,25.0,44.0,20.0


From the above, it looks like the features are all heavily skewed toward smaller values. That checks out from the data overview given in the challenge docs. It looks like most of the entries in each feature are 0. Feature 42 had a minimum value of 0 in the training set and -2 in the test set, so we'll see how weird that turns out to be. Otherwise, the summaries of the training and test sets did not reveal any material difference in distribution.

In [18]:
# Any missing values?
missing_vals = data_X.isnull().sum()
missing_vals.sum()
# Nope

0

In [None]:
### Check cross-correlation
# No, made my computer puke, this was a bad idea