In this notebook we are going to look at just exactly what the datasets consists of and do an initial submition with a very simple neural network

If you fork this or think it was useful please upvote

In [None]:
"""Initial exploratory analysis"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Constants
PATH = '../input/'
TRAIN = 'train.csv'
TEST = 'test.csv'

# Load Data
train_df = pd.read_csv(PATH + TRAIN)

print('Train Set')
train_df.head()

Our dataset has 378 columns, consisting of the example ID, the target variable ***y*** and 376 feature, some of them appear to be categorical, with letters as categories and others appear to be binary.

The feature names also don't prove very insightful, consisting mostly of **X0**, **X1**,... **X385**, but skipping some numbers (eg. **X7**, **X9**, ...)

In [None]:
# Simple Metrics of Dataset
print('Number of examples: {}'.format(train_df.shape[0]))
print('Number of Features: {}'.format(train_df.shape[1] - 2))

# Distribution of target variable
print('\nMean of target variable: {}'.format(train_df['y'].mean()))
print('Unbiased Variance of target variable {}'.format(train_df['y'].var()))
plt.figure(figsize=(12,8))
sns.distplot(train_df['y'].values, bins=50, kde=False)
plt.xlabel('y variable', fontsize=12)
plt.ylabel('Frequency')
plt.show()

We have only 4209 examples !!
This is quite a small dataset, which does not come as a surprise since the train and test set combined are 343.21 KB

The distribution of the target variable appears to be centered mostly around 100, with quite some variance.
Let's now have a look at our Features

In [None]:
# Feature types and distributions
print('Feature Types and #')
print(train_df.dtypes.value_counts())

# Categorical Features
categoricals = train_df.columns[train_df.dtypes == object]
print('\nCategorical Features:')
print(categoricals.values,'\n')

# Let's Look at how many categories there are
for feature in categoricals:
    print('Feature {}: {} Categories'.format(str(feature), len(train_df[feature].unique())))

We have 369 64-bit integer features and 8 categoricals, the float refers to the target variable.

Furthermore, we see that our categorical features are **X0**, **X1**, **X2**, **X3**, **X4**, **X5**, **X6** and **X8**
and each feature has between 4 and 47 categories, and together this 8 features encode 195 bits of information.

Let's now have a look at the integer features

In [None]:
# Let's Now look at the values for the int64 features
int_features = train_df.columns[train_df.dtypes == 'int64']

values_dict = {}

for feature in int_features:
    values_dict[str(feature)] = len(train_df[feature].unique())

del values_dict['ID']
print('# Of unique Values for each int Feature')
print(values_dict)

As we suspected, the integer variables are in fact binary, even more, some features have a constant unique state and these can be drop since they don't add any information to our model (constant state means 0 bits of information)

In [None]:
drop = []
for key in values_dict:
    if values_dict[key] == 1:
        drop.append(key)
        
for feature in drop:
    print('Dropped Feature {}'.format(feature))
    del values_dict[feature]

We dropped the 12 features which did not encode any information into our model, we need to keep track of which features were dropped because we gotta drop them in the test set as well
 now we are left with 357 binary features and 8 categoricals, making up to 552 bits. Quite manageable. If we use a sparse "One-hot" representation to encode our categorical features, our feature vector will be 552x1

Now let's have a quick look at the test set just to be sure that it is not very different from our training set

In [None]:
# Loading Test Set

test_df = pd.read_csv(PATH + TEST)
print('TEST SET')
print(test_df.head())
print('shape = ',test_df.shape,'\n')

test_categoricals = test_df.columns[test_df.dtypes == object]

for feature in test_categoricals:
    print('Feature {}: {} Categories'.format(str(feature), len(test_df[feature].unique())))

Oops, Features X0, X2 and X5 don't quite seem to agree on the number of categories between the train and test set, let's have a better look:

In [None]:
for feature in categoricals:
    test_feature = test_df[feature].unique()
    train_feature = train_df[feature].unique()
    union = pd.Series(test_df[feature].tolist() + train_df[feature].tolist()).unique()
    
    test_feature.sort()
    train_feature.sort()
    union.sort()
    
    print('\n\nTest {}: {}'.format(feature,test_feature))
    print('\nTrain {}: {}'.format(feature,train_feature))
    print('\nUnion size: ',len(union))
    
    
    
    

We can see that the Train and the Test Set each have some exclusive categories not seen in both sets. It would be wise to also check the other categorical feature.

We are then left with two options, we can encode the categorical features with a number of categories corresponding to the union of the test and the train categories for each feature, or we can drop the examples on which the categories are exclusive to either set

**MORE TO COME**

**KEEP ON THE LOOKOUT**