**Exploration Topics: ID - 10 of train**

 - correlations of y and features
 - feature dynamics

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import kagglegym

%matplotlib inline

In [None]:
# Create environment
env = kagglegym.make()

# Get first observation
observation = env.reset()

# Get the train dataframe
train = observation.train

In [None]:
train.shape

In [None]:
len(train.id.unique())

In [None]:
len(train.timestamp.unique())

In [None]:
def getidtraindata(instrument):
    return train.loc[train.id==instrument,:]

train10 = getidtraindata(10)
train10.head()    

In [None]:
train10.describe()

Let's have a look at the features that are mostly correlated with the y-values of id 10:

In [None]:
x_cols = [col for col in train.columns if col not in ['id','timestamp','y']]
labels = []
values = []
nan_counts = []
for col in x_cols:
    labels.append(col)
    values.append(np.corrcoef(train10[col].values, train10.y.values)[0,1])
    nan_counts.append(train10[col].isnull().sum())

ind = np.arange(len(labels))
width = 0.9
fig, ax = plt.subplots(figsize=(6,40))
rects = ax.barh(ind, np.array(values), color='y')
ax.set_yticks(ind+((width)/2.))
ax.set_yticklabels(labels, rotation='horizontal')
ax.set_xlabel("Correlation coefficient")
ax.set_title("Correlation coefficient")
plt.show()

In [None]:
print(nan_counts)

To work with the correlation coefficients of y and features, they are stored in a dataframe. Now, let's have a closer look at three features that have coefficients above abs(0.2) and the lowest number of nan values:

In [None]:
y_feature_correlations = pd.DataFrame(values, index=labels, columns=['coefficient'])
y_feature_correlations['Nr of NaN'] = nan_counts
y_feature_correlations.head()

In [None]:
sorted_y_feature_correlations = y_feature_correlations.sort_values(['coefficient'], ascending=False)
sorted_y_feature_correlations.head()

Now, we see that for id 10, fundamental 50, 39 and 46 have zero nan-values and are highest correlated with our y-values. Let's plot the time-evolution of y and these fundamentals:

In [None]:
plt.figure()
plt.plot(train10.timestamp, train10.fundamental_50, '.')
plt.xlabel('timestamp')
plt.ylabel('fundamental_50')

plt.figure()
plt.plot(train10.timestamp, train10.fundamental_39, '.')
plt.xlabel('timestamp')
plt.ylabel('fundamental_39')

plt.figure()
plt.plot(train10.timestamp, train10.fundamental_46, '.')
plt.xlabel('timestamp')
plt.ylabel('fundamental_39')

plt.figure()
plt.plot(train10.timestamp, train10.y, '.-')
plt.xlabel('timestamp')
plt.ylabel('y')

Uhi, there are crazy jumps at nearly the same time!!! But... in the beginning they do not behave the same way.... In contrast, let's have a look at the time evolution of the anti-correlated features: 

In [None]:
sorted_y_feature_correlations = y_feature_correlations.sort_values(['coefficient'], ascending=True)
sorted_y_feature_correlations.head()

In [None]:
plt.figure()
plt.plot(train10.timestamp, train10.fundamental_50, '.')
plt.xlabel('timestamp')
plt.ylabel('fundamental_36')

plt.figure()
plt.plot(train10.timestamp, train10.fundamental_39, '.')
plt.xlabel('timestamp')
plt.ylabel('fundamental_30')

plt.figure()
plt.plot(train10.timestamp, train10.fundamental_46, '.')
plt.xlabel('timestamp')
plt.ylabel('technical_27')

For the next analysis part, the feature data is scaled to values between 0 and 1. This way it is much easier to visualize the data:

In [None]:
def scale(values):
    new_values = []
    for value in values:
        new_value = (value - values.min())/(values.max()-values.min())
        new_values.append(new_value)
    return new_values

def scale_all_features(data):
    scaled_data = pd.DataFrame(data.timestamp)
    for col, old_values in data.iteritems():
        if col not in ['id','timestamp','y']:
            scaled_data[str(col)] = scale(old_values)
    return scaled_data

scaled_train10 = scale_all_features(train10)
scaled_train10.head()

In [None]:
plt.figure()
for col, values in scaled_train10.iteritems():
    if col not in ['id','timestamp','y']:
        plt.plot(scaled_train10.timestamp, values, '.')
plt.xlabel('timestamp')
plt.ylabel('scaled feature values')

This looks funny and and reminds me of bifurcations, feigenbaum constant etc.. I think we have entered the world of nonlinear dynamics and perhaps of chaotic systems. The features are probably highly non-linear correlated and there are some kind of phase-transitions (for example close to 20, 40 and 85 timepoints many features "jump"). 