In this notebook, we take a quick look at some individual IDs and what their y value looks like across the timestamp.  It might give a bit more insight into modeling based on an ID basis.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
with pd.HDFStore("../input/train.h5", "r") as train:
    # Note that the "train" dataframe is the only dataframe in the file
    df = train.get("train")

In [None]:
df.head()

In [None]:
print("{0} unique IDs".format(len(df.id.unique())))
print("{0} unique timestamp".format(len(df.timestamp.unique())))

### Plot timestamp vs y by individual IDs
I wanted to take a look at each individual id to see whether they provide insight so I pulled the ids out and plotted them against their timestamp.

In [None]:
fig = plt.figure(figsize=(8, 20))
plot_count = 0
for i in range(0,10):
    plot_count += 1
    plt.subplot(10,1,plot_count)
    randID=np.array(df.id.sample(1))[0]
    dften=df[df['id']==randID]
    plt.plot(dften['timestamp'],dften['y'])
    plt.title("ID = "+str(randID))
    plt.xlim(0,1900)
    plt.tight_layout()
plt.show()

### Plot moving average of "timestamp" vs "y" by individual IDs
Since moving averages tend to give the best depiction of the trend of Y values.  I've randomly selected an interval of 100 and plotted the data.

In [None]:
fig = plt.figure(figsize=(8, 20))
plot_count = 0
for i in range(0,10):
    plot_count += 1
    plt.subplot(10,1,plot_count)
    randID=np.array(df.id.sample(1))[0]
    dften=df[df['id']==randID]
    dften['ymean']=pd.rolling_mean(dften['y'],100)
    plt.plot(dften['timestamp'],dften['ymean'])
    plt.title("ID = "+str(randID))
    plt.xlim(0,1900)
    plt.tight_layout()
plt.show()

### Plot Both
Now that I've seen the moving average does have some sort of shape rather than a complete straight line, we can plot them side by side to see how they stack up!

In [None]:
fig = plt.figure(figsize=(8, 20))
plot_count = 0
for i in range(0,5):
    plot_count += 1
    plt.subplot(10,1,plot_count)
    randID=np.array(df.id.sample(1))[0]
    dften=df[df['id']==randID]
    dften['ymean']=pd.rolling_mean(dften['y'],100)
    plt.plot(dften['timestamp'],dften['ymean'])
    plt.title("Moving Average ID = "+str(randID))
    plt.xlim(0,1900)
    plt.tight_layout()
    plot_count += 1
    plt.subplot(10,1,plot_count)
    plt.plot(dften['timestamp'],dften['y'])
    plt.title("ID = "+str(randID))
    plt.xlim(0,1900)
    plt.tight_layout()
plt.show()

### Identifying correlation on the original Y
Before running correlation on moving averages, I wanted to have a baseline of correlations to compare to.

In [None]:
dften=df[df['id']==870]
cor=dften.corr(method='pearson')
cordf=pd.DataFrame(cor['y'])
cordf['sort']=cordf.y.abs()
print(randID)
cordf.sort_values('sort',ascending=False).drop('sort', axis=1).head(10)

### Identifying correlation on moving averages
Maybe moving averages give a better picture as Y value instead of the actual Y value so let's take a look at the correlation coefficient

In [None]:
#randID=np.array(df.id.sample(1))[0]
dften=df[df['id']==870]
dften['ymean']=dften['y'].rolling(window=100).mean()
cor=dften.corr(method='pearson')
cordf=pd.DataFrame(cor['ymean'])
cordf['sort']=cordf.ymean.abs()
print(randID)
cordf.sort_values('sort',ascending=False).drop('sort', axis=1).head(10)

### Plotting the metrics with high correlation
After running through a few iterations of metrics with decently high correlation, I wanted to take a look to see how the data looked.  So I took a look at the data, got rid of the ymean values that were NaN and plotted this in a scatter matrix

In [None]:
dften.loc[:,['fundamental_20','fundamental_45','fundamental_26','ymean']]

In [None]:
dften = dften[np.isfinite(dften['ymean'])]
dften.shape
axs = scatter_matrix(dften.loc[:,['fundamental_20','fundamental_45','fundamental_26','ymean']], alpha=0.3, figsize=(9,9), diagonal='hist')

### Conclusion
It may be worth using moving average along with linear regression at an individual ID level.  However, because there are more than 1400 individual IDs, calculating the top 4 correlated metrics and generating a linear model from this may not completed within the time limit.  One other possible way to get around this is to initially identify IDs that correlate with each other on the moving average and group them in the regression analysis.