This challenge consists of two sets of files:

* `X.CSV` for various values of `X`: contains data similar to what you were to produce for the C++ part of this challenge. Specifically, each file contains the average, min, max, and standard deviation of the total acceleration in 10 second long time windows for a single video. The `start` and `end` columns indicate the bounds of the time window in milliseconds from the beginning of the video.
* `X.jumps.csv` for various values of `X`: these files go with the matching `X.CSV` files and indicate when a kite surfer jumped. That is, if one row of `X.jumps.csv` has `start == 53280` and `end == 55540` then a kite surfer jumped, leaving the water at 53280 milliseconds from the beginning of the video and landing back on the water at 55540 milliseconds from the beginning of the video.

The goal is to be able to predict time windows which contain kite jumps. This Jupyter Notebook is a (bad) attempt at building a model to do that. Your challenge is to identify the various ways this attempt could be improved. You do not need to actual fix the code (though that is permitted). You can simply add new Markdown cells to the notebook indicating the mistakes you observe and suggesting improvements.

In [6]:
import pandas as pd
import numpy as np
import glob
import os.path
from keras.layers import Input, Dense
from keras.models import Model
from keras import optimizers
import sklearn.model_selection as ms

In [7]:
data_dir='challenge_data'

Read all the forces and jump data into dictionaries whose keys are the names of the jump files and whose values are the Pandas DataFrame instances holding the data.

In [8]:
force_files = glob.glob(os.path.join(data_dir, '*.CSV'))
jump_files = glob.glob(os.path.join(data_dir, '*.jumps.csv'))

forces = {}
for ff in force_files:
    forces[ff] = pd.read_csv(ff)
jumps = {}
for ff, jf in zip(force_files, jump_files):
    # Use the forces filename as the key so we can easily match the jump times with
    # the corresponding forces
    jumps[ff] = pd.read_csv(jf)

In [9]:
forces.keys()

['challenge_data/V3.CSV',
 'challenge_data/V4.CSV',
 'challenge_data/V1.CSV',
 'challenge_data/V2.CSV']

In [10]:
forces['challenge_data/V1.CSV'].head()

Unnamed: 0,avg,max,min,start,stddev,end
0,1.193093,3.077215,0.120284,0,0.568452,10000
1,1.188233,3.077215,0.120284,500,0.572365,10500
2,1.188903,3.077215,0.120284,1000,0.575058,11000
3,1.194125,3.077215,0.120284,1500,0.577237,11500
4,1.221625,3.113024,0.120284,2000,0.595027,12000


In [11]:
jumps.keys()

['challenge_data/V3.CSV',
 'challenge_data/V4.CSV',
 'challenge_data/V1.CSV',
 'challenge_data/V2.CSV']

In [12]:
jumps['challenge_data/V1.CSV'].head()

Unnamed: 0,Start,End
0,295800.0,298000.0
1,379600.0,384300.0
2,558300.0,562800.0
3,1056300.0,1060700.0
4,1125400.0,1129500.0


Now we want to join the computed data (the summary statistics for each time window) against the jumps data so we have a target for supervised learning. Specifically, if the window for a row contains a jump then our target is `True`. Otherwise it is `False`.

In [13]:
for k in forces.keys():
    cur_f = forces[k]
    cur_f['Target'] = False
    jt = jumps[k]
    for i in range(jt.shape[0]):
        start = jt.Start.iloc[i]
        end = jt.End.iloc[i]
        cur_f.loc[(cur_f.start <= start) & (cur_f.end >= end), 'Target'] = True

In [14]:
jumps['challenge_data/V1.CSV'].head()

Unnamed: 0,Start,End
0,295800.0,298000.0
1,379600.0,384300.0
2,558300.0,562800.0
3,1056300.0,1060700.0
4,1125400.0,1129500.0


In [15]:
# Make sure we set our target correctly
f1 = forces['challenge_data/V1.CSV']
f1[(f1.start > 280000) & (f1.start < 310000)]

Unnamed: 0,avg,max,min,start,stddev,end,Target
561,1.038614,1.619985,0.668307,280500,0.140928,290500,False
562,1.05116,1.619985,0.668307,281000,0.139471,291000,False
563,1.049243,1.619985,0.668307,281500,0.13683,291500,False
564,1.040384,1.619985,0.668307,282000,0.133006,292000,False
565,1.047557,1.619985,0.668307,282500,0.143634,292500,False
566,1.050927,1.619985,0.668307,283000,0.145682,293000,False
567,1.047278,1.619985,0.668307,283500,0.139229,293500,False
568,1.052787,1.588935,0.668307,284000,0.13372,294000,False
569,1.04715,1.588935,0.800859,284500,0.130529,294500,False
570,1.045195,1.588935,0.749637,285000,0.134458,295000,False


In [16]:
# Now concatenate things into one big data set
all_data = pd.concat(forces.values(), ignore_index=True)

Don't keep any data separate for an actual test set which is independent of both the training and validation.

In [17]:
all_data.head()

Unnamed: 0,avg,max,min,start,stddev,end,Target
0,0.97924,1.434954,0.654423,0,0.061798,10000,False
1,0.978753,1.434954,0.654423,500,0.060129,10500,False
2,0.976366,1.434954,0.654423,1000,0.059742,11000,False
3,0.976729,1.434954,0.654423,1500,0.053761,11500,False
4,0.975602,1.434954,0.479002,2000,0.060523,12000,False


In [18]:
all_data.shape

(8397, 7)

In [19]:
sum([x.shape[0] for x in forces.values()])

8397

Now split the data into training and test data sets.

In [20]:
train, valid = ms.train_test_split(all_data, test_size=0.2)

In [21]:
train.head()

Unnamed: 0,avg,max,min,start,stddev,end,Target
4574,1.286037,4.157926,0.119931,748000,0.593629,758000,False
1101,1.616383,6.50306,0.358433,550500,0.835429,560500,False
5958,1.040207,1.457351,0.804496,445000,0.11125,455000,False
2459,1.636622,6.260625,0.370994,1229500,0.92498,1239500,False
458,1.460342,3.744293,0.405085,229000,0.544053,239000,False


In [22]:
valid.head()

Unnamed: 0,avg,max,min,start,stddev,end,Target
453,1.466148,3.508835,0.405085,226500,0.527799,236500,False
1249,1.996287,4.575622,0.236244,624500,0.92271,634500,False
6975,1.376434,4.062463,0.406345,198000,0.648803,208000,False
8166,1.269831,3.446483,0.404536,793500,0.523956,803500,False
1949,1.427231,7.853846,0.470511,974500,0.776878,984500,False


In [23]:
train.shape

(6717, 7)

In [24]:
valid.shape

(1680, 7)

Now lets see if we can train a neural network on this data.

In [25]:
inputs = Input(shape=(4,))
l1 = Dense(10, activation='sigmoid')(inputs)
l2 = Dense(15, activation='sigmoid')(l1)
out = Dense(1, activation='sigmoid')(l2)
model = Model(inputs=inputs, outputs=out)

In [26]:
def to_data_and_target(df):
    """Given either our training or test data set return a pair of (data, targets) where data is just the
    columns that should be input and targets are just the corresponding targets."""
    data = df[['avg', 'min', 'max', 'stddev']]
    targets = df.Target
    return (data, targets)

In [27]:
train_d, train_t = to_data_and_target(train)

In [28]:
train_d.head()

Unnamed: 0,avg,min,max,stddev
4574,1.286037,0.119931,4.157926,0.593629
1101,1.616383,0.358433,6.50306,0.835429
5958,1.040207,0.804496,1.457351,0.11125
2459,1.636622,0.370994,6.260625,0.92498
458,1.460342,0.405085,3.744293,0.544053


In [29]:
train_t.head()

4574    False
1101    False
5958    False
2459    False
458     False
Name: Target, dtype: bool

The model is trained on data that has not been normalized. Since there is a large difference in the magnitudes of the variables (ie max is always much greater than min), then it is likely that these variables may be unjustly weighted through training.

In [31]:
opt = optimizers.SGD(lr=0.02, momentum=0.2, decay=1e-6)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x=train_d.values, y=train_t.values, epochs=50, verbose=2)

Epoch 1/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 2/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 3/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 4/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 5/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 6/50
 - 1s - loss: 0.1634 - acc: 0.9614
Epoch 7/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 8/50
 - 1s - loss: 0.1634 - acc: 0.9614
Epoch 9/50
 - 1s - loss: 0.1634 - acc: 0.9614
Epoch 10/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 11/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 12/50
 - 1s - loss: 0.1634 - acc: 0.9614
Epoch 13/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 14/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 15/50
 - 1s - loss: 0.1634 - acc: 0.9614
Epoch 16/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 17/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 18/50
 - 1s - loss: 0.1634 - acc: 0.9614
Epoch 19/50
 - 1s - loss: 0.1634 - acc: 0.9614
Epoch 20/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 21/50
 - 1s - loss: 0.1635 - acc: 0.9614
Epoch 22/50
 - 1s - lo

<keras.callbacks.History at 0x7f8700e73e90>

There is no early stopping, creating the potential for overfitting.

In [32]:
valid_d, valid_t = to_data_and_target(valid)
model.evaluate(valid_d.values, valid_t.values)



[0.15801479589371453, 0.96309523809523812]

It looks like things were still improving after 50 epoch so maybe we should try more.

In [None]:
opt = optimizers.SGD(lr=0.02, momentum=0.2, decay=1e-6)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x=train_d.values, y=train_t.values, epochs=100, verbose=2)

In [None]:
valid_d, valid_t = to_data_and_target(valid)
model.evaluate(valid_d.values, valid_t.values)

And it looks like even after 100 epochs things were still improving, though slowly. So maybe we should try more and/or increase the learning rate and/or decrease the decay. Let's try something like that.

In [None]:
opt = optimizers.SGD(lr=0.04, momentum=0.2, decay=1e-7)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x=train_d.values, y=train_t.values, epochs=200, verbose=2)

That looks pretty good! Let's assess accuracy on our validation data set.

In [None]:
valid_d, valid_t = to_data_and_target(valid)
model.evaluate(valid_d.values, valid_t.values)



Awesome! Better than 95% accuracy our validation set. This is a good model!

This accuracy is actually very unimpressive, since there is a serious class imbalance that the model has incredibly low outputs for both jumps and non-jumps as seen below. Therefore, this high accuracy is actually just due to the class imbalance already present.

In [33]:

print(model.predict(valid_d))


[[ 0.03806587]
 [ 0.0366641 ]
 [ 0.03794405]
 ..., 
 [ 0.03821696]
 [ 0.03717816]
 [ 0.03700233]]


Should measure the accuracy of the model on each class (jump and non-jump) before deeming it a good model.

Likely this model is too simple and requires more data to train on, more features, and more parameters to possibly find a good fit. Since these are strict requirements, a neural network is likely not ideal and other methods that are not as susceptible to class imbalances should be chosen, most likely a random forest.