Overall strategy:

Knowing that every tenth sample from the training data was labeled, the data was extrapolated to include the entire train_time_series set. 

A simple rolling window was used to find the mean value of the window samples(for all three axis). The NaNs that this produced were replaced with the first value returned by the rolling window.

To help out the final classifier, a visual inspection was done on the mean transformed data to locate the "Standing" labels since the "standing" samples are quite obvious to the naked eye. A simple boolean was used to find if the values of the 'x' feature met a certain threshold value. This was then added as three extra columns to reinforce the 'standing' datapoint. (a weight could have been used instead I presume). 

The classifier used was a stacking classifier as to use a Random Forrest Classifier, and then feed that into a KNN classifier in hopes of cleaning up any erractic, one-off predictions.

The results seemed very good for the training/test split. However, there is probably a substantial amount of overfitting going on due to the fact that the result accuracy trends with the test to train data ratio.  I tried playing with the classifiers but I could not keep it from overfitting.

Thanks for reviewing and I am excited to see everyones solutions!

In [70]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import time


#start timer
start = time.time()

train_labels = pd.read_csv(("train_labels.csv"), index_col=0)
train_time_series = pd.read_csv(("train_time_series.csv"), index_col=0)
test_time_series = pd.read_csv(("test_time_series.csv"), index_col=0)
test_labels = pd.read_csv(("test_labels.csv"),index_col=0)

In [71]:
#Final test data set
test_time = test_time_series[['x','y','z']]

#since every tenth sample is labled, extrapolated to label the entire large dataset for the purpose of having more data to work with.
lrgset = train_time_series[['x','y','z']]
lrgset['label'] = train_labels[['label']].copy()
lrgset.label[0:100] = 1.0
lrgset.label[105:980] = 2.0
lrgset.label[1000:1350] = 4.0
lrgset.label[1380:1530] = 1.0
lrgset.label[1540:1870] = 3.0
lrgset.label[1880:2350] = 2.0
lrgset.label[2360:2890] = 3.0
lrgset.label[2900:3660] = 2.0
lrgset.label[3670:3760] = 4.0
lrgset = lrgset[pd.notnull(lrgset['label'])]

#Standardizing and resetting index of the large dataset and keeping it isolated from X,y.
X = lrgset[['x','y','z']]
y = lrgset[['label']]
X = X.reset_index().drop(columns='index')
y = y.reset_index().drop(columns='index')

#y axis at rest is -1 for 1G from the force from gravity. Added it back in to standardize further. 
X['y'] = X.y + 1

#same actions taken on final test data as above
test_time['y'] = test_time.y + 1
test_time = test_time.reset_index().drop(columns='index')

In [72]:
def data_rolling_mean(X,y,windows,remove_nan=True):
    '''This function will get the mean average of a moving window on a dataset X. 'y' is the labels for X. 'windows' will give the size of the rolling window. You may pass a list of intergers to itterate the data and take multiple means. 'remove_nan' default is True, which will replace nans at the beggining of the list with the value of the first mean value in the new list. If set to false, the function will return the original label list in addition to an index of nans'''    
    for i in range(len(windows)):
        win = windows[i]
        X_mean = X.rolling(win).mean()
        X_mean.x.iloc[0:win] = X_mean.x.iloc[win+1]
        X_mean.y.iloc[0:win] = X_mean.y.iloc[win+1]
        X_mean.z.iloc[0:win] = X_mean.z.iloc[win+1]
        X = X_mean
    nans = np.where(np.isnan(X_mean))[0]
    if remove_nan == True:
        y = y.drop(nans)
        X = X.drop(nans)
    else:
        return(X,y,nans)
    #X_mean = X_mean[~np.isnan(X_mean)]
    return(X,y)

def data_rolling_mean_test(X,windows,remove_nan=True):
    '''this function does the same as data_rolling_mean except that it does not expect y- Used for the final test data.'''
    for i in range(len(windows)):
        win = windows[i]
        X_mean = X.rolling(win).mean()
        X_mean.x.iloc[0:win] = X_mean.x.iloc[win+1]
        X_mean.y.iloc[0:win] = X_mean.y.iloc[win+1]
        X_mean.z.iloc[0:win] = X_mean.z.iloc[win+1]
        X = X_mean
    nans = np.where(np.isnan(X_mean))[0]
    if remove_nan == True:
        X = X.drop(nans)
    else:
        return(X,nans)
    return(X)

In [73]:
#Passing train data and test data into mean functions from above.  X_known,y_known is the data to be trained with.

X_known,y_known= data_rolling_mean(X,y,[80,20],remove_nan=True)
X_test = data_rolling_mean_test(test_time,[80,20],remove_nan=True)

#Adding extra columns to help the classifier identify label "1".  Threshold value based on visually inspecting data.
X_test['xones'] = False
X_test['yones'] = False
X_test['zones'] = False
X_test['xones'].where(X_test.x > .090,other=True,inplace=True)
X_test['yones'].where(X_test.x > .090,other=True,inplace=True)
X_test['zones'].where(X_test.x > .090,other=True,inplace=True)
X_known['xones'] = False
X_known['yones'] = False
X_known['zones'] = False
X_known['xones'].where(X_known.x > .090,other=True,inplace=True)
X_known['yones'].where(X_known.x > .090,other=True,inplace=True)
X_known['zones'].where(X_known.x > .090,other=True,inplace=True)


Below, you will find the training cell. Here, after tweaking some of the variables, and trying different classifiers, it seems that the accuracy tops out at about 96-97%.  I feel that there is probably some overfitting present.

In [74]:
#This cell is for training the classifier only. 

xtrain,xtest,ytrain,ytest = train_test_split(X_known,y_known,train_size=.7, shuffle=True)

#Init classifiers
knn = KNeighborsClassifier(n_neighbors=10)
rfc = RandomForestClassifier(n_estimators=1000, n_jobs=-1, criterion='entropy')

#List of estimators used in the stack.
estimators = [('rf',rfc),('svr', make_pipeline(StandardScaler(),LinearSVC(random_state=42)))]

#Init the stacking classifier with estimators from above
stack_clf = StackingClassifier(estimators=estimators, final_estimator = knn)

#Fit, predict, and print accuracy.
stack_clf.fit(xtrain,ytrain)
p = stack_clf.predict(xtest)

print(accuracy_score(ytest, p))


0.9598173515981735


In [75]:
#This cell is the final test.  

#Re-init the classifiers
rfc = RandomForestClassifier(n_estimators=1000, n_jobs=-1, criterion='entropy')
knn = KNeighborsClassifier(n_neighbors=10)

estimators = [('rf', RandomForestClassifier(n_estimators=1000, random_state=42)),('svr', make_pipeline(StandardScaler(),LinearSVC(random_state=42)))]
stack_clf = StackingClassifier(estimators=estimators, final_estimator = knn)

#Fit, predict and stop timer
stack_clf.fit(X_known,y_known)
prediction = stack_clf.predict(X_test)


print("The total time for the code to complete was:",round(time.time()-start, ndigits=2), "seconds")



The total time for the code to complete was: 42.3 seconds


In [76]:
#saving every 10th test predictions to the test_label dataframe and saving as csv.
predicition = np.array(prediction)
test_labels['label'] = prediction[::10,]
test_labels.to_csv(os.path.abspath('test_labels.csv'))