# Logistic Regression to Classify Terrain by IMU and Odometry Data from TurtleBot3
### By Jacob Laframboise, Jack Demeter
Logistic regression works great when the data is randomly split into train and test (high 90 accuracy), but it struggles when the data is split into train/test based on which trial number is was collected in. This further supports the hypothesis that data from each run is more similar to itself than data from a given terrain. 



In [3]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, PowerTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

from sklearn.feature_selection import SelectKBest, chi2

import plotly as ply
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
%matplotlib notebook

pd.set_option('display.max_columns', 120)
pd.set_option('display.max_rows', 80)

## Data Pre-Processing
We load in individual csv files collected from Rosbag on TurtleBot3 with ROS.

In [3]:
# Set dataFolder path to data file location
dataFolder = r"C:\Users\jaker\Documents\Experiment3Data-2019-11-21"

dataFiles = [
    r"gMitTile_s15_t8.csv",
    r"gMitTile_s15_t9.csv",
    r"gMitTile_s15_t10.csv",
    r"gTurf_s15_t3.csv",
    r"gTurf_s15_t4.csv",
    r"gTurf_s15_t5.csv",
    r"gTurf_s15_t6.csv",
    r"gTurf_s15_t7.csv",
    r"gTurf_s15_t8.csv",
    r"gTurf_s15_t9.csv",
    r"gTurf_s15_t10.csv",
    r"gArcTile_s15_t3.csv",
    r"gArcTile_s15_t4.csv",
    r"gArcTile_s15_t5.csv",
    r"gArcTile_s15_t6.csv",
    r"gArcTile_s15_t7.csv",
    r"gArcTile_s15_t8.csv",
    r"gArcTile_s15_t9.csv",
    r"gArcTile_s15_t10.csv",
    r"gCarp_s15_t3.csv",
    r"gCarp_s15_t4.csv",
    r"gCarp_s15_t5.csv",
    r"gCarp_s15_t6.csv",
    r"gCarp_s15_t7.csv",
    r"gCarp_s15_t8.csv",
    r"gCarp_s15_t9.csv",
    r"gCarp_s15_t10.csv",
    r"gMitTile_s15_t3.csv",
    r"gMitTile_s15_t4.csv",
    r"gMitTile_s15_t5.csv",
    r"gMitTile_s15_t6.csv",
    r"gMitTile_s15_t7.csv"
]

savePath = "Data-32Series-Delta30-Squared.csv"

In [6]:
""" 
For each data file we:
adjust the index, 
interpolate NaN values,
drop remaining NaN values, 
drop some empty columns.

We then augment the feature space with delta columns, 
and with polynomial columns,
and label the columns with terrain, speed, and trial number
"""
for i in range(len(dataFiles)):
    # get speed/terrain based on file name
    terrain = dataFiles[i].split('_')[0][1:]
    speed = dataFiles[i].split('_s')[1][:2]
    trial = dataFiles[i].split('_t')[1][0]

    df = pd.read_csv(os.path.join(dataFolder, dataFiles[i]))
    df = df.rename(columns={'Unnamed: 0': 'Seq'})
    df = df.set_index('Seq')

    # interpolate the missing data with a polynomial (upscaling)
    df = df.interpolate(method='polynomial', order=1)

    # remove incomplete entries
    df = df.dropna()

    # reset data to be ordered based on Sequence
    df = df.reset_index().drop(columns=['Seq'])

    
    df = df.drop(columns=['OdomPosZ', 'OdomOrientX', 'OdomOrientY', 'OdomLinY', 'OdomLinZ', 'OdomAngX', 'OdomAngY'])
    # use the XY magnitude to remove unique run IDs
    df['OdomPosXY'] = np.sqrt(df.OdomPosX**2 + df.OdomPosY**2)
    df = df.drop(columns=['OdomPosY', 'OdomPosX'])
    
    # Order is the exponent applied to the delta data sets to allow LR to find higher polynomial patterns, increases memory and comp. time
    order=2
    # the delta list specifies how many data points to go back to and apply the 
    dList = range(1, 302, 20)
    for col in df.columns.tolist():
        if col!='Sensor':
            for d in dList:
                df[col+'Delta{}'.format(d)] = df[col].diff(d)
                if order>1:
                    for p in range(2, order+1):
                        df[col+'Delta{}Exp{}'.format(d, p)] = df[col+'Delta{}'.format(d)]**p
        else:
            pass

    df = df.reset_index().drop(columns=['index'])
    df = df.drop(columns=['Sensor', 'Time'])
    df['Speed']=int(speed)
    df['Terrain']=terrain
    df['Trial']=int(trial)

    if i==0:
        mainDf = df.copy(deep=True)
    else:
        mainDf = pd.concat([mainDf, df], axis=0, sort=False)
print("Data series completed.")
print("MainDf is now size {}".format(mainDf.shape))


Added gMitTile_s15_t8.csv of size (3831, 18) to mainDf. 
Data series completed: 1/32
MainDf is now size (3831, 18)


Added gMitTile_s15_t9.csv of size (4331, 18) to mainDf. 
Data series completed: 2/32
MainDf is now size (8162, 18)


Added gMitTile_s15_t10.csv of size (4853, 18) to mainDf. 
Data series completed: 3/32
MainDf is now size (13015, 18)


Added gTurf_s15_t3.csv of size (6101, 18) to mainDf. 
Data series completed: 4/32
MainDf is now size (19116, 18)


Added gTurf_s15_t4.csv of size (5840, 18) to mainDf. 
Data series completed: 5/32
MainDf is now size (24956, 18)


Added gTurf_s15_t5.csv of size (3352, 18) to mainDf. 
Data series completed: 6/32
MainDf is now size (28308, 18)


Added gTurf_s15_t6.csv of size (5253, 18) to mainDf. 
Data series completed: 7/32
MainDf is now size (33561, 18)


Added gTurf_s15_t7.csv of size (5251, 18) to mainDf. 
Data series completed: 8/32
MainDf is now size (38812, 18)


Added gTurf_s15_t8.csv of size (6668, 18) to mainDf. 
Data series comple

In [7]:
# avoid reloading dataset when working, removing this can reduce memory usage (doubles)
df = mainDf.copy(deep=True)

In [8]:
# sample output of the data
df.head(10)

Unnamed: 0,OdomOrientZ,OdomOrientW,OdomLinX,OdomAngZ,ImuOrientX,ImuOrientY,ImuOrientZ,ImuOrientW,ImuAngVelX,ImuAngVelY,ImuAngVelZ,ImuAccelX,ImuAccelY,ImuAccelZ,OdomPosXY,Speed,Terrain,Trial
0,0.093628,0.995607,0.149563,0.001384,-0.012936,0.003677,-0.102512,-0.994637,0.014367,-0.002661,0.006917,0.521337,0.494403,10.55394,18.812442,15,MitTile,8
1,0.09371,0.995599,0.149276,0.005066,-0.013177,0.003931,-0.10253,-0.994631,0.019156,0.017027,0.008514,0.788291,0.634463,9.111135,18.81238,15,MitTile,8
2,0.093793,0.995592,0.148988,0.008748,-0.013698,0.004072,-0.10257,-0.994619,0.02022,0.026605,0.014899,0.219069,0.453701,10.493187,18.812318,15,MitTile,8
3,0.093875,0.995584,0.1487,0.012429,-0.012621,0.002504,-0.102654,-0.994628,0.015431,0.03033,0.01277,-0.061052,0.196325,10.284293,18.812256,15,MitTile,8
4,0.093843,0.995587,0.148514,0.00875,-0.011543,0.000936,-0.102737,-0.994637,0.010642,0.034054,0.010642,-0.341174,-0.061052,10.075399,18.812234,15,MitTile,8
5,0.093811,0.99559,0.148328,0.00507,-0.011305,0.000417,-0.102711,-0.994643,-0.009578,0.015963,-0.008514,-0.915184,-0.056264,10.979809,18.812212,15,MitTile,8
6,0.093779,0.995593,0.148142,0.00139,-0.011305,0.000417,-0.102711,-0.994643,-0.009578,0.015963,-0.008514,-0.915184,-0.056264,10.979809,18.81219,15,MitTile,8
7,0.093746,0.995596,0.147956,-0.00229,-0.011117,-0.000264,-0.102601,-0.994656,-0.007449,-0.004257,-0.028733,-1.149217,0.062848,9.466673,18.812168,15,MitTile,8
8,0.093714,0.995599,0.14777,-0.00597,-0.011083,-0.000845,-0.102489,-0.994668,0.009578,-0.037247,-0.028733,-0.572214,0.187945,8.793902,18.812146,15,MitTile,8
9,0.093682,0.995602,0.147584,-0.00965,-0.010887,-4.6e-05,-0.102468,-0.994672,0.026605,-0.045761,-0.007449,0.708085,-0.201113,10.21127,18.812124,15,MitTile,8


In [75]:
# check for none values and correct size
print(df.isnull().sum().sum()) # check for NaN
df.shape

0


(145353, 1506)

### Feature engineering: elimination

In [76]:
# eliminate features that may ID individual runs improperly
# only retain delta data, speed, and terrain(the label)
columnsToDrop = [x for x in df.columns.tolist() if 'Time' in x or ('Delta' not in x and 'Speed' not in x and 'Terrain' not in x and 'Trial' not in x) ]
df = df.drop(columns=columnsToDrop)

In [77]:
# Group carp/turf and Tiles into 2 broader categories to attempt differentiation of similar terrains
df['Terrain'] = df['Terrain'].replace({'Carp':'Turf', 'ArcTile':'MitTile'})

### Speed filtering

In [79]:
df = df[df['Speed']==15]

### Logistic Regression
Now that we have processed the data, we can apply logistic regression to classify.

In [80]:
# sort the train test split where train on t1 test on t2
dfTrain = df[df['Trial']<9]
dfTest = df[df['Trial']>=9]

Y_train = dfTrain['Terrain']
Y_test = dfTest['Terrain']

X_train = dfTrain.drop(columns=['Terrain', 'Speed', 'Trial'])
X_test = dfTest.drop(columns=['Terrain', 'Speed', 'Trial'])

### Feature engineering: Kbest features

In [81]:
# highlight and utilize only the Kbest features in an attempt to reduce computation speed
featureCount = 30
test = SelectKBest(k=featureCount)
fit = test.fit(X_train, Y_train)

print(X_test.columns.tolist()[fit.scores_.argmax()])

# fit testing data
X_train = fit.transform(X_train)
X_test = fit.transform(X_test)

ImuAngVelXDelta21Exp2


In [83]:
# check proper sizing
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(123978, 30)
(123978,)
(21375, 30)
(21375,)


## Model running and accuracy exploration
During our testing an accuracy of approx. 73.37% was found when the data was split into two catergories, given that random guesses would result in ~50% accuracy the model did succeed, however not enough for the team to explore the additional applications that were initially intended for the data. Exploration with NN and other methods provided similar results.

In [84]:
model = LogisticRegression(solver = "lbfgs")
model.fit(X_train, Y_train)





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [85]:
print("Accuracy on test set is: {}".format(model.score(X_test, Y_test)))

Accuracy on test set is: 0.7336608187134503
