# Classification Notes

Here I'll work through my reasoning from choosing features to training a model on the data.

Possible features:

- crowLength
- pathCrowRatio
- coveredArea
- windowArea
- areaPerUnitL
- areaPerUnitT
- hurst
- DFA
- angleDensS
- angleDensT
- timeSpent
- corrDim

Possible labels:

- transMode

Could try and use some of the features as labels instead?

Ideally I would put all of these into a table with each row representing a trajectory, this might be very large though - let's see.

In [2]:
%load_ext autoreload
%autoreload 2
import warnings; warnings.simplefilter('ignore')

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('../Metadata/Inventory.csv')

In [7]:
len(df)

13395

So we have roughly 13,400 trajectories, this is manageable. Now we need to write a script which will populate a csv with the features and labels for each trajectory. This may take a while.

I've discarded the DFA feature because it was too buggy and I'm not sure how much it really adds.

The featExtract script seems to be working nicely, giving me the csv I need. I've had to filter (arbitrarily) to trajectories with no fewer than 20 points, that are of 0.5mins < duration < 60mins and are longer than 20m. This was on one hand to remove the noisy short trajectories and also reduce computation time on the longer trajectories (the latter could be relaxed with access to more power).

I have decided to remove the correlation dimension feature as I was unsure that it was valid on this time-series data and it also created a large computational burden. Random Forest analysis on a small dataset showed it also to be a weak predictor of mode of transport. I've also increased the efficiency of the angle-density measures.

I have converted the trajectories into 32x32 histograms, I intend to use these with a CNN to see if I get any interesting results. My guess is that it will perform poorly but it'll be interesting to see.

The updated list of features now reads:

- crowLength
- pathCrowRatio
- coveredArea
- windowArea
- areaPerUnitL
- areaPerUnitT
- hurst
- angleDensS
- angleDensT
- timeSpent

## Data Preprocessing

In [4]:
%load_ext autoreload
%autoreload 2

In [9]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

seed = 20

In [10]:
df = pd.read_csv('../Metadata/trajFeatures.csv')
df = df.loc[df['Label-state'] != 'Unlabelled']
df.loc[df['Mode of Transport']=='taxi','Mode of Transport'] = 'car' # group taxis and cars

for column in df.columns:
    if 'Unnamed' in column:
        df.drop(column, axis=1, inplace=True)

In [11]:
modes = np.array(df['Mode of Transport'])

# Encoding modes of transport from here: bit.ly/2LdtVjV (see here also for inverse encoding)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(modes)

In [13]:
print(modes)

['train' 'train' 'walk' ... 'bus' 'walk' 'subway']


# Models

In [7]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [39]:
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)

## Random Forest

In [40]:
from sklearn.ensemble import RandomForestClassifier

Through experimentation I found the optimal features to include/remove.

In [41]:
feature_drop = ['Mode of Transport','Path','Label-state', 'Point Count','Duration','Path-Crow Ratio','Covered Area','Area/Length','Hurst Exponent','Length']
features = list(df.drop(feature_drop, axis=1).columns)

In [42]:
X = np.array(df.drop(feature_drop, axis=1))
Y = integer_encoded

In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, Y)

In [44]:
results = cross_val_score(clf, X, Y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
print(pd.DataFrame({'Feature Importance':clf.feature_importances_, 'Feature':features}).loc[:,('Feature','Feature Importance')])

Accuracy: 72.02% (1.75%)
                Feature  Feature Importance
0           Crow Length            0.240932
1           Window Area            0.139624
2             Area/Time            0.368523
3  Turning-angle/Length            0.168154
4    Turning-angle/Time            0.000000
5            Mean Speed            0.082766


## Scikit Learn Gradient Boost

In [45]:
from sklearn.ensemble import GradientBoostingClassifier

In [46]:
feature_drop = ['Mode of Transport','Path','Label-state', 'Point Count', 'Duration','Length', 'Turning-angle/Time','Hurst Exponent']
features = list(df.drop(feature_drop, axis=1).columns)

In [47]:
X = np.array(df.drop(feature_drop, axis=1))

In [None]:
clf = GradientBoostingClassifier(max_depth=2, random_state=0)
clf.fit(X, Y)

In [49]:
results = cross_val_score(clf, X, Y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
print(pd.DataFrame({'Feature Importance':clf.feature_importances_, 'Feature':features}).loc[:,('Feature','Feature Importance')])

Accuracy: 81.91% (1.36%)
                Feature  Feature Importance
0           Crow Length            0.196559
1       Path-Crow Ratio            0.136586
2          Covered Area            0.115823
3           Window Area            0.095270
4           Area/Length            0.103683
5             Area/Time            0.126515
6  Turning-angle/Length            0.097602
7            Mean Speed            0.082960


## XGBoost

In [55]:
from xgboost import XGBClassifier

In [56]:
feature_drop = ['Mode of Transport','Path','Label-state', 'Point Count', 'Duration','Length', 'Turning-angle/Time','Hurst Exponent']
features = list(df.drop(feature_drop, axis=1).columns)

In [58]:
X = np.array(df.drop(feature_drop, axis=1))

In [59]:
clf = XGBClassifier()
clf.fit(X, Y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [62]:
results = cross_val_score(clf, X, Y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
print(pd.DataFrame({'Feature Importance':clf.feature_importances_, 'Feature':features}).loc[:,('Feature','Feature Importance')])

Accuracy: 81.98% (1.44%)
                Feature  Feature Importance
0           Crow Length            0.130764
1       Path-Crow Ratio            0.153085
2          Covered Area            0.094336
3           Window Area            0.121289
4           Area/Length            0.142135
5             Area/Time            0.144241
6  Turning-angle/Length            0.130975
7            Mean Speed            0.083175


## CNN

### 32 x 32

In [13]:
# Modified from: bit.ly/2P2S9iZ

import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.optimizers import Adam
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.layers import Conv2D, MaxPooling2D, ZeroPadding2D, GlobalAveragePooling2D
from keras.layers.advanced_activations import LeakyReLU 
from keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

Using TensorFlow backend.


In [26]:
X = np.load('../Metadata/labelled_traj_imgs_32x32_X.npy')
modes = np.load('../Metadata/labelled_traj_imgs_32x32_Y.npy')

In [27]:
modes[np.where(modes=='taxi')] = 'car'

In [28]:
integer_encoded = label_encoder.fit_transform(modes)
# One hot encoding
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
Y = onehot_encoder.fit_transform(integer_encoded)

In [29]:
# Reshape input data for TensorFlow
X = X.reshape(X.shape[0],32,32,1).astype('float32')

In [32]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=seed)

In [37]:
model = Sequential()

model.add(Conv2D(32, (3, 3), input_shape=(32,32,1)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(64,(3, 3)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Flatten())

# Fully connected layer
model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))

model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

gen = ImageDataGenerator(rotation_range=8, width_shift_range=0.08, shear_range=0, height_shift_range=0.08, zoom_range=0.08)

test_gen = ImageDataGenerator()

train_generator = gen.flow(X_train, Y_train, batch_size=64)
test_generator = test_gen.flow(X_test, Y_test, batch_size=64)

In [38]:
model.fit_generator(train_generator, steps_per_epoch=60000//64, epochs=5, validation_data=test_generator, validation_steps=10000//64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x131d0ed30>

This looks like it has overtrained, the performance on the test data is very poor.

We could try increasing the resolution of the images, 32x32 is fairly poor for resolving the finer trajectory details. It also might not make sense to use data augmentation since we are working with stricty planar trajectories (there is no perspective change).

What's more, taking a purely frequentist approach, the probability of being correct is:

$$p(correct) = p(Y=car\mid X=car)p(X=car)+p(Y=bus \mid X=bus)p(X=bus)+...$$

Our best frequentist guess assumes $p(y \mid x) = p(y) = p(x)$ since we do not truly gain any information from $X$, we are guessing. Therefore the probability reduces to:

$$p(correct) = \sum_{x \in X}p(x)^{2}$$

In [43]:
np.square(np.unique(modes, return_counts=True)[1]/len(Y)).sum()

0.25445312460855696

Therefore our CNN model is even poorer than it first seemed.

Could also try removing the point density as a feature and populate the pixels in a binary fashion purely based on occupancy.

### 64 x 64

Let's try with the higher resolution images. This is going to take longer.

In [14]:
X = np.load('../Metadata/labelled_traj_imgs_64x64_X.npy')
modes = np.load('../Metadata/labelled_traj_imgs_64x64_Y.npy')

modes[np.where(modes=='taxi')] = 'car'

integer_encoded = label_encoder.fit_transform(modes)
# One hot encoding
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
Y = onehot_encoder.fit_transform(integer_encoded)

In [16]:
# Reshape input data for TensorFlow
X = X.reshape(X.shape[0],64,64,1).astype('float32')

In [17]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=seed)

In [19]:
model = Sequential()

model.add(Conv2D(32, (3, 3), input_shape=(64,64,1)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(64,(3, 3)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Flatten())

# Fully connected layer
model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))

model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

gen = ImageDataGenerator(rotation_range=8, width_shift_range=0.08, shear_range=0, height_shift_range=0.08, zoom_range=0.08)

test_gen = ImageDataGenerator()

train_generator = gen.flow(X_train, Y_train, batch_size=64)
test_generator = test_gen.flow(X_test, Y_test, batch_size=64)

In [20]:
model.fit_generator(train_generator, steps_per_epoch=60000//64, epochs=5, validation_data=test_generator, validation_steps=10000//64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x132860668>

Again, although the initial accuracy was better there is obviously some overfitting. Will try and increase the dropout.

In [21]:
model = Sequential()

model.add(Conv2D(32, (3, 3), input_shape=(64,64,1)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(64,(3, 3)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(BatchNormalization(axis=-1))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Flatten())

# Fully connected layer
model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))

model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])

gen = ImageDataGenerator(rotation_range=8, width_shift_range=0.08, shear_range=0, height_shift_range=0.08, zoom_range=0.08)

test_gen = ImageDataGenerator()

train_generator = gen.flow(X_train, Y_train, batch_size=64)
test_generator = test_gen.flow(X_test, Y_test, batch_size=64)

In [22]:
model.fit_generator(train_generator, steps_per_epoch=60000//64, epochs=5, validation_data=test_generator, validation_steps=10000//64)

Epoch 1/5

KeyboardInterrupt: 