## ECE 579M ST: Machine Learning in Cybersecurity
### Project Four: Privacy Leakage Analysis

In this project, we are analyzing location data obtained from the mobile phone of a WPI employee by Google Maps data collection. This is of critical importance as it could be used to estimate future locations of an user from supposedly private information. 

The provided dataset is collected every 30 minutes over 5 weeks consisting of timestamps, latitude, longitude, accuracy (Google Maps confidence of location), and label (type of location such as home/restaurant etc.). The aim is to train a regression method on the first 4 weeks of data as a training set and use the last week (week 5) as a validation set. The final performance will be tested on an unseen week 6 test set. 

Since an accuracy metric together with ground-truth labels were provided, we could bring in a notion of **soft labeling** instead of one-hot labeling.

Additionally, while I really wanted to utilize a fancy algorithm or even something like Ensemble/XGBoost, maybe just utilizing a simplistic ML algorithm through a better understanding of the data is more apt. :D

```
Classification Accuracy: 100.0 % from 168 testing datapoints 
Mean-Squared Error amongst correct classifications: 1.196
Mean Predicted Probability is 73.998 % while mean given probability is 67.751 %
Predicted Mean probability is greater than the given mean probability. Good tuning.
```


 ---
## Step 0: Import required packages

In [1]:
## LIST OF ALL IMPORTS
import os
import glob
import math
import numpy as np
import scipy as sp
import time

from sklearn.externals import joblib
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeRegressor
from skmultilearn.problem_transform import BinaryRelevance
# from sklearn.svm import SVR ,NuSVR
# from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor, BaggingRegressor, GradientBoostingRegressor

from utils import load_data,soft_labeling,sparse2array,argmax_custom,evaluate

---
## Step 1: Load Datasets & Basic Exploration of Dataset

In [2]:
## DATA PATHS
data_path='dataset/'

filenames=sorted(glob.glob(data_path+'week*',recursive=True))
print("Reading test and train data files.")

week1_features,week1_labels=load_data(filenames[0])
week2_features,week2_labels=load_data(filenames[1])
week3_features,week3_labels=load_data(filenames[2])
week4_features,week4_labels=load_data(filenames[3])
week5_features,week5_labels=load_data(filenames[4])

print("Loaded map data.")

Reading test and train data files.
Loaded map data.


In [3]:
print(week1_features)

[[   0.5       42.372    -71.903      0.94923]
 [   1.        42.372    -71.906      0.97863]
 [   1.5       42.373    -71.904      0.91212]
 ..., 
 [ 167.        43.678    -70.547      0.70833]
 [ 167.5       43.571    -70.125      0.87538]
 [ 168.        44.053    -70.582      0.75917]]


In [4]:
print("Training data consists of data from Weeks 1 through 4. Validation data is obtained from Week 5.\n")
print("Week One data has a shape of {} and been collected over {} hrs.".
      format(week1_features.shape,week1_features[-1,0]))
print("Week Two data has a shape of {} and been collected over {} hrs.".
      format(week2_features.shape,week2_features[-1,0]))
print("Week Three data has a shape of {} and been collected over {} hrs.".
      format(week3_features.shape,week3_features[-1,0]))
print("Week Four data has a shape of {} and been collected over {} hrs.".
      format(week4_features.shape,week4_features[-1,0]))
print("Week Five data has a shape of {} and been collected over {} hrs.".
      format(week5_features.shape,week5_features[-1,0]))

all_labels=np.vstack((week1_labels,week2_labels,week3_labels,week4_labels,week5_labels))
labels_unique,labels_count=np.unique(all_labels,return_counts=True)
print("\nThe labels are {} and with frequencies of {}.".format(labels_unique,labels_count))

Training data consists of data from Weeks 1 through 4. Validation data is obtained from Week 5.

Week One data has a shape of (336, 4) and been collected over 168.0 hrs.
Week Two data has a shape of (336, 4) and been collected over 168.0 hrs.
Week Three data has a shape of (336, 4) and been collected over 168.0 hrs.
Week Four data has a shape of (336, 4) and been collected over 168.0 hrs.
Week Five data has a shape of (336, 4) and been collected over 168.0 hrs.

The labels are [48 49 50 51 52] and with frequencies of [756  31 550 326  17].


This shows that the dataset is unbalanced as some occur relatively infrequently.

----

## Step 2: Dataset Modification

_**"scikit-multilearn: A scikit-based Python environment for performing multi-label classification"**_ is a paper and a library by Piotr Szymanski and Tomasz Kajdanowicz of the Wroclaw University of Technology and helps with multi-label classification. Since the features of the dataset consists of timestamps, longitude/latitude, accuracy, and labels [48,49,50,51,52], we can create soft-labels by associating the accuracy with the labels as follows:

>`Given-label--> Given Accuracy 
>Wrong-labels--> (1-Given Accuracy)/(n_classes-1)`


If a datapoint has a label of 48, and an accuracy of 94%, we can state the false labels (49,50,51,52) have a probability of (100-94)/4= 1.5% and means that this particular datapoint is of label 48 with accuracy 94%, label 49-52 with accuracy 1.5%. This soft-labeling approximation allows for multi-label classification using a multilearn wrapper around a scikit-learn classifier.

Additionally,if timesteps need to be used, an RNN network could be implemented as `RNN==sequential data`, but we could just ignore timesteps and have features of longitude/latitude only and predict based on accumulated data (but might be prone to overfitting). Furthermore, the dataset for each week will be shuffled and forced through k-fold cross-validation to bulk up the number of training-validation datapoints instead of just considering Week 5 alone to be a validation set.

A reason against to not use a neural network/RNN is the lack of features, so just simplistic classification will be used.

In [5]:
# [48 49 50 51 52]-->[0 1 2 3 4] for soft labels
n_classes=5
week1_hard_y,week1_soft_y=soft_labeling(week1_features,week1_labels,n_classes)
week2_hard_y,week2_soft_y=soft_labeling(week2_features,week2_labels,n_classes)
week3_hard_y,week3_soft_y=soft_labeling(week3_features,week3_labels,n_classes)
week4_hard_y,week4_soft_y=soft_labeling(week4_features,week4_labels,n_classes)
week5_hard_y,week5_soft_y=soft_labeling(week5_features,week5_labels,n_classes)

X_aggregate=np.vstack((week1_features[:,1:3],week2_features[:,1:3],week3_features[:,1:3],
                       week4_features[:,1:3],week5_features[0:168,1:3])) # Using just latitude and longitude data

Y_aggregate=np.vstack((week1_soft_y,week2_soft_y,week3_soft_y,week4_soft_y,week5_soft_y[168:])) # Aggregating soft labels

X_test=week5_features[168:,1:3]
Y_test_hard=week5_hard_y[168:]
Y_test_soft=week5_features[168:,-1] #week5_soft_y


X_aggregate_shuffle,Y_aggregate_shuffle=shuffle(X_aggregate,Y_aggregate,random_state=0) # Shuffling dataset
# X_test,Y_test=shuffle(X_test,Y_test)

print("Obtained final datasets.")

Obtained final datasets.


----

## Step 3: Classifier Creation

In [6]:
# Regression Decision Tree- MAE criterion 77.822 %
print("Regression Decision Tree- Mean Absolute Error criterion")
t0=time.clock()
base_classifier=DecisionTreeRegressor(criterion='mae',splitter='random', 
                                      max_depth=None,min_samples_split=2,
                                      random_state=0)

classifier = BinaryRelevance(classifier=base_classifier,
    require_dense = [False, True])
print(classifier,"\n")
classifier.fit(X_aggregate_shuffle,Y_aggregate_shuffle)
print("Fit Time {} s\n".format(round(time.clock()-t0,2)))
print("Saving loaded model.")
joblib.dump(classifier,'trained_DTR.pkl')

Regression Decision Tree- Mean Absolute Error criterion
BinaryRelevance(classifier=DecisionTreeRegressor(criterion='mae', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=0, splitter='random'),
        require_dense=[False, True]) 

Fit Time 0.14 s

Saving loaded model.


['trained_DTR.pkl']

In [7]:
test_predictions_proba={}
for ind in range(len(X_test)):
    predictions=classifier.predict(X_test[ind])
    dense_predictions=sparse2array(predictions)

    max_index,max_value=argmax_custom(dense_predictions)
    test_predictions_proba[ind]=[max_index+48,max_value]

In [8]:
correct_count,classifier_accuracy,mse_error,mean_pred_probs,mean_true_probs=evaluate(evaluated_dict=test_predictions_proba,
                                                     ground_truth_labels=Y_test_hard,
                                                     ground_truth_probability=Y_test_soft)

In [13]:
print("Classification Accuracy: {} % from {} datapoints ".format(round(100*classifier_accuracy,3),correct_count))
print("Mean-Squared Error amongst correct classifications: {}".format(round(100*mse_error,3)))
print("Mean Predicted Probability is {} % while mean given probability is {} %".
      format(round(100*mean_pred_probs,3),round(100*mean_true_probs,3)))
if mean_pred_probs>mean_true_probs:
    print("Predicted Mean probability is greater than the given mean probability. Good tuning.")
else:
    print("Predicted Mean probability is less than the given mean probability. More tuning/ different approach needed.")

Classification Accuracy: 100.0 % from 168 datapoints 
Mean-Squared Error amongst correct classifications: 1.196
Mean Predicted Probability is 73.998 % while mean given probability is 67.751 %
Predicted Mean probability is greater than the given mean probability. Good tuning.


In [14]:
from test import test_model
saved_model='trained_DTR.pkl'
test_features=week1_features
test_labels=week1_labels
test_predictions_proba=test_model(saved_model,test_features,test_labels)

Loaded trained Decision Tree Regression Model.
Obtained testing features and labels.
Obtained predictions
335 0.9970238095238095 [ 0.27715753  0.53660399  0.36536064  0.48266871  0.54364654] 0.781931193513 [ 0.40703019  0.0678765   0.29515987  0.16780831  0.06212513]
Classification Accuracy: 99.702 % from 336 datapoints 
Mean Predicted Probability is 78.193%


----

### Trial Classifiers

In [11]:
#Tried and failed
# Adaboost Regressor 75%
# print("AdaBoost Regressor")
# t0=time.clock()
# base_classifier=AdaBoostRegressor(base_estimator=None,n_estimators=100,
#                                   learning_rate=1.0,loss='linear',random_state=0)

# classifier = BinaryRelevance(classifier=base_classifier,
#     require_dense = [False, True])
# print(classifier,"\n")
# classifier.fit(X_aggregate_shuffle,Y_aggregate_shuffle)
# print("Fit Time {} s\n".format(round(time.clock()-t0,2)))

# ExtraTrees Regressor 76%
# print("ExtraTrees Regressor")
# t0=time.clock()
# base_classifier=ExtraTreesRegressor(n_estimators=25,criterion='mse',
#                                     max_depth=None,min_samples_split=2,
#                                     min_samples_leaf=1,min_weight_fraction_leaf=0.0,
#                                     max_features='auto',max_leaf_nodes=None,
#                                     min_impurity_decrease=0.0,min_impurity_split=None,
#                                     bootstrap=True,oob_score=False,
#                                     n_jobs=1,random_state=0,
#                                     verbose=1,warm_start=False)

# classifier = BinaryRelevance(classifier=base_classifier,
#     require_dense = [False, True])
# print(classifier,"\n")
# classifier.fit(X_aggregate_shuffle,Y_aggregate_shuffle)
# print("Fit Time {} s\n".format(round(time.clock()-t0,2)))

# Bagging Regressor 76.286
# print("Bagging Regressor")
# t0=time.clock()
# base_classifier=BaggingRegressor(base_estimator=None,n_estimators=30,
#                                     max_samples=1.0,max_features=1.0,
#                                     bootstrap=True,bootstrap_features=False,
#                                     oob_score=False,warm_start=False,
#                                     n_jobs=1,random_state=0,verbose=1)

# classifier = BinaryRelevance(classifier=base_classifier,
#     require_dense = [False, True])
# print(classifier,"\n")
# classifier.fit(X_aggregate_shuffle,Y_aggregate_shuffle)
# print("Fit Time {} s\n".format(round(time.clock()-t0,2)))

# Bagging Regressor 76.235
# print("Gradient Boosting Regressor")
# t0=time.clock()
# base_classifier=GradientBoostingRegressor(loss='ls',learning_rate=1.0,
#                                           n_estimators=20,subsample=1.0,
#                                           criterion='friedman_mse',min_samples_split=2,
#                                           min_samples_leaf=1,min_weight_fraction_leaf=0.0,
#                                           max_depth=3,min_impurity_decrease=0.0,
#                                           min_impurity_split=None,init=None,
#                                           random_state=0,max_features=None,
#                                           alpha=0.95,verbose=1,
#                                           max_leaf_nodes=None,warm_start=False,
#                                           presort='auto')

# classifier = BinaryRelevance(classifier=base_classifier,
#     require_dense = [False, True])
# print(classifier,"\n")
# classifier.fit(X_aggregate_shuffle,Y_aggregate_shuffle)
# print("Fit Time {} s\n".format(round(time.clock()-t0,2)))

# SVR 75.078 %
# print("SVR Regressor")
# t0=time.clock()
# base_classifier=SVR(kernel='rbf',degree=3,
#                     gamma='auto',coef0=0.0,
#                     tol=0.00001,C=1.0,
#                     epsilon=0.05,shrinking=True,
#                     cache_size=200,verbose=1,max_iter=-1)

# classifier = BinaryRelevance(classifier=base_classifier,
#     require_dense = [False, True])
# print(classifier,"\n")
# classifier.fit(X_aggregate_shuffle,Y_aggregate_shuffle)
# print("Fit Time {} s\n".format(round(time.clock()-t0,2)))

# NuSVR 76.005 %
# print("NuSVR Regressor")
# t0=time.clock()
# base_classifier=NuSVR(nu=0.95,kernel='rbf',degree=3,
#                     gamma='auto',coef0=0.0,
#                     tol=0.00001,C=1.0,
#                     shrinking=True,
#                     cache_size=200,verbose=1,max_iter=-1)

# classifier = BinaryRelevance(classifier=base_classifier,
#     require_dense = [False, True])
# print(classifier,"\n")
# classifier.fit(X_aggregate_shuffle,Y_aggregate_shuffle)
# print("Fit Time {} s\n".format(round(time.clock()-t0,2)))