# Advanced Data Science Capstone Project

The dataset I'm going to use for the project is motionsense-dataset.

## Data Set Information:

This dataset includes time-series data generated by accelerometer and gyroscope sensors (attitude, gravity, userAcceleration, and rotationRate). It is collected with an iPhone 6s kept in the participant's front pocket using SensingKit which collects information from Core Motion framework on iOS devices. A total of 24 participants in a range of gender, age, weight, and height performed 6 activities in 15 trials in the same environment and conditions: downstairs, upstairs, walking, jogging, sitting, and standing. With this dataset, we aim to look for personal attributes fingerprints in time-series of sensor data, i.e. attribute-specific patterns that can be used to infer gender or personality of the data subjects in addition to their activities.

Use Case:

Given the input of motion sensor data, we can use a machine learning model to predict the activity such as: 

downstairs, upstairs, walking, jogging, sitting, and standing 

Neural network would be a good fit to use as a deep learning model for this scenario.

Test GitHub

# Download the data

When this command completes you will have a file "montionsense.all-data"

Data source:  https://www.kaggle.com/malekzadeh/motionsense-dataset


# Citation

@inproceedings{Malekzadeh:2019:MSD:3302505.3310068,
author = {Malekzadeh, Mohammad and Clegg, Richard G. and Cavallaro, Andrea and Haddadi, Hamed},
title = {Mobile Sensor Data Anonymization},
booktitle = {Proceedings of the International Conference on Internet of Things Design and Implementation},
series = {IoTDI '19},
year = {2019},
isbn = {978-1-4503-6283-2},
location = {Montreal, Quebec, Canada},
pages = {49--58},
numpages = {10},
url = {http://doi.acm.org/10.1145/3302505.3310068},
doi = {10.1145/3302505.3310068},
acmid = {3310068},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {adversarial training, deep learning, edge computing, sensor data privacy, time series analysis},
} 


https://github.com/IBM/coursera/blob/master/coursera_ml/a2_w1_s3_Pipelines_with_ScikitLearn.ipynb

In [1]:
!pip install ibm-cos-sdk



In [2]:
from ibm_botocore.client import Config
import ibm_boto3

credentials = {
    'IAM_SERVICE_ID': 'iam-ServiceId-35495bf4-1c7a-4958-9cad-bd19d3628db0',
    'IBM_API_KEY_ID': '',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.bluemix.net/oidc/token',
    'BUCKET': 'coursera-donotdelete-pr-l7reg9k0xefdz2',
    'FILE': 'sub_1.csv'
}

def upload_file_cos(credentials,local_file_name,key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.upload_file(Filename=local_file_name, Bucket=credentials['BUCKET'],Key=key)
    except Exception as e:
        print(Exception, e)
    else:
        print(' File Uploaded')

def download_file_cos(credentials,local_file_name,key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.download_file(Bucket=credentials['BUCKET'],Key=key,Filename=local_file_name)
    except Exception as e:
        print(Exception, e)
    else:
        print('File Downloaded')
        

# Start load data from here

In [3]:
!rm -rf sensor.tgz

In [4]:
download_file_cos(credentials,'sensor.tgz','sensor.tgz')

File Downloaded


In [5]:
!ls -l

total 35468
-rw-r----- 1 wsuser watsonstudio 36318084 Oct 15 23:38 sensor.tgz


In [6]:
!rm -rf sensor.parquet
!tar -zxvf sensor.tgz


sensor.parquet/
sensor.parquet/_SUCCESS
sensor.parquet/part-00000-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet
sensor.parquet/.part-00000-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet.crc
sensor.parquet/part-00001-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet
sensor.parquet/.part-00001-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet.crc
sensor.parquet/part-00002-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet
sensor.parquet/.part-00002-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet.crc
sensor.parquet/part-00003-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet
sensor.parquet/.part-00003-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet.crc
sensor.parquet/part-00004-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet
sensor.parquet/.part-00004-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet.crc
sensor.parquet/part-00005-37a01292-79d3-45a5-9722-4851cc83b43d-c000.snappy.parquet
sensor.parquet/.part-0

https://cloud.ibm.com/docs/services/cloud-object-storage/libraries?topic=cloud-object-storage-python#using-python

In [7]:
df = spark.read.parquet('sensor.parquet')
df.createOrReplaceTempView('sensor')
df.show()

+----+---------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+----------+-----+
|   a|        b|        c|       d|        e|       f|        g|        h|        i|        j|        k|        l|        m|    source|class|
+----+---------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+----------+-----+
| 0.0|-1.504388|-1.184453|1.373747|-0.375966|0.926293|-0.025004|-0.076703|-0.022654|-0.050195|-0.009517|-0.005028| 0.021846|sub_19.csv|std_6|
| 1.0|-1.507574| -1.18388|1.370343|-0.376574|0.926077| -0.02384|-0.055318| 0.007022|-0.030855|-0.007596|-0.006796| 0.024801|sub_19.csv|std_6|
| 2.0|-1.509548|-1.183511| 1.36786|-0.376962|0.925938|-0.023117|-0.041444|  0.01971|-0.019024|-0.010307|-0.006169| 0.020966|sub_19.csv|std_6|
| 3.0|-1.511309|-1.183154|1.365622|-0.377332|0.925803|-0.022473|-0.035052| 0.014374|-0.022178|-0.014651|-0.004599| 0.009991|sub_19.csv|std_6|
| 4.0|

In [8]:
import numpy as np
import pandas as pd

Convert from Spark dataframe to Pandas dataframe

In [9]:

df = df.toPandas()

# Convert Text Labels to Integers

You will create a Keras Neural Network to classify each record as a Mine or a Rock. 

Although, It is straightforward to keep the labels "M" or "R" in Keras and have working code, the goal of this exercise is to save the model and then load the model into DeepLearning4J a java framework. The Java Code to import has been prebuilt and precompiled and expects numeric labels. With that restriction in mind, convert the "M's" to "0" and the "R's" to "1" with the following commands. 

In [10]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['class'].as_matrix())
encoded_classes = le.transform(df['class'].as_matrix())
encoded_classes

array([6, 6, 6, ..., 1, 1, 1])

In [11]:
df_encoded = df.join(pd.DataFrame(encoded_classes,columns=['class_factorized']))

In [12]:
df=df_encoded

# Verify the contents of the file.

The file has 60 features per line, followed by a label of 0 or 1. 

The data is not shuffled, although for best neural network training performance shuffling would be advised. 

To verify that the above conversion succeeded view the head and the tail of the file. 

In [13]:
df.shape

(838544, 16)

# Build a Neural Network

Build a Keras Neural Network to Process the data file. By training a Neural Network we are feeding the network the features and asking it to make a prediction of which class of object those readings are from. 

We will build a Feed Forward Neural Network using Keras Sequential Model. 

First some imports


In [14]:
import numpy
import pandas
from keras.models import Sequential

from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from keras.utils import np_utils

import keras
from keras.layers import Dense, Dropout, Flatten
from keras.optimizers import RMSprop
from keras.layers import Conv2D, MaxPooling2D

Using TensorFlow backend.


# Set Random Seed


Neural Networks begin by defining a computation grid with random weights applied to each initial calculation. 

For repeatable results setting a random seed guarantees that the second run will be the same as the first.



In [15]:
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# Load the data into a pandas dataframe and Split into Features and Labels

The first 60 fields are measurements from the sonar, the last field is the Label


In [16]:
df = df.sample(frac=0.1)

msk = np.random.rand(len(df)) < 0.8

df_train = df[msk]

df_test = df[~msk]
    


In [17]:
df_train.count()

a                   67283
b                   67283
c                   67283
d                   67283
e                   67283
f                   67283
g                   67283
h                   67283
i                   67283
j                   67283
k                   67283
l                   67283
m                   67283
source              67283
class               67283
class_factorized    67283
dtype: int64

In [18]:
df_test.count()

a                   16571
b                   16571
c                   16571
d                   16571
e                   16571
f                   16571
g                   16571
h                   16571
i                   16571
j                   16571
k                   16571
l                   16571
m                   16571
source              16571
class               16571
class_factorized    16571
dtype: int64

In [22]:
#df = df_train

dataset_train = df_train.values



# split into input (X) and output (Y) variables
X_train = dataset_train[:,0:12].astype(float)
Y_train = dataset_train[:,15]


dataset_test = df_test.values

# split into input (X) and output (Y) variables
X_test = dataset_test[:,0:12].astype(float)
Y_test = dataset_test[:,15]


In [23]:
df.shape

(83854, 16)

# Encode Labels

The following code converts the Labels to integers, this section would actually work on the unmodified dataset containing "M" or "R" for labels, so in this case the step is redundant. 

The Code also takes the integers and converts to one-hot, or dummy encoding. 

Given n labels dummy encoding creates an array of length n.
The array will have a "1" value corresponding to the label and all ther values will be "0"

For this example with 2 labels, dummy encoding will make the following conversion. 

Original Data

```
0
1
0
```

Dummy Encoded

```
1,0
0,1
1,0
```

To verify you can uncomment the line. 

```
print(dummy_y)
```

 



In [24]:
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y_train)
encoded_Y = encoder.transform(Y_train)

# convert integers to dummy variables (hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
print(dummy_y)

Y_train = dummy_y



encoder.fit(Y_test)
encoded_Y = encoder.transform(Y_test)

# convert integers to dummy variables (hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
print(dummy_y)

Y_test = dummy_y


[[ 0.  0.  0. ...,  1.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  1.]
 [ 0.  0.  1. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  1.  0.  0.]
 [ 0.  0.  1. ...,  0.  0.  0.]
 [ 0.  0.  1. ...,  0.  0.  0.]]
[[ 0.  0.  0. ...,  0.  0.  1.]
 [ 0.  0.  0. ...,  1.  0.  0.]
 [ 0.  0.  1. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  1. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  1.]
 [ 0.  0.  0. ...,  1.  0.  0.]]


# Build a model

Your code here, in this case you are on your own to build a working Neural Network. 

You can review the Keras section for examples. 

You are free to decide the depth and features of the Neural Network. 

Note however, the first Layer has to have input_dim = 60 to correspond to the number of features and 
the last layer has to have 2 nodes to correspond to the number of labels.

How will you know you have a good model? 

Accuracy levels of about .80 can be expected with this dataset.



In [25]:
# create model
# ADD YOUR CODE HERE
model = Sequential()
model.add(Dense(12, input_dim=12, activation='relu'))
model.add(Dense(7, activation='sigmoid'))
# Compile model
#model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#model.fit(X,dummy_y,epochs=20,batch_size=5)
#model.save('iris_model.h5')

Try different parameters

In [73]:
batch_size = 5
num_classes = 7
epochs = 20

In [74]:
model = Sequential()
#your_code_goes_here


#Please delete this code <
model = Sequential()
model.add(Dense(12, activation='relu', input_shape=(12,)))
model.add(Dropout(0.2))
model.add(Dense(12, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))
#> Please delete this code 

model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 12)                156       
_________________________________________________________________
dropout_1 (Dropout)          (None, 12)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 12)                156       
_________________________________________________________________
dropout_2 (Dropout)          (None, 12)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 7)                 91        
Total params: 403
Trainable params: 403
Non-trainable params: 0
_________________________________________________________________


# Compile the Model and Train

Modify the following cell and set your number of epochs and your batch size. 

Depending on your model it may train in 20 epochs or it may take 40, or it may not train at all. 

Replace the "***Your VALUE HERE**" with a numeric value. 

If your loss function is not decreasing then your model is not training, modify your model and try again. 




In [26]:

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#model.fit(X, dummy_y, epochs=20, batch_size=5)

In [27]:
epochs=20
batch_size=5

In [29]:
#some learners constantly reported 502 errors in Watson Studio. 
#This is due to the limited resources in the free tier and the heavy resource consumption of Keras.
#This is a workaround to limit resource consumption

from keras import backend as K

K.set_session(K.tf.Session(config=K.tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)))

history = model.fit(X_train, Y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(X_test, Y_test))

score = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 67283 samples, validate on 16571 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test loss: 0.189979918286
Test accuracy: 0.912670186923


# Save your Model

Your Model will be loaded into dl4j and run in a Spark context. A saved model includes the weights and the computation graph needed for either further training or inference. In this example we will load the model into dl4j and pass it our datafile and evaluate the accuracy of the model in dl4j running in spark. 


In [30]:
model.save('my_modelx.h5')

# Verify your model has saved

The ls command should show your model in the local directory of this notebook. 

In [52]:
!ls



logs  my_modelx.h5  sensor.parquet  sensor.tgz	spark-events  user-libs


#DONE 

Great Job !!!