# HW5

For this homework, we are going to work with [*Indoor User Movement Prediction from RSS data*](https://archive.ics.uci.edu/ml/datasets/Indoor+User+Movement+Prediction+from+RSS+data) dataset from UCI.  The homework is due Friday, December 21st midnight. 

## Task 1

Download the dataset and unzip it in under a subdirectory of `data` folder named `rss_data`.

The files we are interested is in the subfolder `dataset`.  Each of these files whose names that start with `MovementAAL_RSS_` contain data collected from indivuduals. Each of these files represent a single data point.  There are 314 of these files, and hence, you have 314 data points.  Each file has 4 columns but the number of rows change from file to file.  

There is also a file named `MovementALL_target.csv` in that folder. This file tells us the class each of these measurements are assigned. Some of these measurements are labelled with +1 and some are labelled with -1.

## Task 2

Construct a SVM model that separates +1 labelled data points from -1 data points.  You must first solve the problem that these datapoints do not have the same number of rows even though they all have the same number of columns. 

## Task 3

Using [Keras](https://keras.io/getting-started/sequential-model-guide/) write a neural network model that separates +1 labelled data points from -1 data points.

## Notes

1. You must document each step of your tasks: what are you doing, why are you doing it, what problems you encountered and how you solved it.  All of these must be explained and documented.  Solutions without sufficient documentations will be penalized accordingly. 50% of your points will come from your code, while the other 50% will come from your explanations.

1. You can use MS Excel to inspect the files, but loading them up to python using pandas and inspecting them there under jupyter is easier.

3. Put the data in a separate subfolder of your `data` folder and rename it `rss_data`. I'll take points off if the data is not saved under the correct place.

1. For both of Task 2 and Task 3, you must split your data into a train and test set, and then evaluate the accuracy of your model on the test set.



# Task 1

First, we need to import required libraries:

In [1]:
import os
import numpy as np
import pandas as pd
from pandas import read_csv
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adadelta

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


We will check the size of each data point as follows:

In [2]:
length = []

for i in range(314):
    filename = os.path.expanduser('~/MAT388E/data/rss_data/dataset/MovementAAL_RSS_'+str(i+1)+'.csv')
    df = read_csv(filename, header=0)
    length.append(len(df))
    del df
print('minimum size of the data is:',min(length))
print('maximum size of the data is:',max(length))
del length

minimum size of the data is: 19
maximum size of the data is: 129


The size of each data point will be reduced to 19 rows since it is the min size.

In [3]:
sequences = []
for i in range(314):
    filename = os.path.expanduser('~/MAT388E/data/rss_data/dataset/MovementAAL_RSS_'+str(i+1)+'.csv')
    df = read_csv(filename)
    a = df.iloc[0:19,0].tolist()
    b = df.iloc[0:19,1].tolist()
    c = df.iloc[0:19,2].tolist()
    d = df.iloc[0:19,3].tolist()
    values=[a,b,c,d]
    sequences.append(values)

Reading target data:

In [4]:
target = read_csv(os.path.expanduser('~/MAT388E/data/rss_data/dataset/MovementAAL_target.csv'))

In order to reshape the data, we will first convert the data to numpy array. The data is reshaped by the code below:

In [5]:
variables=np.array(sequences)
reshaped = variables.reshape((314,4*19))

In [6]:
target.iloc[:,1].value_counts()

 1    158
-1    156
Name:  class_label, dtype: int64

Data is splitted into train and test sets:

In [61]:
X_train, X_test, y_train, y_test = train_test_split(reshaped, target.iloc[:,1], test_size=0.30)

# Task 2

The SVM model will be built in the second task. The model is created and fit to data as below:

In [62]:
y_train_binary = y_train.apply(lambda x: 1 if x==1 else 0)
y_test_binary = y_test.apply(lambda x: 1 if x==1 else 0)

In [88]:
svm_clf = SVC(kernel="rbf",gamma=0.05,C=1)
svm_clf.fit(X_train, y_train_binary)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

The model is evaluated as below:

In [90]:
from sklearn.metrics import confusion_matrix, accuracy_score
predicted=svm_clf.predict(X_test)

print('\nconfusion matrix of the test data:\n',confusion_matrix(predicted,y_test_binary))
print('accuracy score of the test data:\n',accuracy_score(predicted,y_test_binary))


confusion matrix of the test data:
 [[40  4]
 [18 33]]
accuracy score of the test data:
 0.7684210526315789


The accuracy score in the test data is 76%. The confusion matrix is shown above.

# Task 3

Neural network model will be built in this task.
I will first convert the target variable into binary as follows:

In [70]:
nn_clf = Sequential()

nn_clf.add(Dense(15, activation='sigmoid', input_dim=76))
nn_clf.add(Dense(1, activation='sigmoid'))

nn_clf.compile(optimizer='RMSprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [72]:
nn_clf.fit(X_train, y_train_binary, epochs=100, batch_size=75, verbose=1, validation_data=(X_test, y_test_binary))

Train on 219 samples, validate on 95 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100


Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1e9d25eeef0>

Predicting the test set as below:

In [74]:
y_pred = (nn_clf.predict(X_test) > 0.5)

In [75]:
print('confusion matrix of the test data:\n',confusion_matrix(y_pred,y_test_binary))

confusion matrix of the test data:
 [[33  4]
 [25 33]]


In [76]:
print('accuracy score of the test data:\n',accuracy_score(y_pred,y_test_binary))


accuracy score of the test data:
 0.6947368421052632


The accuracy of the test set in the neural network model is 69% for this example.