# EE514 assignment part 1 starter code

Using sensors to predict activity. This part of the assignment uses the [ExtraSensory dataset](http://extrasensory.ucsd.edu/). You can download the dataset from [here](http://extrasensory.ucsd.edu/data/primary_data_files/ExtraSensory.per_uuid_features_labels.zip). The starter code expects that this dataset has been unpacked in a folder called `data` that is in the same parent folder as this notebook. You can read more about the dataset in [this README file](http://extrasensory.ucsd.edu/data/primary_data_files/README.txt).

In [None]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier

from google.colab import drive, files

## Location of the .csv.gz files

In [None]:
drive.mount("/content/gdrive", force_remount=True)
data_dir = '/content/gdrive/MyDrive/EE514(Data-Analysis-&-Machine-Learning)/data/ExtraSensory/ExtraSensory.per_uuid_features_labels'
users = os.listdir(data_dir)
print(users)

Mounted at /content/gdrive
['00EABED2-271D-49D8-B599-1D4A09240601.features_labels.csv.gz', '0A986513-7828-4D53-AA1F-E02D6DF9561B.features_labels.csv.gz', '098A72A5-E3E5-4F54-A152-BBDA0DF7B694.features_labels.csv.gz', '0BFC35E2-4817-4865-BFA7-764742302A2D.features_labels.csv.gz', '0E6184E1-90C0-48EE-B25A-F1ECB7B9714E.features_labels.csv.gz', '1155FF54-63D3-4AB2-9863-8385D0BD0A13.features_labels.csv.gz', '11B5EC4D-4133-4289-B475-4E737182A406.features_labels.csv.gz', '136562B6-95B2-483D-88DC-065F28409FD2.features_labels.csv.gz', '1538C99F-BA1E-4EFB-A949-6C7C47701B20.features_labels.csv.gz', '1DBB0F6F-1F81-4A50-9DF4-CD62ACFA4842.features_labels.csv.gz', '24E40C4C-A349-4F9F-93AB-01D00FB994AF.features_labels.csv.gz', '27E04243-B138-4F40-A164-F40B60165CF3.features_labels.csv.gz', '2C32C23E-E30C-498A-8DD2-0EFB9150A02E.features_labels.csv.gz', '33A85C34-CFE4-4732-9E73-0A7AC861B27A.features_labels.csv.gz', '3600D531-0C55-44A7-AE95-A7A38519464E.features_labels.csv.gz', '40E170A7-607B-4578-AF04-F0

## Some utility functions

The first one loads a pandas dataframe given a user UUID. The second one extracts specified feature columns $X$ and target column $y$ from a dataframe and converts these to numpy.

In [None]:
def get_user_id(user):
    return user.split(".")[0]

def load_data_for_user(uuid):
    return pd.read_csv(data_dir + "/" + (uuid + '.features_labels.csv.gz'))

def load_data_for_users(start_index, end_index, appended):
  dataframes = []
  for uuid in appended[start_index:end_index]:
    dataframes.append(pd.read_csv(data_dir + "/" + (uuid + '.features_labels.csv.gz')))
  
  return pd.concat(dataframes)

train_set = load_data_for_users(0, 30, appended)
val_set = load_data_for_users(30, 40, appended)

def get_features_and_target(df, feature_names, target_name):
    
    # select out features and target columns and convert to numpy
    X = df[feature_names].to_numpy()
    y = df[target_name].to_numpy()
    
    # remove examples with no label
    has_label = ~np.isnan(y)
    X = X[has_label,:]
    y = y[has_label]
    return X, y

## Load in some data 
Load in the data for a user and display the first few rows of the dataframe

In [None]:
df = load_data_for_user('0A986513-7828-4D53-AA1F-E02D6DF9561B')
# df = load_data_for_user(get_user_id(users[-1]))
df.head()

Unnamed: 0,timestamp,raw_acc:magnitude_stats:mean,raw_acc:magnitude_stats:std,raw_acc:magnitude_stats:moment3,raw_acc:magnitude_stats:moment4,raw_acc:magnitude_stats:percentile25,raw_acc:magnitude_stats:percentile50,raw_acc:magnitude_stats:percentile75,raw_acc:magnitude_stats:value_entropy,raw_acc:magnitude_stats:time_entropy,...,label:STAIRS_-_GOING_DOWN,label:ELEVATOR,label:OR_standing,label:AT_SCHOOL,label:PHONE_IN_HAND,label:PHONE_IN_BAG,label:PHONE_ON_TABLE,label:WITH_CO-WORKERS,label:WITH_FRIENDS,label_source
0,1449601597,1.000371,0.007671,-0.016173,0.02786,0.998221,1.000739,1.003265,0.891038,6.684582,...,,,,,,,,,,-1
1,1449601657,1.000243,0.003782,-0.002713,0.007046,0.998463,1.000373,1.002088,1.647929,6.684605,...,,,,,,,,,,-1
2,1449601717,1.000811,0.002082,-0.001922,0.003575,0.999653,1.000928,1.002032,1.960286,6.68461,...,,,,,,,,,,-1
3,1449601777,1.001245,0.004715,-0.002895,0.008881,0.999188,1.001425,1.0035,1.614524,6.684601,...,,,,,,,,,,-1
4,1449601855,1.001354,0.065186,-0.09652,0.165298,1.000807,1.002259,1.003631,0.83779,6.682252,...,0.0,,0.0,1.0,,,,,0.0,2


## What columns are available?

In [None]:
print(df.columns.to_list())
print(len(df.columns.to_list()))

['timestamp', 'raw_acc:magnitude_stats:mean', 'raw_acc:magnitude_stats:std', 'raw_acc:magnitude_stats:moment3', 'raw_acc:magnitude_stats:moment4', 'raw_acc:magnitude_stats:percentile25', 'raw_acc:magnitude_stats:percentile50', 'raw_acc:magnitude_stats:percentile75', 'raw_acc:magnitude_stats:value_entropy', 'raw_acc:magnitude_stats:time_entropy', 'raw_acc:magnitude_spectrum:log_energy_band0', 'raw_acc:magnitude_spectrum:log_energy_band1', 'raw_acc:magnitude_spectrum:log_energy_band2', 'raw_acc:magnitude_spectrum:log_energy_band3', 'raw_acc:magnitude_spectrum:log_energy_band4', 'raw_acc:magnitude_spectrum:spectral_entropy', 'raw_acc:magnitude_autocorrelation:period', 'raw_acc:magnitude_autocorrelation:normalized_ac', 'raw_acc:3d:mean_x', 'raw_acc:3d:mean_y', 'raw_acc:3d:mean_z', 'raw_acc:3d:std_x', 'raw_acc:3d:std_y', 'raw_acc:3d:std_z', 'raw_acc:3d:ro_xy', 'raw_acc:3d:ro_xz', 'raw_acc:3d:ro_yz', 'proc_gyro:magnitude_stats:mean', 'proc_gyro:magnitude_stats:std', 'proc_gyro:magnitude_stat

## Feature selection

The columns that start with `label:` correspond to potential y values. Let's look at using the accelerometer features. These start with `raw_acc:` and `watch_acceleration:`

In [None]:
# acc_sensors = [s for s in df.columns if 
#                s.startswith('raw_acc:') or 
#                s.startswith('watch_acceleration:')]


# # target_column = 'label:FIX_walking'
# target_column = 'label:FIX_walking'

## Extract our training data

In [None]:
X_train, y_train = get_features_and_target(df, acc_sensors, target_column)
print(f'{y_train.shape[0]} examples with {y_train.sum()} positives')

3879 examples with 1145.0 positives


## Preprocessing

We want to make the learning problem easier by making all columns have a mean of zero and a standard deviation of one. There are also lots of missing values in this dataset. We'll use mean imputation here to get rid of them. Since our data is scaled to have zero mean, this will just zero out missing values.

In [None]:
scaler = StandardScaler()
imputer = SimpleImputer(strategy='mean')

X_train = scaler.fit_transform(X_train)
X_train = imputer.fit_transform(X_train)

## Fitting a model
Let's fit a logistic regression model to this user. We can then test it's predictive power on a different user

In [None]:
clf = LogisticRegression(solver='liblinear', max_iter=1000, C=1.0)
clf.fit(X_train, y_train)

## Training accuracy

Let's see the accuracy on the training set. The score function can be used to do this:

In [None]:
print(f'Training accuracy: {clf.score(X_train, y_train):0.4f}')

Looks like the model can fit the training data reasonably well anyway. But this says nothing about how well it will generalize to new data. The dataset is also unbalanced, so this figure may be misleading. How accurate would we be if we just predicted zero each time?

In [None]:
1 - y_train.sum() / y_train.shape[0]

Oh wow. Our model may not be that great after all. Let's try to calculate balanced accuracy, which should better reflect how well the model does on the training data

In [None]:
y_pred = clf.predict(X_train)
print(f'Balanced accuracy (train): {metrics.balanced_accuracy_score(y_train, y_pred):0.4f}')

## Testing the model

Ok, it seems our model has fit the training data well. How well does it perform on unseen test data? Let's load the data in for a different user.

In [None]:
df_test = load_data_for_user('11B5EC4D-4133-4289-B475-4E737182A406')
X_test, y_test = get_features_and_target(df_test, acc_sensors, target_column)
print(f'{y_train.shape[0]} examples with {y_train.sum()} positives')

We also need to preprocess as before. **Note**: we are using the scaler and imputer fit to the training data here. It's very important that you do not call `fit` or `fit_transform` here! Think about why.

In [None]:
X_test = imputer.transform(scaler.transform(X_test))

## Test accuracy

In [None]:
print(f'Test accuracy: {clf.score(X_test, y_test):0.4f}')

In [None]:
y_pred = clf.predict(X_test)
print(f'Balanced accuracy (train): {metrics.balanced_accuracy_score(y_test, y_pred):0.4f}')

In [None]:
# Improving the test set

In [None]:
# Train models and return list of accuracies for n users
def get_accuracies(model, test_dfs):
  accuracies = []
  bal_accuracies = []
  for df in test_dfs:
    X_test, y_test = get_features_and_target(df, acc_sensors, target_column)
    print(f'{y_train.shape[0]} examples with {y_train.sum()} positives')
    X_test = imputer.transform(scaler.transform(X_test))
    print(f'Test accuracy: {clf.score(X_test, y_test):0.4f}')
    y_pred = clf.predict(X_test)
    print(f'Balanced accuracy (train): {metrics.balanced_accuracy_score(y_test, y_pred):0.4f}')
    accuracies.append(clf.score(X_test, y_test))
    bal_accuracies.append(metrics.balanced_accuracy_score(y_test, y_pred))

  return accuracies, bal_accuracies

In [None]:
test_dfs = []
for user in users[:5]:
  test_dfs.append(load_data_for_user(get_user_id(user)))
accuracies, bal_accuracies = get_accuracies(clf, test_dfs)

# Show mean and variance of accuracies for 5 test sets with 1 user
print()
print("Evaluation of accuracies of 5x1 test sets")
print(f"Mean: {np.mean(accuracies):0.6f}")
print(f"Variance: {np.var(accuracies):0.6f}")
print("\nEvaluation of balanced accuracies of 5x1 test sets")
print(f"Mean: {np.mean(bal_accuracies):0.6f}")
print(f"Variance: {np.var(bal_accuracies):0.6f}")

In [None]:
def build_test_set(users, slice_start, slice_end):
  dfs = []
  for user in users[slice_start:slice_end]:
    df = load_data_for_user(user.split(".")[0])
    dfs.append(df)

  return pd.concat(dfs)

In [None]:
def test_model(model, test_set):
  X_test, y_test = get_features_and_target(test_set, acc_sensors, target_column)
  print(f'{y_train.shape[0]} examples with {y_train.sum()} positives')
  X_test = imputer.transform(scaler.transform(X_test))
  print(f'Test accuracy: {model.score(X_test, y_test):0.4f}')
  y_pred = model.predict(X_test)
  print(f'Balanced accuracy: {metrics.balanced_accuracy_score(y_test, y_pred):0.4f}')
  # print(f'F1 score: {metrics.f1_score(y_test, y_pred):0.4f}')

In [None]:
# Build 5 different test sets to demonstrate mean and variance of accuracy on larger sets
test_set1 = build_test_set(users, 5, 10)
test_set2 = build_test_set(users, 10, 15)
test_set3 = build_test_set(users, 15, 20)
test_set4 = build_test_set(users, 20, 25)
test_set5 = build_test_set(users, 25, 30)
dfs = [test_set1, test_set2, test_set3, test_set4, test_set5]

accuracies, bal_accuracies = get_accuracies(clf, dfs)

# Show mean and variance of accuracies for 5 test sets with 5 users
print()
print("Evaluation of accuracies of 5x5 test sets")
print(f"Mean: {np.mean(accuracies):0.6f}")
print(f"Variance: {np.var(accuracies):0.6f}")
print("\nEvaluation of balanced accuracies of 5x5 test sets")
print(f"Mean: {np.mean(bal_accuracies):0.6f}")
print(f"Variance: {np.var(bal_accuracies):0.6f}")
# test_model(clf, test_set1)

### Data Splitting

In [None]:
def split_data(users, train_split, val_split):
  n_users = len(users)
  n_train = round(n_users * train_split)
  n_val = round(n_users * val_split)
  n_test = n_users - n_train - n_val

  print(f"Total samples in dataset: {n_users}")
  print(f"Samples in train set: {n_train}")
  print(f"Samples in validation set: {n_val}")
  print(f"Samples in test set: {n_test}")

  i = 0
  train_dfs = []
  while i < n_train:
    df = load_data_for_user(users[i].split(".")[0])
    train_dfs.append(df)
    i+=1

  val_dfs = []
  while i < n_train + n_val:
    df = load_data_for_user(users[i].split(".")[0])
    val_dfs.append(df)
    i+=1

  test_dfs = []
  while i < n_train + n_val + n_test:
    df = load_data_for_user(users[i].split(".")[0])
    test_dfs.append(df)
    i+=1

  return pd.concat(train_dfs), pd.concat(val_dfs), pd.concat(test_dfs)

train_set, val_set, test_set = split_data(users, 0.6, 0.2)

Total samples in dataset: 60
Samples in train set: 36
Samples in validation set: 12
Samples in test set: 12


### Model Selection

In [None]:
# Evaluate model with validation set
def evaluate_model(model, val_set, features, target_column):
  X_test, y_test = get_features_and_target(val_set, features, target_column)
  print(f'{y_test.shape[0]} examples with {y_test.sum()} positives')
  X_test = imputer.transform(scaler.transform(X_test))
  print(f'Validation accuracy: {model.score(X_test, y_test):0.4f}')
  y_pred = model.predict(X_test)
  print(f'Balanced accuracy: {metrics.balanced_accuracy_score(y_test, y_pred):0.4f}')

In [None]:
# Train logistic regression model
# It was found that reducing the C parameter increased the balanced accuracy and F1 score 
def train_lr_model(train_set, features, target_column, C_param):
  X_train, y_train = get_features_and_target(train_set, features, target_column)
  print(f'{y_train.shape[0]} examples with {y_train.sum()} positives')
  scaler = StandardScaler()
  imputer = SimpleImputer(strategy='mean')

  X_train = scaler.fit_transform(X_train)
  X_train = imputer.fit_transform(X_train)

  model = LogisticRegression(solver='liblinear', max_iter=1000, C=C_param)
  model.fit(X_train, y_train)
  print(f'Training accuracy: {model.score(X_train, y_train):0.4f}')

  return model

In [None]:
# Train decision tree model
def train_dt_model(train_set, features, target_column, max_depth):
  X_train, y_train = get_features_and_target(train_set, features, target_column)
  print(f'{y_train.shape[0]} examples with {y_train.sum()} positives')
  scaler = StandardScaler()
  imputer = SimpleImputer(strategy='mean')

  X_train = scaler.fit_transform(X_train)
  X_train = imputer.fit_transform(X_train)

  model = tree.DecisionTreeClassifier(max_features="sqrt", max_depth=max_depth, min_samples_split=2)
  model.fit(X_train, y_train)
  print(f'Training accuracy: {model.score(X_train, y_train):0.4f}')

  return model

In [None]:
# Train random forest model
def train_rf_model(train_set, features, target_column, n_trees, max_depth):
  X_train, y_train = get_features_and_target(train_set, features, target_column)
  print(f'{y_train.shape[0]} examples with {y_train.sum()} positives')
  scaler = StandardScaler()
  imputer = SimpleImputer(strategy='mean')

  X_train = scaler.fit_transform(X_train)
  X_train = imputer.fit_transform(X_train)

  model = RandomForestClassifier(n_estimators=n_trees, max_features="sqrt", max_depth=max_depth, min_samples_split=2, min_samples_leaf=2)
  model.fit(X_train, y_train)
  print(f'Training accuracy: {model.score(X_train, y_train):0.4f}')

  return model

In [None]:
# Train naive bayes model
def train_nb_model(train_set, features, target_column):
  X_train, y_train = get_features_and_target(train_set, features, target_column)
  print(f'{y_train.shape[0]} examples with {y_train.sum()} positives')
  scaler = StandardScaler()
  imputer = SimpleImputer(strategy='mean')

  X_train = scaler.fit_transform(X_train)
  X_train = imputer.fit_transform(X_train)

  model = GaussianNB()
  model.fit(X_train, y_train)
  print(f'Training accuracy: {model.score(X_train, y_train):0.4f}')

  return model

In [None]:
# Train and evaluate logistic regression model
features = [s for s in train_set.columns if 
               s.startswith('raw_acc:') or 
               s.startswith('watch_acceleration:')]

target_column = 'label:BICYCLING'

print("Logistic Regression")
print("Fitting model...")
model_lr = train_lr_model(train_set, features, target_column, 0.01)
print("\nEvaluating model...")
evaluate_model(model_lr, val_set, features, target_column)

Logistic Regression
Fitting model...
78801 examples with 3573.0 positives
Training accuracy: 0.9723

Evaluating model...
34978 examples with 953.0 positives
Validation accuracy: 0.9826
Balanced accuracy: 0.7432


In [None]:
print("Decision Tree")
print("Fitting model...")
model_dt = train_dt_model(train_set, features, target_column, 10)
print("\nEvaluating model...")
evaluate_model(model_dt, val_set, features, target_column)

Decision Tree
Fitting model...
194965 examples with 15091.0 positives
Training accuracy: 0.9420

Evaluating model...
66006 examples with 4259.0 positives
Validation accuracy: 0.9362
Balanced accuracy: 0.6387


In [None]:
print("Random Forest")
print("Fitting model...")
model_rf = train_rf_model(train_set, features, target_column, 50, 10)
print("\nEvaluating model...")
evaluate_model(model_rf, val_set, features, target_column)

Random Forest
Fitting model...
78801 examples with 3573.0 positives
Training accuracy: 0.9804

Evaluating model...
34978 examples with 953.0 positives
Validation accuracy: 0.9854
Balanced accuracy: 0.7442


In [None]:
print("Gaussian Naive Bayes")
print("Fitting model...")
model_nb = train_nb_model(train_set, features, target_column)
print("\nEvaluating model...")
evaluate_model(model_nb, val_set, features, target_column)

Gaussian Naive Bayes
Fitting model...
83125 examples with 608.0 positives
Training accuracy: 0.9207

Evaluating model...
42573 examples with 306.0 positives
Validation accuracy: 0.9919
Balanced accuracy: 0.5044


In [None]:
def get_model_perf(model, test_set, features, target_column):
  X_test, y_test = get_features_and_target(test_set, features, target_column)
  print(f'{y_train.shape[0]} examples with {y_train.sum()} positives')
  X_test = imputer.transform(scaler.transform(X_test))
  print(f'Test accuracy: {model.score(X_test, y_test):0.4f}')
  y_pred = model.predict(X_test)
  print(f'Recall: {metrics.recall_score(y_test, y_pred):0.4f}')
  print(f'Precision: {metrics.precision_score(y_test, y_pred):0.4f}')
  print(f'Balanced accuracy: {metrics.balanced_accuracy_score(y_test, y_pred):0.4f}')
  print(f'F1 score: {metrics.f1_score(y_test, y_pred):0.4f}')
  # print(metrics.confusion_matrix(y_test, y_pred))
  # print(metrics.classification_report(y_test, y_pred, digits=3))


In [None]:
print("Testing Random Forest Classifier...")
get_model_perf(model_rf, test_set, features, target_column)

Testing Random Forest Classifier...
3879 examples with 1145.0 positives
Test accuracy: 0.9905
Recall: 0.6174
Precision: 0.9561
Balanced accuracy: 0.8084
F1 score: 0.7503


In [None]:
print("Testing Naive Bayes Classifier...")
get_model_perf(model_nb, test_set, features, target_column)

Testing Naive Bayes Classifier...
3879 examples with 1145.0 positives
Test accuracy: 0.9324
Recall: 0.8182
Precision: 0.1265
Balanced accuracy: 0.8759
F1 score: 0.2192
