# Activity Recognition (ROAMM)
Data collected from Samsung Smartwatch Gear S3 are summarized into 7 features for non-overlapping 15-second epochs of acclerometer data.

1. `MVM`: mean vector magnitude.
2. `SDVM`: standard deviation of vector magnitude.
3. `MANGLE`: mean angle between the vector magnitude and the horizontal line.
4. `SDANGLE`: standard deviation of angle between the vector magnitude and the horizontal line.
5. `P625`: fraction of power covered by frequencies in [0.6, 2.5] Hz (human movement frequency).
6. `DF`: dominant frequency.
7. `FPDF`: fraction of power covered by the dominant frequency.

ROAMM application prompts the user only four times a day for activity types. Therefore, the majority of data points are unlabeled and their activity types are unknown. In this notebook, we use smartwatch data that were collected from other participants who performed the following activities in the lab.

* Ironing (<font color='blue'>**Standing**</font>)
* Mopping (<font color='green'>**Walking**</font>)
* Trash Removal (<font color='green'>**Walking**</font>)
* Washing Windows (<font color='blue'>**Standing**</font>)
* Computer Work (<font color='red'>**Sitting**</font>)
* Replacing Bed Sheet (<font color='green'>**Walking**</font>)
* Heavy Weight Lifting (<font color='green'>**Walking**</font>)
* Home Maintenance (<font color='blue'>**Standing**</font>)
* Laundry (<font color='blue'>**Standing**</font>)
* Yoga (<font color='blue'>**Standing**</font>)
* Chest Press (<font color='red'>**Sitting**</font>)
* Leg Extension (<font color='red'>**Sitting**</font>)
* Leg Curl (<font color='red'>**Sitting**</font>)

We are interested in identifying activity types for each receiving 15-second epoch as <font color='red'>**Sitting**</font>, <font color='blue'>**Standing**</font>, or <font color='green'>**Walking**</font>. Therefore, activity labels that were used for training a classifier were chosen as shown above - based on similarity of the performed chore to the activities of interest.

In our training data set, data points belonging to each activity type are distributed as follows.
<table style="width:50%">
  <tr>
    <th>Activity Label</th>
    <th>Number of 15-second Epochs</th>
  </tr>
  <tr>
      <td><font color='red'>**Sitting**</font></td>
      <td>198</td>
  </tr>
  <tr>
      <td><font color='blue'>**Standing**</font></td>
      <td>639</td>
  </tr>
  <tr>
      <td><font color='green'>**Walking**</font></td>
      <td>757</td>
  </tr>
</table>

And here is how the training data points are scattered.
![Training Data](images/training_data.png)

One issue becomes quite apparent: although <font color='red'>**Sitting**</font> activities are quite distinct from the rest, <font color='blue'>**Standing**</font> and <font color='green'>**Walking**</font> activities are not very well distinguishable in our training data. This is partially because of the features that are being used and partially due to lack of pure walks.

This is the importance of features which resulted in the activity recognition shown above.

![Feature importance](images/feature_importance.png)

Since the dataset is not yet publicly available, the training set is not included in this repository. However, the trained classifier is located next to this notebook (`trained_classifier.pkl`).

Here is an example on how to use this classifier to obtain activity labels.

---
\[1\](_the best performance was obtained for `number-of-trees` = 100_)

### Example
In this example, we use the trained classifier to obtain the following columns:
1. `predicted_activity`: Activity label predicted by the trained model. (i.e., activity with the maximum probability)
2. `sitting_prob`: Probability of data point belonging to <font color='red'>**Sitting**</font> class.
3. `standing_prob`: Probability of data point belonging to <font color='blue'>**Standing**</font> class.
4. `walking_prob`: Probability of data point belonging to <font color='green'>**Walking**</font> class.

In [None]:
import pandas as pd
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestClassifier

# Loading the trained classifier
clf = joblib.load(r"trained_classifier.pkl")

# File which has unlabeled data.
filename = r"~/Desktop/test.csv"

# File which should contain the data with predicted labels.
output_filename = r"~/Desktop/test_labeled.csv"

# This file should have the following columns
# The `activity` feature is the ones that user provides. Not used in the script but kept anyway.
selected_features = ["mvm", "sdvm", "mangle", "sdangle", "p625", "df", "fpdf", "activity"]

# loading data into memory
unlabeled_df = pd.read_csv(filename)
test_df = unlabeled_df.loc[:, selected_features]

# Handling None and missing values
for feature in selected_features1[:-1]:
    test_df.loc[:, feature] = [np.float64(test_df.loc[i, feature]) if test_df.loc[i, feature] != 'None' else None for i in range(test_df.shape[0])]

# test_Y (if provided) is the activity label that user provides. Not used.
test_X, test_Y = test_df.loc[:, selected_features[:-1]], test_df.loc[:, selected_features[-1]]

# For performance purposes, it is better to feed the classifier all the unlabeled data at once.
# We later remove the predicted labels for those with missing values.

# cells with missing values
nan_idx = test_X.isnull().values

# replace NaNs with 0; we can use any arbitrary number. We ignore samples with missing values later.
test_X[nan_idx] = 0

# Predicting the activity label for the unlabeled data points.
predicted_activity_labels = clf.predict(test_X)

# Predicting the probabilities that data points belong to which activity type.
predicted_probabilities = clf.predict_proba(test_X)

# Forming a neat data frame from the outcome
predicted_df = pd.DataFrame(data={'Activity_Label':predicted_activity_labels,
                                  'Sitting_Prob': [None] * test_df.shape[0],
                                  'Standing_Prob': [None] * test_df.shape[0],
                                  'Walking_Prob': [None] * test_df.shape[0]})
predicted_df.loc[:, ["Sitting_Prob", "Standing_Prob", "Walking_Prob"]] = predicted_probabilities

# Now we remove the predictions for data points that had missing values.
for i in range(len(nan_idx)):
    if sum(nan_idx[i]) > 0: # there was a null value here
        predicted_df.loc[i, ["Activity_Label", "Sitting_Prob", "Standing_Prob", "Walking_Prob"]] = [None]*4
        
        
# Stiching back the outcome to the original data - so we can save data points with the predicted labels.
unlabeled_df.loc[:, "predicted_activity"] = [None]*predicted_df.shape[0]
unlabeled_df.loc[:, "sitting_prob"] = [None]*predicted_df.shape[0]
unlabeled_df.loc[:, "standing_prob"] = [None]*predicted_df.shape[0]
unlabeled_df.loc[:, "walking_prob"] = [None]*predicted_df.shape[0]
unlabeled_df.loc[:, ["predicted_activity", "sitting_prob", "standing_prob", "walking_prob"]] = predicted_df.loc[:, ["Activity_Label", "Sitting_Prob", "Standing_Prob", "Walking_Prob"]].values

# Saving to file
unlabeled_df.to_csv(output_filename, index=False)