# CS 6961 - A3

## 1.2 - Collection
### Data cleaning and train/test splitting
I recorded each activity for 6 minutes to allow for settling in my pocket and to remove any time where the phone was being removed from my pocket at the end

Because of this, the first step in data processing is to remove anything with a timestamp less than 30000 (0:30) or greater than 330000 (5:30)

_Note: All data was collected with relative timestamps measured in milliseconds_

Second, we'll need to split the data into train and test sets. The first 4 minutes of data will be training, the final minute will be test

In [4]:
#!pip3 install pandas
import pandas as pd
import os

data_dir = "data_raw"
train_dir = "train"
test_dir = "test"
min_sec = 30
max_sec = 330
train_duration = 4 * 60 * 1000 # 4 minutes in milliseconds
train_cutoff = (min_sec * 1000) + train_duration

# get csv filenames in the data directory
data_filenames = os.listdir("./{}".format(data_dir))

training_dfs = {}

for data_file in data_filenames:
    # read into a dataframe
    temp_df = pd.read_csv('./{}/{}'.format(data_dir, data_file))
    
    # filter on timestamps < min_sec || > max_sec
    temp_df = temp_df[(min_sec * 1000) <= temp_df['relative_time']]
    temp_df = temp_df[temp_df['relative_time'] < (max_sec * 1000)]
    
    # add a column with datetimes to make resampling easier
    temp_df = temp_df.assign(abs_time=pd.to_datetime(temp_df["relative_time"], unit='ms'))
    temp_df = temp_df.resample("100ms", on="abs_time").mean()

    # now we need to split the data into train and test data
    # the first 4 minutes will be training data, the last minute will be test data
    # in terms of this data, that means 4 minutes plus the min_sec value where we cut off before (30 sec) is the new cutoff
    # anything with a timestamp less than 270000 (4:30) will be training data, anything greater is test data
    
    train_df = temp_df[temp_df['relative_time'] <= train_cutoff]
    test_df = temp_df[temp_df['relative_time'] > train_cutoff]

    train_status = u'\u2713' if 2399 <= len(train_df.index) <= 2401 else u'\u26A0'
    test_status = u'\u2713' if 599 <= len(test_df.index) <= 601 else u'\u26A0'
    print(data_file[:-4])
    print("Train: {} entries {} | Test: {} entries {}".format(len(train_df.index), train_status , len(test_df.index), test_status))
    
    training_dfs[data_file[:-4]] = train_df
    
    # write data to csv's in respective directories
    train_df.to_csv("./{}/{}".format(train_dir, data_file), mode="w")
    test_df.to_csv("./{}/{}".format(test_dir, data_file), mode="w")
    print("Data files written\n")
    

Collecting pandas
  Using cached pandas-1.2.3-cp38-cp38-macosx_10_9_x86_64.whl (10.5 MB)
Collecting pytz>=2017.3
  Using cached pytz-2021.1-py2.py3-none-any.whl (510 kB)
Installing collected packages: pytz, pandas
Successfully installed pandas-1.2.3 pytz-2021.1
stairs
Train: 2400 entries ✓ | Test: 600 entries ✓
Data files written

jogging
Train: 2399 entries ✓ | Test: 600 entries ✓
Data files written

vehicle
Train: 2400 entries ✓ | Test: 600 entries ✓
Data files written

walking
Train: 2400 entries ✓ | Test: 599 entries ✓
Data files written

web_browsing
Train: 2400 entries ✓ | Test: 600 entries ✓
Data files written



### Plot training data

In [5]:
#!pip3 install altair
import altair as alt

all_graphs = {}
fields = ["AccX", "AccY", "AccZ"]

for name, training_df in training_dfs.items():
    graphs = []
    for field in fields:
        temp_chart = alt.Chart(training_df.reset_index(), title=name
            ).mark_line(
            ).encode(
                x=alt.X('abs_time', axis=alt.Axis(format='%M:%S'), title="Time"),
                y=field
            )
        
        graphs.append(temp_chart)
    all_graphs[name] = graphs
           
(all_graphs["walking"][0] | all_graphs["walking"][1] | all_graphs["walking"][2]) &\
    (all_graphs["jogging"][0] | all_graphs["jogging"][1] | all_graphs["jogging"][2]) &\
    (all_graphs["stairs"][0] | all_graphs["stairs"][1] | all_graphs["stairs"][2]) &\
    (all_graphs["web_browsing"][0] | all_graphs["web_browsing"][1] | all_graphs["web_browsing"][2]) &\
    (all_graphs["vehicle"][0] | all_graphs["vehicle"][1] | all_graphs["vehicle"][2])

Collecting altair
  Using cached altair-4.1.0-py3-none-any.whl (727 kB)
Collecting entrypoints
  Using cached entrypoints-0.3-py2.py3-none-any.whl (11 kB)
Collecting toolz
  Using cached toolz-0.11.1-py3-none-any.whl (55 kB)
Collecting jinja2
  Using cached Jinja2-2.11.3-py2.py3-none-any.whl (125 kB)
Collecting jsonschema
  Using cached jsonschema-3.2.0-py2.py3-none-any.whl (56 kB)
Collecting MarkupSafe>=0.23
  Using cached MarkupSafe-1.1.1-cp38-cp38-macosx_10_9_x86_64.whl (16 kB)
Collecting attrs>=17.4.0
  Using cached attrs-20.3.0-py2.py3-none-any.whl (49 kB)
Collecting pyrsistent>=0.14.0
  Using cached pyrsistent-0.17.3.tar.gz (106 kB)
Building wheels for collected packages: pyrsistent
  Building wheel for pyrsistent (setup.py) ... [?25ldone
[?25h  Created wheel for pyrsistent: filename=pyrsistent-0.17.3-cp38-cp38-macosx_10_15_x86_64.whl size=68579 sha256=b30a1f0909f9c16f55d25710baa5b8e278ed998c774b0a474b181bf39550eb83
  Stored in directory: /Users/johnlund/Library/Caches/pip/whee

## 1.3 Make Features
First, we'll review the charts above to identify useful features

Then, we'll write a function that extracts them

Finally, we'll divide the training and test data into 10 second segments that can be processed by the function

### Identifying useful features
Looking at the graphs, the clearest distinctions I see between the different activities are:
- how much they vary from a central value
- what the typical central value is
    - e.g. walking AccX hangs around 1.7ish and jumps around to consistently +/-1.5 from that value
    - each activity has pretty stable patterns like this, and they don't overlap much

I also noted that these stay fairly consistent across all three dimensions (X, Y, Z)

As a result, I chose to use the following features:
- means for X, Y, and Z
- standard deviations for X, Y, Z

### Conversion function
This function is pretty straightforward. It takes in a dataframe with 100 entries (10 seconds at 0.1 second intervals)

It uses built-in pandas functions `mean()` and `std()` to compute the mean and standard deviation for each column

It returns a dictionary that with the following format:
```
{
    "label": the activity,
    "x_mean": the mean of the 10 AccX values,
    "y_mean": the mean of the 10 AccY values,
    "z_mean": the mean of the 10 AccZ values,
    "x_std": the standard deviation of the 10 AccX values,
    "y_std": the standard deviation of the 10 AccY values,
    "z_std": the standard deviation of the 10 AccZ values,
}
```

In [6]:
# features to include
    # X, Y, Z means
    # X, Y, Z standard deviations
# expects a dataframe with 100 entries
def data_to_features(input_data, label):
    features = {"label": label}
    # means
    features["x_mean"] = input_data["AccX"].mean()
    features["y_mean"] = input_data["AccY"].mean()
    features["z_mean"] = input_data["AccZ"].mean()

    # standard deviations
    features["x_std"] = input_data["AccX"].std()
    features["y_std"] = input_data["AccY"].std()
    features["z_std"] = input_data["AccZ"].std()

    return features

### Splitting into 10 second groups
The last step in prepping the training and test data is to create groups of 10 seconds for both.

To do this, we'll use `np.array_split()` to create 24 groups for the training data, then 6 for the test data

In [7]:
import numpy as np

train_filenames = os.listdir("./{}".format(train_dir))

combined_training_df = pd.DataFrame()

# for each training file, load dataframe
for train_file in train_filenames:
    activity_train_df = pd.read_csv("./{}/{}".format(train_dir, train_file))
    # then, break into 24 groups (1 for each 10 seconds)
    activity_10_sec_splits = np.array_split(activity_train_df, 24)
    for data_10_sec in activity_10_sec_splits:
        features = data_to_features(data_10_sec, train_file[:-4])
        
        combined_training_df = combined_training_df.append(features, ignore_index=True)

combined_training_df.to_csv("combined_training_data.csv")
csv_status = u'\u2713' if len(combined_training_df.index) == 120 else u'\u26A0'
print("Wrote {}/120 entries to 'combined_training_data.csv' {}".format(len(combined_training_df.index), csv_status))

Wrote 120/120 entries to 'combined_training_data.csv' ✓


In [8]:
# now we'll do a very similar process to make the combined test file
test_filenames = os.listdir("./{}".format(test_dir))

combined_test_df = pd.DataFrame()

# for each training file, load dataframe
for test_file in test_filenames:
    activity_test_df = pd.read_csv("./{}/{}".format(test_dir, test_file))
    # then, break into 24 groups (1 for each 10 seconds)
    activity_10_sec_splits = np.array_split(activity_test_df, 6)
    for data_10_sec in activity_10_sec_splits:
        features = data_to_features(data_10_sec, test_file[:-4])
        
        combined_test_df = combined_test_df.append(features, ignore_index=True)

combined_test_df.to_csv("combined_test_data.csv")
csv_status = u'\u2713' if len(combined_test_df.index) == 30 else u'\u26A0'
print("Wrote {}/30 entries to 'combined_test_data.csv' {}".format(len(combined_test_df.index), csv_status))

Wrote 30/30 entries to 'combined_test_data.csv' ✓


## 1.4 - Classification
Now we use sklearn to train a few different models

I chose to use:
- Logistic Regression
- K-Nearest Neighbors
- Linear SVM

I used sklearn's shuffle to make sure the order of the entries wasn't affecting how the models learned

In [9]:
#!pip3 install -U scikit-learn
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.utils import shuffle

# set up models
models = {}
models["Logistic Regression"] = LogisticRegression(max_iter=400) # set to 400 to silence iteration warning from sklearn
models["KNN"] = KNeighborsClassifier()
models["SVM"] = SVC(kernel='linear')

# separate features from labels
x_train = combined_training_df.drop('label', axis=1)
y_train = combined_training_df[['label']].values.ravel()

# shuffle them randomly
x_train, y_train = shuffle(x_train, y_train, random_state=0)

# train the models
for model in models.values():
    model.fit(x_train, y_train)

# separate features from labels for test data
x_test = combined_test_df.drop('label', axis=1)
y_test = combined_test_df[['label']].values.ravel()

# shuffle them randomly
x_test, y_test = shuffle(x_test, y_test, random_state=0)

# test the models and get the accuracy
scores = {}
for name, model in models.items():
    scores[name] = model.score(x_test, y_test)

# print the scores
for name, score in scores.items():
    print("{}: {}%".format(name.rjust(19, " "), score * 100))



Logistic Regression: 100.0%
                KNN: 100.0%
                SVM: 100.0%


### Reflection
Each of the models produced 100% accuracy which is just impossible usually. I'm assuming there's overfit?

Actually, looking back at the data more carefully, I think you could pretty accurately determine the type of activity based on `x_mean` alone because of the stability of values for these features and the fact that there's little to no overlap in those values.

I think these models would struggle much more if we were to combine multiple people's sensor data into a larger, more diverse dataset.
