In [None]:
# scratch notebook

In [1]:
# can any of these be deleted for the final notebook?

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression,\
LassoCV, RidgeCV, ElasticNetCV, LogisticRegression

from sklearn.model_selection import train_test_split, cross_validate, KFold, \
cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, \
FunctionTransformer
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix, \
precision_score, recall_score, accuracy_score, f1_score, \
log_loss, roc_curve, roc_auc_score, classification_report 

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# from sklearn.impute import SimpleImputer

# from imblearn.over_sampling import SMOTE
# from imblearn.pipeline import Pipeline as ImPipeline

from pathlib import Path

In [None]:
# didn't get in yet:  from sklearn.metrics import plot_confusion_matrix, plot_roc_curve

# Contents

In [None]:
# image

# Overview

# Business Understanding

I am interested in taking raw data from a body-worn sensor on simulated activities, including falls, to create a model that will accurately predict when a person has had a fall or not.  As a former physical therapist who worked in a variety of settings, I believe there is great value in having real-time recognition of a fall event for a patient so staff/family can receive immediate notice and the patient can receive prompt medical attention.  This business challenge applies to the healthcare industry, and could be relevant across the continuum of care, from an acute care hospital environment, to a subacute/rehab setting, a long-term care/nursing home, or even to elderly residents residing alone in the community with family support.  My target audience would ideally include administrators considering better real-time fall monitoring in their facilities, or even to family members of elderly residents living independently.  In addition to the original research that provided this dataset, my analysis would be further proof that this monitoring system can yield accurate and actionable information that would improve patient safety and reduce costly medical complications related to falls.  My domain knowledge includes 15 years of experience as a physical therapist in acute care, rehabs / long term care, outpatient, and home care environments.  I also recently did a __[blog post](https://medium.com/@jonmccaffrey524/deep-learning-and-human-activity-recognition-98cb43da229)__ on deep learning and human activity recognition, and read the __[paper](https://arcoresearch.com/2021/11/23/the-shapes-smart-mirror-approach-for-independent-living-healthy-and-active-ageing/)__ published by the ARCO research group related to their fall monitoring system.  In summary, I am motivated by the fact that falls in healthcare facilities and in the home can be a cause of serious injury and complications for patients / residents, as well as be tremendously costly to our healthcare system.  Though this project doesn’t aim to prevent falls, it does aim to verify accurate diagnosis that a fall has occurred based on sensor-data provided, to help contribute to the goal of improved real-time monitoring and hopefully emergent medical management for someone who has sustained a fall. 

# Data Understanding

The data I plan to explore and model on comes from theARCO research group in Spain.  It involves activity-monitoring recording of 17 participants undergoing a variety of Activity of Daily Living (ADL) tasks as well as simulated falls.  The data was obtained __[here](https://arcoresearch.com/2021/04/16/dataset-for-fall-detection/)__, and downloaded as a .zip file.  The features include detailed sensor information for acceleration (in g), rotation (in deg/sec), and absolute orientation in Euler angles.  There are 3 different folders of CSV files in total.  Included in the data is a clear target (0 or 1) indicating if a fall occurred during the recording of the activity.     Previous work has been done by the authors, who reference a “machine learning algorithm” they created with “100% accuracy” in a “controlled environment”.  Though the details of the algorithm are vague, I aim to try to replicate their findings while also building my own machine-learning understanding for human activity recognition tasks.  

Subject	Age	Weight(Kg)	Height(m)
1	24	84	1.90
2	27	90	1.70
3	24	69	1.80
4	24	65	1.59
5	43	83	1.77
6	27	65	1.70
7	34	76	1.76
8	42	89	1.84
9	24	65	1.75
10	24	56.2	1.75
11	23	74.3	1.72
12	22	85	1.72
13	41	72	1.65
14	36	80	1.85
15	31	75	1.64
16	22	64.5	1.71
17	43	71	1.76

fall-dataset-features: Each row of this dataset  contains the features used in our study to filter raw data and describe a movement. Each row represents a complete exericse (Fall or ADL)

fall-dataset-raw: Raw data from a one second window when the user perfomed the activity. Each row alone is not relevant because it only contains raw data in a instant of time. In order to get relevant information you must use all the data with the same value on the column index, all this data are part of the same exercise along the time.

fall-dataset-all: On these files, all the data collected when the exercises were performed by the users is saved. It could be useful if you need data out of the one second window. This data is not labeled, but you can use fall-dataset-raw in order to find when a fall or an ADLs were produced. Both fall-dataset-raw and fall-dataset-all have timestamp in order to ease this task.

# Data Preparation

The .zip file contains 3 folders of CSV files of data.  
- The 1st folder contains 17 CSV files (one for each participant) with all the raw data (11 columns each, 2800 rows each) but summarized/indexed by the task being performed.  
- The 2nd folder contains 17 CSV files with all the features information (25 columns each, 45 rows each - representing summations of each of 45 tasks, including simulated falls)
- The 3rd folder contains 17 CSV files of all the raw data for each participant (10 columns, 30K rows) NOT summarized / indexed by task.

The data types are all numeric (int or float), though that includes a timestamp variable and the 0/1 boolean of fall occurrence.  The division between data that represent falls vs. non-fall tasks appears about evenly divided, so there is not a significant class imbalance.  Also, the amount of data available for each task being performed (45 total), for each participant (17 total), appears roughly equal.  The libraries I intend to use include at least Pandas, NumPy, sklearn, matplotlib, and Seaborn.  As far as preprocessing, the data is very clean with no nulls.  Pre-processing may include clustering to see if any groupings can be discerned from that.  It may require scaling and dimensionality reduction as well.  I will need more domain knowledge related to the units of acceleration, rotation, and absolute orientation in Euler angles.  At a minimum, it appears I would utilize the 1st folder (raw data summarized/indexed by task) which contains ~2800 rows per participant.  Visualizations could include confusion matrices and ROC curves for iterative modeling, then separate visuals to represent accuracy by tasks (stacked bar graph?).  I could also do a bar graph for eventual feature importance.  

## Unzipping folders in Jupyter Notebook

In [None]:
# files too large (~117MB) to save on GitHub unless they remain compressed (~30MB)

In [13]:
cd data

C:\Users\JonMc\Documents\Flatiron\Fall_Detection_Model\data


In [14]:
ls

 Volume in drive C is Windows
 Volume Serial Number is 62FE-3091

 Directory of C:\Users\JonMc\Documents\Flatiron\Fall_Detection_Model\data

12/12/2022  05:15 PM    <DIR>          .
12/12/2022  05:13 PM    <DIR>          ..
12/12/2022  09:31 AM        31,452,115 fall-dataset.zip
12/12/2022  09:31 AM           227,728 test_dataset.zip
               2 File(s)     31,679,843 bytes
               2 Dir(s)  385,159,757,824 bytes free


In [15]:
! unzip fall-dataset

Archive:  fall-dataset.zip
   creating: fall-dataset/
   creating: fall-dataset/fall-dataset-features/
  inflating: fall-dataset/fall-dataset-features/Subject10.csv  
  inflating: fall-dataset/fall-dataset-features/Subject11.csv  
  inflating: fall-dataset/fall-dataset-features/Subject12.csv  
  inflating: fall-dataset/fall-dataset-features/Subject13.csv  
  inflating: fall-dataset/fall-dataset-features/Subject14.csv  
  inflating: fall-dataset/fall-dataset-features/Subject15.csv  
  inflating: fall-dataset/fall-dataset-features/Subject16.csv  
  inflating: fall-dataset/fall-dataset-features/Subject17.csv  
  inflating: fall-dataset/fall-dataset-features/Subject1.csv  
  inflating: fall-dataset/fall-dataset-features/Subject2.csv  
  inflating: fall-dataset/fall-dataset-features/Subject3.csv  
  inflating: fall-dataset/fall-dataset-features/Subject4.csv  
  inflating: fall-dataset/fall-dataset-features/Subject5.csv  
  inflating: fall-dataset/fall-dataset-features/Subject6.csv  
  infla

## Creating a dataframe for fall-dataset-raw

In [85]:
# can replace this pathname with the full path to the folder locally
path_raw = r'C:\Users\JonMc\Documents\Flatiron\Fall_Detection_Model\data\fall-dataset\fall-dataset-raw' 

# Get the files from the path provided
files_raw = Path(path_raw).glob('*.csv') 

In [86]:
# this for loop will create a separate column based on the filename, to separate subjects if needed

dfs_1 = []
for f in files_raw:
    data = pd.read_csv(f)
    # .stem is method for pathlib objects to get the filename w/o the extension
    data['File'] = f.stem
    dfs_1.append(data)

In [87]:
# concatenating all 17 files into one dataframe
df_raw = pd.concat(dfs_1, ignore_index=True)

In [88]:
df_raw.head()

Unnamed: 0,Feature Line,Acc(X),Acc(Y),Acc(Z),Rot(X),Rot(Y),Rot(Z),Pitch,Roll,Yaw,Timestamp,Fall,File
0,1,3.191406,0.768555,8.799805,98.841469,-488.109772,-94.939026,8.554567,68.015976,354.055115,1612546353614,0,Subject1-raw
1,1,2.96582,0.224121,2.638672,-261.890259,-15.853659,-24.634148,7.382404,72.709183,353.782318,1612546353616,0,Subject1-raw
2,1,0.85498,0.5,0.548828,-337.865875,535.853699,49.817074,7.836745,72.958641,355.967834,1612546353657,0,Subject1-raw
3,1,-1.23877,-2.900391,-6.257324,-254.207321,460.792694,49.817074,10.936003,65.359154,0.66708,1612546353659,0,Subject1-raw
4,1,1.804688,2.567871,-0.529297,741.890259,-307.5,107.073174,26.398607,61.147324,2.398508,1612546353661,0,Subject1-raw


In [89]:
# value counts per subject
df_raw['File'].value_counts()

Subject15-raw    3370
Subject9-raw     3166
Subject2-raw     3150
Subject3-raw     3109
Subject8-raw     3019
Subject13-raw    2935
Subject10-raw    2924
Subject5-raw     2899
Subject14-raw    2897
Subject7-raw     2896
Subject11-raw    2872
Subject17-raw    2847
Subject6-raw     2837
Subject4-raw     2770
Subject1-raw     2755
Subject16-raw    2722
Subject12-raw    2545
Name: File, dtype: int64

In [81]:
# no nulls
df_raw.isna().sum()

Feature Line    0
Acc(X)          0
Acc(Y)          0
Acc(Z)          0
Rot(X)          0
Rot(Y)          0
Rot(Z)          0
Pitch           0
Roll            0
Yaw             0
Timestamp       0
Fall            0
File            0
dtype: int64

## Creating a dataframe for fall-dataset-all

In [90]:
# can replace this pathname with the full path to the folder locally
path_all = r'C:\Users\JonMc\Documents\Flatiron\Fall_Detection_Model\data\fall-dataset\fall-dataset-all' 

# Get the files from the path provided
files_all = Path(path_all).glob('*.csv')

In [91]:
# this for loop will create a separate column based on the filename, to separate subjects if needed

dfs_2 = []
for f in files_all:
    data = pd.read_csv(f)
    # .stem is method for pathlib objects to get the filename w/o the extension
    data['File'] = f.stem
    dfs_2.append(data)

In [92]:
# concatenating all 17 files into one dataframe
df_all = pd.concat(dfs_2, ignore_index=True)

In [93]:
df_all.head()

Unnamed: 0,Acc(X),Acc(Y),Acc(Z),Rot(X),Rot(Y),Rot(Z),Pitch,Roll,Yaw,Timestamp,File
0,0.932617,-0.166504,0.411133,3.231707,-2.865854,3.536585,9.411585,64.421898,359.941193,1612546351138,Subject1-raw-all
1,0.93457,-0.166016,0.398926,3.109756,-1.280488,3.353659,9.430594,64.434891,359.882324,1612546351140,Subject1-raw-all
2,0.938477,-0.17041,0.387207,2.317073,-0.609756,2.987805,9.448231,64.434715,359.828003,1612546351141,Subject1-raw-all
3,0.937988,-0.17627,0.380371,2.195122,-0.731707,2.621951,9.465791,64.43203,359.783264,1612546351182,Subject1-raw-all
4,0.937012,-0.17334,0.384766,2.195122,-1.463415,2.256098,9.481668,64.436104,359.742218,1612546351184,Subject1-raw-all


In [94]:
df_all['File'].value_counts()

Subject15-raw-all    75650
Subject2-raw-all     57617
Subject8-raw-all     53069
Subject14-raw-all    47176
Subject9-raw-all     43937
Subject7-raw-all     43639
Subject10-raw-all    39043
Subject17-raw-all    38285
Subject12-raw-all    37134
Subject3-raw-all     34303
Subject4-raw-all     34194
Subject5-raw-all     32879
Subject13-raw-all    32634
Subject11-raw-all    29949
Subject1-raw-all     29551
Subject6-raw-all     25949
Subject16-raw-all    23706
Name: File, dtype: int64

## Creating a dataframe for fall-dataset-features

In [95]:
# can replace this pathname with the full path to the folder locally
path_feat = r'C:\Users\JonMc\Documents\Flatiron\Fall_Detection_Model\data\fall-dataset\fall-dataset-raw' 

# Get the files from the path provided
files_feat = Path(path_feat).glob('*.csv')

In [96]:
# this for loop will create a separate column based on the filename, to separate subjects if needed

dfs_3 = []
for f in files_feat:
    data = pd.read_csv(f)
    # .stem is method for pathlib objects to get the filename w/o the extension
    data['File'] = f.stem
    dfs_3.append(data)

In [97]:
# concatenating all 17 files into one dataframe
df_feat = pd.concat(dfs_3, ignore_index=True)

In [98]:
df_feat.head()

Unnamed: 0,Feature Line,Acc(X),Acc(Y),Acc(Z),Rot(X),Rot(Y),Rot(Z),Pitch,Roll,Yaw,Timestamp,Fall,File
0,1,3.191406,0.768555,8.799805,98.841469,-488.109772,-94.939026,8.554567,68.015976,354.055115,1612546353614,0,Subject1-raw
1,1,2.96582,0.224121,2.638672,-261.890259,-15.853659,-24.634148,7.382404,72.709183,353.782318,1612546353616,0,Subject1-raw
2,1,0.85498,0.5,0.548828,-337.865875,535.853699,49.817074,7.836745,72.958641,355.967834,1612546353657,0,Subject1-raw
3,1,-1.23877,-2.900391,-6.257324,-254.207321,460.792694,49.817074,10.936003,65.359154,0.66708,1612546353659,0,Subject1-raw
4,1,1.804688,2.567871,-0.529297,741.890259,-307.5,107.073174,26.398607,61.147324,2.398508,1612546353661,0,Subject1-raw


In [99]:
df_feat['File'].value_counts()

Subject15-raw    3370
Subject9-raw     3166
Subject2-raw     3150
Subject3-raw     3109
Subject8-raw     3019
Subject13-raw    2935
Subject10-raw    2924
Subject5-raw     2899
Subject14-raw    2897
Subject7-raw     2896
Subject11-raw    2872
Subject17-raw    2847
Subject6-raw     2837
Subject4-raw     2770
Subject1-raw     2755
Subject16-raw    2722
Subject12-raw    2545
Name: File, dtype: int64

# Modeling

Modeling will involve the use of a train-test split, a baseline dummy classifier, then pipelines and cross-validation for logistic regression, kNN, decision trees / random forest, grid searches, XG boosting.  My target variable is the defined 0 or 1 for whether a fall occurred in that segment of testing, which indicates a binary classification problem.  My local computer should be sufficient for this, as it was in phase 3, but I am open to using Google Colab  if I need additional computing power, or if the processing speed on my own computer is woefully slow.  My data will be stored on my local machine.

# Evaluation

My chosen metrics for evaluation are accuracy and recall (aiming to reduce false negatives, therefore catching almost all or all of the falls that occurred in the data).  The MVP involves finding the best model with the highest accuracy and recall possible in a reasonable timeframe with the computational resources I have available.  The smaller project I hope to accomplish this first week is to get through Random Forest models and into Grid Searching to optimize those models.  XG boosting will likely have to wait until next week.  For a level up, I am considering deploying the best model with examples and explanations via Streamlit

# Summary