## Problem definition and data description

### Problem Definition
Our company (Spotify) would like to dynamically target advertising to non-premium members based on their physical activity while using Spotify services. For example, while a listener is enjoying a podcast and folding their laundry, they would receive an ad for laundry detergent. 

In addition Spotify also wishes to cater to our premium members by enhancing music recommendation/auto-play options based on a members physical activity. For example, while a user is exercising play up-tempo music, and while a user is eating pasta play Italian classics.

### Data Description

Accelerometer (measures proper acceleration) and Gyroscope (measures orientation and angular velocity) data was collected from 51 volunteer subjects. Each subject was asked to perform 18 tasks for 3 minutes each. The 18 tasks were a mix of physical activities that could be distinctly identified, such as walking, eating, laundry, etc. We (Spotify) tried to collect data for activities that our members might be doing while using our services. The tasks are listed below.

![image info](./images/Activity-Code-Table.png)

Each subject had a smartwatch placed on his/her dominant hand and a smartphone in their pocket. The smartphone and smartwatch both had an accelerometer and gyrocope, yielding four total sensors (Phone - Gyroscope, Phone - Accelerometer, Watch - Gyroscope, Watch - Accelerometer).

![image info](./images/Human-With-Sensors.png)

To accomodate the four sensors, the data is split up into 4 subdirectories, one for each device and sensor. 

![image info](./images/Sensor-Subdirectories.png)

Each directory contains the sensor results for the 51 subject's performance of the 18 activities. The results for each subject are stored in a comma delimited text file. Since there are 51 subjects and 4 different sensors, there are a total of 204 text files. Each text file has the same six attributes: Subject-id, Activity Code, Timestamp, x, y, z

![image info](./images/Raw-Data-Description.png)

## Data preparation process

Our data is pretty clean, we don't need to do a lot of preproccessing/data engineering. We really just need to do the ML side, which, lends itself more to the majority of work we need to do with this project. We stuck with dask so we could use the natural integration it has with python, as well as its similiar syntax to Pandas.

To clean, prepare and train our data, we decided to go with dask. Our reasoning was that, while our data was large (approx. 15 million records, ~1 gb), it was not large enough to warrant the use of Spark. The image below summarizes our thoughts on the choice between dask vs spark.

![image info](./images/Pandas-Dask-Spark-Compare.png)

### Importing the data

To shortstep the inconvenience of downloading and importing over 200 text files, we decided to host all the data on github for easy access (https://github.com/gojandrooo/DSE-230/tree/main/data). To quickly pull the github data into a pandas dataframe, we defined a function collate_df that will pull in all data matching the parameters given.

Begin by importing all the necessary libraries

Running within *Docker* container you will need to install libraries not already included in the image.
- comment/uncomment the `%pip install` cell (below)
- run the cell, wait for the packages to install, and then restart notebook. 
- once installs are complete, comment out the cell and run all

In [1]:
%pip install plotly
%pip install dask_distance

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
#set a random state seed for replication
seed=23

In [3]:
# standard libraries
import os
import pandas as pd
import numpy as np
import itertools as it

# plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# distributed libraries
import dask
import dask.dataframe as dd
import dask.array as da
from dask.distributed import Client
from dask import delayed
import joblib

# model processing libraries
import dask_distance
from dask_ml.model_selection import train_test_split
from dask_ml.preprocessing import StandardScaler
#from sklearn.model_selection import GridSearchCV
import dask_ml.model_selection as dcv

# models
# will need to update these with the models we use
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans

import ssl
# needed to request files from GitHub when running within docker container
ssl._create_default_https_context = ssl._create_unverified_context

In [4]:
# Start and connect to a local dask.distributed client
client = Client(processes=True) # use all 4 cores
client.connection_args

{'ssl_context': None, 'require_encryption': False, 'extra_conn_args': {}}

Get data from github and prep files for analysis

In [5]:
# key for understanding which activity is being measured in a record
activity_key_url = r"https://raw.githubusercontent.com/gojandrooo/DSE-230/main/data/activity_key.txt"

#read the activity table from gtihub
activity_key = pd.read_csv(activity_key_url, header=None)

#split the data into a proper table
activity_key = activity_key[0].str.replace(" ", "").str.split("=", expand=True)
activity_key.columns = ['activity', 'code']

In [6]:
# #do not run unless you need cleaned parquet files locally
# #cleaned parquet files should already be located on github

# #below function takes the raw data on githup and converts to parquet files
# #then stores the files on local machine
# '''
# def convert_raw_to_parquet():
#     #base URL where raw data can be easly grabbed
#     base_url = r"https://raw.githubusercontent.com/gojandrooo/DSE-230/main/data"

#     # TOGGLE FOR DEVICE
#     devices = ["phone", "watch"]

#     # TOGGLE FOR MEASUREMENT TYPE
#     data_types = ["accel", "gyro"]
    
    
#     # create list of local folders
#     for data_type in data_types:
#         for device in devices:
#             os.makedirs(r"data/parquet/" + "/" + device + "/" + data_type, exist_ok=True) 
    

#     locs = {}
#     for data_type in data_types:
#         for device in devices:
#             file_locs = []
#             for user_id in range(1600, 1651):
#                 url = base_url + "/" + device + "/" + data_type + f"/data_{user_id}_{data_type}_{device}.txt"
#                 df = pd.read_csv(url, header=None)
#                 df.columns = ['subject_id', 'code', 'timestamp', 'x', 'y', 'z']
#                 custom_dtypes = {"subject_id": "int16", "x": "float32", "y": "float32", "z": "float32"}
#                 df['z'] = df['z'].str.replace(";", "")
#                 #df = df.reset_index(drop = True)
#                 df = df.astype(custom_dtypes)
#                 df['index'] = df['subject_id'].astype('str') + df['code'] + df['timestamp'].astype('str')
#                 fname = r"data/parquet/" + "/" + device + "/" + data_type + f"/data_{user_id}_{data_type}_{device}.gzip"
#                 df.to_parquet(fname)
#                 file_locs.append(fname)
#             locs[device, data_type] = file_locs
# '''

In [13]:
# NOTE
# this still only grabs three spreadsheets, update for production

def collate_dask_df(device, data_type):

    '''
    returns a single dask dataframe from multiple text files hosted on github
    
    device: ["phone", "watch"]
    
    data_type: ["accel", "gyro"]
    ----------------------------
    '''
    
    #base_url = r"https://raw.githubusercontent.com/gojandrooo/DSE-230/main/data"
    base_url = r"https://github.com/garrett391/DSE-230/blob/main/data/parquet"
    #base_url = r"https://github.com/gojandrooo/DSE-230/blob/main/data/parquet"
    # TOGGLE FOR DEVICE
    device = device

    # TOGGLE FOR MEASUREMENT TYPE
    data_type = data_type
    
    # create list of all file names
    #file_names = [f"/data_{user_id}_{data_type}_{device}.txt" for user_id in range(1600, 1651)]
    file_names = [f"/data_{user_id}_{data_type}_{device}.gzip?raw=true" for user_id in range(1600, 1651)]

    # create urls of all files
    loop_urls = [base_url + "/" + device + "/" + data_type + file_name for file_name in file_names]
    
    #setting datatypes to save memory
    
    #dask_df = dd.read_parquet(loop_urls[:3], index = 'index') # for dev this is only the first three files
    dask_df = dd.read_parquet(loop_urls, index = 'index') # PRODUCTION, all of the files

    #dask_df = dd.multi.concat([pd.read_csv(url, header=None) for url in loop_urls[:3]]) # for dev this is only the first three files
    #dask_df = dd.multi.concat([pd.read_csv(url, header=None) for url in loop_urls]) # PRODUCTION, all of the files
    
    #dask_df.columns = ['subject_id', 'code', 'timestamp', 'x', 'y', 'z']
    #dask_df['z'] = dask_df['z'].str.replace(";", "").astype('float64')
    #dask_df = dask_df.reset_index(drop = True)
    
    return dask_df # dask df output

### Importing the data

In [14]:
client.restart()

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 24.91 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:45433,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: 1 minute ago,Total memory: 24.91 GiB

0,1
Comm: tcp://127.0.0.1:42683,Total threads: 2
Dashboard: http://127.0.0.1:35393/status,Memory: 6.23 GiB
Nanny: tcp://127.0.0.1:39603,
Local directory: /home/work/dask-worker-space/worker-dp6guaxb,Local directory: /home/work/dask-worker-space/worker-dp6guaxb

0,1
Comm: tcp://127.0.0.1:45571,Total threads: 2
Dashboard: http://127.0.0.1:34283/status,Memory: 6.23 GiB
Nanny: tcp://127.0.0.1:43689,
Local directory: /home/work/dask-worker-space/worker-orm52486,Local directory: /home/work/dask-worker-space/worker-orm52486

0,1
Comm: tcp://127.0.0.1:39391,Total threads: 2
Dashboard: http://127.0.0.1:37231/status,Memory: 6.23 GiB
Nanny: tcp://127.0.0.1:40355,
Local directory: /home/work/dask-worker-space/worker-hhq1i5ey,Local directory: /home/work/dask-worker-space/worker-hhq1i5ey

0,1
Comm: tcp://127.0.0.1:44851,Total threads: 2
Dashboard: http://127.0.0.1:33977/status,Memory: 6.23 GiB
Nanny: tcp://127.0.0.1:33855,
Local directory: /home/work/dask-worker-space/worker-6xd_p4p8,Local directory: /home/work/dask-worker-space/worker-6xd_p4p8


In [16]:
%%time
# create the dask dataframes for each sensor
#phone_accel = collate_dask_df("phone", "accel")
phone_gyro = collate_dask_df("phone", "gyro")
#watch_accel = collate_dask_df("watch", "accel")
#watch_gyro = collate_dask_df("watch", "gyro")
phone_accel.head(3)

In [146]:
#phone_accel = phone_accel.map_partitions(lambda df: df.assign(xy=df.x * df.y))

In [154]:
phone_accel = phone_accel.assign(
    xy = phone_accel['x'] * phone_accel['y'],
    yz = phone_accel['y'] * phone_accel['z'],
    xz = phone_accel['x']*phone_accel['z']    
    )

In [122]:
#multiply the accel columns together
phone_accel['xy'] = phone_accel['x']*phone_accel['y']
phone_accel['yz'] = phone_accel['y']*phone_accel['z']
phone_accel['xz'] = phone_accel['x']*phone_accel['z']

phone_gyro['xy'] = phone_gyro['x']*phone_gyro['y']
phone_gyro['yz'] = phone_gyro['y']*phone_gyro['z']
phone_gyro['xz'] = phone_gyro['x']*phone_gyro['z']

watch_accel['x^2'] = watch_accel['x']**2
watch_accel['y^2'] = watch_accel['y']**2
watch_accel['z^2'] = watch_accel['z']**2

watch_gyro['x^2'] = watch_gyro['x']**2
watch_gyro['y^2'] = watch_gyro['y']**2
watch_gyro['z^2'] = watch_gyro['z']**2

SyntaxError: expression cannot contain assignment, perhaps you meant "=="? (1457377093.py, line 3)

### Merge files based on index

In [155]:
feat_cols = ['x', 'y', 'z']
def merge_dfs(df1, df2, suffixes):
    df1partitions = df1.npartitions
    df2partitions = df2.npartitions
    partitions = min(df1partitions, df2partitions)
    merged =  dd.merge(df1, df2[feat_cols], how='inner', left_index = True, right_index = True, 
                    suffixes=suffixes).reset_index(drop = True)
    return dd.from_pandas(merged.compute(), npartitions = partitions)

In [156]:
# attempt to merge phone data
phone_df = merge_dfs(phone_accel, phone_gyro[feat_cols], ('_phone_accel', '_phone_gyro'))
phone_df.head()

Unnamed: 0,subject_id,code,timestamp,x_phone_accel,y_phone_accel,z_phone_accel,xy,yz,xz,x_phone_gyro,y_phone_gyro,z_phone_gyro
0,1600,A,252208371766837,-0.496979,18.676529,0.937378,-9.281838,17.506966,-0.465857,-1.356628,-0.435287,-0.45517
0,1600,A,252209278138907,-0.093048,11.105392,1.92598,-1.033336,21.38876,-0.179209,-0.565445,0.700211,0.390747
0,1600,A,252211493715079,-2.82991,8.635773,-0.91246,-24.438461,-7.8798,2.582181,1.324387,-0.051498,-0.21669
0,1600,A,252209127076895,-2.257034,2.684174,-0.441772,-6.058272,-1.185794,0.997096,-0.485886,1.538055,0.009125
0,1600,A,252208875306876,1.651108,13.003159,-2.630463,21.469616,-34.204323,-4.343177,1.082047,0.425598,0.237305


In [114]:
# attempt to merge watch data
watch_df = merge_dfs(watch_accel, watch_gyro[feat_cols], ('_watch_accel', '_watch_gyro'))
watch_df.head()

Unnamed: 0,subject_id,code,timestamp,x_watch_accel,y_watch_accel,z_watch_accel,x_watch_gyro,y_watch_gyro,z_watch_gyro
0,1600,A,90426856696641,2.801216,-0.155922,5.997625,0.070999,-0.20948,-0.195978
0,1600,A,90426757696641,4.972757,-0.158317,6.696732,0.314944,-1.022277,-0.309962
0,1600,A,90427005196641,6.145916,0.832883,11.003901,-0.101574,1.082686,-0.134193
1,1600,A,90427054696641,7.25922,-0.79278,11.485135,-0.677882,1.176429,-0.211957
1,1600,A,90426955696641,4.661511,0.169689,9.684695,0.073129,0.719431,-0.001035


In [115]:
client.cancel(phone_accel)
client.cancel(phone_gyro)
client.cancel(watch_accel)
client.cancel(watch_gyro)

### EDA
Below we compare the accelerometer sensors results and the gyroscope results independently. This is because the sensors have different units. The accelerometer sensor has units in m/s^2 while the gyroscope has units in radians/s.

In [None]:
# VISUALIZE DATA
# just a sample

# depending on your setup may need different renderer to display
# iframe should render on local implementation and docker image implementation

renderer = [
    'notebook', # local
    'notebook_connected', # local
    'kaggle', # local
    'azure', # local
    'browser', # local (opens plot in new browser tab)
    'iframe', # docker, local (saves plot in `iframe_figures` folder)
    'iframe_connected', # docker, local (saves plot in `iframe_figures` folder)
    'colab' # docker
]

# take a sample of the data
df = phone_accel.sample(frac=0.2, random_state=seed).sort_values(by='code').compute()
fig = px.scatter_3d(df, 
                    x='x', 
                    y='y', 
                    z='z',
                    color='code')
fig.show(renderer=renderer[-2]) # if plot does not render, try a different index, the last three are preferred

In [None]:
feat_cols = ['x', 'y', 'z']

In [None]:
phone_accel_stats, watch_accel_stats = dask.compute(
    phone_accel[feat_cols].describe(),
    watch_accel[feat_cols].describe()    
    )
accel_stats = phone_accel_stats.merge(watch_accel_stats, left_index=True, right_index=True, suffixes=('_phone_accel', '_watch_accel'))
accel_stats.reindex(sorted(accel_stats.columns), axis=1)

In [None]:
phone_gyro_stats, watch_gyro_stats = dask.compute(
    phone_gyro[feat_cols].describe(),
    watch_gyro[feat_cols].describe()    
    )
gyro_stats = phone_gyro_stats.merge(watch_gyro_stats, left_index=True, right_index=True, suffixes=('_phone_gyro', '_watch_gyro'))
gyro_stats.reindex(sorted(gyro_stats.columns), axis=1)

In [None]:
del phone_accel_stats
del watch_accel_stats
del phone_gyro_stats
del watch_gyro_stats
del accel_stats
del gyro_stats

# <font color='red'>Don't run this! With full data its killing the memory</font>

In [None]:
# PLOT HISTOGRAMS
# calculate the histograms using the dask dataframes
# since dask deals with large data we cant easily graph the dataframe
# need to get all the histograms by hand and plot

sensor_dfs = [phone_accel, watch_accel, phone_gyro,  watch_gyro]
sensor_labels = ['phone_accel', 'watch_accel', 'phone_gyro',  'watch_gyro']

def hist_subplot(dask_df, axis, n_bins, data_label, ax_row, ax_col):
    '''
    helper function to plot histograms
    
    dask_df: underlying dataframe
    
    axis: ['x', 'y', 'z']
    
    n_bins: int
    
    ax_row: subplot location
    ax_col: subplot location
    '''  
    h, bins = da.histogram(dask_df[axis], bins=n_bins, range=[dask_df[axis].min().compute(), dask_df[axis].max().compute()])
    bincenter = (bins[:-1] + bins[1:]) / 2
    axes[ax_row,ax_col].bar(bincenter, list(h.compute()), align='center', width=2, alpha=0.65, label = data_label)
    axes[ax_row,ax_col].legend(loc='best')

In [None]:
# now that we have the function, actually plot the data
fig, axes = plt.subplots(nrows=3, ncols=2, sharex=True, sharey=True, figsize=(14,12))
    
# x-axis - firt row of subplot
axis = 'x'
i = 0
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 0, 0)
i = 1
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 0, 0)
i = 2
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 0, 1)
i = 3
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 0, 1)

# y-axis - second row of subplot
axis = 'y'
i = 0
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 1, 0)
i = 1
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 1, 0)
i = 2
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 1, 1)
i = 3
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 1, 1)

# z-axis - third row of subplot
axis = 'z'
i = 0
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 2, 0)
i = 1
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 2, 0)
i = 2
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 2, 1)
i = 3
hist_subplot(sensor_dfs[i], axis, 20, sensor_labels[i], 2, 1)

axes[0,0].set_title('Phone/Watch Accelerometer')
axes[0,1].set_title('Phone/Watch Gyroscope')

plt.setp(axes[0, :], ylabel='x-axis')
plt.setp(axes[1, :], ylabel='y-axis')
plt.setp(axes[2, :], ylabel='z-axis')

plt.show()

Before merging, each of the activities was roughly balanced with each otiher. The `watch` sensors are unaffected, but merging the accel/gyro sensors for the `phone` created a slighty misbalance. However, it is not to the point that it will create problems for our classification modeling. 

In [None]:
# inspect whether our classes are still balanced

activity_distr = pd.DataFrame(dask.compute(
    phone_df['code'].value_counts(normalize=True),
    watch_df['code'].value_counts(normalize=True)
    )).T
activity_distr.columns = ['phone', 'watch']
activity_distr.style.format("{:,.1%}")

### Scaling the data

In [None]:
#maybe we won't even do this. if we do, we need to define train_test_split up here
'''
code for scaling is at the ML stage if we want to move it up here
'''

### Aggregating the data

the raw data is time series measurements, for our model, we want to aggregate the data to n-second intervals with with we can make predictions. We selected 3-seconds as the time for a task to be performed and make a prediction

- within each n_second agregation, map relationship between each of the x/y/x paired arrays
    - COSINE
        - xy
        - xz
        - yz
    - CORRELATION
        - xy
        - xz
        - yz
- Calculate AVERAGES
    - x-mean
    - y-mean
    - z-mean

In [None]:
feat_cols

In [None]:
col_combos = list(it.combinations(feat_cols, 2))
col_combos

In [None]:
# create a function to calculate the pairwise metrics for out features
def cos_cor_combos(dask_df, cols):
    '''
    calculate the cosine similarity and the correlation coefficient for two arrays
    '''
    cos, cor = dask.compute(
        #dask_distance.chebyshev(dask_df[cols[0]], dask_df[cols[1]]),
        dask_distance.cosine(dask_df[cols[0]], dask_df[cols[1]]),
        dask_distance.correlation(dask_df[cols[0]], dask_df[cols[1]])
        )
    return cos, cor

In [None]:
'''
not actual implementation, just wireframe guide
'''
for combo in col_combos:
    print(combo, ":", cos_cor_combos(phone_accel.head(10), combo))

#### some helpful reference materials <font color='red'>delete later<font>

inefficient: loop over grouper
    for group in grouper:
        compute cosine
        
[df.apply from SO](https://stackoverflow.com/questions/45535892/calculate-cosine-similarity-for-two-columns-in-a-group-by-in-a-dataframe)

[dask delayed inside for loop](https://stackoverflow.com/questions/42550529/dask-how-would-i-parallelize-my-code-with-dask-delayed)

```python
from dask import compute, delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]

def compute_mse(file_name):
    df = pd.read_csv(file_name)
    prediction = df['Close'][:-1]
    observed = df['Close'][1:]
    return mse(observed, prediction)

# DASK DELAYED EXAMPLE SYNTAX
delayed_results = [delayed(compute_mse)(file_name) for file_name in filenames]
mean_squared_errors = compute(*delayed_results, scheduler="processes")
```

In [None]:
# create combinations for each of our features

accel_feats = ['x_phone_accel', 'y_phone_accel', 'z_phone_accel']
gryo_feats = ['x_phone_gyro', 'y_phone_gyro', 'z_phone_gyro']

accel_combos = list(it.combinations(accel_feats, 2))
gyro_combos = list(it.combinations(gryo_feats, 2))

In [None]:
accel_combos

In [None]:
gyro_combos

In [None]:
def cos_cor_aggregation(df, num_seconds, partitions):
    '''
    function to compute the cosine and correlation for each of the feature pairs within each of the time subdivisions
    '''  
    n_rows = (num_seconds*1000)/50
    print('Grouped every', n_rows, 'rows')
    
    tempdf = df.reset_index()
    # rename of the index column
    tempdf = tempdf.rename(columns={'index': 'grouper'})

    # creates a variable to group within n_seconds
    tempdf['grouper'] = tempdf['grouper']//n_rows

    # # RELATIONSHIP FEATURES
    # (two approaches, Loop Method vs Apply Method)

    # LOOP METHOD - did not work at scale
    
    # ACCELEROMETER
    # for combo in accel_combos:

    #     delayed_results = [delayed(cos_cor_combos)(group, combo) for name, group in tempdf.groupby(['grouper', 'subject_id', 'code'])]
    #     output = dd.compute(*delayed_results, scheduler="processes")

    #     for name, group in tempdf.groupby(['grouper', 'subject_id', 'code']):
    #         print(cos_cor_combos(group, combo))

    # # GYROSCOPE
    # for combo in gyro_combos:
    #     for name, group in tempdf.groupby(['grouper', 'subject_id', 'code']):
    #         print(cos_cor_combos(group, combo))


    # APPLY METHOD - technically works but is slow af
    '''
    there is likely room for performance improvement with increased parallelization
    change the initial dataframes from pandas to dask
    '''
    # ACCELEROMETER
    # instantiate empty dataframe to build on
    accel_new_feat_df = pd.DataFrame()

    for combo in accel_combos:
        new_col_name = "-".join(combo) # for naming columns in the returned output  
        # calculate the metrics for our subdivisions of data
        new_feat_temp = tempdf.groupby(['grouper', 'subject_id', 'code']).apply(lambda g: cos_cor_combos(tempdf, combo))
        # create a dataframe with the new features
        accel_new_feat_df[['cos-'+new_col_name, 'cor-'+new_col_name]] = pd.DataFrame(new_feat_temp)[0].to_list()
        # create the dask df
        #n_partitions = accel_new_feat_df.npartitions
        accel_new_feat_dd = dd.from_pandas(accel_new_feat_df, npartitions=partitions)
    
    # GYROSCOPE
    # instantiate empty dataframe to build on
    gyro_new_feat_df = pd.DataFrame()

    for combo in gyro_combos:
        new_col_name = "-".join(combo) # for naming columns in the returned output
        # calculate the metrics for our subdivisions of data
        '''
        new_feat_temp = tempdf.groupby(['grouper', 'subject_id', 'code']).apply(lambda g: cos_cor_combos(tempdf, combo))
        '''
        # create a dataframe with the new features
        gyro_new_feat_df[['cos-'+new_col_name, 'cor-'+new_col_name]] = pd.DataFrame(new_feat_temp)[0].to_list()
        # create the dask df
        #n_partitions = gyro_new_feat_df.npartitions
        '''
        I think we might be able to get away with just keeping it as pandas since the aggregate df will be much smaller
        '''
        gyro_new_feat_dd = dd.from_pandas(gyro_new_feat_df, npartitions=partitions)
    
    # MERGE SENSOR TYPES
    created_feats = dd.merge(
        accel_new_feat_dd, 
        gyro_new_feat_dd, 
        how='inner', 
        left_index = True, 
        right_index = True, 
            )
    return created_feats

In [None]:
%%time
# test the function
# note it's just executing on a sample of the data `.head(1000)`
synth_feats = cos_cor_aggregation(phone_df.head(1000), 3, 64) # experiment with larger number of partitions

print(dd.compute(synth_feats.shape))
synth_feats.head()

# <font color='red'>Proposed alt way of getting cos and cor</font>

In [None]:
#multiply the accel columns together
phone_df['xy_phone_accel'] = phone_df['x_phone_accel']*phone_df['y_phone_accel']
phone_df['yz_phone_accel'] = phone_df['y_phone_accel']*phone_df['z_phone_accel']
phone_df['xz_phone_accel'] = phone_df['x_phone_accel']*phone_df['z_phone_accel']

phone_df['xy_phone_gyro'] = phone_df['x_phone_gyro']*phone_df['y_phone_gyro']
phone_df['yz_phone_gyro'] = phone_df['y_phone_gyro']*phone_df['z_phone_gyro']
phone_df['xz_phone_gyro'] = phone_df['x_phone_gyro']*phone_df['z_phone_gyro']

phone_df['x_phone_accel^2'] = phone_df['x_phone_accel']**2
phone_df['y_phone_accel^2'] = phone_df['y_phone_accel']**2
phone_df['z_phone_accel^2'] = phone_df['z_phone_accel']**2

phone_df['x_phone_gyro^2'] = phone_df['x_phone_gyro']**2
phone_df['y_phone_gyro^2'] = phone_df['y_phone_gyro']**2
phone_df['z_phone_gyro^2'] = phone_df['z_phone_gyro']**2

In [None]:
# takes the combined sensor data and bins the data by taking the average depending on the seconds required

def group_into_seconds(df, num_seconds):
    # calculates the number of rows to average over by converting seconds to ms and diving by 50 (sensor interval)
    
    n_rows = (num_seconds*1000)/50
    print('Grouped every', n_rows, 'rows')
    
    tempdf = df.reset_index()
    # rename of the index column
    tempdf = tempdf.rename(columns= {'index': 'grouper'})
    
    # creates a variable to group within n_seconds
    tempdf['grouper'] = tempdf['grouper']//n_rows
    
    # aggregate to n_seconds
    tempdf = tempdf.groupby(by = ['grouper', 'code', 'subject_id']).agg(['mean', 'sum']).reset_index()
    # drop superflous grouper column
    del tempdf['grouper']
    tempdf.columns = list(map(''.join, tempdf.columns.values))
    
    return tempdf
    # return df.groupby(np.arange(len(df))//n_rows).mean().compute()

In [None]:
#just testing to make sure it returns the same exact data frame when growing rows = 1
#group_into_seconds(phone_df.compute(),50/1000)

In [None]:
group_into_seconds(phone_df.compute(),2)

In [None]:
# pass this variable in to all our aggregation functions
# it is the number of seconds we are aggregating to
agg_time = 3

In [None]:
# calculate the averages within our time interval
grouped_phone_df, grouped_watch_df = dask.compute(
    group_into_seconds(phone_df, agg_time),
    group_into_seconds(watch_df, agg_time)
    )

# <font color='red'>warning - this is when it gets REALLY slow</font>

In [None]:
# calculate the created features within our time interval
synth_phone_df, synth_watch_df = dask.compute(
    cos_cor_aggregation(phone_df, agg_time, 64),
    cos_cor_aggregation(watch_df, agg_time, 64)
    )    

In [None]:
'''
merge grouped averages with synthetic features
they must be the same shapes
we also need to test for unexpected shuffling behavior
'''

# merge the new features from each sensor into one df
prepped_phone_df = dd.merge(
    grouped_phone_df, 
    synth_phone_df, 
    how='inner', 
    left_index = True, 
    right_index = True, 
        )

prepped_watch_df = dd.merge(
    grouped_watch_df, 
    synth_watch_df, 
    how='inner', 
    left_index = True, 
    right_index = True, 
        )

In [None]:
dd.compute(prepped_phone_df.shape, prepped_watch_df.shape)

In [None]:
prepped_phone_df.head()

In [None]:
prepped_watch_df.head()

### create csv files for faster recall

In [None]:
# write out file to csv
file_name = 'prepped_phone_df'
df = prepped_phone_df
# should output as .csv to retain data structure
df.to_csv(fr'./prepped-data/{file_name}.csv')

In [None]:
# write out file to csv
file_name = 'prepped_watch_df'
df = prepped_watch_df
# should output as .csv to retain data structure
df.to_csv(fr'./prepped-data/{file_name}.csv')

In [None]:
# # write out to excel (wireframe)
# file_name = 'file_name'
# writer = pd.ExcelWriter(f'{file_name}.xlsx', engine='xlsxwriter')
# df.to_excel(writer, sheet_name='sheet-name')
# writer.save()

In [None]:
# # output the to .tsv/csv (wireframe)
# file_name = 'file_name'
# df = df#.astype(str) #preserve dtype with str if not already
# # should output as .tsv to retain data structure
# df.to_csv(fr'{file_name}.tsv', sep='\t', index=False)

In [None]:
# # serialize file (wireframe)
# file_name = 'file_name'
# df = df
# df.to_pickle(f"./{file_name}.pkl")

In [None]:
# # read serialized file (wireframe)
# file_name = 'file_name'
# unpickled_df = pd.read_pickle(f"./{file_name}.pkl")

In [None]:
# # uncompress file and read in to dask (wireframe)
# file_name = 'file_name'
# unpickled_df = pd.read_pickle(f"./{file_name}.pkl")
# ddf = dd.from_pandas(unpickled_df, npartitions=8)

In [None]:
# # read in file as a dask dataframe
# phone_accel = dd.read_csv(f"prepped-data/{file_name}.csv")

**<font color='red'>I don't think we actually need hadoop. saving in case we do and/or syntax for running other  bash commands</font>**

In [None]:
# %%bash
# dir

**create hadoop directory**

In [None]:
# %%bash
# hadoop fs -mkdir /hdfs-data

**copy from local into hadoop**

In [None]:
# %%bash
# hadoop fs -copyFromLocal prepped-data/data_phone_accel.csv /hdfs-data

**make sure file is in hadoop**

In [None]:
# %%bash
# hadoop fs -ls /hdfs-data

- Use PySpark or Dask
- Include one classificationorregressionorclusteranalysis task
- Describe problem
    - To include:  Explain why problem is interesting, what real-life application is being addressed
- Describe analysis task
    - To include:  type of task (e.g., classification), how does task related to business problem
- Describe data
    - To include:  data quality issues, characteristics of the dataset (summary statistics,
correlation, outliers, etc.), plots
- Describe data preparation process
    - To include:  data cleaning steps, features used, train/validation/test datasets
- Describe analysis approaches
    - To include:  input, setup, and output of model(s)
- Describe challenges and solutions
    - To include:  challenges encountered, solutions to address challenges
- Describe analysis results and insights gained
    - To include:  discussion of results, insights gained from analysis
- Describe future work
    - To include:  lessons learned, next steps, what you would have done differently




Measures movement data over ten-second
intervals while subjects perform the various tasks.

## Analysis approaches

### Model Selection

<font color='red'>this section is wildly incomplete</font>

[**sklearn - Decision Tree Regression with AdaBoost**](https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_regression.html)

In [None]:
phone_accel.head()

In [None]:
# TRAIN TEST SPLIT

# split off labels
feat_cols = ['x', 'y', 'z']
label_col = ['code']

feature_df = phone_accel[feat_cols]
label_df = phone_accel[label_col]

X_train, x_test, y_train, y_test = train_test_split(feature_df, label_df, test_size=0.8, shuffle=True, random_state=seed)

In [None]:
# SCALE DATA

# instatiate scaler
scaler = StandardScaler()
# fit the scaler
scalerModel = scaler.fit(X_train)
# scale the training data
X_train_scaled = scalerModel.transform(X_train)
# scale the test data
X_test_scaled = scalerModel.transform(x_test)

In [None]:
# set up grid search parameters
param_grid = {'max_depth'        : list(range(1, 10)), # play around with max depth
              'min_samples_split': list(range(2, 10)), # must start at 2+
              'criterion'        : ['gini','entropy'],
             }

In [None]:
# GRID SEARCH

# instantiate base model
dt_model = DecisionTreeClassifier(random_state=seed)

# istantiate grid search object
dt_model_grid_dask = dcv.GridSearchCV(dt_model, param_grid, cv=10)

# execute grid search
'''
does this need joblib backend if we are using native dask?
'''
with joblib.parallel_backend("dask"):
    dt_model_grid_dask.fit(X_train_scaled, y_train)

In [None]:
best_params = dt_model_grid_dask.best_params_
print(best_params)

In [None]:
print(dt_model_grid_dask.best_score_)

In [None]:
# now that we've performed a gridsearch, use parameters from out best model

# instantiate best model
best_dt_model = DecisionTreeClassifier(
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    criterion=best_params['criterion'],
    random_state=seed
)

# fit model to training data
'''
does this need joblib backend if we are using native dask?
'''
with joblib.parallel_backend("dask"):
    best_dt_model.fit(X_train_scaled, y_train)

# check accuracy from this model on test data
best_dt_model.score(X_test_scaled, y_test)

## Analysis results

## Challenges & solutions

## Insights gained

## Future work

## References

1. Dask vs spark picture: https://medium.datadriveninvestor.com/pandas-dask-or-pyspark-what-should-you-choose-for-your-dataset-c0f67e1b1d36
2. Accelerometer information https://en.wikipedia.org/wiki/Accelerometer
3. Gyroscope Information https://en.wikipedia.org/wiki/Gyroscope

In [17]:
# always close client connection at end of workflow
client.shutdown()