## **0. Project History**
---

##### **Maintaining a Change Log**

    A change log is a file that records all changes made to a project, including bug fixes, new features, and improvements. It's a crucial part of any project's documentation, as it provides a historical record of the project's evolution.

##### **Why Maintain a Change Log?**

    Tracking Changes: A change log allows you to track all changes made to your project. This is especially useful when working in a team, as it allows everyone to see what changes have been made and why.

    Documentation: A change log serves as a form of documentation. It can be used to understand the history of a project, how it evolved, and what the current state of the project is.

    Rollback: If a problem arises in a project, a change log can be used to rollback to a previous version of the project.

##### **How to Maintain a Change Log**

    Keep It Up-to-Date: Whenever you make a change to your project, update the change log. This could be as simple as noting the date, the person making the change, and a brief description of the change.

    Use a Consistent Format: Consistency is key when it comes to maintaining a change log. Use a consistent format for all entries, including the date format, the person's name format, and the description format.

    Keep It Organized: If your project has a lot of changes, consider organizing your change log. For example, you could group changes by date, by person, or by type of change.

    Use Version Control Systems: If you're using a version control system like Git, a change log is usually maintained for you. You can use the commit history to see what changes have been made and why.

#### **Change Log**


<span style="color:lime"> ***2023-11-28 - Amin Rezaei***</span>


- Switched off the AutoCalibrate function
- Relocated redundant import modules to the file level 

    ***Imports should be placed at the top of the module, not inside functions, unless there is a specific reason to do so. This makes the code more readable, maintainable, and avoids potential problems with circular imports, name clashes, and performance issues.***

- Included a package version control 
- Implementation of input and output sanity checks
- Added a generate CSV result feature for the Step2 results (alg_activity_acti4_v1_1_3.py)


<span style="color:lime"> ***2023-01-16 - Amin Rezaei***</span>

- Muted Sampling Frequnecy in ChunkActi4Pre_v1_3_3.py - default set to SF=30
- Added functions: extract_sensor_id, process_pre_step_data, chunk_data_extended
- A simple pdf report of sensor id, sampling frquency, and availble chunks added 
- Ability to store the chunks of information in csv format under a sensor id name directory added
- Added section 7 for validating results and storing non-matching events
- Directory structure added

<span style="color:orange"> ***2023-02-22 - Sebastian Hørlück***</span>

- Adjusted relative imports to fit structure of GitHub repo
- Added functionality: Combine pre-chunks within days
- Added functionality: Combine classification-chunks for comparison of entire period
- Muted pdf generator after step 1
- Muted plotting of results after step 2
- Converted timestamps from backend to datetime objects when comparing with ground truth
- Removed timezone from backend times and ground truth times
- Changed alignment of backend and ground truth to be based on `pd.merge_asof(direction=nearest)` using datetime objects from backend and ground truth
    - Seems like backend times are pushed one second forward, as matching is much better when subtracting backend time by one second
- Add comparison of backend classifications without chunking (does errors come from chunking or backend-specific functions)

<span style="color:orange"> ***2024-04-29 Sebastian Hørlück***</span>

- Add functionality of multiple sensor input in step 2 (temprorary)

### **0.1 Project Structure** 

The workspace is organized as follows:

```plaintext
Workspace/
│
├── Acti4_Sandbox (Chunking added).ipynb
│
├── Extras/
│   ├── requirements.txt
│   └── sens logo black.png
│
├── Functions/
│   ├── Acti4.py
│   ├── alg_activity_acti4_v1_1_3.py
│   ├── alg_chunk_acti4pre_v1_1_3.py
│   ├── backendfunctions.py
│   ├── preprocessing.py
│   ├── tools.py
│   ├── __init__.py
│   └── __pycache__/
│
├── Sample Dataset/
│   ├── Activity_perSecond_2401.csv
│   └── export_73-5D.C3_acc-3ax-4g_2023-10-05T220000_2023-10-07T22.00.00.bin
│
└── results/
    └── <sensor_id>/
        ├── Activity_Acti4_V_1_1_3/
        │   └── activity_chunk_<i+1>.csv
        ├── Pre_Acti4_V_1_1_3/
        │   └── pre_chunk_<i+1>.csv
        ├── non_matching_activity.csv
        └── 73-5D.C3_report.pdf

## **1. Background**
---

#### ***About***

This Jupyter Notebook provides a sandbox environment for loading raw accelerometer data from the SENS Motion sensor and performing various data analysis tasks. With the help of this notebook, you can easily import and preprocess the data, visualize it, and extract meaningful insights.

*Disclaimer*:
This notebook is part of the sandbox environmnet developed specifically for SENS Innovation Aps and NFA for the integration of Acti4 algorithm V 1_1_3 into the SENS backend.

---

#### ***Structure of the notebook***

The notebook contains two types of cell:  

**Text cells** provide information and can be modified by douple-clicking the cell. You are currently reading the text cell. You can create a new text by clicking `+ Text`.

**Code cells** contain code and the code can be modfied by selecting the cell. To execute the cell, move your cursor on the `[ ]`-mark on the left side of the cell (play button appears). Click to execute the cell. After execution is done the animation of play button stops. You can create a new coding cell by clicking `+ Code`.

---

#### ***Google Colab short intro and features***

On the top left side of the notebook you will find the tabs which contain from top to bottom:

***Table of contents*** = contains structure of the notebook. Click the content to move quickly between sections.

***Find and replace*** = find and replace tool allows you to find and replace text/variables in selected items/cells within the entire Jupyter Notebook .

***Variables*** = This feature allows you to inspect and manage variables within your Colab notebook.

***Secrets*** = This feature allows you to safely store your private keys, such as your [SENS](https://ask.for.link.dk/r/login) API tokens, in Colab! Values stored in Secrets are private, visible only to you and the notebooks you select.

***Files*** =  This feature allows you to manage and access files within your Colab environment. It provides an interface for uploading, downloading, and organizing files, which can be useful for various tasks, including data manipulation and storage..


---


**Remember that all uploaded files are purged after changing the runtime.** However, all files saved in Google Drive will remain. <u>You do not need to use the Mount Drive-button; your Google Drive can be connected if you run section 1.4.</u>

**Note:** If you wish to proceed with the pre-loaded "sample data", you do not need to upload anything in the **Section 2. Load your dataset**!

## **2. Preparing the environment**
---

To get started, make sure you have the necessary dependencies installed and loaded.

**Tip:** For similar results and smooth experience, make sure to run the cells in chronological order.

### **2.1 Install key dependencies**
---
<font size = 4>

In [None]:
# ! pip install -r requirements.txt

### **2.2 Load key dependencies**
---
<font size = 4>

In [1]:
import os
import sys
import scipy
import platform
import warnings
import numpy as np
import numpy.matlib
import pandas as pd
import ipywidgets as widgets
import matplotlib.pyplot as plt
from prettytable import PrettyTable
from datetime import timedelta
from itertools import product
from tqdm.notebook import tqdm

from functions.preprocessing import read_bin
from alg_chunk_motus_pre_v2_0_0 import ChunkMotusPre_v2_0_0
from alg_activity_motus_v2_0_0_ErgoConnect import ActivityMotus_v2_0_0_ErgoConnect
from alg_activity_motus_v1_2_0 import ActivityMotus_v1_2_0
from functions.BackendExtras.tools import (
    sanity_check_pre,
    time_sanity,
    acti4pre_csv,
    activity_motus_csv,
    plot_activity_classification,
    display_dataframe_info,
    extract_sensor_id,
    generate_pdf,
    chunk_data_extended,
    motuspre_merge_outputs,
    process_pre_step_data,
    extract_id_placements,
    get_chunk_start_idx,
)

from bokeh.layouts import column
from bokeh.models import ColumnDataSource, RangeTool
from bokeh.plotting import figure, show, output_file, save

# Setup autoreload of modules, so kernel does not have to be restarted when import files are changed
%load_ext autoreload
%autoreload 2

# Define the list of required dependencies
dependencies = [
    (
        "Python",
        platform.python_version(),
        "Installed" if "python" in sys.modules else "Pre Installed",
    ),
    (
        "NumPy",
        np.__version__,
        "Installed" if "numpy" in sys.modules else "Not Installed",
    ),
    (
        "Pandas",
        pd.__version__,
        "Installed" if "pandas" in sys.modules else "Not Installed",
    ),
    (
        "SciPy",
        scipy.__version__,
        "Installed" if "scipy" in sys.modules else "Not Installed",
    ),
]

# Create a table to display package information
table = PrettyTable()
table.field_names = ["Library", "Version", "Installation Status"]

# Set the max width
table.max_width = 15

for dependency in dependencies:
    table.add_row(dependency)

# Print the table
print(table)

+---------+---------+---------------------+
| Library | Version | Installation Status |
+---------+---------+---------------------+
|  Python |  3.11.0 |    Pre Installed    |
|  NumPy  |  1.24.4 |      Installed      |
|  Pandas |  2.0.3  |      Installed      |
|  SciPy  |  1.10.1 |      Installed      |
+---------+---------+---------------------+


## **3. Load your dataset**
---

### **3.1 Handling your data**
---

##### ***Downloading data***

Data export is handled through the [web application](https://app.sens.dk/r/login). We recommend using a browser like Mozilla Firefox on your computer.

Exporting data can be achieve through various ways as listed below, however, to make use of this notebook, you need to download the **<u>raw data</u>** in <u>.bin</u> format.

* PDF files
* Raw data (in .bin or .hex format)
* CSV file from accelerometer data
* CSV format directly from the Patient Overview

If you need guidance and introduction to our **<u>Web App</u>**, please visit our [Support page](https://support.sens.dk/hc/en-us/articles/8001330518429-Data-export) for an in-depth explanation on data retrival from the [Web App](https://app.sens.dk/r/login).

---

##### ***Setting the path to your data***


**`raw_acc_bin`:** The raw_acc_data should be povided with the path to your data. To find the paths of the folders containing the respective datasets, go to your Files on the left of the notebook, navigate to the folder containing your files and copy the path by right-clicking on the folder, **Copy path** and pasting it into the right box below.

***Note*** If you do not wish to use files from your Google Drive, you can simply drag and drop your file to the temporary drive that has been allocated to this session. Any files uploaded to your temporary colab drive will be purged once you close the notebook.

In [2]:
# Set folders
src_folder = "U:\\DI\\MOTUS\\Diverse kode tests\\BackendV2_0_0\\source"
gt_folder = (
    "U:\\DI\\MOTUS\\Diverse kode tests\\BackendV2_0_0\\source\\output\\activity_files"
)
res_folder = "U:\\DI\\MOTUS\\Diverse kode tests\\BackendV2_0_0\\results"

In [3]:
# Source files available
src_file_ids = list(
    set([extract_sensor_id(i) for i in os.listdir(src_folder) if ".bin" in i])
)
print(src_file_ids)

['2401', '007', '22004', '22003', '21005']


In [4]:
id_dropdown = widgets.Dropdown(
    options=[id for id in src_file_ids],
    description="Ids",
    layout=widgets.Layout(width="200px"),
)
placement_dropdown = widgets.SelectMultiple(
    description="Sensor placements",
    layout=widgets.Layout(width="400px"),
)


def update(*args):
    # Select ID and body placements
    selected_id = id_dropdown.value
    id_placements = extract_id_placements(selected_id, src_folder)
    placement_dropdown.options = list(id_placements.keys())


id_dropdown.observe(update)

display(id_dropdown)
display(placement_dropdown)

Dropdown(description='Ids', layout=Layout(width='200px'), options=('2401', '007', '22004', '22003', '21005'), …

SelectMultiple(description='Sensor placements', layout=Layout(width='400px'), options=(), value=())

In [5]:
# Get ID and sensor placements
selected_id = id_dropdown.value

selected_placements = placement_dropdown.value

selected_files = {
    key: os.path.join(src_folder,value)
    for key, value
    in extract_id_placements(selected_id, src_folder).items()
    if key in selected_placements
}

data_dict = {}

print(f"ID: {selected_id}")
print(f"Selected sensors: {', '.join(selected_placements)}\n")
# Allocate data to data dict
for plc, file in selected_files.items():
    print(f"\tReading {plc} ...")
    data_dict[plc] = read_bin(os.path.join(src_folder, file))
    if data_dict[plc] is None:
        print(f"\tError in reading data for {plc}\n")
    else:
        print(f"\tData for {plc} succesfully read\n")
# name = os.path.join(src_folder, f"export_raw_{selected_id}.bin")
# gt_file = os.path.join(gt_folder, f"Activity_perSecond_{selected_id}.csv")

# # Read the bin file
# read_bin_data = read_bin(name)

# if read_bin_data is None:
#     print("Error: No data was returned from the read_bin function\n")
#     sys.exit()
# else:
#     print("Success: Data was returned from the read_bin function\n")
#     print("Sensor ID: ", extract_sensor_id(name), "\n")

ID: 2401
Selected sensors: thigh

	Reading thigh ...
	Data for thigh succesfully read



In [6]:
# Search all activity files in ground truth folder
possible_files = [
    file for file
    in os.listdir(gt_folder)
    if (
        selected_id in file and
        "activity_persecond" in file.lower()
    )
]

# If only one file matches name format and has ID, choose this
if len(possible_files) == 1:
    gt_file = os.path.join(gt_folder,possible_files[0])
    print(f'"Ground truth"-file: {possible_files[0]}')
# If more than one file matches activity file name-type and has ID, make user select
elif len(possible_files) > 1:
    file_list = "\n".join(possible_files)
    print(f'Multiple files could be "ground truth". Manually put the name of the right one: \n{file_list}')
    file_name = input("Copy file name here: ")
    gt_file = os.path.join(gt_folder,file_name)

"Ground truth"-file: Activity_perSecond_2401.csv


### **3.2 Loading provided dataset - bulk**
---

In [None]:
# # Create empty lists to store data and timestamps
# data_list = []
# ts_list = []

# # Loop over placement order and append data if any
# for key in key_order:
#     if key in data_dict:
#         data_list.append(data_dict[key][:, 1:4])
#         ts_list.append(data_dict[key][:, 0])
#     else:
#         data_list.append(None)
#         ts_list.append(None)

#### **3.2.1 Chunking the Dataset**
---

In [7]:
chunk_of_data_in_hours = 12  # in hours
overlap_threshold_in_hours = 10  # in minutes

# Define the order of data
key_order = ["thigh", "trunk", "arm", "calf"]

chunk_dict = {}

print(f"ID: {selected_id}")
print(f"Selected sensors: {', '.join(selected_placements)}\n")
for plc in key_order:
    if plc in data_dict.keys():
        chunk_dict[plc] = {}
        print(f"\tChunking data for {plc}...")
        # Process the data
        (
            chunk_dict[plc]["list_of_df"],
            chunk_dict[plc]["sampling_frequency"],
            chunk_dict[plc]["df"],
            chunk_dict[plc]["extended_chunks_original_format"],
        ) = chunk_data_extended(
            data_dict[plc],
            desired_chunking=chunk_of_data_in_hours,
            overlapping_threshold=overlap_threshold_in_hours,
        )
        print(
            f"\tTotal number of chunks: {len(chunk_dict[plc]['extended_chunks_original_format'])}"
        )
        print(
            f"\tStart of first chunk: {chunk_dict[plc]['list_of_df'][0].iloc[0,0].strftime('%Y-%m-%d %H:%M:%S')}"
        )
        print(
            f"\tStart of last chunk: {chunk_dict[plc]['list_of_df'][-1].iloc[0,0].strftime('%Y-%m-%d %H:%M:%S')}\n"
        )
# # Generate the PDF report
# generate_pdf(list_of_df, sampling_frequency, df, name)

# ts_list_chunk = []
# data_list_chunk = []

# out_pre_all_chunks = []
# acti4pre_df_all_chunks = []

# for i in range(len(extended_chunks_original_format)):
#     ts_list_chunk.append(extended_chunks_original_format[i][:, 0])
#     data_list_chunk.append(extended_chunks_original_format[i][:, 1:4])

# del read_bin_data, ts_list, data_list

ID: 2401
Selected sensors: thigh

	Chunking data for thigh...
	Total number of chunks: 5
	Start of first chunk: 2023-10-05 22:00:00
	Start of last chunk: 2023-10-07 11:50:00



In [8]:
# align chunks on timestamp
chunk_dict, chunk_range = get_chunk_start_idx(chunk_dict, selected_placements)

## **4. alg_chunk_acti4pre_v1_2_0**
---

#### **4.1 Sanity check on the provided dataset and runing the pre-step on each chunk**

In [9]:
pre_dict = {}
for plc in selected_placements:
    pre_dict[plc] = {
        "out_pre_all_chunks": [],
        "motuspre_df_all_chunks": [],
        "motuspre_df_all_chunks_without_threshold": [],
    }

In [10]:

iterprod = product(selected_placements,range(len(chunk_range)))

pbar = tqdm(
    list(iterprod),
    desc = "Running pre step",
)

for plc, i in pbar:
    if chunk_dict[plc]['list_of_df'][i] is not None:
        ts_list = chunk_dict[plc]["extended_chunks_original_format"][i][:, 0]
        data_list = chunk_dict[plc]["extended_chunks_original_format"][i][:, 1:4]

        ts_list = time_sanity(ts_list)

        # if len(ts_list) != len(data_list):
        #     print("ERROR: Data and timestamp lists are not equal in length!")
        # else:
        #     print(
        #         "Data and timestamp lists are equal in length, proceeding with the preprocessing..."
        #     )

        # perform sanity check on the inputs
        ts_list, data_list = sanity_check_pre(ts_list, data_list)

        # Create an instance of the ChunkActi4Pre_v1_3_3 class
        chunk_acti4_pre = ChunkMotusPre_v2_0_0

        # Call the analyse_data_list_new method and pass the inputs
        out_ts, out_cat, out_val, out_ver = chunk_acti4_pre.analyse_data_list_new(
            ts_list, data_list
        )

        # combine the categorical and value outputs for step 2
        out_pre = np.column_stack((out_ts, out_cat, out_val))

        
        # save the categorical and value outputs in a list
        pre_dict[plc]["out_pre_all_chunks"].append(out_pre)

        # save the timestamp and categorical/value outputs as a CSV file
        motuspre_df = motuspre_merge_outputs(
            out_ts, out_cat, out_val, i
        )

        pre_dict[plc]["motuspre_df_all_chunks"].append(motuspre_df)

    else:
        pre_dict[plc]["out_pre_all_chunks"].append(None)
        pre_dict[plc]["motuspre_df_all_chunks"].append(None)
    

# del motuspre_df, out_pre           
#     ts_list_chunk.append(extended_chunks_original_format[i][:, 0])
#     data_list_chunk.append(extended_chunks_original_format[i][:, 1:4])

Running pre step:   0%|          | 0/5 [00:00<?, ?it/s]

#### **4.2 Removing the overlapping minutes and storing chunks**
---

In [11]:
iterprod = product(selected_placements,range(len(chunk_range)))

pbar = tqdm(
    list(iterprod),
    desc = "Storing pre steps"
)

for plc, i in pbar:
    if chunk_dict[plc]['list_of_df'][i] is not None:
        rows_to_remove = overlap_threshold_in_hours * 60

        if i == 0:
            pre_dict[plc]["motuspre_df_all_chunks_without_threshold"].append(
                pre_dict[plc]["motuspre_df_all_chunks"][i][: -rows_to_remove + 2]
            )

        elif i == len(chunk_range)-1:
            pre_dict[plc]["motuspre_df_all_chunks_without_threshold"].append(
                pre_dict[plc]["motuspre_df_all_chunks"][i][rows_to_remove:]
            )

        else:
            pre_dict[plc]["motuspre_df_all_chunks_without_threshold"].append(
                pre_dict[plc]["motuspre_df_all_chunks"][i][rows_to_remove : -rows_to_remove + 2]
            )

        if not os.path.exists(
            os.path.join(res_folder, f"{selected_id}/prestep")
        ):
            os.makedirs(
                os.path.join(res_folder, f"{selected_id}/prestep"),
                exist_ok=True,
            )

        pre_dict[plc]["motuspre_df_all_chunks_without_threshold"][i].to_csv(
            os.path.join(
                res_folder,
                f"{selected_id}/prestep",
                f"pre_chunk_{plc}_{i+1}.csv"
            ),
            index=False            
        )
    else:
        pre_dict[plc]["motuspre_df_all_chunks_without_threshold"].append(
            None
        )


Storing pre steps:   0%|          | 0/5 [00:00<?, ?it/s]

##### **4.3 ChunkActi4Pre_v1_2_0 output sanity check**
---

In [12]:
iterprod = product(selected_placements,range(len(chunk_range)))

for plc, i in iterprod:
    # print(
    #     f"\Sanity checking of chunk {i+1} of {len(acti4pre_df_all_chunks_without_threshold)}\n"
    # )
    if pre_dict[plc]["motuspre_df_all_chunks_without_threshold"][i] is not None:
        display_dataframe_info(pre_dict[plc]["motuspre_df_all_chunks_without_threshold"][i])
    print("\n")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7201 entries, 0 to 7200
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ts_chunked      7201 non-null   int64 
 1   ts_chunked_iso  7201 non-null   object
 2   out_cat         7201 non-null   int32 
 3   Stdx            7201 non-null   int32 
 4   Stdy            7201 non-null   int32 
 5   Stdz            7201 non-null   int32 
 6   Meanx           7201 non-null   int32 
 7   Meany           7201 non-null   int32 
 8   Meanz           7201 non-null   int32 
 9   hlratio         7201 non-null   int32 
 10  Iws             7201 non-null   int32 
 11  Irun            7201 non-null   int32 
 12  NonWear         7201 non-null   int32 
 13  xsum            7201 non-null   int32 
 14  zsum            7201 non-null   int32 
 15  xSqsum          7201 non-null   int32 
 16  zSqsum          7201 non-null   int32 
 17  xzsum           7201 non-null   int32 
 18  SF12    

In [None]:
# for j, i in enumerate(acti4pre_df_all_chunks_without_threshold):
#     if j >= 1:
#         print(
#             f'Last chunk j-1: {acti4pre_df_all_chunks_without_threshold[j-1]["ts_chunked_iso"].iloc[-1]}'
#         )
#         print(f'First chunk j: {i["ts_chunked_iso"].iloc[0]}')
#         last_dt = pd.to_datetime(
#             acti4pre_df_all_chunks_without_threshold[j - 1]["ts_chunked_iso"].iloc[-20:]
#         )
#         first_dt = pd.to_datetime(i["ts_chunked_iso"].iloc[:20])
#         print(
#             f'Freq on last 20 (j-1): {np.diff(last_dt).astype("timedelta64[ms]").mean()}'
#         )
#         print(
#             f'Freq on first 20 (j): {np.diff(first_dt).astype("timedelta64[ms]").mean()}'
#         )
#         print(
#             f"Diff in collection point: {int(abs(first_dt.iloc[0]-last_dt.iloc[-1]).total_seconds()*1000)} miliseconds"
#         )
#         print("-" * 40)

## **5. alg_activity_acti4_v1_2_0.py**
---

##### **5.1 Step2 Preparation**
---

In [13]:
# Convert chunked data to arrays
for plc in selected_placements:
    pre_dict[plc]["out_pre_all_chunks_arrays"] = process_pre_step_data(pre_dict[plc]["out_pre_all_chunks"])

In [14]:
# Create df of starts and ends of chunks
start_end_df = pd.DataFrame(chunk_range,columns=["Start Timestamp"])
start_end_df["End Timestamp"] = start_end_df["Start Timestamp"] + pd.Timedelta(hours=12)
start_end_df["Chunk number"] = range(1,len(chunk_range)+1)
start_end_df["Date"] = start_end_df["Start Timestamp"].dt.date
start_end_df

Unnamed: 0,Start Timestamp,End Timestamp,Chunk number,Date
0,2023-10-05 12:00:00,2023-10-06 00:00:00,1,2023-10-05
1,2023-10-06 00:00:00,2023-10-06 12:00:00,2,2023-10-06
2,2023-10-06 12:00:00,2023-10-07 00:00:00,3,2023-10-06
3,2023-10-07 00:00:00,2023-10-07 12:00:00,4,2023-10-07
4,2023-10-07 12:00:00,2023-10-08 00:00:00,5,2023-10-07


In [15]:
daychunks = list(start_end_df.groupby('Date')['Chunk number'].apply(list).values)
days = list(start_end_df['Date'].unique())

##### **5.2 Run step2 on all chunks**
---

In [19]:
# Define the order of data
key_order = ["thigh", "trunk", "arm", "calf"]
# Find number of cols in data
for i in pre_dict[list(pre_dict.keys())[0]]['out_pre_all_chunks']:
    if i is not None:
        ncols = i.shape[1]
        break

list_of_dfs = []

iterlist = list(zip(daychunks,days))

pbar = tqdm(
    iterlist,
    desc="Step 2",
    total=len(iterlist),
)
warnings.filterwarnings("ignore") # deactivate warnings temprorarily

# Loop over chunks related to each day
for chunks, day in pbar:
    data_list = [] # List to store data for step 2
    ts_list = [] # List to store timestamps for step 2
    # Add previous chunk (not sure if this should be done)
    # if 1 not in chunks:
    #     chunks = [min(chunks)-1]+chunks
    pbar.write(f"{day:%D} {chunks = }")
    for plcidx, plc in enumerate(key_order): # loop over placements in correct order
        list_of_arrays = [] # List to store data for each chunk
        if plc in pre_dict.keys(): # If placement is loaded
            for chunk in chunks: # append data for each chunk
                if pre_dict[plc]["out_pre_all_chunks"][chunk-1] is not None:
                    list_of_arrays.append(
                        pre_dict[plc]["out_pre_all_chunks_arrays"][chunk-1]
                    )
                else: # and if no data in chunk append None
                    list_of_arrays.append(
                        [[None]*ncols]
                    )
        else: # If placement is not selected append None
            for chunk in chunks:
                list_of_arrays.append(
                    [[None]*ncols]
                )
        conc_array = np.concatenate(list_of_arrays, axis=0) # Concatenate chunks
        conc_array = conc_array[(conc_array != None).all(axis=1)] # Remove Nones
        data_list.append(conc_array[:,1:] if conc_array.shape[0] != 0 else None)  # Append data for date chunks
        ts_list.append(conc_array[:,0])  # Append timestamps for date chunks

    
    
    # Run step 2
    activity_motus = ActivityMotus_v1_2_0
    parameters = {"TrunkRef": None}
    
    # Call the analyse_data_list_new method and pass the inputs
    (
        out_ts_step2,
        out_cat_step2,
        out_val_step2,
        out_ver_step2,
        _,
    ) = activity_motus.analyse_data_list_new(
        ts_list,
        data_list,
        parameters=parameters,
        debug_stream=None,
        debug_chunks=None,
    )

    out_ts_step2 = out_ts_step2.reshape(-1,1)
    out_cat_step2 = out_cat_step2.reshape(-1,1)
    
    # combine the categorical and value outputs for step 2
    activity_motus = activity_motus_csv(
        out_ts_step2, out_cat_step2, out_val_step2, day, selected_id, res_folder, ver="thighonly"
    )

    activity_motus.drop(["out_ts"], axis=1, inplace=True)

    # rename the out_ts_iso to out_ts
    activity_motus.rename(columns={"out_ts_iso": "out_ts"}, inplace=True)

    list_of_dfs.append(activity_motus)

warnings.resetwarnings() # reactivate warnings

Step 2:   0%|          | 0/3 [00:00<?, ?it/s]

10/05/23 chunks = [1]
10/06/23 chunks = [2, 3]
10/07/23 chunks = [4, 5]


In [23]:
activity_motus_all.columns[1:-1]

Index(['Steps'], dtype='object')

In [25]:
# Concatenate output dataframes
activity_motus_all = pd.concat(list_of_dfs).reset_index(drop=True)
# Remove timezone
activity_motus_all["out_ts"] = pd.to_datetime(
    activity_motus_all["out_ts"]
).dt.tz_localize(None)

# Capture value columns (that are multiplied by 1000)
val_cols = activity_motus_all.columns[1:-1]

# Capture nans (converted to large negative number) and convert
# Also divide other values by 1000
activity_motus_all.loc[:,val_cols] = activity_motus_all[val_cols].apply(lambda x: np.where(x<-21000000,np.nan,x/1000))
# activity_motus_all.loc[angle_nans:angle_cols]

In [26]:
activity_motus_all

Unnamed: 0,out_cat,Steps,out_ts
0,2,0.0,2023-10-05 22:00:00
1,2,0.0,2023-10-05 22:00:01
2,2,0.0,2023-10-05 22:00:02
3,2,0.0,2023-10-05 22:00:03
4,2,0.0,2023-10-05 22:00:04
...,...,...,...
172798,2,0.0,2023-10-07 21:59:53
172799,2,0.0,2023-10-07 21:59:54
172800,2,0.0,2023-10-07 21:59:55
172801,2,0.0,2023-10-07 21:59:56


##### **5.3ActivityActi4_v1_3_3 output sanity check**
---

In [27]:
# analyse the output and data attributes
display_dataframe_info(activity_motus_all)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172803 entries, 0 to 172802
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype         
---  ------   --------------   -----         
 0   out_cat  172803 non-null  int64         
 1   Steps    172803 non-null  float64       
 2   out_ts   172803 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 4.0 MB


{'Dataframe Type': pandas.core.frame.DataFrame,
 'Dataframe Shape': (172803, 3),
 'Dataframe Columns': ['out_cat', 'Steps', 'out_ts'],
 'Dataframe Info': None}

## **6. Plot results** (muted for now)
---

In [None]:
# Insert code here , but comment out

## **7. Validating results**
---

#### **7.1. Loading ground truth data and ranging it to match the chunk begining and ending**

In [28]:
# Function to compute match stats
def match_stats(
    df_in, match_var="match", perday=True, date_var="out_ts", print_res=True
):
    """
    Analyze match statistics from a DataFrame.

    Parameters:
    - df_in (DataFrame): Input DataFrame containing match data.
    - match_var (str, optional): Column name in the DataFrame containing match information. Defaults to 'match'.
    - perday (bool, optional): Indicates whether to analyze match statistics per day. Defaults to True.
    - date_var (str, optional): Column name in the DataFrame containing dates. Defaults to 'out_ts'.
    - print_res (bool, optional): Indicates whether to print the analysis results. Defaults to True.

    Returns:
    - dict: A dictionary containing match statistics DataFrames.
        - 'collapsed': DataFrame summarizing match counts and proportions.
        - 'perday' (if perday is True): DataFrame summarizing match counts and proportions per day.

    Notes:
    - If perday is True and date_var is None, it raises an error.
    - If perday is True and date_var is not found in the DataFrame columns, it sets perday to False.
    """
    if perday and date_var is None:
        print("date_var must be defined")
        return None
    elif perday and date_var not in df_in.columns:
        print(f"{date_var} not in coluns of inputted dataframe")
        perday = False

    df_out = {}
    # Create a DataFrame with counts and proportions
    df_out["collapsed"] = pd.DataFrame(
        {
            "count": df_in[match_var].value_counts(),
            "proportion %": df_in[match_var].value_counts(normalize=True) * 100,
        }
    )

    msg = ""
    msg += "-" * 40
    msg += "\n"
    msg += "Matching results \n"
    msg += "-" * 40
    msg += "\n\n"
    msg += df_out["collapsed"].to_string()

    if perday:
        # Create a DataFrame with counts and proportions
        df_out["perday"] = pd.DataFrame(
            {
                "count": df_in.groupby(df_in[date_var].dt.date)[
                    match_var
                ].value_counts(),
                "proportion %": df_in.groupby(df_in[date_var].dt.date)[
                    match_var
                ].value_counts(normalize=True)
                * 100,
            }
        )

        msg += "\n\n"
        msg += "-" * 40
        msg += "\n\n"
        msg += df_out["perday"].to_string()

    if print_res:
        print(msg)

    return df_out

In [37]:
((df_gt['out_ts'] >= start_end_timestamps[0]) & 
 (df_gt['out_ts'] <= start_end_timestamps[1]))

0         False
1         False
2         False
3         False
4         False
          ...  
691194    False
691195    False
691196    False
691197    False
691198    False
Name: out_ts, Length: 691199, dtype: bool

In [38]:
"""
This section performs the following tasks:
1. Takes the 'activity_acti4' dataframe and extracts the first and last timestamps from the 'out_ts' column, saving them in a list.
2. Loads a CSV file ('Activity_perSecond_2401.csv') as a dataframe ('df_gt').
3. Selects only the necessary columns ('Time' and 'Activity') from 'df_gt'.
4. Renames the columns to match the output dataframe ('out_cat' and 'out_ts').
5. Converts the 'out_ts' column to a timezone-aware datetime object, if it's not already.
6. Removes the timezone information, converting it to a timezone-naive datetime object.
7. Subtracts 8 hours from the 'out_ts' column.
8. Filters the 'df_gt' dataframe to include only rows with timestamps between the first and last timestamps from 'activity_acti4'.
9. Resets the index of the filtered 'df_gt' dataframe.
10. Selects only the necessary columns ('out_ts' and 'out_cat') from 'activity_acti4' and filters it to match the length of 'df_gt_filtered'.
11. Selects the columns 'out_ts' and 'out_cat' from 'activity_acti4_filtered'.
12. Selects the 'out_cat' column from 'df_gt_filtered' and renames it to 'out_cat_gt' to avoid column name conflicts.
13. Concatenates the selected columns horizontally to create a new dataframe ('new_df').
14. Compares the 'out_cat' and 'out_cat_gt' columns in 'new_df' and creates a new column called 'match' with True if the values match and False if they don't.
15. Creates a dataframe ('match_stats') with counts and proportions of the 'match' column values.
16. Prints the 'match_stats' dataframe.
"""

# take the activity_acti4 dataframe, go to column out_ts and take the first and last timestamp and save them in a list
start_end_timestamps = []

# locate the first timestamp in the dataframe and append it to the list for future validation purposes
start_end_timestamps.append(activity_motus_all[["out_ts"]].iloc[0].values[0])
start_end_timestamps.append(activity_motus_all[["out_ts"]].iloc[-1].values[0])

print(
    f'Comparison for timeperiod: \n   {start_end_timestamps[0].astype("datetime64[D]")} - {start_end_timestamps[-1].astype("datetime64[D]")}'
)

# load csv file as a dataframe
df_gt = pd.read_csv(gt_file)

# rename the columns to match the output dataframe, Activity to out_cat and Time to out_ts
df_gt.rename(columns={"Activity": "out_cat", "Time": "out_ts"}, inplace=True)

# only use the columns that we need for the comparison, including time and activity
keep_cols = ['out_ts', 'out_cat']
if 'trunk' in data_dict.keys():
    keep_cols += ['TrunkInc', 'TrunkFB']
if 'arm' in data_dict.keys():
    keep_cols += ['ArmInc']
df_gt = df_gt[keep_cols]

# rename the Activity column numerical values to the corresponding activity using a dictionary 1 == 'lie' , 2 == 'sit', 3 == 'stand', 4 == 'move', 5 == 'walk', 6 == 'run', 7 == 'stair', 8 == 'cycle', 9 == 'row'
df_gt["activity_name"] = df_gt["out_cat"].map(
    {
        1: "lie",
        2: "sit",
        3: "stand",
        4: "move",
        5: "walk",
        6: "run",
        7: "stair",
        8: "cycle",
        9: "row",
    }
)



# Convert 'out_ts' to a timezone-aware datetime object, if it's not already
df_gt["out_ts"] = pd.to_datetime(df_gt["out_ts"])

# Remove the timezone information, converting to timezone-naive datetime object
df_gt["out_ts"] = df_gt["out_ts"].dt.tz_convert(None)

# Use only time from gt that it is in backend version
df_gt = df_gt[(
    (df_gt['out_ts'] >= start_end_timestamps[0]) &
    (df_gt['out_ts'] <= start_end_timestamps[1]) 
)]

# Selecting columns from backend DataFrame
activity_motus_cols = activity_motus_all[keep_cols]
# Sort on time (necessary for pd.merge_asof())
activity_motus_cols = activity_motus_cols.sort_values("out_ts")
# Shift backend time (seems necessary for now)
time_shift = pd.Timedelta(seconds=1)
activity_motus_cols["out_ts"] += time_shift

# Selecting the column from ground truth DataFrame and renaming it to avoid conflict
df_gt_cols = df_gt.rename(columns={col: f"{col}_gt" for col in df_gt.columns})

# Merge ground truth classifications with backend classifications based on nearest timestamps
new_df = pd.merge_asof(
    left=df_gt_cols,
    right=activity_motus_cols,
    left_on="out_ts_gt",
    right_on="out_ts",
    direction="nearest",
    tolerance=pd.Timedelta(seconds=1),
)

# compare the out_cat and out_cat_gt columns and create a new column called 'match' which will be True if the values match and False if they don't
new_df["match"] = new_df["out_cat"] == new_df["out_cat_gt"]

match_res = match_stats(
    new_df, match_var="match", perday=True, date_var="out_ts_gt", print_res=True
)

Comparison for timeperiod: 
   2023-10-05 - 2023-10-07
----------------------------------------
Matching results 
----------------------------------------

        count  proportion %
match                      
True   164780     95.360452
False    8017      4.639548

----------------------------------------

                  count  proportion %
out_ts_gt  match                     
2023-10-05 True    4919     68.319444
           False   2281     31.680556
2023-10-06 True   84932     98.300926
           False   1468      1.699074
2023-10-07 True   74929     94.610907
           False   4268      5.389093


In [49]:
new_df["out_ts_gt"].dt.date.astype(str) == "2023-10-05"

0          True
1          True
2          True
3          True
4          True
          ...  
172792    False
172793    False
172794    False
172795    False
172796    False
Name: out_ts_gt, Length: 172797, dtype: bool

In [55]:
new_df

Unnamed: 0,out_ts_gt,out_cat_gt,activity_name_gt,out_ts,out_cat,match
0,2023-10-05 22:00:00.027003136,1.0,lie,2023-10-05 22:00:01,2,False
1,2023-10-05 22:00:01.027007744,1.0,lie,2023-10-05 22:00:01,2,False
2,2023-10-05 22:00:02.027002368,1.0,lie,2023-10-05 22:00:02,2,False
3,2023-10-05 22:00:03.027006976,1.0,lie,2023-10-05 22:00:03,2,False
4,2023-10-05 22:00:04.027001600,1.0,lie,2023-10-05 22:00:04,2,False
...,...,...,...,...,...,...
172792,2023-10-07 21:59:52.027006208,1.0,lie,2023-10-07 21:59:52,2,False
172793,2023-10-07 21:59:53.027000832,1.0,lie,2023-10-07 21:59:53,2,False
172794,2023-10-07 21:59:54.027005440,1.0,lie,2023-10-07 21:59:54,2,False
172795,2023-10-07 21:59:55.027000064,1.0,lie,2023-10-07 21:59:55,2,False


In [59]:
anglecols = ["TrunkInc", "TrunkFB", "TrunkLat", "ArmInc"]
anglecols = [i for i in anglecols if i in new_df.columns]

if len(anglecols) > 0:
    
    fig_w = 5*len(anglecols)
    f, axs = plt.subplots(
        nrows=1,
        ncols=len(anglecols),
        figsize=(fig_w,5),
        sharex=True,
        sharey=True,
    )
    
    f.suptitle("Histograms of differences")
    
    for i, angle in enumerate(anglecols):
        ax = axs[i]
        (new_df[angle] - new_df[f"{angle}_gt"]).plot(kind="hist", bins=100, ax=ax)
        stats = (new_df[angle] - new_df[f"{angle}_gt"]).describe()[1:]
        msg = "Stats: \n"
        for i, stat in enumerate(stats):
            msg += f"{stats.index[i]:4} {stat:>7.2f}\n"
        ax.set_title(angle)
        ax.text(
            y=ax.get_ylim()[1]*0.95,
            x=ax.get_xlim()[0]+abs(ax.get_xlim()[0]*0.05),
            s=msg,
            family="monospace",
            va='top',
            ha='left',
        )
    
    
    f.tight_layout()
    f.savefig(os.path.join(res_folder,selected_id,"angle_diff_hist.png"))

In [24]:
# Write statistics of comparison to .txt-file
with open(
    os.path.join(
        res_folder, f"{selected_id}\\match_stats_bulk_V_2_0_0.txt"
    ),
    "w",
) as f:
    f.write("-" * 40)
    f.write("\n")
    f.write(match_res["collapsed"].to_string())
    f.write("\n")
    f.write("-" * 40)
    f.write(match_res["perday"].to_string())

In [None]:
# Bokeh plot for comparison of ground truth and backend classifications

df_plot = new_df

output_file(
    filename=os.path.join(
        res_folder, f"{extract_sensor_id(name)}\\comparison_plot.html"
    )
)

dates = np.array(df_plot["out_ts"], dtype=np.datetime64)
source = ColumnDataSource(
    data=dict(date=dates, cat=df_plot["out_cat"], cat_gt=df_plot["out_cat_gt"])
)

p = figure(
    height=300,
    width=800,
    tools=["xpan", "xwheel_zoom"],
    toolbar_location="above",
    x_axis_type="datetime",
    x_axis_location="above",
    active_scroll="xwheel_zoom",
    background_fill_color="#efefef",
    x_range=(dates[1500], dates[2500]),
)

# p = figure(height=300, width=1000, tools=["xpan",'xwheel_zoom'], toolbar_location='above',
#                    x_axis_type="datetime", x_axis_location="above", active_scroll = "xwheel_zoom", x_range=(dates[0], dates[slider_end_index]),
#                    title=f'Activities and intensity measured on {tmpdag.Time.iloc[0].day_name()} {tmpdag.Time.iloc[0].day}/{tmpdag.Time.iloc[0].month} {tmpdag.Time.iloc[0].year} (scroll to zoom and drag to pan)')


p.line(
    "date", "cat", source=source, color="orange", legend_label="Backend classifications"
)
p.line("date", "cat_gt", source=source, legend_label="Offline classifications")
p.yaxis.axis_label = "Categories"

select = figure(
    title="Drag the middle and edges of the selection box to change the range above",
    height=130,
    width=800,
    y_range=p.y_range,
    x_axis_type="datetime",
    y_axis_type=None,
    tools="",
    toolbar_location=None,
    background_fill_color="#efefef",
)

range_tool = RangeTool(x_range=p.x_range)
range_tool.overlay.fill_color = "navy"
range_tool.overlay.fill_alpha = 0.2

select.line("date", "cat", source=source, color="orange")
select.line("date", "cat_gt", source=source)
select.ygrid.grid_line_color = None
select.add_tools(range_tool)

# show(column(p, select))
save(column(p, select))

#### **7.2. Saving the non-matching activities, those that are difference from the GT data**

In [None]:
# identify the rows where the values don't match, i.e., where the match column is False only - store it as non_mathing_activity
non_matching_activity = new_df[new_df["match"] == False]

# save the non_matching_activity dataframe as a csv file
non_matching_activity.to_csv(
    os.path.join(res_folder, f"{extract_sensor_id(name)}\\non_matching_activity.csv"),
    index=False,
)

# print where the non_matching_activity dataframe is saved
print(
    f"The non_matching_activity dataframe is saved in ",
    os.path.join(res_folder, f"{extract_sensor_id(name)}\\non_matching_activity.csv"),
)

#### **7.3. Ploting side by side (muted for now)**

In [None]:
# plot_activity_classification(df_gt_cols, plot_title='Ground Truth Activity Classification')

In [None]:
# plot_activity_classification(activity_acti4_cols, plot_title='Activity Classification - Motus backend v1.2.0')

### **8. Backend in one go**

In [None]:
# Import file to run backend in one go
from functions.BackendFiles import backend

In [None]:
# Run backend classification and store outputs as dataframe
Akt_full, Fstep_full, out_val_full, out_ts_full = backend.classify(
    read_bin_data[:, 0], read_bin_data[:, 1:4]
)
backend_full_df = pd.DataFrame(
    {
        "out_cat_full": Akt_full,
        "out_ts_full": pd.to_datetime(out_ts_full, utc=False, unit="ms"),
    }
)
backend_full_df["out_ts_full"] += time_shift

In [None]:
# Compare with ground truth
compare_full = pd.merge_asof(
    left=backend_full_df, right=new_df, left_on="out_ts_full", right_on="out_ts_gt"
)

compare_full["match"] = compare_full["out_cat_full"] == compare_full["out_cat_gt"]

match_res_full = match_stats(compare_full, perday=True, date_var="out_ts_gt")

In [None]:
# Store results in .txt-file
with open(
    os.path.join(
        res_folder, f"{extract_sensor_id(name)}\\match_stats_full_V_1_2_0.txt"
    ),
    "w",
) as f:
    f.write("-" * 40)
    f.write("\n")
    f.write(match_res_full["collapsed"].to_string())
    f.write("\n")
    f.write("-" * 40)
    f.write(match_res_full["perday"].to_string())