# Padronizador de views de HAR

Este notebook auxilia a padronizar as views dos datasets de HAR

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import hashlib

In [2]:
root_views_dir = Path("../data/old-views/")    # Local onde as views de encontram (entrada)
output_dir = Path("../data/views")             # Local onde as views padronizadas serão colocadas

In [3]:
# Código de atividades padronizado
standartized_codes = {
    0: "sit",
    1: "stand",
    2: "walk",
    3: "stair up",
    4: "stair down",
    5: "run",
    6: "stair up and down"
}

## Descrição das views a serem processadas

A variavel `views` é um dicionário, onde cada chave é o nome do dataset (nome da pasta do dataset raíz, onde possui as views dentro) e o valor é uma lista de dicionários com meta informações (pode ter várias meta-informações de processamento. Elas serão processadas em ordem).

Cada meta-informação é um dicionário deve conter as seguintes informações:

* **view**: nome da pasta com a view
* **output**: nome da pasta de saída (será criada)
* **train**: nome do arquivo csv de treino
* **validation**: nome do arquivo csv de validação (None, se não houver)
* **test**: nome do arquivo csv de teste (None, se não houver)
* **activity code**: Dicionário com o mapeamento entre os nomes das atividades originais e seus respectivos códigos
* **select activities**: Lista com quais serão as atividades selecionadas (em relação ao activity code)
* **standard activity code map**: mapeamento do código das atividades originais (chave) para o código de atividade padronizado (valor) 
* **brief**: resumo para o README.md


In [4]:
views = {
    # Views a serem preprocessadas
    "KuHar": [
        {
            "view": "balanced_motionsense_equivalent_resampled_view_20Hz",
            "output": "balanced_20Hz_motionsense_equivalent-v1",
            "train": "train.csv",
            "validation": "validation.csv",
            "test": "test.csv",
            "activity code": {
                0: "stair down",
                1: "stair up",
                2: "sit",
                3: "stand",
                4: "walk",
                5: "run"
            },
            "select activities": [
                0, 1, 2, 3, 4, 5
            ],
            "standard activity code map": {
                0: 4,
                1: 3,
                2: 0,
                3: 1,
                4: 2,
                5: 5
            },
            "brief": """# Balanced KuHar View Resampled to 20Hz

This is a view from [KuHar v5](https://data.mendeley.com/datasets/45f952y38r/5) that was spllited into 3s windows and was resampled to 20Hz using the [FFT method](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html#scipy.signal.resample). 

The data was first splitted in three sets: train, validation and test. Each one with the following proportions:
- Train: 70% of samples
- Validation: 10% of samples
- Test: 20% of samples

After splits, the datasets were balanced in relation to the activity code column, that is, each subset have the same number of activitiy samples.

**NOTE**: Each subset contain samples from distinct users, that is, samples of one user belongs exclusivelly to one of three subsets.

"""
        }
        # Podes adicionar outras views do KuHar aqui
    ],

    "CHARM": [
        {
            "view": "balanced_view_train_test-v1",
            "output": "balanced_20Hz_train_test-v1",
            "train": "train.csv",
            "validation": None,
            "test": "test.csv",
            "activity code": {
                0: "sitting on a chair",
                1: "sitting on a couch",
                2: "standing",
                3: "lying up",
                4: "lying by side",
                5: "device on surface",
                6: "walking",
                7: "running",
                8: "walking upstairs",
                9: "walking downstairs"
            },
            "select activities": [
                0, 1, 2, 6, 7, 8, 9
            ],
            "standard activity code map": {
                0: 0,
                1: 0,
                2: 1,
                6: 2,
                7: 5,
                8: 3,
                9: 4
            },
            "brief": """# Balanced CHARM View

This is a view from [CHARM dataset](https://zenodo.org/record/4642560) that was spllited into 3s windows. The sample rate was 20Hz.

The data was first splitted in two sets: train and test. Each one with the following proportions:
- Train: 70% of samples
- Test: 30% of samples

After splits, the datasets were balanced in relation to the activity code column, that is, each subset have the same number of activitiy samples.

**NOTE**: Each subset contain samples from distinct users, that is, samples of one user belongs exclusivelly to one of three subsets.

"""
        }
        # Podes adicionar outras views do CHARM aqui
    ],

    "ExtraSensory": [
        {
            "view": "unbalanced_train_only_resampled_20hz",
            "output": "unbalanced_20Hz_train-v1",
            "train": "train.csv",
            "validation": None,
            "test": None,
            "activity code": {
                0: "sitting",
                1: "or_standing",
                2: "fix_walking",
                3: "fix_running"
            },
            "select activities": [
                0, 1, 2, 3
            ],
            "standard activity code map": {
                0: 1,
                1: 1,
                2: 2,
                3: 5
            },
            "brief": """# Unbalanced ExtraSensory View Resampled to 20Hz

This is a view from [ExtraSensory dataset](http://extrasensory.ucsd.edu/) that was spllited into 3s windows. The view contain only the train file resampled to 20Hz, interpolated using the cubic spline method due to non stable sampling. The gravity was already subtracted. 

"""
        }
        # Podes adicionar outras views do ExtraSensory aqui
    ],

    "MotionSense": [
        {
            "view": "resampled_view_20Hz",
            "output": "balanced_20Hz-v1",
            "train": "train.csv",
            "validation": "validation.csv",
            "test": "test.csv",
            "activity code": {
                0: "downstairs",
                1: "upstairs",
                2: "sitting",
                3: "standing",
                4: "walking",
                5: "jogging"
            },
            "select activities": [
                0, 1, 2, 3, 4, 5
            ],
            "standard activity code map": {
                0: 4,
                1: 3,
                2: 0,
                3: 1,
                4: 2,
                5: 5
            },
            "brief": """# Balanced MotionSense View Resampled to 20Hz

This is a view from [KuHar v5](https://data.mendeley.com/datasets/45f952y38r/5) that was spllited into 3s windows and was resampled to 20Hz using the [FFT method](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html#scipy.signal.resample). 

The data was first splitted in three sets: train, validation and test. Each one with the following proportions:
- Train: 70% of samples
- Validation: 10% of samples
- Test: 20% of samples

After splits, the datasets were balanced in relation to the activity code column, that is, each subset have the same number of activitiy samples.

**NOTE**: Each subset contain samples from distinct users, that is, samples of one user belongs exclusivelly to one of three subsets.

"""
        },
        # Podes adicionar outras views do MotionSense aqui
    ],

    "UCI-HAR": [
        {
            "view": "unbalanced_view_train_test-resampled_20hz-v1",
            "output": "unbalanced_20Hz_train_test-v1",
            "train": "train.csv",
            "validation": None,
            "test": "test.csv",
            "activity code": {
                1: "walking",
                2: "walking upstairs",
                3: "walking downstairs",
                4: "sitting",
                5: "standing",
                6: "laying"
            },
            "select activities": [
                1, 2, 3, 4, 5
            ],
            "standard activity code map": {
                1: 2,
                2: 3,
                3: 4,
                4: 0,
                5: 1
            },
            "brief": """# Unbalanced UCI-HAR View Resampled to 20Hz

This view contain only the train and test files for [UCI-HAR dataset](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones#) (70% samples train and 30% test). The data was spllited into 3s windows and was resampled to 20Hz using the [FFT method](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html#scipy.signal.resample).

"""
            
        }
        # Podes adicionar outras views do UCI-HAR aqui
    ],

    "WISDM": [
        {
            "view": "interpolated_unbalanced_view_train_test-v1",
            "output": "unbalanced_20Hz_train_test-v1",
            "train": "train.csv",
            "validation": None,
            "test": "test.csv",
            "activity code": {
                0: "walking",
                1: "jogging",
                2: "stairs",
                3: "sitting",
                4: "standing"
            },
            "select activities": [
                0, 1, 2, 3, 4
            ],
            "standard activity code map": {
                0: 2,
                1: 5,
                2: 6,
                3: 0,
                4: 1
            },
            "brief": """# Unbalanced WISDM View Resampled to 20Hz

This view contain only the train and test files for [WISDM dataset](https://archive.ics.uci.edu/ml/datasets/WISDM+Smartphone+and+Smartwatch+Activity+and+Biometrics+Dataset) (70% samples train and 30% test).
The dataset was sampled at 20Hz and interpolated using the cubic spline method due to non stable sampling.

"""
        }
        # Podes adicionar outras views do WISDM aqui
    ]
}

O código abaixo processa as views (às padroniza) e gera na pasta de saída

In [5]:
backslash = "\n"

for dataset_name, values_list in views.items():
    split_counts = {}
    root_output_path = output_dir / dataset_name
    for values in values_list:
        for split in ("train", "validation", "test"):
            if values[split] is None:
                continue

            path = root_views_dir / dataset_name / values["view"] / values[split]
            df = pd.read_csv(path)
            df = df.loc[df["activity code"].isin(values["select activities"])]
            df["standard activity code"] = df["activity code"].replace(values["standard activity code map"])
            if "normalized activity code" in df.columns:
                df = df.drop(columns="normalized activity code")
                
            df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
            df = df.dropna()
            
            path = root_output_path / values["output"] / f"{split}.csv"
            path.parent.mkdir(exist_ok=True, parents=True)
            df.to_csv(path, index=False)
            
            md5sum = hashlib.md5(path.open('rb').read()).hexdigest()
            path = path.parent / (path.name + '.md5')
            with path.open("w") as f:
                f.write(md5sum)

            split_counts[split] = {
                "standard activity code": df["standard activity code"].value_counts(),
                "activity code": df["activity code"].value_counts()
            }

        readme = values["brief"]
        readme = readme + f"""## Activity codes
- {f"- ".join(f'{k}: {v} ({split_counts["train"]["activity code"][k]} train, {split_counts["validation"]["activity code"][k] if "validation" in split_counts else 0} validation, {split_counts["test"]["activity code"][k] if "test" in split_counts else 0} test) {backslash}' for k, v in values['activity code'].items() if k in split_counts["train"]["activity code"])} 

## Standartized activity codes
- {f"- ".join(f'{k}: {v} ({split_counts["train"]["standard activity code"][k]} train, {split_counts["validation"]["standard activity code"][k] if "validation" in split_counts else 0} validation, {split_counts["test"]["standard activity code"][k] if "test" in split_counts else 0} test) {backslash}' for k, v in standartized_codes.items() if k in split_counts["train"]["standard activity code"]  )    }      


"""
        readme_path = root_output_path / values["output"] / "README.md"
        with readme_path.open("w") as f:
            f.write(readme)
        print(f"Processed {dataset_name}")

Processed KuHar
Processed CHARM
Processed ExtraSensory
Processed MotionSense
Processed UCI-HAR
Processed WISDM


Processamento especifico para o MotionSense. Troca o nome das colunas: `userAcceleration` para `accel` e `rotationRate` para `gyro`

In [6]:
for path in Path("data/views/MotionSense/balanced_20Hz-v1").glob("*.csv"):
    df = pd.read_csv(path)
    replace_map = {
        f"{c}-{i}": f"{new_c}-{i}"
        for c, new_c in [("userAcceleration.x", "accel-x"), 
                         ("userAcceleration.y", "accel-y"), 
                         ("userAcceleration.z", "accel-z"),
                         ("rotationRate.x", "gyro-x"), 
                         ("rotationRate.y", "gyro-y"),
                         ("rotationRate.z", "gyro-z")]
        for i in range(60)
    }
    df.rename(columns=replace_map, inplace=True)
    df.to_csv(path, index=False)

In [7]:
for path in Path("data/views/").rglob("*.csv"):
    df = pd.read_csv(path)
    print(path, list(df.columns)[:10])

data/views/UCI-HAR/unbalanced_20Hz_train_test-v1/test.csv ['accel-x-0', 'accel-x-1', 'accel-x-2', 'accel-x-3', 'accel-x-4', 'accel-x-5', 'accel-x-6', 'accel-x-7', 'accel-x-8', 'accel-x-9']
data/views/UCI-HAR/unbalanced_20Hz_train_test-v1/train.csv ['accel-x-0', 'accel-x-1', 'accel-x-2', 'accel-x-3', 'accel-x-4', 'accel-x-5', 'accel-x-6', 'accel-x-7', 'accel-x-8', 'accel-x-9']
data/views/MotionSense/balanced_20Hz-v1/test.csv ['accel-x-0', 'accel-x-1', 'accel-x-2', 'accel-x-3', 'accel-x-4', 'accel-x-5', 'accel-x-6', 'accel-x-7', 'accel-x-8', 'accel-x-9']
data/views/MotionSense/balanced_20Hz-v1/validation.csv ['accel-x-0', 'accel-x-1', 'accel-x-2', 'accel-x-3', 'accel-x-4', 'accel-x-5', 'accel-x-6', 'accel-x-7', 'accel-x-8', 'accel-x-9']
data/views/MotionSense/balanced_20Hz-v1/train.csv ['accel-x-0', 'accel-x-1', 'accel-x-2', 'accel-x-3', 'accel-x-4', 'accel-x-5', 'accel-x-6', 'accel-x-7', 'accel-x-8', 'accel-x-9']
data/views/KuHar/balanced_20Hz_motionsense_equivalent-v1/test.csv ['accel-

In [12]:
pd.DataFrame(standartized_codes.items(), columns=["standard code", "standard label"]).to_csv(output_dir / "standard_codes.csv", index=False)