# Padronizador de views de HAR

Este notebook auxilia a padronizar as views dos datasets de HAR

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import hashlib

In [2]:
root_views_dir = Path("../../../data_2/raw_data/")    # Local onde as views de encontram (entrada)
output_dir = Path("../../../data_2/views")         # Local onde as views padronizadas serão colocadas

In [3]:
# Código de atividades padronizado
standartized_codes = {
    0: "sit",
    1: "stand",
    2: "walk",
    3: "stair up",
    4: "stair down",
    5: "run",
    6: "stair up and down"
}

## Descrição das views a serem processadas

A variavel `views` é um dicionário, onde cada chave é o nome do dataset (nome da pasta do dataset raíz, onde possui as views dentro) e o valor é uma lista de dicionários com meta informações (pode ter várias meta-informações de processamento. Elas serão processadas em ordem).

Cada meta-informação é um dicionário deve conter as seguintes informações:

* **view**: nome da pasta com a view
* **output**: nome da pasta de saída (será criada)
* **train**: nome do arquivo csv de treino
* **validation**: nome do arquivo csv de validação (None, se não houver)
* **test**: nome do arquivo csv de teste (None, se não houver)
* **activity code**: Dicionário com o mapeamento entre os nomes das atividades originais e seus respectivos códigos
* **select activities**: Lista com quais serão as atividades selecionadas (em relação ao activity code)
* **standard activity code map**: mapeamento do código das atividades originais (chave) para o código de atividade padronizado (valor) 
* **brief**: resumo para o README.md


In [4]:
views ={
        "UCI-HAR": [
        {
            "view": "unbalanced_view_filtered_acc_9.81_train_test-v1",
            "output": "unbalanced_20Hz_train_test_9.81_acc_filtered",
            "train": "train.csv",
            "validation": None,
            "test": "test.csv",
            "activity code": {
                1: "walking",
                2: "walking upstairs",
                3: "walking downstairs",
                4: "sitting",
                5: "standing",
                6: "laying"
            },
            "select activities": [
                1, 2, 3, 4, 5
            ],
            "standard activity code map": {
                1: 2,
                2: 3,
                3: 4,
                4: 0,
                5: 1
            },
            "brief": """# Unbalanced UCI-HAR View Resampled to 20Hz without gravity. 
            
The data used was the samples with gravity by autors.

The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used and the signal filtered was subtracted from the original signal.

This view contain only the train and test files for [UCI-HAR dataset](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones#) (70% samples train and 30% test). The data was spllited into 3s windows and was resampled to 20Hz using the [FFT method](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html#scipy.signal.resample). The accelerometer meansure is in m/s² and without gravity.

"""
        },
        # Podes adicionar outras views do UCI-HAR aqui
        {
            "view": "unbalanced_view_gravity_acc_9.81_train_test-v1",
            "output": "unbalanced_20Hz_train_test_with_gravity_9.81_acc",
            "train": "train.csv",
            "validation": None,
            "test": "test.csv",
            "activity code": {
                1: "walking",
                2: "walking upstairs",
                3: "walking downstairs",
                4: "sitting",
                5: "standing",
                6: "laying"
            },
            "select activities": [
                1, 2, 3, 4, 5
            ],
            "standard activity code map": {
                1: 2,
                2: 3,
                3: 4,
                4: 0,
                5: 1
            },
            "brief": """# Unbalanced UCI-HAR View Resampled to 20Hz without gravity

The data used was the samples with gravity by autors.

This view contain only the train and test files for [UCI-HAR dataset](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones#) (70% samples train and 30% test). The data was spllited into 3s windows and was resampled to 20Hz using the [FFT method](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html#scipy.signal.resample). The accelerometer meansure is in m/s² and with gravity.

"""
           
        },
            
        {
            "view": "unbalanced_view_without_gravity_acc_9.81_train_test-v1",
            "output": "unbalanced_20Hz_train_test_9.81_acc",
            "train": "train.csv",
            "validation": None,
            "test": "test.csv",
            "activity code": {
                1: "walking",
                2: "walking upstairs",
                3: "walking downstairs",
                4: "sitting",
                5: "standing",
                6: "laying"
            },
            "select activities": [
                1, 2, 3, 4, 5
            ],
            "standard activity code map": {
                1: 2,
                2: 3,
                3: 4,
                4: 0,
                5: 1
            },
            "brief": """# Unbalanced UCI-HAR View Resampled to 20Hz with gravity

The data used was the samples without gravity by autors.

This view contain only the train and test files for [UCI-HAR dataset](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones#) (70% samples train and 30% test). The data was spllited into 3s windows and was resampled to 20Hz using the [FFT method](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html#scipy.signal.resample). The accelerometer meansure is in m/s² and with gravity.

"""
           
        }
    ] 
}

O código abaixo processa as views (às padroniza) e gera na pasta de saída

In [5]:
backslash = "\n"

for dataset_name, values_list in views.items():
    split_counts = {}
    root_output_path = output_dir / dataset_name
    for values in values_list:
        for split in ("train", "validation", "test"):
            if values[split] is None:
                continue

            path = root_views_dir / dataset_name / values["view"] / values[split]
            df = pd.read_csv(path)
            df = df.loc[df["activity code"].isin(values["select activities"])]
            df["standard activity code"] = df["activity code"].replace(values["standard activity code map"])
            if "normalized activity code" in df.columns:
                df = df.drop(columns="normalized activity code")
                
            df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
            df = df.dropna()
            
            path = root_output_path / values["output"] / f"{split}.csv"
            path.parent.mkdir(exist_ok=True, parents=True)
            df.to_csv(path, index=False)
            
            md5sum = hashlib.md5(path.open('rb').read()).hexdigest()
            path = path.parent / (path.name + '.md5')
            with path.open("w") as f:
                f.write(md5sum)

            split_counts[split] = {
                "standard activity code": df["standard activity code"].value_counts(),
                "activity code": df["activity code"].value_counts()
            }

        readme = values["brief"]
        readme = readme + f"""## Activity codes
- {f"- ".join(f'{k}: {v} ({split_counts["train"]["activity code"][k]} train, {split_counts["validation"]["activity code"][k] if "validation" in split_counts else 0} validation, {split_counts["test"]["activity code"][k] if "test" in split_counts else 0} test) {backslash}' for k, v in values['activity code'].items() if k in split_counts["train"]["activity code"])} 

## Standartized activity codes
- {f"- ".join(f'{k}: {v} ({split_counts["train"]["standard activity code"][k]} train, {split_counts["validation"]["standard activity code"][k] if "validation" in split_counts else 0} validation, {split_counts["test"]["standard activity code"][k] if "test" in split_counts else 0} test) {backslash}' for k, v in standartized_codes.items() if k in split_counts["train"]["standard activity code"]  )    }      


"""
        readme_path = root_output_path / values["output"] / "README.md"
        with readme_path.open("w") as f:
            f.write(readme)
        print(f"Processed {dataset_name}")

Processed UCI-HAR
Processed UCI-HAR
Processed UCI-HAR


In [6]:
# for path in Path("../../../data_2/views/UCI-HAR").rglob("*.csv"):
#     df = pd.read_csv(path)
#     print(path, list(df.columns)[:10])

In [7]:
pd.DataFrame(standartized_codes.items(), columns=["standard code", "standard label"]).to_csv(output_dir / "standard_codes.csv", index=False)