# Data


De [link](https://github.com/chihyaoma/regretful-agent/tree/master/tasks/R2R-pano)

Each JSON Lines entry contains a guide annotation for a path in the environment.

Data schema:

```python
{'split': str,
 'instruction_id': int,
 'annotator_id': int,
 'language': str,
 'path_id': int,
 'scan': str,
 'path': Sequence[str],
 'heading': float,
 'instruction': str,
 'timed_instruction': Sequence[Mapping[str, Union[str, float]]],
 'edit_distance': float}
```

Field descriptions:

*   `split`: The annotation split: `train`, `val_seen`, `val_unseen`,
    `test_standard`.
*   `instruction_id`: Uniquely identifies the guide annotation.
*   `annotator_id`: Uniquely identifies the guide annotator.
*   `language`: The IETF BCP 47 language tag: `en-IN`, `en-US`, `hi-IN`,
    `te-IN`.
*   `path_id`: Uniquely identifies a path sampled from the Matterport3D
    environment.
*   `scan`: Uniquely identifies a scan in the Matterport3D environment.
*   `path`: A sequence of panoramic viewpoints along the path.
*   `heading`: The initial heading in radians. Following R2R, the heading angle
    is zero facing the y-axis with z-up, and increases by turning right.
*   `instruction`: The navigation instruction.
*   `timed_instruction`: A sequence of time-aligned words in the instruction.
    Note that a small number of words are missing the `start_time` and
    `end_time` fields.
    *   `word`: The aligned utterance.
    *   `start_time`: The start of the time span, w.r.t. the recording.
    *   `end_time`: The end of the time span, w.r.t. the recording.
*   `edit_distance` Edit distance between the manually transcribed instructions
    and the automatic transcript generated by Google Cloud
    [Text-to-Speech](https://cloud.google.com/text-to-speech) API.

In [56]:
# Las features de todos los puntos del dataset pesan 3.9G. Como no me cabe, tengo que tomar un sample
DATA_SIZE_FACTOR = 0.01

In [57]:
import json
import os
import numpy as np
import pandas as pd

In [58]:
DATADIR = './data/original'
DATAPATHS = {
    'train':        os.path.join(DATADIR, 'R2R_train.json'),
    'test':         os.path.join(DATADIR, 'R2R_test.json'), 
    'val seen':     os.path.join(DATADIR, 'R2R_val_seen.json'),
    'val unseen':   os.path.join(DATADIR, 'R2R_val_unseen.json'), 
}

In [59]:
with open(DATAPATHS['train']) as f:
    train = json.load(f)
    train_df = pd.DataFrame.from_records(train)

with open(DATAPATHS['test']) as f:
    test = json.load(f)
    test_df = pd.DataFrame.from_records(test)

with open(DATAPATHS['val seen']) as f:
    val_seen = json.load(f)
    val_seen_df = pd.DataFrame.from_records(val_seen)

with open(DATAPATHS['val unseen']) as f:
    val_unseen = json.load(f)
    val_unseen_df = pd.DataFrame.from_records(val_unseen)


In [60]:
def scan_ids(df):
    return df['scan'].unique()

def n_scenarios(df):
    return scan_ids(df).shape[0]

print(f"Hay {len(train)} ejemplos en el set de entrenamiento.")
print(f"Hay {len(test)} ejemplos en el set de test.")
print(f"Hay {len(val_seen)} ejemplos en el set de validacion seen.")
print(f"Hay {len(val_unseen)} ejemplos en el set de validacion unseen.")
print("----------------------------------------------------")
print(f"Hay {n_scenarios(train_df)} escenarios distintos en el set de entrenamiento.")
print(f"Hay {n_scenarios(test_df)} escenarios distintos en el set de test.")
print(f"Hay {n_scenarios(val_seen_df)} escenarios distintos en el set de validacion seen. (Todos extraidos de train)")
print(f"Hay {n_scenarios(val_unseen_df)} escenarios distintos en el set de validacion unseen.")
print(f"Hay {n_scenarios(pd.concat([train_df, test_df, val_seen_df, val_unseen_df]))} escenarios distintos en total.")
print("----------------------------------------------------")



Hay 4675 ejemplos en el set de entrenamiento.
Hay 1391 ejemplos en el set de test.
Hay 340 ejemplos en el set de validacion seen.
Hay 783 ejemplos en el set de validacion unseen.
----------------------------------------------------
Hay 61 escenarios distintos en el set de entrenamiento.
Hay 18 escenarios distintos en el set de test.
Hay 56 escenarios distintos en el set de validacion seen. (Todos extraidos de train)
Hay 11 escenarios distintos en el set de validacion unseen.
Hay 90 escenarios distintos en total.
----------------------------------------------------


In [61]:
def sample_scan_ids(df):
    unique_scan_ids = scan_ids(df)
    return np.random.choice(unique_scan_ids, size=int(np.ceil(unique_scan_ids.shape[0] * DATA_SIZE_FACTOR)), replace=False)  # ceil hace que sea al menos 1.


train_scan_ids      = sample_scan_ids(train_df)
test_scan_ids       = sample_scan_ids(test_df)
val_unseen_scan_ids = sample_scan_ids(val_unseen_df)
val_seen_scan_ids   = np.intersect1d(train_scan_ids, scan_ids(val_seen_df))


def filter_by_scan_id(split_dict, scans):
    is_in_scans = lambda row: row['scan'] in scans
    return list(filter(is_in_scans, split_dict))


new_train      = filter_by_scan_id(train, train_scan_ids)
new_test       = filter_by_scan_id(test, test_scan_ids)
new_val_unseen = filter_by_scan_id(val_unseen, val_unseen_scan_ids)
new_val_seen   = filter_by_scan_id(val_seen, val_seen_scan_ids)

new_train_df      = pd.DataFrame.from_records(new_train)
new_test_df       = pd.DataFrame.from_records(new_test)
new_val_unseen_df = pd.DataFrame.from_records(new_val_unseen)
new_val_seen_df   = pd.DataFrame.from_records(new_val_seen)

# print(val_seen_scan_ids)

print(f"Quedan {n_scenarios(new_train_df)} escenarios distintos en el set de entrenamiento.")
print(f"Quedan {n_scenarios(new_test_df)} escenarios distintos en el set de test.")
print(f"Quedan {n_scenarios(new_val_seen_df)} escenarios distintos en el set de validacion seen. (Todos extraidos de train)")
print(f"Quedan {n_scenarios(new_val_unseen_df)} escenarios distintos en el set de validacion unseen.")
print(f"Quedan {n_scenarios(pd.concat([new_train_df, new_test_df, new_val_seen_df, new_val_unseen_df]))} escenarios distintos en total.")

Quedan 1 escenarios distintos en el set de entrenamiento.
Quedan 1 escenarios distintos en el set de test.
Quedan 1 escenarios distintos en el set de validacion seen. (Todos extraidos de train)
Quedan 1 escenarios distintos en el set de validacion unseen.
Quedan 3 escenarios distintos en total.


In [62]:
# Guardar los nuevos datasets reducidos

DESTINY_DIR = 'data'
DESTINY_PATHS = {
    'train':        os.path.join(DESTINY_DIR, 'R2R_train.json'),
    'test':         os.path.join(DESTINY_DIR, 'R2R_test.json'), 
    'val seen':     os.path.join(DESTINY_DIR, 'R2R_val_seen.json'),
    'val unseen':   os.path.join(DESTINY_DIR, 'R2R_val_unseen.json'), 
}

def save_json(obj, destiny_path):
    with open(destiny_path, 'w') as f:
        json.dump(obj, f)

save_json(new_train,      DESTINY_PATHS['train'])
save_json(new_test,       DESTINY_PATHS['test'])
save_json(new_val_seen,   DESTINY_PATHS['val seen'])
save_json(new_val_unseen, DESTINY_PATHS['val unseen'])

In [63]:
# Guardar los ids para luego filtrarlas features (en el runtime)
all_scan_ids = np.hstack([
    train_scan_ids,
    test_scan_ids,
    val_seen_scan_ids,
    val_unseen_scan_ids
])

save_json(list(np.unique(all_scan_ids)), 'data/scan_ids.json')