# O que faremos aqui

Neste notebook, iremos preparar as predições para serem submetidas ao site da competição. Lembrando que a estratégia que nós usamos foi escolher, para cada tarefa (campo a ser extraído) o modelo com melhor valor de word f1 nos dados de validação. Os melhores modelos escolhidos foram então treinados utilizando todos dados rotulados disponíveis (dados de treino e validação), com o número de epochs atingidos no checkpoint.
Nos arquivos de submissão na competição, a resposta de cada campo será dada pelo modelo escolhido na fase anterior. Na tabela abaixo temos os modelos escolhidos para cada tarefa:


| Task                                    | Initial finetuning experiment id / name                                                   | Final finetuning experiment id / name|
| :-:                                      | :-:                                                            | :-:      |
| extract_company                   | [FIN-27 / t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-27/details) | [FIN-51 / FIN-27-t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-51/details)  
| extract_total                     | [FIN-38 / t5-base_extract_total_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-38/details)  | [FIN-53 / all_labelled_data-FIN-38-t5-base_extract_total_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-53/details) 
| extract_address                   | [FIN-41 / t5-base_extract_address_newlines_as_pipes](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-41/details) | [FIN-50 / t5-all_labelled_data-FIN-41-t5-base_extract_address_newlines_as_pipes](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-50/details) 
| extract_date                   | [FIN-28 / t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-28/details) | [FIN-52 / all_labelled_data-FIN-28-t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-52/details)

**extract_company**

[FIN-27 / t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-27/details)

[FIN-51 / FIN-27-t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-51/details)

**extract_total**

[FIN-38 / t5-base_extract_total_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-38/details)

[FIN-53 / all_labelled_data-FIN-38-t5-base_extract_total_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-53/details)

**extract_address**

[FIN-41 / t5-base_extract_address_newlines_as_pipes](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-41/details)

[FIN-50 / t5-all_labelled_data-FIN-41-t5-base_extract_address_newlines_as_pipes](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-50/details)

**extract_date**

[FIN-28 / t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-28/details)

[FIN-52 / all_labelled_data-FIN-28-t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-52/details)

In [204]:
import os
from src.data.sroie import get_all_keynames_from_dir, load_labels
from src.evaluation.sroie_eval_utils import extract_base_keyname
from src.evaluation.sroie_eval_utils import save_predictions_in_dir

In [64]:
map_field_to_initial_exp = {
    'company': 'FIN-27',
    'total': 'FIN-38',
    'address': 'FIN-41',
    'date': 'FIN-28'
}

map_initial_exp_to_final_exp = {
    'FIN-27': 'FIN-51',
    'FIN-38': 'FIN-53',
    'FIN-41': 'FIN-50',
    'FIN-28': 'FIN-52'
}

map_field_to_final_exp = {
    k: map_initial_exp_to_final_exp[v]
    for k,v in map_field_to_initial_exp.items()
}
map_field_to_final_exp

{'company': 'FIN-51', 'total': 'FIN-53', 'address': 'FIN-50', 'date': 'FIN-52'}

In [195]:
predictions_basedir = "/home/marcospiau/final_project_ia376j/data/sroie_receipt_dataset/predictions"

def get_keynames_from_experiment(
    experiment_id,
    partition,
    predictions_basedir=predictions_basedir
):
    keynames_path = os.path.join(predictions_basedir, experiment_id, partition)
    return get_all_keynames_from_dir(keynames_path)

def get_labels_dict(keynames):
    """Loads labels (or predictions) for `keynames`.txt.
    """
    return {extract_base_keyname(x): load_labels(x) for x in keynames}

def get_predictions_for_experiment(exp_id, field):
    all_preds = get_labels_dict(get_keynames_from_experiment(exp_id, ''))
    filter_preds = {k: {field: v[field]} for k,v in all_preds.items()}
    return filter_preds

def merge_predictions_for_all_experiments(map_field_to_exp):
#     dict, where keys are the keynames, and the values are {field: prediction}
    broadcast_pred = [
        get_predictions_for_experiment(exp_id, field)
        for field, exp_id in map_field_to_exp.items()
    ]
#     keynames must be the same in all experiments
    keynames_set_list = [set(x.keys()) for x in broadcast_pred]
    keynames = keynames_set_list[0]
    assert all(keynames == x for x in keynames_set_list[1:])
    
    final_preds = {k: {} for k in keynames}
    for field_pred in broadcast_pred:
        for keyname_pred in keynames:
            final_preds[keyname_pred].update(field_pred[keyname_pred])
    return final_preds
    
    

In [201]:
final_test_predictions = merge_predictions_for_all_experiments(
    map_field_to_final_exp
)
final_test_predictions

{'X51005677336': {'company': 'SYARIKAT PERNIAGAAN GIN KEE',
  'total': '22.26',
  'address': 'NO 290, JALAN AIR PANAS, SETAPAK, 53200, KUALA LUMPUR.',
  'date': '27/01/2018'},
 'X51005745244': {'company': 'URBAN IDEA SDN BHD',
  'total': 'RM11.23',
  'address': 'A-6-06, DATARAN GLOMAC, JALAN SS6/5A, PUSAT BANDAR KELANA JAYA, 47301 PETALING JAYA, SELANGOR, MALAYSIA',
  'date': '14/02/2018'},
 'X51006349083': {'company': 'CHA FOR TEA',
  'total': 'RM46.25',
  'address': '1-1, CENTRAL PARK , JALAN PJU5/13, DATARAN SUNWAY, 47810 KOTA DAMANSARA, SELANGOR',
  'date': '19/04/2018'},
 'X51006619569': {'company': 'MR. D.I.Y. (KUCHAI) SDN BHD',
  'total': '26.40',
  'address': 'LOT 1851-A & 1851-B, JALAN KPB 6, KAWASAN PERINDUSTRIAN BALAKONG, 43300 SERI KEMBANGAN, SELANGOR',
  'date': '01-02-16'},
 'X51007846400': {'company': 'KOREAN BBQ KOREAN DINE SDN BHD',
  'total': '152.90',
  'address': 'NO 4, JALAN PERMAS 10/5, BANDAR BARU PERMAS JAYA, 81750 JOHOR BAHRU, JOHOR.',
  'date': '19/06/2018'},


In [212]:
submissions_path = "/home/marcospiau/final_project_ia376j/data/sroie_receipt_dataset/submissions/v1_checkpoint_selection_by_task"

In [213]:
save_predictions_in_dir(final_test_predictions, submissions_path)

Saving predictions in /home/marcospiau/final_project_ia376j/data/sroie_receipt_dataset/submissions/v1_checkpoint_selection_by_task: 100%|██████████| 347/347 [00:00<00:00, 16216.78it/s]


Não está registrado aqui, mas, a critério de teste, também vou tentar uma submissão utilizando o melhor modelo único ([FIN-28 / t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-28/details) | [FIN-52 / all_labelled_data-FIN-28-t5-base_all_tasks_concat_newlines_as_spaces](https://ui.neptune.ai/marcospiau/final-project-ia376j-1/e/FIN-52/details)), que por acaso também é o mesmo modelo que foi o melhor para o campo `date`.

In [217]:
os.listdir(f'{submissions_path}/..')

['v1_checkpoint_selection_by_task.zip',
 'v2_single_model_initial=FIN-28_final=FIN-52',
 'v1_checkpoint_selection_by_task',
 'v2_single_model_initial=FIN-28_final=FIN-52.zip']