# Parser construction example with a user-provided data dictionary

This file demonstrates the process of constructing a parser file using `animal_data_choices.csv` as a source dataset.

Before you start: `autoparser` requires an LLM API key to function, for either OpenAI or Gemini.
You should add yours to your environment, as described [here](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety).
This example uses the OpenAI API; edit the `API_KEY` line below to match the name you gave yours.

If you would prefer to use Gemini, use the `llm_provider` argument in functions where the api key is used, e.g.

`writer.generate_descriptions("fr", data_dict, key=API_KEY, llm_provider='gemini')`

You can also specify which model from either OpenAI or Gemini you wish to use, with the `llm_model` argument. Your model choice should support Structured Outputs (for [OpenAI](https://platform.openai.com/docs/guides/structured-outputs#supported-models)) or Controlled Generation (for [Gemini](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output)).
The model should be provided as a string recognised by the respective api, e.g. `llm_model = "gpt-4o-mini"` (the default model when OpenAI is selected as the provider).

In [1]:
import pandas as pd

import adtl.autoparser as autoparser

API_KEY = "OPENAI_API_KEY"

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
autoparser.setup_config(
    {
        "language": "fr",
        "llm_provider": "openai",
        "api_key": API_KEY,
        "max_common_count": 8,
        "schemas": {
            "animals": "../../../tests/test_autoparser/schemas/animals.schema.json"
        },
        "column_mappings": {
            "source_field": "Field Name",
            "source_description": "Description",
            "source_type": "Field Type",
            "choices": "Choices",
        },
    }
)

In [3]:
data = pd.read_csv("../../../tests/test_autoparser/sources/animal_data_choices.csv")
data.head()

Unnamed: 0,Identité,Province,DateNotification,Classicfication,Nom complet,Date de naissance,AgeAns,AgeMois,Sexe,StatusCas,DateDec,ContSoins,ContHumain Autre,ContexteContHumain,ContactAnimal,Micropucé,AnimalDeCompagnie,ConditionsPreexistantes
0,A001,Equateur,2024-01-01,4,Luna,15/03/2022,2,10,2,1,,1.0,2.0,2.0,1,1,1,"[arthrite, vomir]"
1,B002,Equateur,2024-15-02,1,Max,21/07/2021,3,4,1,2,2024-06-01,2.0,1.0,1.0,2,2,1,
2,C003,Equateur,2024-03-10,3,Coco,10/02/2023,1,11,2,1,,1.0,2.0,2.0,1,1,2,
3,D004,,2024-04-22,2,Bella,05/11/2020,4,5,1,1,,1.0,,3.0,2,2,2,
4,E005,,2024-05-30,5,Charlie,18/05/2019,5,3,2,2,2024-07-01,,,1.0,1,1,1,


You can see from the above data that a lot of the columns are encoded as numeric values rather than as strings (e.g. the 'Sexe' column contains 1's and 2's, not gender identities). This means the data dictionary must be used to translate those values into meaningful data; so let's look at that.

In [4]:
data_dict = pd.read_csv("../../../tests/test_autoparser/sources/animals_dd_choices.csv")
data_dict

Unnamed: 0,Field Name,Description,Field Type,Choices
0,Identité,Identity,string,
1,Province,Province,string,
2,DateNotification,Notification Date,string,
3,Classicfication,Classification,string,"1=fish, 2=amphibie, 3=oiseau, 4=mammifère, 5=p..."
4,Nom complet,Full Name,string,
5,Date de naissance,Date of Birth,string,
6,AgeAns,Age in Years,number,
7,AgeMois,Age in Months,number,
8,Sexe,Gender,string,"1=mâle, 2=femelle, 3=inconnu"
9,StatusCas,Case Status,string,"1=Vivant, 2=Décédé"


Before we use this data dictionary to map our data, we should check that it can be converted and validated for use with AutoParser.

To do this, we can run the `format_dict` function, providing a config file that describes how the columns should be mapped, like this one located in the `tests` directory.

In [5]:
formatted_data_dict = autoparser.format_dict(data_dict)
formatted_data_dict

Unnamed: 0,source_field,source_description,source_type,choices
0,Identité,Identity,string,
1,Province,Province,string,
2,DateNotification,Notification Date,string,
3,Classicfication,Classification,string,"{'1': 'fish', '2': 'amphibie', '3': 'oiseau', ..."
4,Nom complet,Full Name,string,
5,Date de naissance,Date of Birth,string,
6,AgeAns,Age in Years,number,
7,AgeMois,Age in Months,number,
8,Sexe,Gender,string,"{'1': 'mâle', '2': 'femelle', '3': 'inconnu'}"
9,StatusCas,Case Status,string,"{'1': 'Vivant', '2': 'Décédé'}"


We can see that now the dictionary's headers have been converted for a format recognised by autoparser, and the `choices` column contains dictionaries of values mapped to data, rather than being in the string format of the input dictionary. This data dictionary was sucessfully validated and is ready to be used for data mapping and parser generation.

AutoParser requires that every field (meaning every row in the data dictionary) must have a description, and those descriptions must be unique. The field descriptions are what is used to map the raw data to the new schema, so their presence is vital, and they must be able to be disambiguated. A data dictionary will fail validation if the required columns cannot be identified, descriptions are duplicated or missing, and if the options in the `common_values` or `choices` columns cannot be converted to their expected formats (a list of strings or a string dictionary, respectively). You can find help for validation errors in the (troubleshooting)[../getting_started/index.md#troubleshooting] section of the docs.

Now we've validated the data dictionary, we can proceed to create an intermediate mapping file:

In [6]:
mapper = autoparser.WideMapper(formatted_data_dict, "animals")
mapping_dict = mapper.create_mapping(file_name="example_mapping_choices.csv")

mapping_dict.head()



Unnamed: 0_level_0,source_description,source_field,choices,target_values,value_mapping
target_field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
identity,Identity,Identité,,,
name,Full Name,Nom complet,,,
loc_admin_1,Province,Province,,,
country_iso3,,,,,
notification_date,Notification Date,DateNotification,,,


At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary.
The mapping file has been written out to [example_mapping.csv](example_mapping.csv). A good example is the 'loc_admin_1' field; the LLM often maps the common values provided to 'None' as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text.
Also note the warning above; the LLM should not have found fields to map to the 'country_iso3' or 'owner' fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly.

Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, `example_parser.toml`:

In [7]:
writer = autoparser.ParserGenerator(
    "example_mapping_choices.csv", "", "example_choices"
)
writer.create_parser("example_parser_with_choices.toml")

Missing required field country_iso3 in animals schema. Adding empty field...


You can veiw/edit the created parser at [example_parser_with_choices.toml](example_parser_with_choices.toml), and use it with adtl.

In [8]:
import adtl

data = adtl.parse(
    "example_parser_with_choices.toml",
    "../../../tests/test_autoparser/sources/animal_data_choices.csv",
    "example_choices_output",
)
data["animals"].head()

[example_choices] parsing animal_data_choices.csv: 100%|██████████| 30/30 [00:00<00:00, 22541.94it/s]
[example_choices] validating animals table: 30it [00:00, 124460.06it/s]


Unnamed: 0,age_months,age_years,chipped,identity,loc_admin_1,name,notification_date,pet,underlying_conditions,country_iso3,case_status,classification,sex,adtl_valid,adtl_error,date_of_death
0,10,2,True,A001,Equateur,Luna,2024-01-01,True,"[arthrite, vomir]",,alive,mammal,female,False,data.underlying_conditions must be array or null,
1,4,3,False,B002,Equateur,Max,2024-15-02,True,,,dead,fish,male,True,,2024-06-01
2,11,1,True,C003,Equateur,Coco,2024-03-10,False,,,alive,bird,female,True,,
3,5,4,False,D004,,Bella,2024-04-22,False,,,alive,amphibian,male,True,,
4,3,5,True,E005,,Charlie,2024-05-30,True,,,dead,fish,female,True,,2024-07-01
