# Parser construction example

This file demonstrates the process of constructing a parser file using `animals.csv` as a source dataset.

Before you start: `autoparser` requires an LLM API key to function, for either OpenAI or Gemini.
You should add yours to your environment, as described [here](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety).
This example uses the OpenAI API; edit the `API_KEY` line below to match the name you gave yours.

If you would prefer to use Gemini, use the `llm_provider` argument in functions where the api key is used, e.g.

`writer.generate_descriptions("fr", data_dict, key=API_KEY, llm_provider='gemini')`

You can also specify which model from either OpenAI or Gemini you wish to use, with the `llm_model` argument. Your model choice should support Structured Outputs (for [OpenAI](https://platform.openai.com/docs/guides/structured-outputs#supported-models)) or Controlled Generation (for [Gemini](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output)).
The model should be provided as a string recognised by the respective api, e.g. `llm_model = "gpt-4o-mini"` (the default model when OpenAI is selected as the provider).

In [1]:
import pandas as pd

import adtl.autoparser as autoparser

API_KEY = "OPENAI_API_KEY"

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
autoparser.setup_config(
    {
        "language": "fr",
        "llm_provider": "openai",
        "api_key": API_KEY,
        "max_common_count": 8,
        "schemas": {
            "animals": "../../../tests/test_autoparser/schemas/animals.schema.json"
        },
    }
)

In [3]:
data = pd.read_csv("../../../tests/test_autoparser/sources/animal_data.csv")
data.head()

Unnamed: 0,Identité,Province,DateNotification,Classicfication,Nom complet,Date de naissance,AgeAns,AgeMois,Sexe,StatusCas,DateDec,ContSoins,ContHumain Autre,ContexteContHumain,ContactAnimal,Micropucé,AnimalDeCompagnie,ConditionsPreexistantes
0,A001,Equateur,2024-01-01,Mammifère,Luna,15/03/2022,2,10,f,Vivant,,Oui,Non,Non,Oui,Oui,Oui,"[arthrite, vomir]"
1,B002,Equateur,2024-15-02,FISH,Max,21/07/2021,3,4,m,Décédé,2024-06-01,Non,Oui,Voyage,Non,NON,Oui,
2,C003,Equateur,2024-03-10,oiseau,Coco,10/02/2023,1,11,F,Vivant,,Oui,Non,Non,Oui,Oui,Non,
3,D004,,2024-04-22,amphibie,Bella,05/11/2020,4,5,m,Vivant,,Oui,,Autres,Non,NON,Non,
4,E005,,2024-05-30,poisson,Charlie,18/05/2019,5,3,F,Décédé,2024-07-01,,,Voyage,Oui,Oui,Oui,


Let's generate a basic data dictionary from this data set. We want to use the configuration file set up for this dataset, located in the `tests` directory.

In [4]:
writer = autoparser.DictWriter()
data_dict = writer.create_dict(data)
data_dict.head()

Unnamed: 0,Field Name,Description,Field Type,Common Values
0,Identité,,string,
1,Province,,string,"Equateur, Orientale, Katanga"
2,DateNotification,,string,
3,Classicfication,,string,"FISH, amphibie, oiseau, Mammifère, poisson..."
4,Nom complet,,string,


The 'Common Values' column indicates fields where there are a limited number of unique values, suggesting mapping to a controlled terminology may have been done, or might be required in the parser. The list of common values is every unique value in the field.

Notice that the Description column is empty. To proceed to the next step of the parser generation process, creating the mapping file linking source -> schema fields, this column must be filled. You can either do this by hand (the descriptions MUST be in english), or use autoparser's LLM functionality to do it for you, demonstrated below.

In [5]:
dd_described = writer.generate_descriptions(data_dict)
dd_described.head()

Unnamed: 0,Field Name,Description,Field Type,Common Values
0,Identité,Identity,string,
1,Province,Province,string,"Equateur, Orientale, Katanga"
2,DateNotification,Notification Date,string,
3,Classicfication,Classification,string,"FISH, amphibie, oiseau, Mammifère, poisson..."
4,Nom complet,Full Name,string,


Now that we have a data dictionary with descriptions added, we can proceed to creating an intermediate mapping file:

In [6]:
mapper = autoparser.WideMapper(dd_described, "animals")
mapping_dict = mapper.create_mapping(file_name="example_mapping.csv")

mapping_dict.head()



Unnamed: 0_level_0,source_description,source_field,common_values,target_values,value_mapping
target_field,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
identity,Identity,Identité,,,
name,Full Name,Nom complet,,,
loc_admin_1,Province,Province,"equateur, katanga, orientale",,"equateur=None, katanga=None, orientale=None"
country_iso3,,,,,
notification_date,Notification Date,DateNotification,,,


At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary.
The mapping file has been written out to [example_mapping.csv](example_mapping.csv). A good example is the 'loc_admin_1' field; the LLM often maps the common values provided to 'None' as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text.
Also note the warning above; the LLM should not have found fields to map to the 'country_iso3' or 'owner' fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly.

Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, `example_parser.toml`:

In [7]:
writer = autoparser.ParserGenerator(
    "example_mapping.csv",
    "",
    "example",
)
writer.create_parser("example_parser.toml")

Missing required field country_iso3 in animals schema. Adding empty field...


You can veiw/edit the created parser at [example_parser.toml](example_parser.toml), and use it with adtl.

In [8]:
import adtl

data = adtl.parse(
    "example_parser.toml",
    "../../../tests/test_autoparser/sources/animal_data.csv",
    "example_output",
)
data["animals"].head()

[example] parsing animal_data.csv: 100%|██████████| 30/30 [00:00<00:00, 20078.05it/s]
[example] validating animals table: 30it [00:00, 119951.50it/s]


Unnamed: 0,age_months,age_years,chipped,identity,name,notification_date,pet,country_iso3,case_status,classification,sex,underlying_conditions,adtl_valid,date_of_death,loc_admin_1,adtl_error
0,10,2,True,A001,Luna,2024-01-01,True,,alive,mammal,female,"[arthritis, vomiting]",True,,,
1,4,3,False,B002,Max,2024-15-02,True,,dead,fish,male,,True,2024-06-01,,
2,11,1,True,C003,Coco,2024-03-10,False,,alive,bird,female,,True,,,
3,5,4,False,D004,Bella,2024-04-22,False,,alive,amphibian,male,,True,,,
4,3,5,True,E005,Charlie,2024-05-30,True,,dead,fish,female,,True,2024-07-01,,
