# Creating record set curation file

- Fields in recordsets are annotated as a manual effort done by multiple people.
- Curation was collected in a Google spreadsheet table.
- The table has columns to make it easier for the curators to annotated also, also provides extra annotation for croissant ingestion

## Columns

- `dataset_name` - name of the dataset for curators.
- `field_id` - identifier of the field for croissant.
- `column_name` - just the label of the field for curators.
- `column_description` - Curators annotate fields with description
- `foreign_key` - curators add `field_id` of the foreign field
- `bioregistry_prefix` - if data in a column comes from a database in bioregisty, annotate
- `example` - helps curatos 

## Process

1. Fetch curation table from Google
2. Composing description
3. Iterating over column and building output
4. Save curation as json.

In [3]:
import pandas as pd
import json

# Curation of all columns from all OpenTargets output datasets:
curation = 'https://docs.google.com/spreadsheets/d/132SKHMoaJePu4nTlBnQwfaz3dhfJiKmJUujfYkzXMdI/export?format=tsv&gid=179018892'

# Folder to save the resulting curation file:
asset_folder = '../src/ot_croissant/assets/'

# Reading table:
curation_table = (
    pd.read_csv(curation, sep='\t')
    .astype(
        {
            'column_description': pd.StringDtype(),
            'foreign_key': pd.StringDtype(),
        }
    )
)
curation_table.head()

Unnamed: 0,dataset_name,field_id,column_name,column_description,foreign_key,bioregistry_prefix,Example
0,disease_phenotype,disease_phenotype/disease,disease,Disease identifier,disease/id,,MONDO_0800026
1,disease_phenotype,disease_phenotype/phenotype,phenotype,The phenotype linked to the disease.,disease/id,,
2,disease_phenotype,disease_phenotype/evidence,evidence,A container for all evidence-related attribute...,,,
3,disease_phenotype,disease_phenotype/evidence/aspect,aspect,The category of biological information being p...,,,C
4,disease_phenotype,disease_phenotype/evidence/bioCuration,bioCuration,Indicates whether the evidence has been manual...,,,HPO:probinson[2021-09-23];HPO:probinson[2021-0...


In [None]:
# Collection of curated dataset:
curation_json = []

# Composing description:
def compose_description(row: pd.Series) -> str:
    """
    Composes the description of a column based on the bioregistry prefix and the column description.
    If the bioregistry prefix is not available, it returns the column description as is.

    Args:
        row (pd.Series): A row from the curation table.
    
    Returns:
        str: The composed description.
    """
    # If the bioregistry prefix is not available, return the column description as is:
    description = (
        row['column_description']
        if pd.isna(row['bioregistry_prefix'])
        else f"{row['column_description']} [bioregistry:{row['bioregistry_prefix'].lower()}]"
    )

    return description

# Iterating over the rows of the curation table:
for _, row in curation_table.iterrows():
    # If the column description is not available, skip the row:
    if pd.isna(row['column_description']):
        continue

    # Adding curation to the dictionary:
    data = {
        'id': row['field_id'],
        'description': compose_description(row)
    }

    # If the foreign key is available, add it to the dictionary:
    if not pd.isna(row['foreign_key']):
        data['foreign_key'] = row['foreign_key']

    # If the bioregistry prefix is available, add it to the dictionary:
    curation_json.append(data)

# Saving the curation to a JSON file:
with open(f'{asset_folder}/recordset.json', 'w') as f:
    json.dump(curation_json, f, indent=2)


dataset_name          object
field_id              object
column_name           object
column_description    object
foreign_key           object
bioregistry prefix    object
Example               object
dtype: object

In [43]:
curation_table.loc[curation_table.bioregistry_prefix.notna()]

Unnamed: 0,dataset_name,field_id,column_name,column_description,foreign_key,bioregistry_prefix,Example
14,disease_phenotype,disease_phenotype/evidence/references,references,References or citations supporting the evidence.,,pubmed,[PMID:14566559]
20,mouse_phenotype,mouse_phenotype/biologicalModels/id,id,Unique identifier for the biological model.,,MGI,MGI:6140117
21,mouse_phenotype,mouse_phenotype/biologicalModels/literature,literature,References related to the mouse model.,,pubmed,[30949703]
23,mouse_phenotype,mouse_phenotype/modelPhenotypeClasses/id,id,Unique identifier for the phenotype class.,,MP,MP:0005389
25,mouse_phenotype,mouse_phenotype/modelPhenotypeId,modelPhenotypeId,Identifier for the specific phenotype observed...,,MP,MP:0005343
26,mouse_phenotype,mouse_phenotype/modelPhenotypeLabel,modelPhenotypeLabel,Human-readable label describing the observed p...,,,increased circulating aspartate transaminase l...
29,mouse_phenotype,mouse_phenotype/targetInModelEnsemblId,targetInModelEnsemblId,Ensembl identifier for the target gene in the ...,,ENSEMBL,ENSMUSG00000087651
30,mouse_phenotype,mouse_phenotype/targetInModelMgiId,targetInModelMgiId,MGI (Mouse Genome Informatics) identifier for ...,,MGI,MGI:1917034
48,reactome,reactome/id,id,Unique identifier for the Reactome pathway,,reactome,
55,expression,expression/id,id,Ensembl human gene identifier for the expresse...,,ENSEMBL,ENSG00000071243
