# Analysis for GARD ingest
- [Spreadsheet](https://docs.google.com/spreadsheets/d/1w5Xnzr5uNFcPrQqCT8mGBFHnGhwXDfQVzHKofw6kB7c/edit#gid=1282628523)

In [41]:
%reload_ext autoreload
%autoreload 2
from pathlib import Path
import os
import sys

sys.path.insert(0, Path(os.getcwd()).parent)
from gard_owl_ingest.mondo_mapping_status import gard_mondo_mapping_status

proxy_df, sssom_curate_df = gard_mondo_mapping_status(return_type='analysis')
proxy_df_exact, sssom_curate_df_exact = gard_mondo_mapping_status(
    return_type='analysis', mondo_predicate_filter=['skos:exactMatch'], gard_predicate_filter=['skos:exactMatch'])

## 2023-06-29

Created set of output artefacts that have `-exact` in the filename, e.g. `gard-mondo-exact.sssom.tsv`. For these artefacts, we have filtered out everything but `skos:exactMatch` for both the Mondo::proxy and GARD::proxy mappings.

### 1. `gard-mondo-exact_curation.sssom.tsv`: no duplicates

FYI: `sssom_curate_df` is `gard-mondo_curation.sssom.tsv`, which is just `gard-mondo.sssom.tsv` but with additional columns.

#### Duplicate edges: 0

In [42]:
dup_edges = sssom_curate_df_exact[sssom_curate_df_exact.duplicated(subset=['subject_id', 'object_id'])]
len(dup_edges)

0

#### Duplicate Mondo IDs: 0

In [43]:
len(sssom_curate_df_exact) - len(sssom_curate_df_exact['object_id'].unique())

122

In [44]:
dup_mondo = sssom_curate_df_exact[sssom_curate_df_exact.duplicated(subset=['object_id'])].sort_values(['object_id'])
len(dup_mondo)

122

In [46]:
if len(dup_mondo) > 0:
    dup_mondo2 = sssom_curate_df_exact[sssom_curate_df_exact.duplicated(keep=False, subset=['object_id'])].sort_values(['object_id'])
    dup_mondo2.to_csv('~/Desktop/gard-mondo-exact_curate_mondo-duplicates.sssom.tsv', index=False, sep='\t')
    dup_mondo.head()

#### Duplicate GARD IDs

In [19]:
len(sssom_curate_df_exact) - len(sssom_curate_df_exact['subject_id'].unique())

In [20]:
dup_gard = sssom_curate_df_exact[sssom_curate_df_exact.duplicated(subset=['subject_id'])].sort_values(['subject_id'])
len(dup_gard)

In [33]:
if len(dup_gard) > 0:
    dup_gard.to_csv('~/Desktop/gard-mondo-exact_curate_gard-duplicates.sssom.tsv', index=False, sep='\t')
    dup_gard.head()

### 2. Row difference when only using `skos:exactMatch`
There are 12,004 GARD terms. This shows that now only 10 are unmapped. This corrorborates with `gard_unmapped_terms-exact.txt`, though `gard_unmapped_terms.txt` has only 3 entries.

In [23]:
len(sssom_curate_df)

In [24]:
len(sssom_curate_df_exact)

## 2023-06-10

### 1. Why was [unmapped_proxy_or_direct](https://docs.google.com/spreadsheets/d/1w5Xnzr5uNFcPrQqCT8mGBFHnGhwXDfQVzHKofw6kB7c/edit#gid=1504531712) ~178 when it should have been just 3?

FYI: `proxy_df` is a subset of `gard-mondo.sssom-like.tsv`

I had a bug. I was saying that a GARD term was not mapped to Mondo if there was *any* proxy path that was unmapped between _GARD -> proxy ontology -> Mondo_. But I should have only counted as unmapped if there were *no proxy paths*.
In this example below, we can see that for `GARD:15370`, there was a proxy path _`GARD:15370` -> `Orphanet:888` -> `MONDO:0019508`_, so it should have been counted as mapped. But since there was no proxy path to Mondo for _`GARD:15370` -> `OMIM:604547`_, I had been erroneously counting as `unmapped_proxy_or_direct` [in spreadsheet](https://docs.google.com/spreadsheets/d/1w5Xnzr5uNFcPrQqCT8mGBFHnGhwXDfQVzHKofw6kB7c/edit#gid=701107044).

In [25]:
example = proxy_df[proxy_df['subject_id'] == 'GARD:15370']
example

### 2. `gard-mondo_curation.sssom.tsv` duplicates analysis

FYI: `sssom_curate_df` is `gard-mondo_curation.sssom.tsv`, which is just `gard-mondo.sssom.tsv` but with additional columns.

#### Duplicate subj, obj edges: Multiple possible predicates
This is where Joe needs help. Can guess what to do, but not confident enough to choose some kinds of predicates over others, either in general or case by case.

_edit 2023/06/29: I think Nico helped here by determining that currently we're going to only consider exact matches: https://github.com/monarch-initiative/gard/pull/10#issuecomment-1588881615_

In [26]:
dups = sssom_curate_df[sssom_curate_df.duplicated(subset=['subject_id', 'object_id'], keep=False)]
len(dups)

In [27]:
dups.head()

#### Duplicate subj, pred, obj
These exist because these are multiple GARD::Mondo mappings derived from multiple proxy terms.

In [28]:
dups_preds = sssom_curate_df[sssom_curate_df.duplicated(subset=['subject_id', 'predicate_id', 'object_id'], keep=False)]
len(dups_preds)

In [29]:
dups_preds

In [30]:
# Showing the same thing, but leaving only the first of the duplicate rows.
# - Why do the ones with no object_id appear multiple times?
dups_preds2 = sssom_curate_df[sssom_curate_df.duplicated(subset=['subject_id', 'predicate_id', 'object_id'])]
dups_preds2

#### Duplicate subj, skos:exactMatch, obj
None exist!

In [31]:
dups_preds_exacts = dups_preds[dups_preds['predicate_id'] == 'skos:exactMatch']
len(dups_preds_exacts)

### 3. Algorithm for determining mapping predicate
A single Mondo term first can get multiple mappings to a single GARD term via multiple proxies. We can either (a) manually curate the list to get 1:1 mappings, or (b) do this algorithmically.

I set up some pseudocode for a more complex approach:
```py
if preds == {'skos:narrowMatch', 'skos:exactMatch', 'skos:broadMatch'}:
    pass
elif preds == {'skos:narrowMatch', 'skos:broadMatch'}:
    pass
elif preds == {'skos:narrowMatch', 'skos:exactMatch'}:
    pass
elif preds == {'skos:exactMatch', 'skos:broadMatch'}:
    pass
```

But currently we are using this approach:
```py
pred = 'skos:exactMatch' if 'skos:exactMatch' in preds \
    else 'skos:narrowMatch' if 'skos:narrowMatch' in preds \
    else 'skos:broadMatch' if 'skos:broadMatch' in preds \
    else 'skos:relatedMatch'
```

Note that as of 2023/06/29, this is moot for cases where the output artefact has `-exact` in the filename, e.g. `gard-mondo-exact.sssom.tsv`, as for these artefacts we have filtered out everything but `skos:exactMatch` for both the Mondo::proxy and GARD::proxy mappings.