# Mapping CDA to NCIT data
I have implemented a simple mapping strategy in the OpDiagnosisMapper class. Basically we use the three fields
- primary_diagnosis	
- primary_diagnosis_condition
- primary_diagnosis_site	
to look up NCIT codes in a map. If we do not find it, we fall back to a general code for the organ in primary_diagnosis_site.
The code also shows counts of terms that have yet to be mapped to prioritize curation. This should easily be doable for ten datasets.

For convenience, I am using a downloaded file "merged_cervix_disease.tsv" here, but that is identical with the merged tables we get from 
```
cohort_name = "cervix cancer cohort"
query = 'treatment_anatomic_site = "Cervix"'
Tsite = Q('treatment_anatomic_site = "Cervix"')
tableImporter = CdaTableImporter(cohort_name=cohort_name, query_obj=Tsite);
merged_df = tableImporter.get_merged_diagnosis_research_subject_df();
```

In [6]:
import pandas as pd
from oncoexporter.cda.mapper import OpDiagnosisMapper
from collections import defaultdict

In [7]:
df = pd.read_csv("merged_cervix_disease.tsv", sep="\t")
df.head(2)

Unnamed: 0.1,Unnamed: 0,diagnosis_id,diagnosis_identifier,primary_diagnosis,age_at_diagnosis,morphology,stage,grade,method_of_diagnosis,subject_id_di,researchsubject_id,researchsubject_identifier,member_of_research_project,primary_diagnosis_condition,primary_diagnosis_site,subject_id_rs
0,0,CGCI-HTMCP-CC.HTMCP-03-06-02423.HTMCP-03-06-02...,"[{'system': 'GDC', 'field_name': 'case.diagnos...","Squamous cell carcinoma, nonkeratinizing, NOS",,8072/3,,G2,Biopsy,CGCI.HTMCP-03-06-02423,CGCI-HTMCP-CC.HTMCP-03-06-02423,"[{'system': 'GDC', 'field_name': 'case.case_id...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,CGCI.HTMCP-03-06-02423
1,1,CGCI-HTMCP-CC.HTMCP-03-06-02238.HTMCP-03-06-02...,"[{'system': 'GDC', 'field_name': 'case.diagnos...","Squamous cell carcinoma, keratinizing, NOS",14943.0,8071/3,,G2,Biopsy,CGCI.HTMCP-03-06-02238,CGCI-HTMCP-CC.HTMCP-03-06-02238,"[{'system': 'GDC', 'field_name': 'case.case_id...",CGCI-HTMCP-CC,Squamous Cell Neoplasms,Cervix uteri,CGCI.HTMCP-03-06-02238


# OpDiagnosisMapper


In [8]:
dxMapper = OpDiagnosisMapper()

In [9]:
term_count_d = defaultdict(int)
id_to_label = {}
for _, row in df.iterrows():
    ncit_term = dxMapper.get_ontology_term(row)
    term_count_d[ncit_term.id] += 1
    id_to_label[ncit_term.id] = ncit_term.label
for k,v in term_count_d.items():
    label = id_to_label.get(k)
    print(f"{label}({k}): n={v}")

# The following prints a summary of diseases we have not mapped yet

In [10]:
dxMapper.get_error_df()

Unnamed: 0,primary_diagnosis,primary_diagnosis_condition,primary_diagnosis_site,count
0,"Adenocarcinoma, NOS",Adenomas and Adenocarcinomas,Cervix uteri,7
1,Basaloid squamous cell carcinoma,Squamous Cell Neoplasms,Cervix uteri,1
