# Now, bringing it all together 

Now that we can demonstrate the loading of the data (XPT) and the metadata (APIs) how do we link it all together?

First off, what questions are we looking to solve?
* Can I get the metadata for a column in my SDTM dataset?
* Can I work out what version of SDTM was used to build the SDTM dataset?
* Can I confirm all the coded terms in my dataset?
* Can I work out how to convert from the current version to a 'newer' version?


In [1]:
# First off - imports
import pandas as pd
from pandas import DataFrame

# Don't repeat yourself.....
from utils import load_cdiscpilot_dataset

In [2]:
# Can I get the metadata for a column in my SDTM dataset?

# first off, load the dataset

# Declare the type of the variable (optional)
dm: DataFrame
dm = load_cdiscpilot_dataset('DM')

# Get the columns
published_columns = list(dm.columns)

In [3]:
# Get the current set of columns
print(published_columns)

['STUDYID', 'DOMAIN', 'USUBJID', 'SUBJID', 'RFSTDTC', 'RFENDTC', 'RFXSTDTC', 'RFXENDTC', 'RFICDTC', 'RFPENDTC', 'DTHDTC', 'DTHFL', 'SITEID', 'AGE', 'AGEU', 'SEX', 'RACE', 'ETHNIC', 'ARMCD', 'ARM', 'ACTARMCD', 'ACTARM', 'COUNTRY', 'DMDTC', 'DMDY']


In [5]:
# now, to iterate over versions 
from dotenv import load_dotenv
load_dotenv()
import os
from client import LibraryClient

client = LibraryClient(os.getenv('CDISC_LIBRARY_API_TOKEN'))

# create a dataset for comparison
variables_by_version = {}

sdtm_ig_versions = client.get_sdtmig_versions()
for version in sdtm_ig_versions:
    # carve out the version
    version_id = version.get('href').split("/")[-1]
    if not str.isdigit(version_id[0]):
        # strip out the associated persons, devices IGs
        continue
    # get the dataset
    dataset = client.get_ig_dataset_by_version(version_id, "DM")
    for variable in dataset.get('datasetVariables'):
        # setdefault is like upsert for a dictionary
        variables_by_version.setdefault(version_id, []).append(variable.get('name'))



In [11]:
# now we can iterate over the versions and compare the dataset variable
for version, items in variables_by_version.items():
    print(f"Checking {version}")
    if set(items) == set(published_columns):
        print(f"Version {version} is a candidate")
    else:
        print(f"Version {version} is not a candidate")
        if set(items) - set(published_columns): 
            print(f"Variables missing from Dataset: {set(items) - set(published_columns)}")
        if set(published_columns) - set(items):
            print(f"Variables unexpected in Dataset: {set(published_columns) - set(items)}")

# we leave the equivalent for the SDTM model as an exercise for the reader.  There may even be a helper on the client!
        

Checking 3-1-2
Version 3-1-2 is not a candidate
Variables missing from Dataset: {'BRTHDTC', 'INVID', 'INVNAM'}
Variables unexpected from Dataset: {'RFXENDTC', 'RFXSTDTC', 'DTHFL', 'ACTARMCD', 'DTHDTC', 'RFPENDTC', 'RFICDTC', 'ACTARM'}
Checking 3-1-3
Version 3-1-3 is not a candidate
Variables missing from Dataset: {'BRTHDTC', 'INVID', 'INVNAM'}
Checking 3-2
Version 3-2 is not a candidate
Variables missing from Dataset: {'BRTHDTC', 'INVID', 'INVNAM'}
Checking 3-3
Version 3-3 is not a candidate
Variables missing from Dataset: {'ACTARMUD', 'BRTHDTC', 'ARMNRS', 'INVID', 'INVNAM'}
Checking 3-4
Version 3-4 is not a candidate
Variables missing from Dataset: {'ACTARMUD', 'BRTHDTC', 'ARMNRS', 'INVID', 'INVNAM'}


So, that's a little weird!  The Define.xml shows that the version of SDTM is 3-1-2, but there are colummns present in the dataset from a more recent version.  The missing BRTHDTC and INV-- columns could be reconciled as an artifact of the dataset anonymisation.

# What's next
[Wrap Up](./05-Wrap-it-up.ipynb)

