<a href="https://colab.research.google.com/github/hmelberg/pinga/blob/master/notation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# default_exp notation


# Creating notation and functions to deal with medical codes

One key problem when dealing with medical codes, is how to describe a group of codes that make up a disease in a short and efficient manner. For instance, there are many dozens of codes that belong under the general category of liver disease. Listing all codes every time we want to select a group of patients with a given conditions, becomes very cumbersome. Instead, we will develop a notation where you can use stars, hyphens and colons to describe sets of medical codes, and functions that expand the shorthand codes to the full set of codes.




# Notation

There are at least three types of symbols that can make it easier to express  code groups:
* Stars: C77* would include all codes that start with C77
* Hyphen: C77-C80 would be shorthand for C77, C78, C79 and C80
* Colon: C77.0:D80.9 would include all codes that are defined in a codebook to be between C77.0 and D80.9 (including these codes themselves).


In [None]:
#export
import pandas as pd


This kind of notation could simplify things, and make it easier to quickly specify different groups of codes. However, to make use of it we need functions that expand the codes that are written using the notation to the full list of proper codes that the computer can use. 


These functions are not too difficult, but there are some tricky corners around leading zeros and the precision of decimals. For instance, in order to make hyphen notation work, we could use the following logic to create a function:
* Take the string, split it on the hyphen (if it exist), and extract the number component of the start and the end code
* Make a list of all the numbers between start and stop (This requires us to convert the start and end to integers and remember the number of decimals and leading zeros since we later have to reconvert it back to its original format)
* Convert the list of all the number back to string codes with the same format it had originally (the prefix, leading zeros, number of decimals/trailing zeros, star notation).

In code:


In [None]:
#export
# function to expand a string like 'K51.2-K53.8' to a list of codes 

# Need regex to extract the number component of the input string
import re
from functools import singledispatch

# The singledispach decorator enables us to have the same name, but use 
# different functions depending on the datatype of the first argument.
#
# In our case we want one function to deal with a single string input, and
# another to handle a list of strings. It could all be handled in a single 
# function using nested if, but singledispatch makes it less messy and more fun!


# Here is the main function, it is just the name and an error message if the 
# argument does not fit any of the inputs that wil be allowed

@singledispatch
def expand_hyphen(expr):
  """
  Expands codes expression(s) that have hyphens to list of all codes

  Args:
      expr (str, list or dict): String or list of strings to be expanded 
  
  Returns:
      List of strings or dict with list of strings
  
  Examples:
      expand_hyphen('C00*-C26*')
      expand_hyphen('b01.1*-b09.9*')
      expand_hyphen('n02.2-n02.7')
      expand_hyphen('c00*-c260')
      expand_hyphen('b01-b09')
      expand_hyphen('b001.1*-b009.9*')
      expand_hyphen(['b001.1*-b009.9*', 'c11-c15'])
  Note:
      Unequal number of decimals in start and end code is problematic.
      Example: C26.0-C27.11 will not work since the meaning is not obvious:
      Is the step size 0.01? In which case C27.1 will not be included, while 
      C27.10 will be (and trailing zeros can be important in codes)
  """
  raise ValueError('The argument must be a string or a list')

# register the function to be used if the input is a string
@expand_hyphen.register(str)
def _(expr):
    # return immediately if nothing to expand
    if '-' not in expr:
      return [expr]

    lower, upper = expr.split('-')
    
    lower=lower.strip()

    # identify the numeric component of the code
    lower_str = re.search("\d*\.\d+|\d+", lower).group()
    upper_str = re.search("\d*\.\d+|\d+", upper).group()
    # note: what about european decimal notation?
    # also note: what if multiple groups K50.1J8.4-etc


    lower_num = int(lower_str.replace('.',''))
    upper_num = int(upper_str.replace('.','')) +1
    
    if upper_num<lower_num:
      raise ValueError('The start code cannot have a higher number than the end code')

    # remember length in case of leading zeros 
    length = len(lower_str)

    nums = range(lower_num, upper_num)

    # must use integers in a loop, not floats
    # which also means that we must multiply and divide to get decimal back
    # and take care of leading and trailing zeros that may disappear
    if '.' in lower_str:
      lower_decimals = len(lower_str.split('.')[1])
      upper_decimals = len(upper_str.split('.')[1])
      if lower_decimals==upper_decimals:
        multiplier = 10**lower_decimals
        codes = [lower.replace(lower_str, format(num /multiplier, f'.{lower_decimals}f').zfill(length)) for num in nums]
      # special case: allow k1.1-k1.123, but not k.1-k2.123 the last is ambigious: should it list k2.0 only 2.00?
      elif (lower_decimals<upper_decimals) & (upper_str.split('.')[0]==lower_str.split('.')[0]):
        from_decimal = int(lower_str.split('.')[1])
        to_decimal = int(upper_str.split('.')[1]) +1
        nums = range(from_decimal, to_decimal)
        decimal_str = '.'+lower.split('.')[1]
        codes = [lower.replace(decimal_str, '.'+str(num)) for num in nums]
      else:
        raise ValueError('The start code and the end code do not have the same number of decimals')
    else:
        codes = [lower.replace(lower_str, str(num).zfill(length)) for num in nums]
    return codes
 

# register the function to be used if if the input is a list of strings
@expand_hyphen.register(list)
def _(exprs):
  extended = []
  for expr in exprs:
    extended.extend(expand_hyphen(expr))
  return extended

# register the function to be used if if the input is a dict with list of strings
@expand_hyphen.register(dict)
def _(dikt):
  extended = {name: expand_hyphen(exprs) for name, exprs in dikt.items()}
  return extended

And here are some examples:

In [None]:
# very standard expansion
expand_hyphen('K50-K54')

['K50', 'K51', 'K52', 'K53', 'K54']

In [None]:
# Leaving out the K in the end code also works
expand_hyphen('K50-54')

['K50', 'K51', 'K52', 'K53', 'K54']

In [None]:
# hyphen-expansion with stars keep the stars
expand_hyphen('K50*-K54*')

['K50*', 'K51*', 'K52*', 'K53*', 'K54*']

In [None]:
# expansion with decimals also work
expand_hyphen('K50.9-K51.2')

['K50.9', 'K51.0', 'K51.1', 'K51.2']

In [None]:
# expansion with decimals and stars 
# (but usually unnecessary to have decimals here)
expand_hyphen('K50.*-K53.*')

['K50.*', 'K51.*', 'K52.*', 'K53.*']

In [None]:
# decimals and star combined are OK
expand_hyphen('K50.8*-K51.2*')

['K50.8*', 'K50.9*', 'K51.0*', 'K51.1*', 'K51.2*']

In [None]:
# leading zeros 
expand_hyphen('K09.8-K10.2')

['K09.8', 'K09.9', 'K10.0', 'K10.1', 'K10.2']

In [None]:
# double digit with leading decimal zero
expand_hyphen('K1.99-K2.02')

['K1.99', 'K2.00', 'K2.01', 'K2.02']

In [None]:
# unequal number of main digits with unequal leading zeros
# note: Included K010.1, but not K10.1 
# (the pattern in the start code always has priority)
expand_hyphen('K009.8-K10.1')

['K009.8', 'K009.9', 'K010.0', 'K010.1']

In [None]:
# special case that works: unequal number of decimals, as long as main digits are the same
expand_hyphen('K01.8-K01.12')

['K01.8', 'K01.9', 'K01.10', 'K01.11', 'K01.12']

In [None]:
# different number of leading digits work as expected
expand_hyphen('K99.8*-K100.2*')

['K99.8*', 'K99.9*', 'K100.0*', 'K100.1*', 'K100.2*']

The behaviour so far is inuitive, but there are some corner cases that need to be discussed. For instance:
 
    expand_hyphen('K99.1-K100.11')
    expand_hyphen('K50.*-K51.2')
    expand_hyphen('K50.1-K51.15')
    expand_hyphen('K009.8-K10.1')

There are several issues here: Differences in the number of leading zeros, different length of the code, mixing notation, and different precision of decimals in the start and end codes. Because of this, there is no way to reliably fill in the intermediate codes in a non-ambigious manner. 

As an example, consider what codes should be generated if we write 'K50.1-K51.15'? Should the step size be 0.1 or 0.01? If it is 0.01, it would include  K50.20 but not K50.2. There is no inuitive solution and it would make a difference. Remember these are string codes where leading and trailing zeros could be significant for the result.



# dots and zeros

Many code systems are schizophrenic when it comes to dots and zeros. Sometimes they are included, sometimes they are not. For instance, the ICD-10 codes officially use decimals (K50.1), but are very often stored in the databases without decimals  (K501). Another example is the DRG codes which sometimes are used with leading zeros (002), but also presented as numbers without leading zeros (2).

It helps little to curse the world for being inconsistent, so the best we can do is to create functions that makes it easier to deal with it. Make it conosistent by having a function that takes the dots away and delets the leading zeros (and optionally the trailing zeros too):

In [None]:
#export
def del_dot(code):
  if isinstance(code, str):
    return code.replace('.','')
  else:
    codes = [c.replace('.','') for c in code]
  return codes

def del_zero(code, left=True, right=False):
  if isinstance(codes, str):
    codes=[code]
  if left:
    codes = [c.lstrip('0') for c in code]
  if right:
    codes = [c.rstrip('0') for c in code]
  if isinstance(code, str):
    codes=codes[0]
  return codes

# Star notation
Expanding codes using star notation is different from hyphens. Hyphens simply insert numbers in increasing order, but star notation requires an external codelist from which we can pick the codes conform with the given string. For instance, 'K50' should include all codes that start with *K50*, but we need the full set of codes to know which codes that start with *K50*. We can get this list in two ways:
* Find all unique codes from the codes that exist in the dataframe
* Use list of codes from an external codebook

Note that there are some small differences between the two approaches. Repeatedly constructing a list of codes from information in columns might be time consuming with large dataframes, and we should be careful to do it in the most efficient manner possible. Also, using the existing codes to define a list of codes, might not identify all possible codes. Instead it will only identify all those codes that have been used in the data. Expansion using 'all codes in my data' as opposed 'all codes in the official codebook' may produce different results. For many purposes we only want a list of those codes that actually exist in the data - and including irrelevant codes would just make things slower. But it is a difference one should be aware of.

As soon as we have a list of all possible codes, picking those that start with a given string, is very easy. The hard part is creating that list. 

We might, however, make the notation slightly more flexible. Instad of only allowing the user to search for everything that starts with a given pattern,  one might allow searching for all codes that end with a given string, for instance everything that ends with **B3*. Even more advanced, one could allow stars in the middle: Searching for codes that start with something and end with something. Finally, if more advanced searches are desired, we could allow expansion based on a regex pattern i.e. include those codes in a codelist that conform to a pattern described by a regex expression. Lastly, there might be codes that actually have stars and hyphens as part of the code itself - in which case we have to be careful not to expand the codes as if the hyphen or star were notational symbols (for instance by introducing a *raw=True* argument to indicate that the codes should not be expanded).

These complications will introduce themselves as we progress. For now, we need a function that creates a list of all possible codes by identifying the unique values in one or more columns:



In [None]:
#export
# A function to identify all unique values in one or more columns 
# with one or multiple codes in each cell


def get_unique(df, cols=None, sep=None, all_str=False):
  # if no column(s) are specified, find unique values in whole dataframe
  if cols==None:
    cols=df.columns
  # multiple values with seperator in cells
  if sep:
    all_unique=set()

    for col in cols:
      new_unique = set(df[col].str.cat(sep=',').split(','))
      all_unique.update(new_unique)
  # single valued cells
  else:
    values = pd.unique(df[cols].values.ravel('K'))
  
  # if need to make sure all elements are strings without surrounding spaces
  if all_str:
    values=[str(value).strip() for value in values]

  return values

As long as we have a function to create the list of codes, expanding a code with star notation is a matter of iterating over the full codelist to find codes that start or end (or both) with the specified code string. Once again we create one function for a single code, and another function for when we want a list of codes to be expanded:

In [None]:
#export
# A function to expand a string with star notation (K50*) 
# to list of all codes starting with K50

@singledispatch
def expand_star(code, codelist=None):
  """
  Expand expressions with star notation to a list of all values with the specified pattern
  
  Args:
    expr (str or list): Expression (or list of expressions) to be expanded
    codelist (list) : A list of all codes

  Examples:
    expand_star('K50*', codelist=icd9)
    expand_star('K*5', codelist=icd9)
    expand_star('*5', codelist=icd9)

  """
  raise ValueError('The argument must be a string or a list')

@expand_star.register(str)
def _(code, codelist=None): 
  # return immediately if there is nothing to expand
  if '*' not in code:
    return [code]
 
  start_str, end_str = code.split('*')

  if start_str and end_str:
    codes = {code for code in codelist if (code.startswith(start_str) & code.endswith(end_str))}

  if start_str:
    codes = {code for code in codelist if code.startswith(start_str)}
  
  if end_str:
    codes = {code for code in codelist if code.endswith(end_str)}

  return sorted(list(codes))

@expand_star.register(list)
def _(code, codelist=None):
  
  expanded=[]
  for star_code in code:
    new_codes = expand_star(star_code, codelist=codelist)
    expanded.extend(new_codes)
  
  # uniqify in case some overlap
  expanded = list(set(expanded))

  return sorted(expanded)

# register the function to be used if if the input is a dict with list of strings
@expand_star.register(dict)
def _(dikt):
  extended = {name: expand_star(exprs) for name, exprs in dikt.items()}
  return extended



Before we test this, we might as well create the other funtions since they follow the same pattern. First, a function to expand a code using colons (from a given code, to a given code in a codelist, inclulding all codes in between). Second, a function to include only those codes in a codelist that follow a specified regex pattern. Lastly, we create a function that will handle everything i.e. do all the required expansions regardless of what type it is, and which also work when (relevant) notation is combined. Hyphen and star can be combined, regex and colon notation cannot be combined with anything else:

## Colon notation

Hyphen and star notation work fine if the codes are in the same main category (```tumor = 'C77*-C80*'```), but what if you want all codes from, say K40 to L52? In this case we cannot use hyphen or star, and the solution is colon notation. This notation includes all codes (as specified in an input list) between two codes (that both must exist in the list and included). Here is the code:

In [None]:
#export
# function to get all codes in a list between the specified start and end code 
# Example: Get all codes between K40:L52

@singledispatch
def expand_colon(code, codelist=None):
  raise ValueError('The argument must be a string or a list')

@expand_colon.register(str)
def _(code, codelist=None):
  """
  Expand expressions with colon notation to a list of complete code names
  code (str or list): Expression (or list of expressions) to be expanded
  codelist (list or array) : The list to slice from

  Examples
    K50:K52
    K50.5:K52.19
    A3.0:A9.3

  Note: This is different from hyphen and star notation because it can handle 
  different code lengths and different number of decimals 

  """
  if ':' not in code:
    return [code]
  
  startstr, endstr = code.split(':')
  
  # remove spaces
  startstr = startstr.strip()
  endstr =endstr.strip()

  # find start and end position
  startpos = codelist.index(startstr)
  endpos = codelist.index(endstr) + 1
  
  # slice list
  expanded = codelist[startpos:endpos+1]

  return expanded


@expand_colon.register(list)
def _(code, codelist=None, regex=False): 
  expanded=[]

  for cod in code:
    new_codes = expand_colon(cod, codelist=codelist)
    expanded.extend(new_codes)
  
  return expanded
# register the function to be used if if the input is a dict with list of strings
@expand_colon.register(dict)
def _(dikt):
  extended = {name: expand_colon(exprs) for name, exprs in dikt.items()}
  return extended

## Expansion of codes based on regex

Notation using hyphen, star, and colon will often be enough to express codes efficiently, but sometimes it may also be useful to have the option of using more complex code expansion. For this purpose we could have a function that picks out all the codes from a codelist based on whether it fits the regex pattern you specify. This would allow almost all kinds of code expansions.

In [None]:
#export

# Return all elements in a list that fits a regex pattern

@singledispatch
def expand_regex(code, codelist):
  raise ValueError('The argument must be a string or a list of strings')

@expand_regex.register(str)
def _(code, codelist=None):
  code_regex = re.compile(code)
  expanded = {code for code in codelist if code_regex.match(code)}
  # uniqify
  expanded = list(set(expanded))
  return expanded

@expand_regex.register(list)
def _(code, codelist):  
  expanded=[]

  for cod in code:
    new_codes = expand_regex(cod, codelist=codelist)
    expanded.extend(new_codes)
  
  # uniqify in case some overlap
  expanded = sorted(list(set(expanded)))

  return expanded
# register the function to be used if if the input is a dict with list of strings
@expand_regex.register(dict)
def _(dikt):
  extended = {name: expand_regex(exprs) for name, exprs in dikt.items()}
  return extended

## A single function that does all the expansion (star, hyphen, colon, regex) and formatting (delete dots and zeros)

A list of codes may use a combination of several notations and instead of asking the user to apply all the different functions (hyphen, star, colon), we should have one function that expands and formats the codes regardless of the type of notation, but with an option of ignoring some symbols if you want (in case the star, colon or the hyphen is part of the actual code and not a notational symbol!):

In [None]:
#export
@singledispatch
def expand_code(code, codelist=None, 
                hyphen=True, star=True, colon=True, regex=False, 
                drop_dot=False, drop_leading_zero=False,
                sort_unique=True):
  raise ValueError('The argument must be a string or a list of strings')

@expand_code.register(str)
def _(code, codelist=None, 
      hyphen=True, star=True, colon=True, regex=False, 
      drop_dot=False, drop_leading_zero=False,
      sort_unique=True):
  #validating input
  if (not regex) and (':' in code) and (('-' in code) or ('*' in code)):
    raise ValueError('Notation using colon must start from and end in specific codes, not codes using star or hyphen')

  if regex:
    codes = expand_regex(code, codelist=codelist)
    return codes
  
  if drop_dot:
    code = del_dot(code)
  
  codes=[code]

  if hyphen:
    codes=expand_hyphen(code)
  if star:
    codes=expand_star(codes, codelist=codelist)
  if colon:
    codes=expand_colon(codes, codelist=codelist)

  if sort_unique:
    codes = sorted(list(set(codes)))

  return codes

@expand_code.register(list)
def _(code, codelist=None, hyphen=True, star=True, colon=True, regex=False, 
      drop_dot=False, drop_leading_zero=False,
      sort_unique=True):
  
  expanded=[]

  for cod in code:
    new_codes = expand_code(cod, codelist=codelist, hyphen=hyphen, star=star, colon=colon, regex=regex, drop_dot=drop_dot, drop_leading_zero=drop_leading_zero)
    expanded.extend(new_codes)
  
  # uniqify in case some overlap
  expanded = list(set(expanded))

  return sorted(expanded)

@expand_code.register(dict)
def _(code, codelist=None, hyphen=True, star=True, colon=True, regex=False, 
      drop_dot=False, drop_leading_zero=False,
      sort_unique=True):
  expanded = {name, expand_code(cod, codelist=codelist, 
                                hyphen=hyphen, 
                                star=star, 
                                colon=colon, 
                                regex=regex, 
                                drop_dot=drop_dot, 
                                drop_leading_zero=drop_leading_zero)
              for name, cod in dikt.items()}  
  return expanded


In [None]:
#export
# mark rows that contain certain codes in one or more colums
def get_rows(df, codes, cols=None, sep=None, pid='pid'):
  """
  Make a boolean series that is true for all rows that contain the codes
  
  Args
    df (dataframe or series): The dataframe with codes
    codes (str, list, set, dict): codes to be counted
    cols (str or list): list of columns to search in
    sep (str): The symbol that seperates the codes if there are multiple codes in a cell
    pid (str): The name of the column with the personal identifier

  """
    
  # string as input for single codes is allowed
  # but then must make it a list
  if isinstance(codes, str):
    codes = [codes]
  
  # same for cols
  # must be a list sine we may loop over it
  if not isinstance(cols, list):
    cols = [cols]
  
  # approach depends on whether we have multi-value cells or not
  # if sep exist, then have multi-value cells
  if sep:
    # have multi-valued cells
    codes = [rf'\b{code}\b' for code in codes]
    codes_regex = '|'.join(codes)
    
    # starting point: no codes have been found
    # needed since otherwise the function might return None if no codes exist
    rows = pd.Series(False*len(df),index=df.index)

   # loop over all columns and mark when a code exist  
    for col in cols:
      rows=rows | df[col].str.contains(codes_regex, na=False)
  
  # if not multi valued cells
  else:
    mask = df[cols].isin(codes)
    rows = mask.any(axis=1)
  return rows