In [15]:
import pandas as pd
import scispacy
import spacy
nlp = spacy.load("en_core_sci_md")

In [5]:
data = pd.read_csv("../data/processed/drug_table_clean.csv", nrows=1000, low_memory=False)

In [6]:
data.head()

Unnamed: 0,primaryid,caseid,drug_seq,role_cod,drugname,val_vbm,route,nda_num,origin,prod_ai
0,34483284,3448328,1,PS,SUSTIVA,1,TRANSPLACENTAL,20972.0,drug12q4.txt,
1,34483284,3448328,2,SS,NEVIRAPINE,1,TRANSPLACENTAL,,drug12q4.txt,
2,34483284,3448328,3,SS,VIRACEPT,1,TRANSPLACENTAL,,drug12q4.txt,
3,34483284,3448328,4,SS,COMBIVIR,1,TRANSPLACENTAL,,drug12q4.txt,
4,34483284,3448328,5,SS,RETROVIR,1,TRANSPLACENTAL,,drug12q4.txt,


In [34]:
doc = nlp("humira200-300-600 mg capsule REMICADE release")
# print("TEXT", "START", "END", "ENTITY TYPE")
# for ent in doc.ents:
#     print(ent.text, ent.start_char, ent.end_char, ent.label_)
for token in doc:
    print(token)

humira200
-
300
-
600
mg
capsule
REMICADE
release


In [35]:
# load in a list of drugnames to be mapped

# START NORMALIZATION PIPELINE
# tokenize (+lemmatize)
# lowercase everything
# extend all abbreviations
# numeric formatting (remove numbers that are directly attached to letters, 2mg -> 2 mg) *token-splitting
# remove salt-forms IF and only IF only 1 salt is detected
# sort tokens .0-z
# END NORMALIZATION PIPELINE

# START DRUG NER PIPELINE


# extra note:
# Remember, for aggressive spelling correction we can try that ING algorithm that compares a list to itself and uses matrix
# operations to calculate within list similarity efficiently. One aggressive approach would be to spell correct a drug to its
# strongest match above a certain threshold?

# extra note:
# It seems that in RxNorm, for the matching of clinical drug names, it is quite important to include several features.
# Namely, the route, form of admin and strength seem to be important for the matching in RxNorm specifically.
# It is possible to concatenate information from multiple columns to do this, but I do not know if this would eventually
# be beneficial when incorporating non-RxNorm dictionaries... International drug names could be an issue.

# In scientific drug literature, apparently full lowercase is reserved for generic drug names and (brand) trade names start
# with a capital letter (e.g. Lipitor(trade) vs atorvastatin(generic)). Exceptions to this exist of course.

# Alright, final decision time:
# I will now approach this project by trying to manipulate the faers table data into a form that is the most compliant
# with the data format expected by RxNav's approximate matching API call.
# This data transformation will be the crux of my research.
# Subsequently, we can take examples from RxNav and compare the scoring that was given with
# new scoring methods as a nice additional exploration step.
# Any more than that and any ideas that I left out will go into future research and left for another time.
# The end.
# some ideas for this:
# We can do pattern based matching for drugname entries that already satisfy the format that approximate match expects.
# We can use spacy to detect things such as entities that describe quantity and strength, or mode of administration.
# This has been proven in the above example, as we can use these pretrained clinical NER models to do so.
# Here is the URL of the example: https://gbnegrini.com/post/biomedical-text-nlp-scispacy-named-entity-recognition-medical-records/#named-entity-recognition
# Based on this information, we can now use the following rows: ["drugname", "prod_ai", "nda_num", "route",
# "val_vbm(?)", "dose_vbm(?)", "dose_amt", "dose_unit", "dose_form", "origin(for analysis purposes)"]
# Now to do this for each row is a bit tedious considering there are almost 100m examples...
# What we're going to do is repeat the number-of-entry ranking of rows to get our unique samples (based on drugname).
# And then for those samples we will attempt our matching, or else it'll get too slow I think.
# A full run can be left for future research with the code I leave behind.