## Identify Most Common Drugs

This notebook calculates the maximum match percentage between a user's psychotropic drug combo and the psychotropic drug combos of the participants in the NHANES Survey. Based on the maximum match percentage, it selects the survey participants with whom a best match was identified. The user will select a few medication categories that they want to be considered (up to a total of 18) in determining the most common additional drugs that participants on similar psychotropic drug combos take. For each category selected by the user, they will see the most common medications in this category that users taking (almost) the same psychotropic drug combo also take.

In [13]:
import os
import pandas as pd
import operator

Read some test case from the <code>test\_cases</code> directory.

In [2]:
root_dir = "/Users/gogrean/Documents/Insight_Fellowship/Research/Mental_Health/NHANES_Survey/"
test_case_dir = root_dir + "test_cases/"
os.chdir(test_case_dir)

test_case = "medcombo_tc3.txt"
case_name = test_case[:-4]

The user can choose to further restrict the match to include only people within 5 years of the user's age or only people of the same gender. 

A dictionary is created to store the user's data (age, gender, meds taken, categories of medications they are interested in, match preferences).

In [3]:
user = {}
user[case_name] = {'age': -1, 'gender': -1, 'meds': []}
with open(test_case) as f:
    user[case_name]['age'] = float(f.readline().strip())
    user[case_name]['gender'] = int(f.readline().strip())
    user[case_name]['meds'] = [m.strip().lower() for m in f.readlines()]
    user[case_name]['match by age'] = True
    user[case_name]['match by gender'] = False
    user[case_name]['drug categories'] = [x.lower() 
                                          for x in ['CENTRAL NERVOUS SYSTEM AGENTS', 'ANTI-INFECTIVES',
                                                    'Psychotherapeutic Agents']]

In [4]:
# good to know how the data looks like, eh?
user

{'medcombo_tc3': {'age': 51.0,
  'drug categories': ['central nervous system agents',
   'anti-infectives',
   'psychotherapeutic agents'],
  'gender': 1,
  'match by age': True,
  'match by gender': False,
  'meds': ['tranylcypromine', 'sertraline', 'alprazolam']}}

Go through the NHANES Survey and calculate the maximum psychotropic drug combo match percentage between the user and the survey participants. Store the SEQNs of the participants who best match with the user.

In [5]:
results_dir = root_dir + "results/"
os.chdir(results_dir)

m_df = pd.read_csv("filtered_NHANES_data.csv")

In [6]:
similar_seqn = []
max_n_meds_in_common = 0

for seqn in m_df["SEQN"].unique():
    # select rows in the dataframe corresponding to the survey
    # participant with the current SEQN
    seqn_data = m_df[m_df["SEQN"] == seqn]
    # get the age and the gender of the survey participant
    # TODO: This assumes the age/gender is not incorrectly entered 
    # in the database, so it should be the same in all the cells
    # in which the SEQN value is the same. Could figure out a 
    # way to deal with incorrect entries, e.g. choose the value that
    # appears most often, assuming there are more than two medications
    # and the correct values dominate?
    seqn_age = float(seqn_data["AGE"].values[0])
    seqn_gender = float(seqn_data["GENDER"].values[0])
    
    # if match_by_age is True and the survey participant with
    # the current SEQN is too old or too young compared to the 
    # user, then move on to the next survey participant
    if user[case_name]['match by age']:
        if (seqn_age > user[case_name]['age'] + 5) or (seqn_age < user[case_name]['age'] - 5):
            continue
            
    # if match_by_gender is True and the gender of the survey
    # participant with the current SEQN does not match the 
    # gender of the user, move on to the next survey participant
    if user[case_name]['match by gender']:
        if gender != user[case_name]['gender']:
            continue
    
    # get the list of drugs taken by the survey participant with
    # the current SEQN (drug names in lowercase for easier match)
    seqn_meds = set(seqn_data["RXDDRUG"].str.lower())
    
    # Get the list of common medications taken by the survey participant
    # and by the user. Only selects psychotropic medications in common, 
    # as the user only inputs psychotropic drugs.
    meds_in_common = seqn_meds.intersection(set(user[case_name]['meds']))
    
    # Calculate the number of common psychotropic meds and, if 
    # necessary, update the maximum number. If a new maximum is found,
    # the list of survey participants with whom the user has the highest
    # match percentage is re-initiated. If the number of psychotropic 
    # drugs taken in common by the user and the survey participant with 
    # the current SEQN equals the maximum, the SEQN is added to the list
    # of survey participants with whom the user has the highest match percentage.
    n_meds_in_common = len(meds_in_common)
    if n_meds_in_common > max_n_meds_in_common:
        max_n_meds_in_common = n_meds_in_common
        similar_seqn = [seqn]
    elif n_meds_in_common == max_n_meds_in_common:
        similar_seqn.append(seqn)

max_match_percentage = max_n_meds_in_common / len(user[case_name]['meds']) * 100.

Read the drug information database to determine the categories of all the drugs taken by survey respondents included in <code>similar_seqn</code>.

In [7]:
data_dir = root_dir + "data/"
os.chdir(data_dir)

cols = ["RXDDRGID", "RXDDRUG", "RXDDCN1A", "RXDDCN1B", "RXDDCN1C"]
med_info = pd.read_csv("RXQ_DRUG.csv", usecols=cols, 
                       error_bad_lines=False, warn_bad_lines=False)

Drug frequency for the survey participants having the best match percentage with the user will be stored in a dictionary (called <code>meds</code>) indexed by the name of the drug category. Only categories that the user is interested in will be used. 

For each survey participant who (almost) matches the user, the code identifies all the medications taken by the participant based on the RXDDRGID keyword. Using this keyword, the drug category names are read from the drug information database. For each drug, if its category is among the categories that the user is interested in, the drug name and the frequency of its occurence are stored in the <code>meds</code> dictionary.

In [8]:
meds = {}
for category in user[case_name]['drug categories']:
    meds[category] = {}
for seqn in similar_seqn:
    seqn_data = m_df[m_df["SEQN"] == seqn]
    seqn_med_ids = set( [(x, y) 
                         for x, y in zip(list(seqn_data["RXDDRGID"].values), list(seqn_data["RXDDRUG"].values))] )
    for med_id, med_name in seqn_med_ids:
        med_id_cat = med_info[med_info["RXDDRGID"] == med_id]["RXDDCN1A"].str.lower().values[0]
        if med_id_cat in user[case_name]['drug categories']:
            meds[med_id_cat].setdefault(med_name, 0)
            meds[med_id_cat][med_name] += 1

The <code>meds</code> dictionary stores the number of times each medication occurs. However, this is dependent on the number of survey participants who have a best match percentage with the user. Therefore, the code converts this number to a frequency (percentage).

In [9]:
for category in user[case_name]['drug categories']:
    for med in meds[category]:
        meds[category][med] /= len(similar_seqn) / 100.

Print the results. 

TODO: Super temporary way of doing this, until I figure out the app...

In [35]:
if len(similar_seqn) <= 5:
    print("WARNING: Only 5 or fewer survey participants use a similar combination of psychotropic drugs!\n")
    
print("%i participants in the NHANES survey were found to use %i of the %i psychotropic drugs selected.\n\n" 
      %(len(similar_seqn), max_n_meds_in_common, len(user[case_name]['meds'])))

for category in meds:
    print(category.upper())
    sorted_meds_alphabetical = sorted(meds[category].items(), 
                             key=operator.itemgetter(0))
    sorted_meds = sorted(sorted_meds_alphabetical, 
                         key=operator.itemgetter(1), 
                         reverse=True)
    for med_name, p in sorted_meds:
        print('     %s     %.1f' %(med_name.upper(), p))
    print("\n")


5 participants in the NHANES survey were found to use 2 of the 3 psychotropic drugs selected.


PSYCHOTHERAPEUTIC AGENTS
     SERTRALINE     100.0
     ZIPRASIDONE     20.0


ANTI-INFECTIVES
     DOXYCYCLINE     20.0


CENTRAL NERVOUS SYSTEM AGENTS
     ALPRAZOLAM     100.0
     ACETAMINOPHEN; HYDROCODONE     40.0
     CARISOPRODOL     40.0
     ACETAMINOPHEN; PROPOXYPHENE     20.0
     BACLOFEN     20.0
     CYCLOBENZAPRINE     20.0
     DICLOFENAC     20.0
     DICLOFENAC; MISOPROSTOL     20.0
     DOXEPIN     20.0
     ESZOPICLONE     20.0
     PRAMIPEXOLE     20.0
     PROPOXYPHENE     20.0
     TRAMADOL     20.0


