## Filter Psychotropic Drug Users

The project uses [NHANES Survey](https://www.cdc.gov/nchs/nhanes/index.htm) data between 2001 and 2014. The information is saved into multiple files, e.g. the demographic data is in one file, while the data about prescription medication taken by respondents is in another. This notebook gathers in one file the information about prescription medication taken by respondents who take at least one psychotropic medication. This step is done to avoid the time-consuming step of filtering the dataframes each time the analysis is run.

The file containing the filtered data has the following fields:
* SEQN: Respondent ID
* RXDDRUG: Generic drug name
* RXDDRGID: Generic drug code
* RXDDCN1A - RXDDCN1C: Multum drug therapeutic category names
* GENDER: Gender, with 1 male and 2 female; NaN - missing
* AGE: Age, between 0 and 80 (respondents older than 80 years old have their age listed as 80)

In [1]:
import os, glob
import warnings
import numpy as np
import pandas as pd

# avoids unnecessary warnings about the use of .loc when 
# editing Pandas dataframes
pd.options.mode.chained_assignment = None  # default='warn'

The <code>data</code> directory contains subdirectories for data taken between 2001-2002, 2003-2004, 2005-2006, 2007-2008, 2009-2010, 2011-2012, and 2013-2014. Each of these subdirectories contains three <code>.csv</code> files: 
* <code>RXQ\_RX_*</code>: information on prescription medication taken by respondents;
* <code>DEMO\_*</code>: demographic information.

The <code>data</code> directory also contains a file named <code>RXQ\_DRUG.csv</code>, which has information about the individual drugs and the drug categories they belong to.

In [2]:
root_dir = "/Users/gogrean/Documents/Insight_Fellowship/Research/Mental_Health/NHANES_Survey/"
data_dir = root_dir + "data/"
os.chdir(data_dir)

med_files = zip(sorted(glob.glob("????-????/RXQ_RX_*.csv")),
                sorted(glob.glob("????-????/DEMO_*.csv")))

The information from the different files is read in, avoiding columns that are irrelevant for the analysis. The information about prescription drugs (in <code>RXQ\_DRUG.csv</code>) is standard and does not change from year to year. The data in the other seven pairs of files is concatenated to create two large dataframes: one containing all the demographic information and the other containing all the information on prescription medication taken by the participants.

In [3]:
# Only these columns are used in the analysis.
cols_m = ["SEQN", "RXDDRUG", "RXDDRGID"]
cols_demo = ["SEQN", "RIAGENDR", "RIDAGEYR", "RIDAGEMN"]
cols_m_info = ["RXDDRGID", "RXDDCN1A", "RXDDCN1B", "RXDDCN1C"]

# Setting error_bad_lines and warn_bad_lines to False avoids some
# errors and warnings caused by entries written in the incorrect 
# fields (e.g. the dataframe has 14 columns, but in some files data
# is entered in a 15th or a 16th column by mistake; setting these 
# keywords to False will only read the first 14 columns, which 
# requires some later hacking to deal with incorrect entries).
m_info_df = pd.read_csv("RXQ_DRUG.csv", usecols=cols_m_info, 
                        error_bad_lines=False, warn_bad_lines=False)

m_dfs = []
demo_dfs = []
for m, demo in med_files:
    new_m_df = pd.read_csv(m, error_bad_lines=False, warn_bad_lines=False, usecols=cols_m, 
                         dtype={"SEQN": int, "RXDDRUG": str, "RXDDRGID": str})
    new_demo_df = pd.read_csv(demo, usecols=cols_demo)
    m_dfs.append(new_m_df)
    demo_dfs.append(new_demo_df)
    
m_df = pd.concat(m_dfs, ignore_index=True)
m_df.dropna(inplace=True)
demo_df = pd.concat(demo_dfs, ignore_index=True)

Next I get a list of psychotropic medication that appears in the Survey. I choose to use all drugs classified as PSYCHOTHERAPEUTIC AGENTS as well as anti-anxiety drugs (classified as CENTRAL NERVOUS SYSTEM AGENTS -> ANXIOLYTICS). Each drug has a unique RXDDRGID identifier, which links the dataframe containing the prescription medication to the dataframe containing the drug information. Therefore, to find patients using psychotropic medication, I create a set of all the unique drug codes.

In [4]:
filtered_m_info_df = m_info_df[(m_info_df["RXDDCN1A"] == "PSYCHOTHERAPEUTIC AGENTS") |
                               (m_info_df["RXDDCN1B"] == "ANXIOLYTICS")]
unique_psychotropic_rxddrgid = set(filtered_m_info_df["RXDDRGID"])

With the codes of the psychotropic drugs in hand, I can find the users who take at least one of these drugs. The filtered list of SEQN identifiers will be used to gather data for the relavant users into a single dataframe that will be later saved to a new <code>.csv</code> file.

In [5]:
# This is the hack that deals with information that was
# entered in the incorrect fields (by being mistakenly
# shifted by 1-2 columns). This mistake affects only the
# RXDDRGID field. Normally, drug codes consist of a letter 
# followed by a numerical sequence. However, in rows where
# information was shifted, drug codes appear as integers
# (because data from the previous columns is of type int).
# So to identify incorrect RXDDRGID entries, I check whether
# the entry is of type int; if it is, then the entry is 
# incorrect, otherwise correct.
def is_wrong_col(rx_id):
    try:
        int(rx_id)
    except ValueError:
        return False
    return True

filtered_seqn = []
for s in m_df["SEQN"].unique():
    ignore_this_seqn = False
    # Find the codes of all the drugs taken by a participant
    # with a certain SEQN.
    rxd_ids = set(m_df[m_df["SEQN"] == s]["RXDDRGID"])
    # Go through the drug codes and check that none of them 
    # is of type int (see hacky function above).
    for rx_id in rxd_ids:
        if is_wrong_col(rx_id):
            ignore_this_seqn = True
            break
    # If any of the drugs taken by the participant with this
    # SEQN has an incorrect code, the data for this participant
    # is ignored. Otherwise, the SEQN of the participants is 
    # added to the set of SEQNs that are considered in the 
    # analysis.
    if not ignore_this_seqn:
        if rxd_ids.intersection(unique_psychotropic_rxddrgid):
            filtered_seqn.append(s)

In [6]:
# First step filtering to exclude respondents whose SEQN is 
# not in the list of filtered SEQNs.
filtered_m_df = m_df[m_df["SEQN"].isin(filtered_seqn)]

# For each row in the dataframe filtered above...
for index, row in filtered_m_df.iterrows():
    # (1) find the RXDDRGID of the medication in the row;
    # (2) determine the medication categories (RXDDCN1A-RXDDCN1C) 
    # that the RXDDRGID corresponds to using the drug info file
    # (3) set the medication categories in the filtered dataframe;
    for keyword in ["RXDDCN1A", "RXDDCN1B", "RXDDCN1C"]:
        filtered_m_df.loc[index, keyword] = m_info_df[m_info_df["RXDDRGID"] == row["RXDDRGID"]][keyword].values[0]
    # (4) set the gender using data from the demographic file, which 
    # is linked to the prescription drug data via the SEQN keyword;
    filtered_m_df.loc[index, "GENDER"] = demo_df[demo_df["SEQN"] == row["SEQN"]]["RIAGENDR"].values[0]
    # (5) set the age of the respondent using data from the 
    # demographic file
    # NOTE: The column RIDAGEMN has the age in months, which is more accurate 
    # than the age in years (RIDAGEYR) as RIDAGEYR is an integer. However, 
    # RIDAGEMN is sometimes empty, so it can't be used to calculate the age; 
    # in these cases, the age in years is used from the RIDAGEYR field.
    try:
        age = int(demo_df[demo_df["SEQN"] == row["SEQN"]]["RIDAGEMN"].values[0]) / 12.
    except ValueError:
        age = int(demo_df[demo_df["SEQN"] == row["SEQN"]]["RIDAGEYR"].values[0])
    # In all the datasets, participants older than 80-85 years have their age 
    # set to 80 or 85 (you know you are old when it doesn't matter exactly how 
    # old you are other than REALLY OLD ;-) ). The upper cap depends on the year 
    # of the survey, with 85 years old for earlier years and 80 years old for more 
    # recent survey. To easily handle this difference in datasets, all respondents
    # older than 80 have their age set to 80.
    if age > 80:
        age = 80
    filtered_m_df.loc[index, "AGE"] = age
# Make the values in the GENDER column integers. Probably not necessary, but
# prettier to have it that way.
filtered_m_df["GENDER"] = filtered_m_df["GENDER"].astype(int, copy=False)

Save the filtered dataset in a <code>.csv</code> file, so that the filtering does not need to be repeated until new data is added to the NHANES survey (this could be done nicer by appending rows when data is added, which would save significant time, but that won't be necessary for this project).

In [8]:
filtered_m_df.to_csv(root_dir + "results/filtered_NHANES_data.csv", index=False)