<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Remove-durg-interference-(Remove-Drug-comfoundings)" data-toc-modified-id="Remove-durg-interference-(Remove-Drug-comfoundings)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Remove durg interference (Remove Drug comfoundings)</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Drug-AE-pairs-significantly-associated-with-pandemic" data-toc-modified-id="Drug-AE-pairs-significantly-associated-with-pandemic-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Drug-AE pairs significantly associated with pandemic</a></span></li><li><span><a href="#Drug-significantly-associated-with-AE-during-pandemic" data-toc-modified-id="Drug-significantly-associated-with-AE-during-pandemic-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Drug significantly associated with AE during pandemic</a></span></li><li><span><a href="#AEs-that-satisfy-both-significance-check" data-toc-modified-id="AEs-that-satisfy-both-significance-check-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>AEs that satisfy both significance check</a></span></li></ul></div>

# Remove durg interference (Remove Drug comfoundings)


For each top-N SE, we find the corresponding drugs, then count the occurance and non-occurance in 2019 and 2020. After that, calculate ROR, p-value, then use Fisher's test tojudge sig/insig, and multiple test to correct the  p-value.

We check two aspects.First, the adverse event (such as hallucination) should be significantly associated with the therapy of at least one drug (like Pimavanserin). Second, the formed drug-adverse event pair (like Pimavanserin-hallucination) should be significantly associated with the pandemic

**Exchange the order of step 3 and step 4 will not change the results.**

**This section is computational expensive as containing nest loops.**

# Load data

In [5]:
import itertools
from tqdm import tqdm
from collections import Counter
import scipy.stats as stats
from statsmodels.stats.multitest import multipletests
# %matplotlib notebook
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler
import faiss

# calculate 95% confidential interval
def weird_division(n, d):
    return n / d if d else 0

def CI(ROR, A, B, C, D):
    ror = np.log(ROR)
    sq = 1.96*np.sqrt(weird_division(1, A) + weird_division(1, B) + weird_division(1, C) +weird_division(1, D))
    CI_up = np.exp(ror + sq)
    CI_down = np.exp(ror - sq)
    return CI_up, CI_down
def format_tex(float_number):
    exponent = np.floor(np.log10(float_number))
    mantissa = float_number/10**exponent
    mantissa_format = str(mantissa)[0:3]
    if float_number!=0:
        return "$< {0}\times10^{{{1}}}$".format(mantissa_format, str(int(exponent)))
    else:
        return "$< 0 \times10^{0}$"

In [6]:
all_pd_US_pro = pickle.load(open('../Data/pandemic/all_pd_US_pro.pk', 'rb'))
SE_code_dic = pickle.load(open('../Data/parsed/SE_code_dic.pk', 'rb'))
drug_code_dic = pickle.load(open('../Data/parsed/drug_code_dic.pk', 'rb'))

In [8]:
# In this MeDRA_dic, key is string of PT_name, value is a list:
# [PT, PT_name, HLT,HLT_name,HLGT,HLGT_name,SOC,SOC_name,SOC_abbr]
# medra_se_disease_dic = pickle.load(open('../Data/curated/AE_dic.pk', 'rb'))
MedDRA_dic_all = pickle.load(open('../Data/curated/AE_mapping.pk', 'rb'))
drug_dic = pickle.load(open('../Data/curated/drug_dic.pk', 'rb'))
SE_dic = pickle.load(open('../Data/curated/AE_mapping.pk', 'rb'))

drug_dic_pd = pd.DataFrame(drug_dic, index=['Drugbank_ID', 'code'])
drug_dic_pd = drug_dic_pd.T

# Drug-AE pairs significantly associated with pandemic
Some SE of the Top-AE list are missed in the final dataframe, it's because this AE doesn't have significant drug.

In [15]:
## Contingency table:
# A: 2020_A, the # of drug-SE pair in 2020
# B: 2020_B , the # of not this certain drug-SE pair in 2020
# C:2019_A
# D: 2019_B

condition_list = ['SE_uncondition_2019_sig_over', 'SE_uncondition_2019_sig_under', 'SE_male_2019_sig_over', 'SE_male_2019_sig_under',
                 'SE_female_2019_sig_over', 'SE_female_2019_sig_under', 
                 'SE_young_2019_sig_over', 'SE_young_2019_sig_under', 'SE_adult_2019_sig_over', 'SE_adult_2019_sig_under',
                 'SE_elderly_2019_sig_over', 'SE_elderly_2019_sig_under']

for condition in condition_list:
    
    locals()[condition] = pickle.load(open('../Data/pandemic/results/'+condition+'_step2.pk', 'rb'))   
    pop = condition.split('_')[1]
    
    print('load condition', condition, 'pop', pop)
    df = locals()[condition]
    if len(df) ==0:
        print('there is no significant SE in this situation. ')
        continue
    
    df = df.sort_values('p_corrected', ascending=True)
    top_SE_list = list(df.SE)  # find the all the SE


    # for top_SE in top_SE_list:
    SE_list = top_SE_list
    yr_list = [2019, 2020]

    """Find the corresponding Drugs of each Top SE"""
    ind = ['2019-03-11'<i<'2019-09-31' or '2020-03-11'<i<'2020-09-31' for i in all_pd_US_pro['receipt_date']]
    reports_19_20= all_pd_US_pro[ind]

    top_drug_dic = {}
    for i in range(len(SE_list)):
        se = SE_list[i]
        indx_SE = [se in j for j in reports_19_20.SE]
        drug_list = reports_19_20[indx_SE].drugs  # find all the drugs which co-occured with the SE
        drug_list = list(itertools.chain(*list(drug_list)))
        drug_list = list(set(drug_list))  # remove duplicated drugs
        top_drug_dic[i] = drug_list

    """initialize the Data frame, count the occurance for each SE-drug pairs """
    se_matrix = pd.DataFrame() #{'SE': SE_list, 'name':list(SE_dic_df.index), 'medra_ID': list(SE_dic_df['medra_ID'])}

    for yr in yr_list: # for a certain year
        st = str(yr)+'-03-10'
        end = str(yr)+'-09-31'
        ind = [st<i<end for i in all_pd_US_pro['receipt_date']]
        input_df= all_pd_US_pro[ind]
        n_report = len(input_df)
        ### split age_grop
        input_df['age'] = pd.to_numeric(input_df['age'], errors='coerce')
        bins = [1, 20, 65, max(input_df.age)+1]
        age_labels = ['Young', 'Adult','Elderly']
        input_df['age_group'] = pd.cut(input_df.age, bins, right = False, labels= age_labels)


        """Limit the reports into specific population!!!!!!!!!!!"""
        if pop =='male':
            locals()['all_pd_US_pro_'+str(yr)] = input_df[input_df.gender=='1']
        elif pop =='female':
            locals()['all_pd_US_pro_'+str(yr)] = input_df[input_df.gender=='2']
        elif pop =='young':
            locals()['all_pd_US_pro_'+str(yr)] = input_df[input_df.age_group=='Young']
        elif pop =='adult':
            locals()['all_pd_US_pro_'+str(yr)] = input_df[input_df.age_group=='Adult']
        elif pop =='elderly':
            locals()['all_pd_US_pro_'+str(yr)] = input_df[input_df.age_group=='Elderly']
        else:
            locals()['all_pd_US_pro_'+str(yr)] = input_df


        se_, drug_, rank = [], [], []
        A =[]
        for i in tqdm(range(len(SE_list))): # dive into each SE
            se = SE_list[i]    
            name = locals()['all_pd_US_pro_'+str(yr)]
            indx_SE = [se in j for j in name.SE]
            drug_list = top_drug_dic[i]

            #  dive into each SE-drug pair
            SE_reports = name[indx_SE]
            for drug in drug_list: # check the 
                index_SE_drug = [drug in j for j in SE_reports.drugs]        

                se_.append(se) # record the se
                drug_.append(drug) # record the drug
                n_A = sum(index_SE_drug)      
                A.append(n_A)  # record A
                rank.append(i+1)
        B = [n_report - i for i in A]
        se_matrix['SE_rank'] = rank
        se_matrix['SE'] = se_
        se_matrix['drug'] = drug_
        se_matrix[str(yr)+'_A'] = A
        se_matrix[str(yr)+'_B'] = B
        
        

    # insert the the drug name and SE name to the dataframe    
    se_matrix['SE_name'] = se_matrix.apply(lambda row: list(MedDRA_dic_all[MedDRA_dic_all.PT==row.SE]['PT_name'])[0], axis=1)
    se_matrix['drug_name'] = se_matrix.apply(lambda row: list(drug_dic_pd[drug_dic_pd['Drugbank_ID']==row.drug].index)[0], axis=1)
#     se_matrix['SE_name'] = se_matrix.apply(lambda row: SE_code_dic[row.SE][0], axis=1)    
#     se_matrix['drug_name'] = se_matrix.apply(lambda row: drug_code_dic[row.drug][0], axis=1)
    
    se_matrix['2019_Delta'] = (se_matrix['2020_A'] - se_matrix['2019_A'])/se_matrix['2019_A']
    # se_matrix

    """Calculate the Fisher's test, and correct the p-value"""
    # Fisher's test
    se_matrix_2019 = se_matrix
    se_matrix_2019['2019_ROR'] = se_matrix_2019.apply(lambda row: stats.fisher_exact([[row['2020_A'], row['2020_B']], [row['2019_A'], row['2019_B']]])[0], axis = 1)
    se_matrix_2019['2019_ROR_CI_upper'] = se_matrix_2019.apply(lambda row: CI(row['2019_ROR'],row['2020_A'], row['2020_B'],  row['2019_A'], row['2019_B'])[0], axis = 1)
    se_matrix_2019['2019_ROR_CI_lower'] = se_matrix_2019.apply(lambda row: CI(row['2019_ROR'], row['2020_A'], row['2020_B'], row['2019_A'], row['2019_B'])[1], axis = 1)
    
    se_matrix_2019['p_value'] = se_matrix_2019.apply(lambda row: stats.fisher_exact([[row['2020_A'], row['2020_B']], [row['2019_A'], row['2019_B']]])[1], axis = 1)

    # multipletests
    se_matrix_2019['sig'], se_matrix_2019['p_corrected']  = multipletests(pvals=se_matrix_2019['p_value'], alpha=0.05, method='bonferroni')[0:2]
    se_matrix_2019_sig = se_matrix_2019[se_matrix_2019['sig']==True]
    se_matrix_2019_sig.sort_values(by=['SE_rank','p_corrected'], ascending=[True, True])
        
    
    ######Find the SE without any significant drug:
    sig_SE = set(list(se_matrix_2019_sig.SE))

    ### save the Data Frame
    mark = condition[3:]
    pickle.dump(se_matrix_2019_sig,  open('../Data/pandemic/top_SE_drug_' + mark+'.pk', 'wb'))
    
    
    ## Save the list of SE that has >1 drug-SE pairs.
    pickle.dump(list(set(sig_SE)),  open('../Data/pandemic/sig_drugSE_' + mark+'.pk', 'wb'))
    print(mark, 'kept SE that has sig drug-SE pair:', len(sig_SE))    
    
    # save the SE removed by drug comfounding
    insig_SE = list(set(SE_list) - sig_SE)
    pickle.dump(insig_SE,  open('../Data/pandemic/removed_drugSE_' + mark+'.pk', 'wb'))
    print(mark, 'removed SE due to non drug-SE pair:', len(insig_SE))

    print('data saved', condition)

load condition SE_uncondition_2019_sig_over pop uncondition
data saved SE_uncondition_2019_sig_over
load condition SE_uncondition_2019_sig_under pop uncondition
data saved SE_uncondition_2019_sig_under
load condition SE_male_2019_sig_over pop male
data saved SE_male_2019_sig_over
load condition SE_male_2019_sig_under pop male
data saved SE_male_2019_sig_under
load condition SE_female_2019_sig_over pop female
data saved SE_female_2019_sig_over
load condition SE_female_2019_sig_under pop female
data saved SE_female_2019_sig_under
load condition SE_young_2019_sig_over pop young
data saved SE_young_2019_sig_over
load condition SE_young_2019_sig_under pop young
data saved SE_young_2019_sig_under
load condition SE_adult_2019_sig_over pop adult
data saved SE_adult_2019_sig_over
load condition SE_adult_2019_sig_under pop adult
data saved SE_adult_2019_sig_under
load condition SE_elderly_2019_sig_over pop elderly
data saved SE_elderly_2019_sig_over
load condition SE_elderly_2019_sig_under pop e

# Drug significantly associated with AE during pandemic


In [14]:
# A: drug and SE co-occure in 2020
#B: drug appear but SE not appear, B=n_reports -A
#C: not drug but SE occure 
#D: not drug, not SE. C = n_reports -C


condition_list = ['SE_uncondition_2019_sig_over',  'SE_uncondition_2019_sig_under', 'SE_male_2019_sig_over', 'SE_male_2019_sig_under',
                 'SE_female_2019_sig_over', 'SE_female_2019_sig_under', 
                 'SE_young_2019_sig_over', 'SE_young_2019_sig_under', 'SE_adult_2019_sig_over', 'SE_adult_2019_sig_under',
                 'SE_elderly_2019_sig_over', 'SE_elderly_2019_sig_under']

# for condition in condition_list: 
for condition in condition_list:
    pop = condition.split('_')[1]      
#     locals()[condition] = pickle.load(open('../Data/pandemic/results/'+condition+'.pk', 'rb'))   
    
    
    # load the data that we want to check
    df = pickle.load(open('../Data/pandemic/top_SE_drug_' + condition[3:]+'.pk', 'rb'))
    print('load data of',condition)
    
    df = df.sort_values('p_corrected', ascending=True)
    SE_list = list(df.SE)  # find the all the SE
    drug_list = list(df.drug) 


    # for top_SE in top_SE_list:
    yr =  2020 #[2019, 2020]

    """Find the corresponding Drugs of each Top SE"""
    start, end = str(yr) + '-03-11', str(yr) + '-09-31'
    ind = [start<i<end for i in all_pd_US_pro['receipt_date']]
    input_df= all_pd_US_pro[ind]
    n_report = len(input_df)  
    
    ### split age_grop
    input_df['age'] = pd.to_numeric(input_df['age'], errors='coerce')
    bins = [1, 20, 65, max(input_df.age)+1]
    age_labels = ['Young', 'Adult','Elderly']
    input_df['age_group'] = pd.cut(input_df.age, bins, right = False, labels= age_labels)


    """Limit the reports into specific population!!!!!!!!!!!"""
    if pop =='male':
        input_df = input_df[input_df.gender=='1']
    elif pop =='female':
        input_df = input_df[input_df.gender=='2']
    elif pop =='young':
        input_df = input_df[input_df.age_group=='Young']
    elif pop =='adult':
        input_df = input_df[input_df.age_group=='Adult']
    elif pop =='elderly':
        input_df = input_df[input_df.age_group=='Elderly']

    

    """initialize the Data frame, count the occurance for each SE-drug pairs """
    se_matrix = pd.DataFrame() #{'SE': SE_list, 'name':list(SE_dic_df.index), 'medra_ID': list(SE_dic_df['medra_ID'])}


                    
    name = input_df
    A, B, C, D =[], [], [],[]
    SE_non, drug_non = [], []
    for i in tqdm(range(len(SE_list))): # dive into each SE
        # Make sure the drug-se pair and the data source
        drug = drug_list[i]
        se = SE_list[i]           
        
        # negative and positive reports       
        indx_SE = [drug in j for j in name.drugs]  
        positive_reports = name[indx_SE]  # The reports that the drug occurred
        if len(positive_reports)==0:  # If this drug not occurs in 2020, remove it.
            continue
        indx_SE_not = [not h for h in indx_SE]
        negative_reports = name[indx_SE_not]  
        
        # Selct the control group
        # Extract the independent variables in positive/negative set
#         pp =positive_reports.loc[:, 'qualify':'weight']  # features
# 'qualify', 'serious', 'receivedate','receiptdate','age','gender','weight','lastingdays'
        pp =positive_reports.loc[:, [ 'qualify', 'serious', 'receivedate','receiptdate','age','gender','weight','lastingdays']] 

        positive_fea = pp.to_numpy()

        nn =negative_reports.loc[:, [ 'qualify', 'serious', 'receivedate','receiptdate','age','gender','weight','lastingdays']]
        negative_fea = nn.to_numpy()
    
        scaler = Normalizer()  # MinMaxScaler()# StandardScaler()  # Use normalizer instead of other normalization methods
        positive_fea = scaler.fit_transform(positive_fea)
        negative_fea = scaler.fit_transform(negative_fea)
        
        
        """Trying Faiss"""
        X = positive_fea.astype('float32')
        Y = negative_fea.astype('float32')

        index = faiss.IndexFlatIP(X.shape[-1])  # Cosine similarity
#         Y = Y.copy(order='C')
        index.add(np.ascontiguousarray(Y))                  # add negative features;  np.ascontiguousarray(Y)
        k = 10                        # we want to see 10 nearest neighbors
        _, I = index.search(np.ascontiguousarray(X), k)     # actual search
        index_control =  I.reshape([1, -1])
        control_group = negative_reports.iloc[index_control[0]]
        test_group = positive_reports  
    
        
        # Calculate A, B, C,D for a certain pair
        
        a,b,c,d =0, 0,0,0
        n_test = len(test_group)
        n_control = len(control_group)
        
        
        se_ind_test = [se in j for j in test_group.SE]  
        se_ind_control = [se in j for j in control_group.SE]  
        a = sum(se_ind_test)
        c = sum(se_ind_control)
        b = n_test -a
        d = n_control - c
        A.append(a)
        B.append(b)
        C.append(c)
        D.append(d)
        
        SE_non.append(se)
        drug_non.append(drug)

    ## len(drug_list) could be smaller than len(drug_non), because some drug paris were droped for unseen in 2020.

    se_matrix['SE'] = SE_non
    se_matrix['drug'] = drug_non
    se_matrix[str(yr)+'_A'] = A
    se_matrix[str(yr)+'_B'] = B
    se_matrix[str(yr)+'_C'] = C
    se_matrix[str(yr)+'_D'] = D

    # insert the the drug name and SE name to the dataframe
    se_matrix['SE_name'] = se_matrix.apply(lambda row: list(MedDRA_dic_all[MedDRA_dic_all.PT==row.SE]['PT_name'])[0], axis=1)
    se_matrix['drug_name'] = se_matrix.apply(lambda row: list(drug_dic_pd[drug_dic_pd['Drugbank_ID']==row.drug].index)[0], axis=1)
#     se_matrix['SE_name'] = se_matrix.apply(lambda row: SE_code_dic[row.SE][0], axis=1)
#     se_matrix['drug_name'] = se_matrix.apply(lambda row: drug_code_dic[row.drug][0], axis=1)

    """Calculate the Fisher's test, and correct the p-value"""
    # Fisher's test
#     se_matrix_2019 = se_matrix
    se_matrix[str(yr)+'_ROR'] = se_matrix.apply(lambda row: stats.fisher_exact([[row[str(yr)+'_A'], row[str(yr)+'_B']], [row[str(yr)+'_C'], row[str(yr)+'_D']]])[0], axis = 1)
    se_matrix[str(yr)+'_ROR_CI_upper'] = se_matrix.apply(lambda row: CI(row[str(yr)+'_ROR'], row[str(yr)+'_A'], row[str(yr)+'_B'],row[str(yr)+'_C'], row[str(yr)+'_D'])[0], axis = 1)
    se_matrix[str(yr)+'_ROR_CI_lower'] = se_matrix.apply(lambda row: CI(row[str(yr)+'_ROR'], row[str(yr)+'_A'], row[str(yr)+'_B'],row[str(yr)+'_C'], row[str(yr)+'_D'])[1], axis = 1)
    
    se_matrix['p_value'] = se_matrix.apply(lambda row: stats.fisher_exact([[row[str(yr)+'_A'], row[str(yr)+'_B']], [row[str(yr)+'_C'], row[str(yr)+'_D']]])[1], axis = 1)

    # multipletests
    se_matrix['sig'], se_matrix['p_corrected']  = multipletests(pvals=se_matrix['p_value'], alpha=0.05, method='bonferroni')[0:2]
    
    se_matrix_2019_sig = se_matrix[se_matrix['sig']==True]
    
    
    ### save the Data Frame
    mark = condition[3:]
    pickle.dump(se_matrix_2019_sig,  open('../Data/pandemic/sig_SE_drug_pair' + mark+ str(yr)+'.pk', 'wb'))
    print('data saved', mark)

load data of SE_uncondition_2019_sig_over
data saved uncondition_2019_sig_over
load data of SE_uncondition_2019_sig_under
data saved uncondition_2019_sig_under
load data of SE_male_2019_sig_over
data saved male_2019_sig_over
load data of SE_male_2019_sig_under
data saved male_2019_sig_under
load data of SE_female_2019_sig_over
data saved female_2019_sig_over
load data of SE_female_2019_sig_under
data saved female_2019_sig_under
load data of SE_young_2019_sig_over
data saved young_2019_sig_over
load data of SE_young_2019_sig_under
data saved young_2019_sig_under
load data of SE_adult_2019_sig_over
data saved adult_2019_sig_over
load data of SE_adult_2019_sig_under
data saved adult_2019_sig_under
load data of SE_elderly_2019_sig_over
data saved elderly_2019_sig_over
load data of SE_elderly_2019_sig_under
data saved elderly_2019_sig_under


# AEs that satisfy both significance check

In [18]:
condition_list = ['SE_uncondition_2019_sig_over',
'SE_uncondition_2019_sig_under', 'SE_male_2019_sig_over', 'SE_male_2019_sig_under',
                 'SE_female_2019_sig_over', 'SE_female_2019_sig_under', 
                 'SE_young_2019_sig_over', 'SE_young_2019_sig_under', 'SE_adult_2019_sig_over', 'SE_adult_2019_sig_under',
                 'SE_elderly_2019_sig_over', 'SE_elderly_2019_sig_under']

for condition in condition_list: 
    mark = condition[3:]
    yr = 2020    
    yr_pair = pickle.load(open('../Data/pandemic/sig_SE_drug_pair' + mark+ str(yr)+'.pk', 'rb'))
    print('load drug-AE pairs during pandemic: {}'.format(mark))
    
    ### Load the pandemic-sig pairs
    pandemic_pair = pickle.load(open('../Data/pandemic/top_SE_drug_' + mark+'.pk', 'rb'))    
    print('load pairs significantly associated with pandemic : {}'.format(mark))

    
    ## Merge them  to get both_sig_pair
    both_sig_pair = pd.merge(pandemic_pair, yr_pair, how='inner', on=['SE','drug'])
    
    ## Save the list of SE that has >1 drug-SE pairs.
    sig_SE = list(both_sig_pair.SE)
    pickle.dump(list(set(sig_SE)),  open('../Data/pandemic/sig_drugSE_' + mark+'.pk', 'wb'))
    print( 'kept SE that satisfy both significance check:', mark)   

load drug-AE pairs during pandemic: uncondition_2019_sig_over
load pairs significantly associated with pandemic : uncondition_2019_sig_over
kept SE that satisfy both significance check: uncondition_2019_sig_over
load drug-AE pairs during pandemic: uncondition_2019_sig_under
load pairs significantly associated with pandemic : uncondition_2019_sig_under
kept SE that satisfy both significance check: uncondition_2019_sig_under
load drug-AE pairs during pandemic: male_2019_sig_over
load pairs significantly associated with pandemic : male_2019_sig_over
kept SE that satisfy both significance check: male_2019_sig_over
load drug-AE pairs during pandemic: male_2019_sig_under
load pairs significantly associated with pandemic : male_2019_sig_under
kept SE that satisfy both significance check: male_2019_sig_under
load drug-AE pairs during pandemic: female_2019_sig_over
load pairs significantly associated with pandemic : female_2019_sig_over
kept SE that satisfy both significance check: female_2019_