# Sensitivity to badly converged estimates

## Running this notebook for yourself  
If you want to run this notebook, you will need to make sure that you have installed `vigipy` and also have access to the CAERS dataset.    
The dataset is currently stored on Azure in `/dbfs/mnt/data/caers/` if you would like to get it for yourself. It is open-source and fully anonymised.    
Please see the `README.md` for instructions on installing `vigipy`. I **strongly** recommend installing it in a virtual environment so you can play with it  
***IMPORTANT***: Make sure you are using the `local_qol_changes` branch because that is what I produced these results using.

In [1]:
from vigipy import * 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import time as time

In [2]:
caers_dataset = "C:/Users/damlteam/Documents/vigipy_devops/vigipy/example_notebooks/test_datasets/caers_dataset.csv" # put your own path to the dataset

# just reading in the CAERS dataset
df = pd.read_csv(caers_dataset, header = 0)
df.rename(columns={'var1': 'name'}, inplace=True)
df.rename(columns={'var2': 'AE'}, inplace=True)
df['count'] = 1

# drop duplicates from dataset 
df = df.drop_duplicates(subset=['id', 'name', 'AE'], keep='first')

In [3]:
df

Unnamed: 0,id,name,AE,strat1,count
0,147289,PREVAGEN,BRAIN NEOPLASM,Female,1
1,147289,PREVAGEN,CEREBROVASCULAR ACCIDENT,Female,1
2,147289,PREVAGEN,RENAL DISORDER,Female,1
3,147289,PREVAGEN,GOUT,Female,1
4,147289,PREVAGEN,HYPERTENSION,Female,1
...,...,...,...,...,...
20151,160591,"CENTRUM SILVER WOMEN'S 50 PLUS (MULTIMINERALS,...",CHOKING,Female,1
20152,160592,"CENTRUM SILVER WOMEN'S 50+ (MULTIMINERALS, MUL...",CHOKING,Female,1
20153,160592,"CENTRUM SILVER WOMEN'S 50+ (MULTIMINERALS, MUL...",PALPITATIONS,Female,1
20154,160592,"CENTRUM SILVER WOMEN'S 50+ (MULTIMINERALS, MUL...",DYSPHAGIA,Female,1


In [4]:
time1 = time.time()
vigipy_data_2 = convert(df, count_unique_ids=True) # converting in the openEBGM way (always optimised)
time2 = time.time()
print("TIME TO CONVERT WITH OPENEBGM METHOD = ", time2-time1)

TIME TO CONVERT WITH OPENEBGM METHOD =  0.21534013748168945


In [5]:
data_comp2 = vigipy_data_2.data.sort_values(by='events')

In [6]:
data_comp2 # processed table using openEBGM method of counts

Unnamed: 0,events,product_aes,count_across_brands,AE,name
17184,1,1,70,PALPITATIONS,ZINC LOZENGE
16410,1,2,37,SYNCOPE,VITAMIN WORLD NIACIN 500 MG COATED CAPLETS
16411,1,2,56,TREMOR,VITAMIN WORLD NIACIN 500 MG COATED CAPLETS
17155,1,9,5,FACE OEDEMA,ZINC
16412,1,2,2,VASCULAR INJURY,VITAMIN WORLD NIACIN 500 MG COATED CAPLETS
...,...,...,...,...,...
2358,27,30,842,CHOKING,CENTRUM SILVER ULTRA WOMEN'S MULTIMINERALS MUL...
2614,28,64,265,FOREIGN BODY TRAUMA,"CENTRUM SILVER WOMEN'S 50+ (MULTIMINERALS, MUL..."
2652,42,45,842,CHOKING,CENTRUM SILVER WOMEN'S 50+ MULTIMINERALS MULTI...
2430,53,57,842,CHOKING,CENTRUM SILVER ULTRA WOMENS MULTIMINERALS MULT...


## Optimising Hyperparameters and Building a Report

In [7]:
help(gps) # help string for the function is very detailed

Help on function gps in module vigipy.GPS.GPS:

gps(container, relative_risk=1, min_events=1, decision_metric='rank', decision_thres=0.05, ranking_statistic='log2', truncate=False, truncate_thres=1, prior_init={'alpha1': 0.2041, 'beta1': 0.05816, 'alpha2': 1.415, 'beta2': 1.838, 'w': 0.0969}, prior_param=None, expected_method='mantel-haentzel', method_alpha=1, minimization_method='CG', minimization_bounds=((np.float32(1.1920929e-07), 20), (np.float32(1.1920929e-07), 10), (np.float32(1.1920929e-07), 20), (np.float32(1.1920929e-07), 10), (0, 1)), minimization_options=None, message=False, opt_likelihood=False, number_of_iterations=1000, tol_value=0.0001, sim_anneal=False, product_label='name', ae_label='AE')
    Perform disproportionality analysis using the Multi-Item Gamma Poisson Shrinker (GPS) algorithm.
    
    This function implements a gamma-poisson shrinker algorithm for analyzing adverse event 
    data and detecting disproportionality signals. It optimizes hyperparameters, calcu

In [8]:
# MHRA EBGM cut-off is 2.5
log2_ebgm = np.log2(2.5)

# bound for minimisation because values cannot go below zero
EPS = np.finfo(np.float64).eps

In [9]:
## not all of these parameters in GPS are needed!
time1 = time.time()
results = gps(
    vigipy_data_2, # container that was processed with convert
    relative_risk=1, # relative risk parameter
    min_events=3, # minimum number of events for something to be considered a signal 
    decision_metric='rank', # what should we rank signals by? 'rank' means by the ranking_statistic later 
    decision_thres=log2_ebgm, # minimum value of the ranking statistic for something to be considered a signal
    ranking_statistic='log2', # which ranking statistic to use, here log2 means the log2 of the EBGM score
    truncate=True, # whether to truncate or not
    truncate_thres=1,  # threshold for truncation
    prior_init={"alpha1": 2.0, "beta1": 1.0, "alpha2": 2.0, "beta2": 1.0, "w": 0.3333}, # initial guesses for priors
    prior_param=None, # feed in an array if you want to just use those priors and do no optimisation
    expected_method="mantel-haentzel", # method for calculating expected counts
    method_alpha=1, # parameter for other methods of calculating expected counts
    minimization_method="COBYQA", # which minimisation algorithm from scipy to use!
    minimization_bounds=((EPS, 20), (EPS, 10), (EPS, 20), (EPS, 10), (EPS, 1)), # bounds on minimisation: will only be applied to certain algorithms
    minimization_options=None, # any supplementary options for the minimiser
    message=True, # whether to be verbose and print messages
    opt_likelihood=True, # whether to use the optimised versions of likelihood functions (really should always be true)
    number_of_iterations=1000, # number of iterations of the optimiser
    tol_value=1.0e-6, # tolerance for the optimiser
    sim_anneal=False, # whether to use simulated annealing optimisation: very experimental
    product_label='name', # the name of the column in the processed data that contains the product names
    ae_label='AE' # the name of the column in the processed data that contains the AE names
)
time2 = time.time()
print("TIME TAKEN TO PRODUCE DATA = ", time2-time1)

BEGINNING HYPERPARAMETER OPTIMISATION
OPTIMISED PRIORS REACHED:  [3.25578065 0.39997857 2.02376934 1.9061568  0.06530997]
OPTIMISED FUNCTION VALUE =  4162.455710382609
The lower bound for the trust-region radius has been reached
CALCULATING EBGM SCORES
CALCULATING QUANTILES
GENERATING REPORT
TIME TAKEN TO PRODUCE DATA =  14.653939962387085


In [25]:
## not all of these parameters in GPS are needed!
time1 = time.time()
results2 = gps(
    vigipy_data_2, # container that was processed with convert
    relative_risk=1, # relative risk parameter
    min_events=3, # minimum number of events for something to be considered a signal 
    decision_metric='rank', # what should we rank signals by? 'rank' means by the ranking_statistic later 
    decision_thres=log2_ebgm, # minimum value of the ranking statistic for something to be considered a signal
    ranking_statistic='log2', # which ranking statistic to use, here log2 means the log2 of the EBGM score
    truncate=True, # whether to truncate or not
    truncate_thres=1,  # threshold for truncation
    prior_init={"alpha1": 0.2041, "beta1": 0.05816, "alpha2": 1.415, "beta2": 1.838, "w": 0.0969}, # initial guesses for priors
    prior_param=None, # feed in an array if you want to just use those priors and do no optimisation
    expected_method="mantel-haentzel", # method for calculating expected counts
    method_alpha=1, # parameter for other methods of calculating expected counts
    minimization_method="CG", # which minimisation algorithm from scipy to use!
    minimization_bounds=((EPS, 20), (EPS, 10), (EPS, 20), (EPS, 10), (EPS, 1)), # bounds on minimisation: will only be applied to certain algorithms
    minimization_options=None, # any supplementary options for the minimiser
    message=True, # whether to be verbose and print messages
    opt_likelihood=True, # whether to use the optimised versions of likelihood functions (really should always be true)
    number_of_iterations=1000, # number of iterations of the optimiser
    tol_value=1.0e-6, # tolerance for the optimiser
    sim_anneal=False, # whether to use simulated annealing optimisation: very experimental
    product_label='name', # the name of the column in the processed data that contains the product names
    ae_label='AE' # the name of the column in the processed data that contains the AE names
)
time2 = time.time()
print("TIME TAKEN TO PRODUCE DATA = ", time2-time1)

BEGINNING HYPERPARAMETER OPTIMISATION
OPTIMISED PRIORS REACHED:  [3.17234996 0.39510069 2.03040533 1.91653781 0.06652814]
OPTIMISED FUNCTION VALUE =  4162.456429306268
Desired error not necessarily achieved due to precision loss.
CALCULATING EBGM SCORES
CALCULATING QUANTILES
GENERATING REPORT
TIME TAKEN TO PRODUCE DATA =  44.595242500305176


#### Observations
- Values of likelihood and priors that this gives are consistent with OpenEBGM for the same dataset!
- Values converge irrespective of which algorithm is used to minimise them, or the priors chosen (just affects the run-time)
- Various parameters are contained in the gps object (including all input parameters, optimised priors, likelihood at minimum)

In [12]:
results.signals # cut off at the decision threshold

Unnamed: 0,Product,Adverse Event,Count,Expected Count,log2,count/expected,product margin,event margin,fdr,FNR,Se,Sp,LowerBound,p_value
0,REUMOFAN PLUS,WEIGHT INCREASED,16.0,0.406436,4.539840,39.366569,44.0,31.0,0.079960,0.904488,0.879821,0.161140,15.686438,1.943896e-17
1,REUMOFAN PLUS,IMMOBILE,6.0,0.078665,4.191970,76.272727,44.0,6.0,0.080117,0.909704,0.861200,0.171486,10.161142,9.009935e-07
2,HYDROXYCUT REGULAR RAPID RELEASE CAPLETS,EMOTIONAL DISTRESS,19.0,0.896901,4.068400,21.184053,70.0,43.0,0.080021,0.900297,0.639618,0.422631,11.646773,2.541142e-17
3,"EMERGEN-C (ASCORBIC ACID, B-COMPLEX, ELECTROLY...",COUGH,6.0,0.144815,4.002402,41.432099,6.0,81.0,0.089050,0.925122,0.329246,0.568923,8.882635,2.815762e-06
4,HYDROXYCUT HARDCORE CAPSULES,MULTIPLE INJURIES,5.0,0.092372,3.966693,54.129032,31.0,10.0,0.086889,0.899664,0.572963,0.491910,8.271694,2.278548e-05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248,"CENTRUM SILVER ULTRA WOMEN'S (MULTIMINERALS, M...",CHOKING,22.0,7.275924,1.391504,3.023671,29.0,842.0,0.115993,0.922704,0.124253,0.754106,1.824771,2.930861e-05
249,"EMERGEN-C (ASCORBIC ACID, B-COMPLEX, ELECTROLY...",CHOKING,6.0,1.505364,1.348381,3.985748,6.0,842.0,0.089673,0.924079,0.326736,0.578372,1.227613,1.849353e-02
250,CENTRUM SILVER ULTRA WOMENS MULTIMINERALS MULT...,CHOKING,6.0,1.505364,1.348381,3.985748,6.0,842.0,0.099235,0.921572,0.160521,0.735797,1.227613,1.849353e-02
251,CENTRUM SILVER WOMENS 50 PLUS MULTIMINERALS MU...,DYSPHAGIA,4.0,0.849225,1.348119,4.710175,10.0,285.0,0.086598,0.925929,0.261354,0.617815,1.024587,4.563272e-02


The signals and values are identical to those from openEBGM (but run quicker now!)