## European Medicine Agency (EMA)

The European Medicines Agency (EMA) is a decentralized agency of the European Union (EU) responsible for the scientific evaluation, supervision, and safety monitoring of medicines in the EU. EMA plays a crucial role in the public health domain by ensuring that all medicines available on the EU market are safe, effective, and of high quality.

## Summary of product characteristics (SmPC)

A document describing the properties and the officially approved conditions of use of a medicine. Summaries of product characteristics form the basis of information for healthcare professionals on how to use the medicine safely and effectively. Abbreviated as SmPC. 

<br/>

### Structure of an SmPC
The structure of the SmPC is defined by the European pharmaceutical legislation. The information included in the SmPC should be product specific and can be cross-referenced to avoid any redundancy. It should be documented in a clear language and should not lead to any ambiguity. The SmPC is divided into 6 major sections:

1. Name of the product
2. Composition
3. Pharmaceutical Form
4. Clinical particulars – Includes therapeutic indications, recommendation for dosages and safety information
5. Pharmacological properties – Takes into account the therapeutic indications of the clinical elements and their potential adverse drug reactions
6. Pharmaceutical particulars – Includes Regulatory information related to the drug

<br/>

<img src="./images/structureSmPC.png" width="600px;">

<br/>
<br/>

### Prescribing medicines for special populations
When prescribing medication to specific groups like children, the elderly, pregnant and breastfeeding women, and individuals with kidney or liver issues, it's
crucial to exercise caution. The public documentation for medicines, especially the Summary of Product Characteristics (SmPC) and the package leaflets (PLs), often includes many of these essential precautions.
<br/>

<img src="./images/SmPC_authorization_process.png" width="800px;">

When seeking marketing authorization, a company submits an application dossier that contains instructions for healthcare professionals on the safe and effective use of the medicine. In Europe, this is referred to as the Summary of Product Characteristics (SmPC). The SmPC must be regularly updated over the medicine's lifecycle to reflect new data on efficacy or safety. The figure above illustrates the connection between a medicine's development, the regulatory dossier with the initial proposed SmPC, the SmPC that gets approved, and subsequent updates to the SmPC.


[1]: https://www.ema.europa.eu/en/glossary/summary-product-characteristics
[2]: https://www.freyrsolutions.com/what-is-an-smpc


### Project Overview

Using natural language processing (NLP) techniquesto automatically extract adverse drug reactions from such unstructured textual information helps clinicalexperts to effectively and efficiently use them in daily practices. Such techniques have been developed for Structured Product Labels from the Food and Drug Administration (FDA), but there is no researchfocusing on extracting from the Summary of Product Characteristics.


In this work, we built a naturallanguage processing pipeline that automatically scrapes the summary of product characteristics onlineand then extracts adverse drug reactions from them.

<img src="./images/Project_Overview.png" width="300px;">

##  STEP 1: Download medicine data and import them to Database
You can download the European Medicines Agency's (EMA) medicine-related data published on this website in Excel table format from this page. EMA updates these medicine data tables once a day. The following two files need to be downloaded:

<br/>

* The file listing all medicines, **including the withdrawn and not authorised ones**, named medicines_output_european_public_assessment_reports_en.xlsx, found [here](https://www.ema.europa.eu/sites/default/files/Medicines_output_european_public_assessment_reports.xlsx)

and

* The file listing **all the URLs for the SmPCs**, named Product information URLs for member product-information-urls-member-states_en.xls, found [here](https://www.ema.europa.eu/documents/product-information/product-information-urls-member-states_en.xls)


##### Disclaimer
The database is used purely to join the two csv. The view can also be done with dataframes without the need to use the database

[3]: https://toolbox.eupati.eu/resources/prescribing-medicines-for-special-populations/?print=print
[4]: https://www.ema.europa.eu/en/homepage


In [6]:
import os
import pandas as pd
import re
import sqlite3
import camelot

# Define the folder that you store the required xls
PATH = "./data"

MEDICINE_LIST = "medicines_output_european_public_assessment_reports_en.xlsx"
PRODUCT_URLS_LIST = "product-information-urls-member-states_en.xls"

In [None]:
medicinesOutputData = pd.read_excel(os.path.join(PATH, MEDICINE_LIST))
medicinesOutputData.head()

In [None]:
productInformationURLs = pd.read_excel(os.path.join(PATH, PRODUCT_URLS_LIST))
productInformationURLs.head()

In [None]:
conn = sqlite3.connect(os.path.join(PATH, "SmPCs.db"))

medicinesOutputData.to_sql('emaMedicinesOP', conn, if_exists='replace', index=False)
productInformationURLs.to_sql('emaProductInformationURLs', conn, if_exists='replace', index=False)

## Step 2: Create csv with SmPCs URLs

 The actual SmPCs information will be extracted from the joinedHumanAuthorisedProductURLsONLY.csv file which contains the URLs for the authorised medicines for human use.

In [None]:
joinedHumanAuthorisedProductURLsONLY = pd.read_sql("""
    SELECT [URL(currentwebsite)]
    FROM emaMedicinesOP
    LEFT JOIN emaProductInformationURLs 
    ON emaMedicinesOP.[Medicine name]=emaProductInformationURLs.[ProductName]
    WHERE  emaMedicinesOP.Category='Human'
      AND emaMedicinesOP.[Authorisation Status] ='Authorised';
""", conn)

joinedHumanAuthorisedProductURLsONLY.to_csv(os.path.join(PATH, "joinedHumanAuthorisedProductURLsONLY.csv"))

## Step 3: Download SmPCs pdfs


In [None]:
import requests

In [None]:
urlsDF = pd.read_csv(os.path.join(PATH, "joinedHumanAuthorisedProductURLsONLY.csv"))
urlsDF.dropna(inplace = True)
urls = urlsDF['URL(currentwebsite)'].tolist()

In [None]:
for index, url in enumerate(urls):
    response = requests.get(url)
    file_name = url.split('/')[-1]

    print("Download ", index, " with name: ", file_name)

    if response.status_code == 200:
        with open(os.path.join(PATH, "pdfs", file_name), 'wb') as pdf_file:
            pdf_file.write(response.content)
    else:
        print(f"Failed to download the : {file_name}")

### Step 4: Extract Tabulated Summary of Adverse Reactions from the 4.8 section 

This section should include all adverse reactions for which, after thorough assessment, a causal relationship between the medicinal product and the adverse event is at least a reasonable possibility. These ADRs should be all reactions from:
* Clinical trials
* Post-authorisation safety studies
* Spontaneous reporting

#####  The tabulated list of adverse reactions in section 4.8:
* Introduce table with short paragraph stating source of safety database
* Separate tables are acceptable in exceptional cases where adverse profiles markedly differ depending on the use of the product.

#####  Structure of the table: general considerations

* Table should be written according to MedDRA system organ classification (SOC)
* A pragmatic approach to the location of terms should be taken in order to make the identification of adverse reactions simpler and clinically appropriate for the reader
* Within each SOC, adverse reactions should be ranked under headings of frequency, most frequent reactions first
* Frequency grouping, adverse reactions should be presented in order of decreasing seriousness

##### Frequency Grouping:
* Very common (≥1/10)
* Common (≥1/100 to <1/10)
* Uncommon (≥1/1,000 to <1/100)
* Rare (≥1/10,000 to <1/1,000)
* Very rare (<1/10,000)
* Frequency not known (cannot be estimated from the available data)

#### SmPCs Examples

<img src="./images/SmPCs_example.png" width="500px;">


## MedDRA
MedDRA is a multilingual terminology allowing most users to operate in their native languages. The table below identifies the initial MedDRA version when each MedDRA language was made available to users. 

<br />

<img src="./images/MedDRA_Hierarchy.png" width="500px;">

<br />

The most important reason to “code” data into a standardised terminology is to analyse it. A key benefit of MedDRA is in its support of straightforward as well as sophisticated analyses. MedDRA can be used to analyse individual medical events (e.g., “Influenza”) or issues involving a system, organ or aetiology (e.g., infections) using its hierarchical structure. MedDRA can be used for signal detection and monitoring of clinical syndromes whose symptoms encompass numerous systems or organs using its multiaxial hierarchy or through the special feature of Standardised MedDRA Queries.

<br />

### MedDRA Hierarchy

The structure of MedDRA is very logical. There are five levels to the MedDRA hierarchy, arranged from very specific to very general.

<br />

Download [MedDRA codes](https://www.ema.europa.eu/en/documents/other/meddra-important-medical-event-terms-list-version-261_en.xlsx) and extract the unique values from the SOC Name column

In [9]:
medraSOC = pd.read_excel(os.path.join(PATH, "MedDRA", "meddra-important-medical-event-terms-list-version-261_en.xlsx"))
socList = medraSOC['SOC Name'].unique()
SOCListRegex = '(?:% s)' % '|'.join(socList)
regex = re.compile(SOCListRegex)

In [10]:
regex

re.compile(r'(?:Blood and lymphatic system disorders|Cardiac disorders|Congenital, familial and genetic disorders|Endocrine disorders|Eye disorders|Gastrointestinal disorders|Hepatobiliary disorders|Immune system disorders|Infections and infestations|Injury, poisoning and procedural complications|Investigations|Musculoskeletal and connective tissue disorders|Neoplasms benign, malignant and unspecified (incl cysts and polyps)|Nervous system disorders|Pregnancy, puerperium and perinatal conditions|Renal and urinary disorders|Reproductive system and breast disorders|Respiratory, thoracic and mediastinal disorders|Surgical and medical procedures|Vascular disorders|Ear and labyrinth disorders|General disorders and administration site conditions|Metabolism and nutrition disorders|Product issues|Psychiatric disorders|Skin and subcutaneous tissue disorders|Social circumstances)',
           re.UNICODE)

In [16]:
def extract_table_rows(df):
    rows = []

    for row in df.itertuples():
        # First row contains the header
        if row[0] == 0:
            # Find the order of columns
            if True if "reaction".lower() in row[2].lower() else False:
                AD_index, F_index  = 2, 3
            else:
                AD_index, F_index  = 3, 2

        else:
            ad_reaction = row[AD_index].split('\n')
            freq = row[F_index].split('\n')
            
            for index, ad in enumerate(ad_reaction):
                ad_row = []
                ad_row.insert(0, row[1].replace('\n', ''))  
                ad_row.insert(1, ad_reaction[index].strip())
                ad_row.insert(2, freq[index].strip())
                rows.append(ad_row)
    return rows

In [25]:
# List all pdfs in the directory
pdfs_processed = []
pdfs_not_processed = []

for file in os.listdir(os.path.join(PATH, 'pdfs')):
    
    if file.lower().endswith('.pdf'):
        filename = file.split('/')[-1].split('-')[0]
        fullpath = os.path.join(PATH, 'pdfs', file)
        
        print("Start Processing Filename: ", fullpath)
        
        tables = camelot.read_pdf(fullpath, pages='all', flavor='lattice', line_scale=80, shift_text = ['', 'l'], process_background=False)

        rows = []
        for i, table in enumerate(tables):
            # Convert table to a pandas DataFrame
            df = table.df
    
            # Search for regex in the first column
            matches = df[0].str.contains(regex, regex=True)
    
            # Check if any value in the column matches the regex pattern
            contains_regex = matches.any()

            if contains_regex:
                if len(df.columns) == 3:
                    # Case A: Dataframe consists of 3 columns
                    print("The file ", filename, " does have three columns table")
                    extracted_rows = extract_table_rows(df)
                    print(extracted_rows)
                    for row in extracted_rows:
                        rows.append(row)
                    print(rows)

                    pdfs_processed.append(fullpath)
                else:
                    print("The file ", filename, " does NOT have three columns table")
                    pdfs_not_processed.append(fullpath)
        print(rows)
        if rows:
            pd.DataFrame(rows, columns=["SOC", "ADR", "FREQ"]).to_csv(os.path.join(PATH, "extracted_tables", filename + ".csv"), sep=';', encoding='utf-8', index=False)
            
# Create report

Start Processing Filename:  ./data/pdfs/abecma-epar-product-information_en.pdf
The file  abecma  does have three columns table
[['Infections and infestationsa', 'Infections – bacterial', 'Very common'], ['Infections and infestationsa', 'Infections – viral', 'Very common'], ['Infections and infestationsa', 'Infections – pathogen unspecified', 'Very common'], ['Infections and infestationsa', 'Infections – fungal', 'Common'], ['Blood and lymphatic system disorders', 'Neutropenia', 'Very common'], ['Blood and lymphatic system disorders', 'Leucopenia', 'Very common'], ['Blood and lymphatic system disorders', 'Thrombocytopenia', 'Very common'], ['Blood and lymphatic system disorders', 'Febrile neutropenia', 'Very common'], ['Blood and lymphatic system disorders', 'Lymphopenia', 'Very common'], ['Blood and lymphatic system disorders', 'Anaemia', 'Very common'], ['Blood and lymphatic system disorders', 'Disseminated intravascular coagulation', 'Common'], ['Immune system disorders', 'Cytokine r

  matches = df[0].str.contains(regex, regex=True)


In [26]:
print(camelot.__version__)

0.9.0


In [27]:
!pip list

Package                   Version
------------------------- ------------
anyio                     4.2.0
appnope                   0.1.3
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.1
async-lru                 2.0.4
attrs                     23.1.0
Babel                     2.14.0
beautifulsoup4            4.12.2
bleach                    6.1.0
camelot-py                0.9.0
certifi                   2023.11.17
cffi                      1.16.0
chardet                   5.2.0
charset-normalizer        3.3.2
click                     8.1.7
comm                      0.2.0
cryptography              41.0.7
debugpy                   1.8.0
decorator                 5.1.1
defusedxml                0.7.1
et-xmlfile                1.1.0
executing                 2.0.1
fastjsonschema            2.19.0
fqdn                      1.5.1
idna                      3.6
ipykernel              