# EQA results scraper

UKNEQAS provide external quality assessment (EQA) services for UK clinical laboratories. 

Laboratories submit results that are compared against target values, and a PDF report is provided.

Sometimes, a laboratory may wish to use the results and target values for other purposes (for example, if the EQA samples are analysed as part of a validation of a new method). However, it is not easy to quickly get this information from a PDF file.

The EQA results scraper opens UKNEQAS reports, extracts the distribution, results and target values and returns these values in a Pandas dataframe.

The scraper has been tested with the following Birmingham Quality UKNEQAS schemes:
* Newborn British Isles
* Clinical Chemistry
* Immunosuppressants
* Urinary Catecholamines & Metabolites

In [1]:
import PyPDF2
import pandas as pd
import tabula

## Extract distribution summary from EQA report

Opens a UKNEQAS report, and extracts the results and target values as a Pandas dataframe.

Please note, this assumes that there are three samples per distribution in order to copy missing analyte names to subsequent rows.

In [2]:
def EQADistSum(file):
    
    #Open and read the PDF file, and create an emptry Pandas dataframe
    pdfFileObj = open(file, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    df = pd.DataFrame()

    #Open each page in the PDF document and extract text
    for page in range(1, pdfReader.numPages):
        pageObj = pdfReader.getPage(page)
        text = pageObj.extractText()
        
    #If the page contains the distribution summary...
        if 'Distribution Summary' in text:
            # ... read the distribution summary table
            dist_sum = tabula.read_pdf(file, stream=True, pages=(page+1)
                        ,area=(19,0,75,40), relative_area=True
                        )
            # ... read and add the scheme name
            scheme_read = tabula.read_pdf(file, stream=True, pages=(page+1)
                    ,area=(0,24,3,77), relative_area=True
                    ,pandas_options={'header': None})
            scheme_name = scheme_read[0].iloc[0,0]
            
            if scheme_name[0:4] == "for ":
                   scheme_name = scheme_name[4:]
            
            dist_sum[0]['Scheme name'] = scheme_name
            
            #... and add to the dataframe
            df = pd.concat([df,dist_sum[0]],ignore_index=True)

    # Rename the first column
    
    df = df.rename(columns={"Unnamed: 0": "Analyte"})
    
    # Copy the analyte names to 2nd and 3rd columns (ASSUMES THAT THERE ARE THREE SAMPLES PER DISTRIBUTION)
    df['Analyte'][1::3] = df['Analyte'][0::3]
    df['Analyte'][2::3] = df['Analyte'][0::3]
    
    return df

In [3]:
EQADistSum("4000 UKNEQAS Newborn British Isles 314.pdf")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [ipykernel_launcher.py:38]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [ipykernel_launcher.py:39]


Unnamed: 0,Analyte,Specimen,Pool,Result,Targ,Scheme name
0,Initial Phe,314A,659,475,445,Birmingham Quality ~ Newborn British Isles
1,Initial Phe,314B,660,637,537,Birmingham Quality ~ Newborn British Isles
2,Initial Phe,314C,661,652,640,Birmingham Quality ~ Newborn British Isles
3,Initial Phe interpretation,314A,659,F,F,Birmingham Quality ~ Newborn British Isles
4,Initial Phe interpretation,314B,660,F,F,Birmingham Quality ~ Newborn British Isles
5,Initial Phe interpretation,314C,661,F,F,Birmingham Quality ~ Newborn British Isles
6,Final Phe,314A,659,491,440,Birmingham Quality ~ Newborn British Isles
7,Final Phe,314B,660,618,528,Birmingham Quality ~ Newborn British Isles
8,Final Phe,314C,661,670,626,Birmingham Quality ~ Newborn British Isles
9,Tyr,314A,659,147,129,Birmingham Quality ~ Newborn British Isles


## Multi report extractor

If we have many EQA reports to extract, then we can place them all in a subfolder, and extract results from all reports in that folder.

The outputs are combined and exported as a .csv file.

In [4]:
import os

In [5]:
folder_location = '.\EQA reports'
if not os.path.exists(folder_location):os.mkdir(folder_location)

In [6]:
combined = pd.DataFrame()

for file in os.listdir(folder_location):
    individual_results = EQADistSum(file)
    
    combined = pd.concat([combined,individual_results],ignore_index=True)

combined.to_csv('EQAresults.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [ipykernel_launcher.py:38]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy [ipykernel_launcher.py:39]
