## Data linkage bias analysis tutorial 

This file is a tutorial on how data linkage error can create bias in statistics.

Differential error rates means error can be amplified for subgroups within the data who are better or worse represented.

In this example , matched records represent records where MPS/any other data linkage solution has managed to return an NHS number for a patient. Unmatched records are records which failed to trace - perhaps due to poor data quality - and therefore a one time use ID is returned / records are excluded from the analysis. 

This tutorial shows you how a decision to remove unmatched records can have unintended effects on downstream statistics and analysis. 


##### Data linkage glossary
MPS - Master Person Service - the  system that enables data linkage between data sets in NHS England through Person_ID by tracing and verifying key identifiers against the Personal Demographics Service (PDS). 

Matched - MPS has successfully returned an NHS number as a Person_ID for the record.

Unmatched - MPS was unable to return an NHS number for the record. This could be due to poor data quality in the record, or several records looking like close matches. In this instance a 'One time use ID' could be returned instead.

Linkage error - Bias introduced to publications and data assets from missed matches, as the distribution of unmatched records is not distributed equitably over the entire population.

In [220]:
import pandas as pd
import numpy as np

%run tutorial_data.ipynb

data_linkage_df = test_df_1
realistic_data_linkage_df = test_df_2


##### Data profiling
Data profiling is the process of examining the data and collecting statistics and information about that data to ensure accurate linkage results.

In this example, we are getting a count of which patients are matched and unmatched , and admitted and not admitted to hospital for the original data.


In [221]:
# Count the number of 'True' values in the 'Matched' column
matched_count = data_linkage_df['Matched'].sum()
not_matched_count = (~data_linkage_df['Matched']).sum()

# Count the number of 'True' values in the 'Matched' column
admitted_count = data_linkage_df['Admitted'].sum()
not_admitted_count = (~data_linkage_df['Admitted']).sum()

print('Matched count:', matched_count)
print('Not matched count:', not_matched_count)
print('Admitted count:', admitted_count)
print('Not admitted count:', not_admitted_count)

Matched count: 9691
Not matched count: 309
Admitted count: 6166
Not admitted count: 3834


### Calculating the bias accrued from data linkage error when you cannot view unmatched records

If we want to calculate the proportion of people admitted to hospital, and we only have access to matched records, it is as if we cannot see the bottom row of this table.

Therefore, we can only calculate it as the number of records matched and admitted over the total number of matched records, or $(A / A+B)$

|       | Admitted | Not admitted |
| ----------- | ----------- |-----------|
| Matched      | A       |B|
| Unmatched   | C        |D|

In [222]:
matched_and_admitted_count = len(data_linkage_df[(data_linkage_df['Matched'] == True) & (data_linkage_df['Admitted'] == True)]) # A

matched_and_not_admitted_count = len(data_linkage_df[(data_linkage_df['Matched'] == True) & (data_linkage_df['Admitted'] == False)]) # B

matched_admitted_proportion = matched_and_admitted_count/(matched_and_not_admitted_count+matched_and_admitted_count)*100

print(matched_and_not_admitted_count)
print(matched_and_admitted_count)
print(matched_admitted_proportion)

3711
5980
61.70673821071097


##### Accounting for linkage error

Ideally, we want to use the entire quadrant to calculate the admission rates

This would be both the matched and unmatched admitted records, over the total records, $(A+C/A+B+C+D)$

|       | Admitted | Not admitted |
| ----------- | ----------- |-----------|
| Matched      | A       |B|
| Unmatched   | C        |D|

In [223]:
matched_and_admitted_count = len(data_linkage_df[(data_linkage_df['Matched'] == True) & (data_linkage_df['Admitted'] == True)]) # A
matched_and_not_admitted_count = len(data_linkage_df[(data_linkage_df['Matched'] == True) & (data_linkage_df['Admitted'] == False)]) # B

not_matched_and_admitted_count = len(data_linkage_df[(data_linkage_df['Matched'] == False) & (data_linkage_df['Admitted'] == True)]) # C
not_matched_and_not_admitted_count = len(data_linkage_df[(data_linkage_df['Matched'] == False) & (data_linkage_df['Admitted'] == False)]) # D

proportion_accounting_for_linkage_error = (
    (matched_and_admitted_count+not_matched_and_admitted_count)/
    (not_matched_and_admitted_count+matched_and_admitted_count+not_matched_and_not_admitted_count+matched_and_not_admitted_count)
    *100
    ) # A+C / A+B+C+D 


In [224]:
print('Original proportion of admitted, given only matched records are used:', round(matched_admitted_proportion,2), '%')

print('Proportion of admitted, given matched and unmatched records are used:', round(proportion_accounting_for_linkage_error,2), '%')

print('Difference in rates:', round(matched_admitted_proportion-proportion_accounting_for_linkage_error, 2), '%')

Original proportion of admitted, given only matched records are used: 61.71 %
Proportion of admitted, given matched and unmatched records are used: 61.66 %
Difference in rates: 0.05 %


### Accounting for linkage error for more 'real data'

As can be seen in the first example, including the unmatched records does not have a huge effect because the proportion of admitted patients in both groups is very similar.

However, the issue is that data linkage error is not proportionately distributed across the population. Systematic reviews of studies comparing the characteristics of linked and unlinked records have identified that more vulnerable or hard to reach populations are often missed, with the probability of a missed match being associated with a range of characteristics including gender, age, ethnicity, deprivation and health status

This can have a massive effect on the downstream data for the specific segments of the population affected.

#### Data profiling a more 'realistic' sample

In this more 'realistic' data example, the unmatched records contain patients who are younger compared to the matched records - which is often the case in our data sets due to younger people having poorer data quality, for example due to moving house more frequently.


In [225]:
matched_records_realistic_data_linkage_df =  realistic_data_linkage_df[(realistic_data_linkage_df['Matched'] == True)]
unmatched_records_realistic_data_linkage_df =  realistic_data_linkage_df[(realistic_data_linkage_df['Matched'] == False)]

matched_records_realistic_data_linkage_count = len(matched_records_realistic_data_linkage_df)
unmatched_records_realistic_data_linkage_count = len(unmatched_records_realistic_data_linkage_df)

proportion_of_matched_records = (matched_records_realistic_data_linkage_count/(matched_records_realistic_data_linkage_count+unmatched_records_realistic_data_linkage_count))*100

print('Proportion of matched records:', proportion_of_matched_records, '%')

Proportion of matched records: 97.0 %


In [226]:
average_age_matched = round(matched_records_realistic_data_linkage_df['Age'].mean(),2)
average_age_unmatched = round(unmatched_records_realistic_data_linkage_df['Age'].mean(),2)

print('Average age for matched records:', average_age_matched)
print('Average age for unmatched records:', average_age_unmatched)


Average age for matched records: 44.74
Average age for unmatched records: 29.56


#### How this can affect downstream statistics

Hospital admission behaviour would be expected to change for younger patients, as they are less likely to be admitted. This means that bias and error can creep into our calculations, as by not accounting for data linkage error the admission rate could be overstated.

In [227]:
realistic_matched_and_admitted_count = len(realistic_data_linkage_df[(realistic_data_linkage_df['Matched'] == True) & (realistic_data_linkage_df['Admitted'] == True)]) # A
realistic_matched_and_not_admitted_count = len(realistic_data_linkage_df[(realistic_data_linkage_df['Matched'] == True) & (realistic_data_linkage_df['Admitted'] == False)]) # B

realistic_matched_admitted_proportion = realistic_matched_and_admitted_count/(realistic_matched_and_not_admitted_count+realistic_matched_and_admitted_count)*100

print('Proportion of records admitted in matched patients:', round(realistic_matched_admitted_proportion,2), '%')

Proportion of records admitted in matched patients: 61.56 %


In [237]:
realistic_not_matched_and_admitted_count = len(realistic_data_linkage_df[(realistic_data_linkage_df['Matched'] == False) & (realistic_data_linkage_df['Admitted'] == True)]) # C
realistic_not_matched_and_not_admitted_count = len(realistic_data_linkage_df[(realistic_data_linkage_df['Matched'] == False) & (realistic_data_linkage_df['Admitted'] == False)]) # D

realistic_proportion_admitted_unmatched = (
    realistic_not_matched_and_admitted_count/
    (realistic_not_matched_and_admitted_count+realistic_not_matched_and_not_admitted_count)
    *100
    ) # C / (C+D)

realistic_proportion_accounting_for_linkage_error = (
    (realistic_matched_and_admitted_count+realistic_not_matched_and_admitted_count)/
    (realistic_not_matched_and_admitted_count+realistic_matched_and_admitted_count+realistic_not_matched_and_not_admitted_count+realistic_matched_and_not_admitted_count)
    *100
    ) # A+C / (A+B+C+D)

realistic_difference_in_rates =  round((realistic_matched_admitted_proportion-realistic_proportion_accounting_for_linkage_error),2)


In [238]:
print('Original proportion of admitted, given only matched records are used:', round(realistic_matched_admitted_proportion, 2), '%')
print('Proportion admitted in unmatched records:', round(realistic_proportion_admitted_unmatched,2), '%')
print('Proportion of admitted, given matched and unmatched records are used:', round(realistic_proportion_accounting_for_linkage_error, 2), '%')
print('Difference in rates:', realistic_difference_in_rates, '%')

Original proportion of admitted, given only matched records are used: 61.56 %
Proportion admitted in unmatched records: 34.67 %
Proportion of admitted, given matched and unmatched records are used: 60.75 %
Difference in rates: 0.81 %


Seems like there is a fairly large difference, as estimation of the proportion of patients admitted when factoring in linkage error is 60.75%, compared to 61.56. This is a difference of 0.89%, compared to 0.01% in the previous dataframe.

But this effect is further amplified when filtering down to subgroups which are worse represented in regards to data linkage error.

When we filter to ages 20-30, we can see that the proportion of matched records decreases.

In [235]:
realistic_data_linkage_ages_20_30 = (realistic_data_linkage_df[(realistic_data_linkage_df['Age']>=20)&(realistic_data_linkage_df['Age']<=30)])

matched_records_realistic_data_linkage_20_30 =  realistic_data_linkage_ages_20_30[(realistic_data_linkage_ages_20_30['Matched'] == True)]
unmatched_records_realistic_data_linkage_20_30 =  realistic_data_linkage_ages_20_30[(realistic_data_linkage_ages_20_30['Matched'] == False)]

matched_records_realistic_data_linkage_count_20_30 = len(matched_records_realistic_data_linkage_20_30)
unmatched_records_realistic_data_linkage_count_20_30 = len(unmatched_records_realistic_data_linkage_20_30)

proportion_of_matched_records_20_30 = (matched_records_realistic_data_linkage_count_20_30/(matched_records_realistic_data_linkage_count_20_30+unmatched_records_realistic_data_linkage_count_20_30))*100

print('Proportion of matched records:', round(proportion_of_matched_records_20_30, 2), '%')

Proportion of matched records: 95.7 %


In [230]:

      
realistic_matched_and_admitted_count_20_30 = len(realistic_data_linkage_ages_20_30[(realistic_data_linkage_ages_20_30['Matched'] == True) & (realistic_data_linkage_ages_20_30['Admitted'] == True)]) # A
realistic_matched_and_not_admitted_count_20_30 = len(realistic_data_linkage_ages_20_30[(realistic_data_linkage_ages_20_30['Matched'] == True) & (realistic_data_linkage_ages_20_30['Admitted'] == False)]) # B

realistic_matched_admitted_proportion_20_30 = realistic_matched_and_admitted_count_20_30/(realistic_matched_and_not_admitted_count_20_30+realistic_matched_and_admitted_count_20_30)*100

print('Proportion of records admitted in matched patients:', round(realistic_matched_admitted_proportion_20_30,2), '%')

Proportion of records admitted in matched patients: 62.37 %


In [231]:
realistic_not_matched_and_admitted_count_20_30 = len(realistic_data_linkage_ages_20_30[(realistic_data_linkage_ages_20_30['Matched'] == False) & (realistic_data_linkage_ages_20_30['Admitted'] == True)]) # C
realistic_not_matched_and_not_admitted_count_20_30 = len(realistic_data_linkage_ages_20_30[(realistic_data_linkage_ages_20_30['Matched'] == False) & (realistic_data_linkage_ages_20_30['Admitted'] == False)]) # D

realistic_proportion_admitted_unmatched_20_30 = (
    realistic_not_matched_and_admitted_count_20_30/
    (realistic_not_matched_and_admitted_count_20_30+realistic_not_matched_and_not_admitted_count_20_30)
    *100
    ) # C / (C+D)

realistic_proportion_accounting_for_linkage_error_20_30 = (
    (realistic_matched_and_admitted_count_20_30+realistic_not_matched_and_admitted_count_20_30)/
    (realistic_not_matched_and_admitted_count_20_30+realistic_matched_and_admitted_count_20_30+realistic_not_matched_and_not_admitted_count_20_30+realistic_matched_and_not_admitted_count_20_30)
    *100
    ) # A+C / (A+B+C+D)

In [233]:
print('Original proportion of admitted, given only matched records are used:', round(realistic_matched_admitted_proportion_20_30, 2), '%')
print('Proportion admitted in unmatched records:', round(realistic_proportion_admitted_unmatched_20_30,2), '%')
print('Proportion of admitted, given matched and unmatched records are used:', round(realistic_proportion_accounting_for_linkage_error_20_30, 2), '%')
print('Difference in rates:', round((realistic_matched_admitted_proportion_20_30-realistic_proportion_accounting_for_linkage_error_20_30),2), '%')

Original proportion of admitted, given only matched records are used: 62.37 %
Proportion admitted in unmatched records: 39.62 %
Proportion of admitted, given matched and unmatched records are used: 61.39 %
Difference in rates: 0.98 %


### Calculating the bias accrued from data linkage error when you records are pseudonymised

If we want to calculate the proportion of people admitted to hospital, and we only have access to matched records, it is as if we cannot see the bottom row of this table.

Therefore, we can only calculate it as the number of records matched and admitted over the total number of matched records, or $(A / A+B)$

|       | Admitted | Not admitted |
| ----------- | ----------- |-----------|
| Matched      | A       |B|
| Unmatched   | C        |D|

### Review notes



who is the audience?  - engage the audience, and bear in mind different learning styles / reading comprehension

code is verbose  - not a barrier

parts are a bit wordy - level of language needs to be brought down

might need some scenarios - what does ‘unmatched’ mean

^ young people moving house more, numbers are abstract so try to tell the story

work with a smaller cohort (?) - easier to say 5 people are excluded 

tables are good as a matrix - anything you can show pictorally is a bonus

venn diagrams for exclusion? 

something more small scale - like a scenario - so you can ‘exaggerate’ a big swing

X number of people left out due to data linkage issues

try not to split out all three DFs   

audience - don’t need to ‘sell it’ - people will have to seek this out in some form