# Identifying Ethnicity in OpenSAFELY-TPP
This short report describes how ethnicity can be identified in the OpenSAFELY-TPP database, and the strengths and weaknesses of the methods. Ethnicity is known to be an important determinant of health outcomes, particularly during the COVID-19 outbreak where a complex interplay of social and biological factors resulted in increased exposure, reduced protection, and increased severity of illness. The recording of patients’ ethnic group in primary care can support efforts to achieve equity in service provision and outcomes. This is a living document that will be updated to reflect changes to the OpenSAFELY-TPP database and the patient records within.

## OpenSAFELY
OpenSAFELY is an analytics platform for conducting analyses on Electronic Health Records inside the secure environment where the records are held. This has multiple benefits: 

* We don't transport large volumes of potentially disclosive pseudonymised patient data outside of the secure environments for analysis
* Analyses can run in near real-time as records are ready for analysis as soon as they appear in the secure environment
* All infrastructure and analysis code is stored in GitHub repositories, which are open for security review, scientific review, and re-use

A key feature of OpenSAFELY is the use of study definitions, which are formal specifications of the datasets to be generated from the OpenSAFELY database. This takes care of much of the complex EHR data wrangling required to create a dataset in an analysis-ready format. It also creates a library of standardised and validated variable definitions that can be deployed consistently across multiple projects. 

The purpose of this report is to describe the main variables that relate ethnicity, and their relative strengths and weaknesses.

## Available Records
OpenSAFELY-TPP runs inside TPP’s data centre which contains the primary care records for all patients registered at practices using TPP’s SystmOne Clinical Information System. This data centre also imports external datasets from other sources, including A&E attendances and hospital admissions from NHS Digital’s Secondary Use Service, and death registrations from the ONS. More information on available data sources can be found within the [OpenSAFELY documentation](https://docs.opensafely.org/data-sources/intro/). 

#Methods

In OpenSAFELY-TPP, there is no categorical “ethnicity” variable to record this information. Rather, ethnicity is recorded using clinical codes, like any other clinical or administrative event, with specific codes relating to specific ethnic groups

We define three codelists to capture primary care ethnicity in OpenSAFELY-TPP : "[2020-CTV3](https://www.opencodelists.org/codelist/opensafely/ethnicity/2020-04-27)", "[2022-SNOMED](https://www.opencodelists.org/codelist/opensafely/ethnicity-snomed-0removed/2e641f61/)" and "[2021-PRIMIS](https://www.opencodelists.org/codelist/primis-covid19-vacc-uptake/eth2001/v1/)".


To evaluate how well each of these codelists are populated, we count the number of patients with at least one instance each codelist, as well as the grouping of the ethncitity themselves.

We examine trends across the whole population and by each of the following demographic and clinical subgroups to detect any inequalities.

Demographic covariates:

Age band
Sex
Ethnicity
Region
IMD
Clinical covariates:

Dementia
Diabetes
Learning disability

Any counts below 6 were redacted, and all other values were rounded to the nearest 5.

In [348]:
import sys

In [349]:
import os
import pandas as pd
import numpy as np
from itertools import product
from IPython.display import display, Markdown, Image

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)
pd.options.mode.chained_assignment = None 
pd.options.display.float_format = '{:,.0f}'.format


In [350]:

def local_patient_counts(
    definitions, output_path, code_dict="", categories=False, missing=False,
):
    import pandas as pd

    suffix = "_filled"
    overlap = "all_filled"
    if missing == True:
        suffix = "_missing"
        overlap = "all_missing"
    if categories:
        df_population = pd.read_csv(
            f"output/{output_path}/simple_patient_counts_registered.csv"
        ).set_index(["group", "subgroup"])
        

        df_append = pd.read_csv(
            f"output/{output_path}/simple_patient_counts_categories_registered.csv"
        ).set_index(["group", "subgroup"])
        
        if output_path == output_path_5:
            print("Daisy1")
            global df_append_cat_5
            df_append_cat_5 = df_append

        if output_path == output_path_16:
            print("Daisy2")
            global df_append_cat_16
            df_append_cat_16 = df_append

        df_append.drop("population", inplace=True, axis=1)
        df_append["population"] = df_population[definitions[0]+"_filled"]
        # ensure definitions[n] in code_dict[definitions[n]] below refers to one of the definitions of interest
        definitions = [
            f"{category}_{definition}"
            for category, definition in product(
                code_dict[definitions[1]].values(), definitions
            )
        ]
    else:
        df_append = pd.read_csv(
            f"output/{output_path}/simple_patient_counts_registered.csv"
        ).set_index(["group", "subgroup"])
        global total
        total =  df_append["all_filled"][0]
    for definition in definitions:
        if missing:
            df_append[definition + suffix] = (
                df_append["population"] - df_append[definition + "_filled"]
            )    
        df_append[definition + "_pct"] = round(
            (df_append[definition + suffix].div(df_append["population"])) * 100, 1
        )
        df_append[overlap + "_pct"] = round(
            (df_append[overlap].div(df_append["population"])) * 100, 1
        )

        # Combine count and percentage columns
        df_append[definition] = (
            df_append[definition + suffix].apply(lambda x: "{:,.0f}".format(x))
            + " ("
            + df_append[definition + "_pct"].astype(str)
            + ")"
        )
        df_append = df_append.drop(columns=[definition + suffix, definition + "_pct"])
    df_append[overlap] = (
        df_append[overlap].apply(lambda x: "{:,.0f}".format(x))
        + " ("
        + df_append[overlap + "_pct"].astype(str)
        + ")"
    )
    df_append = df_append.drop(columns=[overlap + "_pct"])
    df_patient_counts = df_append[definitions + [overlap] + ["population"]]
    # Final redaction step
    df_patient_counts = df_patient_counts.replace(np.nan, "-")
    df_patient_counts = df_patient_counts.replace("nan (nan)", "- (-)")
    df_patient_counts.columns = df_patient_counts.columns.str.replace("_", " ")
    
    display(df_patient_counts)
    
    if categories:
        df_patient_counts.to_csv(
                f"output/{output_path}/local_patient_counts_categories_registered.csv"
            )
    
    

In [351]:
### CONFIGURE ###
definitions_5 = ['ethnicity_5', 'ethnicity_new_5', 'ethnicity_primis_5']
definitions_16 = ['ethnicity_16', 'ethnicity_new_16', 'ethnicity_primis_16']
covariates = ['_age_band','_sex','_region','_imd','_dementia','_diabetes','_hypertension','_learning_disability']
output_path_5 = 'simplified_output/5_group/tables'
output_path_16 = 'simplified_output/16_group/tables'
suffixes = ['','_missing']
suffix = ''
code_dict_5 = {
    "imd": {
        0: "Unknown",
        1: "1 Most deprived",
        2: "2",
        3: "3",
        4: "4",
        5: "5 Least deprived",
    },
    "ethnicity_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
    "ethnicity_new_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
    "ethnicity_primis_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
}

# Code dictionary
code_dict_16 = {
    "imd": {
        0: "Unknown",
        1: "1 Most deprived",
        2: "2",
        3: "3",
        4: "4",
        5: "5 Least deprived",
    },
    "ethnicity_16": {
        1: "White_British",
        2: "White_Irish",
        3: "Other_White",
        4: "White_and_Black_Caribbean",
        5: "White_and_Black_African",
        6: "White_and_Asian",
        7: "Other_Mixed",
        8: "Indian",
        9: "Pakistani",
        10: "Bangladeshi",
        11: "Other_Asian",
        12: "Caribbean",
        13: "African",
        14: "Other_Black",
        15: "Chinese",
        16: "Any_other_ethnic_group",
    },
    "ethnicity_new_16": {
        1: "White_British",
        2: "White_Irish",
        3: "Other_White",
        4: "White_and_Black_Caribbean",
        5: "White_and_Black_African",
        6: "White_and_Asian",
        7: "Other_Mixed",
        8: "Indian",
        9: "Pakistani",
        10: "Bangladeshi",
        11: "Other_Asian",
        12: "Caribbean",
        13: "African",
        14: "Other_Black",
        15: "Chinese",
        16: "Any_other_ethnic_group",
    },
    "ethnicity_primis_16": {
        1: "White_British",
        2: "White_Irish",
        3: "Other_White",
        4: "White_and_Black_Caribbean",
        5: "White_and_Black_African",
        6: "White_and_Asian",
        7: "Other_Mixed",
        8: "Indian",
        9: "Pakistani",
        10: "Bangladeshi",
        11: "Other_Asian",
        12: "Caribbean",
        13: "African",
        14: "Other_Black",
        15: "Chinese",
        16: "Any_other_ethnic_group",
    },
}


## Results

### Count of Patients

In [352]:
local_patient_counts(
         definitions_5,  output_path_5
    )


Unnamed: 0_level_0,Unnamed: 1_level_0,ethnicity 5,ethnicity new 5,ethnicity primis 5,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
all,with records,495 (76.2),480 (73.8),485 (74.6),275 (42.3),650
age_band,0-19,70 (77.8),65 (72.2),65 (72.2),40 (44.4),90
age_band,20-29,65 (81.2),60 (75.0),65 (81.2),40 (50.0),80
age_band,30-39,65 (81.2),70 (87.5),60 (75.0),40 (50.0),80
age_band,40-49,60 (70.6),60 (70.6),60 (70.6),30 (35.3),85
age_band,50-59,55 (68.8),55 (68.8),55 (68.8),30 (37.5),80
age_band,60-69,55 (68.8),65 (81.2),65 (81.2),35 (43.8),80
age_band,70-79,65 (76.5),60 (70.6),65 (76.5),40 (47.1),85
age_band,80+,55 (73.3),50 (66.7),50 (66.7),25 (33.3),75
sex,F,250 (78.1),240 (75.0),235 (73.4),145 (45.3),320


In [353]:
display(Markdown(f"""
Around 14.8 million patients who have been registered in OpenSAFELY-TPP have each have all three codelists. 2020-CTV3 is the most well-populated with {float('%.2g' % total)/1000000} million patients having at least one 2020-CTV3 recording of ethnicity. 
"""))


Around 14.8 million patients who have been registered in OpenSAFELY-TPP have each have all three codelists. 2020-CTV3 is the most well-populated with 0.00028 million patients having at least one 2020-CTV3 recording of ethnicity. 


### Count by Category

#### 5 Group

The 2022-SNOMED codelist is most well-populated for White (15.9 million), Mixed (360,000), Asian(1.7 million) and Black (570,000) ethnicities. The 2020-CTV3 codelist classifies more people as other than the SNOMED codelist (550,000 and 470,000 respectively), however, the 2020-CTV3 codelist included codes some relating to religion rather than ethnicity (e.g. “XaJSe: Muslim - ethnic category 2001 census”) which were excluded from the 2022-SNOMED codelist.

In [355]:
local_patient_counts(
         definitions_5,  output_path_5,code_dict_5, categories=True,missing=False
    )

Daisy1


Unnamed: 0_level_0,Unnamed: 1_level_0,White ethnicity 5,White ethnicity new 5,White ethnicity primis 5,Mixed ethnicity 5,Mixed ethnicity new 5,Mixed ethnicity primis 5,Asian ethnicity 5,Asian ethnicity new 5,Asian ethnicity primis 5,Black ethnicity 5,Black ethnicity new 5,Black ethnicity primis 5,Other ethnicity 5,Other ethnicity new 5,Other ethnicity primis 5,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
all,with records,90 (18.2),285 (57.6),90 (18.2),100 (20.2),45 (9.1),95 (19.2),85 (17.2),45 (9.1),100 (20.2),120 (24.2),60 (12.1),90 (18.2),95 (19.2),50 (10.1),105 (21.2),275 (55.6),495
age_band,0-19,10 (14.3),35 (50.0),10 (14.3),15 (21.4),- (-),10 (14.3),15 (21.4),10 (14.3),15 (21.4),20 (28.6),- (-),15 (21.4),15 (21.4),- (-),15 (21.4),40 (57.1),70
age_band,20-29,15 (23.1),40 (61.5),10 (15.4),10 (15.4),- (-),15 (23.1),10 (15.4),- (-),15 (23.1),15 (23.1),10 (15.4),10 (15.4),10 (15.4),- (-),15 (23.1),40 (61.5),65
age_band,30-39,15 (23.1),35 (53.8),15 (23.1),10 (15.4),- (-),10 (15.4),10 (15.4),10 (15.4),10 (15.4),15 (23.1),- (-),10 (15.4),15 (23.1),10 (15.4),15 (23.1),40 (61.5),65
age_band,40-49,15 (25.0),35 (58.3),- (-),10 (16.7),- (-),15 (25.0),- (-),- (-),15 (25.0),10 (16.7),- (-),10 (16.7),15 (25.0),10 (16.7),15 (25.0),30 (50.0),60
age_band,50-59,10 (18.2),40 (72.7),10 (18.2),20 (36.4),- (-),15 (27.3),- (-),- (-),10 (18.2),15 (27.3),- (-),- (-),- (-),- (-),10 (18.2),30 (54.5),55
age_band,60-69,10 (18.2),35 (63.6),15 (27.3),20 (36.4),10 (18.2),10 (18.2),10 (18.2),- (-),15 (27.3),15 (27.3),- (-),10 (18.2),- (-),- (-),15 (27.3),35 (63.6),55
age_band,70-79,15 (23.1),40 (61.5),15 (23.1),10 (15.4),- (-),15 (23.1),10 (15.4),- (-),10 (15.4),15 (23.1),10 (15.4),15 (23.1),15 (23.1),- (-),15 (23.1),40 (61.5),65
age_band,80+,- (-),25 (45.5),10 (18.2),10 (18.2),- (-),10 (18.2),15 (27.3),- (-),15 (27.3),15 (27.3),- (-),- (-),10 (18.2),- (-),10 (18.2),25 (45.5),55
sex,F,50 (20.0),145 (58.0),45 (18.0),50 (20.0),15 (6.0),50 (20.0),40 (16.0),30 (12.0),45 (18.0),60 (24.0),35 (14.0),45 (18.0),50 (20.0),20 (8.0),45 (18.0),145 (58.0),250


In [354]:
display(Markdown(f"""
The 2022-SNOMED codelist is most well-populated for White ({float('%.2g' % df_append_cat_5["White_ethnicity_new_5_filled"][0])/1000000} million), Mixed ({float('%.2g' % df_append_cat_5["Mixed_ethnicity_new_5_filled"][0])/1000000} million), Asian ({float('%.2g' % df_append_cat_5["Asian_ethnicity_new_5_filled"][0])/1000000} million) and Black ({float('%.2g' % df_append_cat_5["Black_ethnicity_new_5_filled"][0])/1000000} million) ethnicities. The 2020-CTV3 codelist classifies more people as other than the SNOMED codelist ({float('%.0g' % df_append_cat_5["Other_ethnicity_5_filled"][0])} and {float('%.0g' % df_append_cat_5["Black_ethnicity_new_5_filled"][0])} respectively), however, the 2020-CTV3 codelist included codes some relating to religion rather than ethnicity (e.g. “XaJSe: Muslim - ethnic category 2001 census”) which were excluded from the 2022-SNOMED codelist.
"""))



The 2022-SNOMED codelist is most well-populated for White (0.00028 million), Mixed (4.5e-05 million), Asian (4.5e-05 million) and Black (6e-05 million) ethnicities. The 2020-CTV3 codelist classifies more people as other than the SNOMED codelist (100.0 and 60.0 respectively), however, the 2020-CTV3 codelist included codes some relating to religion rather than ethnicity (e.g. “XaJSe: Muslim - ethnic category 2001 census”) which were excluded from the 2022-SNOMED codelist.


#### 16 group

Expanding to the 16 group 

The 2022-SNOMED codelist is most well-populated for White (15.9 million), Mixed (360,000), Asian(1.7 million) and Black (570,000) ethnicities. The 2020-CTV3 codelist classifies more people as other than the SNOMED codelist (550,000 and 470,000 respectively), however, the 2020-CTV3 codelist included codes some relating to religion rather than ethnicity (e.g. “XaJSe: Muslim - ethnic category 2001 census”) which were excluded from the 2022-SNOMED codelist.

In [356]:
local_patient_counts(
         definitions_16,  output_path_16,code_dict_16, categories=True,missing=False
    )

Daisy2


Unnamed: 0_level_0,Unnamed: 1_level_0,White British ethnicity 16,White British ethnicity new 16,White British ethnicity primis 16,White Irish ethnicity 16,White Irish ethnicity new 16,White Irish ethnicity primis 16,Other White ethnicity 16,Other White ethnicity new 16,Other White ethnicity primis 16,White and Black Caribbean ethnicity 16,White and Black Caribbean ethnicity new 16,White and Black Caribbean ethnicity primis 16,White and Black African ethnicity 16,White and Black African ethnicity new 16,White and Black African ethnicity primis 16,White and Asian ethnicity 16,White and Asian ethnicity new 16,White and Asian ethnicity primis 16,Other Mixed ethnicity 16,Other Mixed ethnicity new 16,Other Mixed ethnicity primis 16,Indian ethnicity 16,Indian ethnicity new 16,Indian ethnicity primis 16,Pakistani ethnicity 16,Pakistani ethnicity new 16,Pakistani ethnicity primis 16,Bangladeshi ethnicity 16,Bangladeshi ethnicity new 16,Bangladeshi ethnicity primis 16,Other Asian ethnicity 16,Other Asian ethnicity new 16,Other Asian ethnicity primis 16,Caribbean ethnicity 16,Caribbean ethnicity new 16,Caribbean ethnicity primis 16,African ethnicity 16,African ethnicity new 16,African ethnicity primis 16,Other Black ethnicity 16,Other Black ethnicity new 16,Other Black ethnicity primis 16,Chinese ethnicity 16,Chinese ethnicity new 16,Chinese ethnicity primis 16,Any other ethnic group ethnicity 16,Any other ethnic group ethnicity new 16,Any other ethnic group ethnicity primis 16,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
all,with records,30 (6.1),25 (5.1),30 (6.1),20 (4.1),45 (9.2),40 (8.2),35 (7.1),40 (8.2),35 (7.1),30 (6.1),30 (6.1),30 (6.1),35 (7.1),20 (4.1),30 (6.1),35 (7.1),35 (7.1),25 (5.1),30 (6.1),25 (5.1),35 (7.1),35 (7.1),30 (6.1),25 (5.1),40 (8.2),30 (6.1),35 (7.1),30 (6.1),30 (6.1),25 (5.1),30 (6.1),35 (7.1),35 (7.1),35 (7.1),30 (6.1),30 (6.1),15 (3.1),25 (5.1),25 (5.1),30 (6.1),35 (7.1),30 (6.1),30 (6.1),30 (6.1),30 (6.1),35 (7.1),20 (4.1),25 (5.1),280 (57.1),490
age_band,0-19,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (14.3),- (-),- (-),- (-),- (-),- (-),40 (57.1),70
age_band,20-29,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (18.2),- (-),- (-),- (-),- (-),25 (45.5),55
age_band,30-39,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (16.7),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (16.7),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),35 (58.3),60
age_band,40-49,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (15.4),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (15.4),- (-),10 (15.4),- (-),- (-),- (-),- (-),- (-),10 (15.4),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),45 (69.2),65
age_band,50-59,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),30 (50.0),60
age_band,60-69,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (15.4),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (15.4),45 (69.2),65
age_band,70-79,- (-),- (-),- (-),- (-),10 (16.7),- (-),- (-),- (-),- (-),- (-),10 (16.7),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),30 (50.0),60
age_band,80+,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (16.7),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),30 (50.0),60
sex,F,20 (8.3),10 (4.2),20 (8.3),10 (4.2),25 (10.4),20 (8.3),20 (8.3),20 (8.3),15 (6.2),15 (6.2),15 (6.2),15 (6.2),15 (6.2),10 (4.2),20 (8.3),15 (6.2),15 (6.2),15 (6.2),15 (6.2),15 (6.2),20 (8.3),- (-),15 (6.2),10 (4.2),20 (8.3),10 (4.2),15 (6.2),15 (6.2),15 (6.2),15 (6.2),15 (6.2),15 (6.2),10 (4.2),20 (8.3),15 (6.2),20 (8.3),10 (4.2),15 (6.2),10 (4.2),15 (6.2),20 (8.3),10 (4.2),20 (8.3),15 (6.2),10 (4.2),20 (8.3),10 (4.2),10 (4.2),135 (56.2),240


### Latest vs. Most Common

#### 5 Group

Overall 98% of the latest 6 group ethnicity matched the most frequent 6 group ethnicity for all codelists. 99.2% of those with the most recent ethnicity classified as ‘White’ in 2022-SNOMED also had the most frequent ethnicity ‘White’. For 2022-SNOMED 'Mixed' was the least concordant with 77.0% of those with the most recent ethnicity ‘Mixed’ also had the most frequent ethnicity ‘Mixed’. 3.1% of those with latest ethnicity ‘Black’ also had the most frequent ethnicity ‘White’.


In [357]:
for definition in definitions_5:
        df_sum = pd.read_csv(f'output/{output_path_5}/simple_latest_common_{definition}{suffix}_registered.csv').set_index(definition)
        # sort rows by category index
        df_sum.columns = df_sum.columns.str.replace(definition + "_", "")
        df_sum.columns = df_sum.columns.str.lower()
        df_sum = df_sum.reindex(list(code_dict_5[definition].values()))
        
        df_counts = pd.DataFrame(
            np.diagonal(df_sum),
            index=df_sum.index,
        #   columns=[f"matching (n={np.diagonal(df_sum).sum()})"],
        )

        df_sum2 = df_sum.copy(deep=True)
        np.fill_diagonal(df_sum2.values, 0)
        df_diag = pd.DataFrame(
            df_sum2.sum(axis=1),
        )
        df_out = df_counts.merge(df_diag, right_index=True, left_index=True)
        columns=round(df_out.sum()/df_out.sum(axis=1).sum()*100,1)
        df_out.columns=[f"matching ({columns[0]}%)",f"not matching ({columns[1]}%)"]
        display(df_out)
        
        if code_dict_5 != "":
            lowerlist_5 = [x.lower() for x in (list(code_dict_5[definition].values()))]
            df_sum = df_sum[lowerlist_5]
        else:
            df_sum = df_sum.reindex(sorted(df_sum.columns), axis=1)

        # Combine count and percentage columns
        df_sum["population"]=df_sum.sum(axis = 1)
        for item in lowerlist_5:
            df_sum[item + "_pct"]= round(
                    (df_sum[item].div(df_sum["population"])) * 100, 1
                )
        
            df_sum[item] = (
                    df_sum[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_sum[item + "_pct"].astype(str)
                    + ")"
                )
        df_sum = df_sum[lowerlist_5]

        display(df_sum)
    # df_expanded = pd.read_csv(f'../output/{output_path}/tables/latest_common_expanded_{definition}.csv').set_index(definition)
    
    # display(df_expanded)


Unnamed: 0_level_0,matching (6.2%),not matching (93.8%)
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,,30
Mixed,,30
Asian,,20
Black,,40
Other,10.0,30


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,nan (nan),10 (33.3),nan (nan),10 (33.3),10 (33.3)
Mixed,10 (33.3),nan (nan),nan (nan),10 (33.3),10 (33.3)
Asian,nan (nan),10 (50.0),nan (nan),nan (nan),10 (50.0)
Black,10 (25.0),10 (25.0),10 (25.0),nan (nan),10 (25.0)
Other,10 (25.0),10 (25.0),nan (nan),10 (25.0),10 (25.0)


Unnamed: 0_level_0,matching (22.2%),not matching (77.8%)
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,20.0,105
Mixed,,0
Asian,,0
Black,10.0,0
Other,,0


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,20 (16.0),25 (20.0),25 (20.0),30 (24.0),25 (20.0)
Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Asian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Black,nan (nan),nan (nan),nan (nan),10 (100.0),nan (nan)
Other,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


Unnamed: 0_level_0,matching (7.7%),not matching (92.3%)
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,,25
Mixed,,40
Asian,,25
Black,10.0,20
Other,,10


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,nan (nan),nan (nan),10 (40.0),15 (60.0),nan (nan)
Mixed,10 (25.0),nan (nan),10 (25.0),10 (25.0),10 (25.0)
Asian,nan (nan),nan (nan),nan (nan),10 (40.0),15 (60.0)
Black,10 (33.3),nan (nan),nan (nan),10 (33.3),10 (33.3)
Other,nan (nan),10 (100.0),nan (nan),nan (nan),nan (nan)


#### 16 Group

In [358]:
for definition in definitions_16:
        df_sum = pd.read_csv(f'output/{output_path_16}/simple_latest_common_{definition}{suffix}_registered.csv').set_index(definition)
        # sort rows by category index
        df_sum.columns = df_sum.columns.str.replace(definition + "_", "")
        df_sum.columns = df_sum.columns.str.lower()
        df_sum = df_sum.reindex(list(code_dict_16[definition].values()))
        
        df_counts = pd.DataFrame(
            np.diagonal(df_sum),
            index=df_sum.index,
        #   columns=[f"matching (n={np.diagonal(df_sum).sum()})"],
        )

        df_sum2 = df_sum.copy(deep=True)
        np.fill_diagonal(df_sum2.values, 0)
        df_diag = pd.DataFrame(
            df_sum2.sum(axis=1),
        )
        df_out = df_counts.merge(df_diag, right_index=True, left_index=True)
        columns=round(df_out.sum()/df_out.sum(axis=1).sum()*100,1)
        df_out.columns=[f"matching ({columns[0]}%)",f"not matching ({columns[1]}%)"]
        display(df_out)
        
        if code_dict_16 != "":
            lowerlist_16 = [x.lower() for x in (list(code_dict_16[definition].values()))]
            df_sum = df_sum[lowerlist_16]
        else:
            df_sum = df_sum.reindex(sorted(df_sum.columns), axis=1)

        # Combine count and percentage columns
        df_sum["population"]=df_sum.sum(axis = 1)
        for item in lowerlist_16:
            df_sum[item + "_pct"]= round(
                    (df_sum[item].div(df_sum["population"])) * 100, 1
                )
        
            df_sum[item] = (
                    df_sum[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_sum[item + "_pct"].astype(str)
                    + ")"
                )
        df_sum = df_sum[lowerlist_16]

        display(df_sum)
    # df_expanded = pd.read_csv(f'../output/{output_path}/tables/latest_common_expanded_{definition}.csv').set_index(definition)
    
    # display(df_expanded)

Unnamed: 0_level_0,matching (nan%),not matching (nan%)
ethnicity_16,Unnamed: 1_level_1,Unnamed: 2_level_1
White_British,,0
White_Irish,,0
Other_White,,0
White_and_Black_Caribbean,,0
White_and_Black_African,,0
White_and_Asian,,0
Other_Mixed,,0
Indian,,0
Pakistani,,0
Bangladeshi,,0


Unnamed: 0_level_0,white_british,white_irish,other_white,white_and_black_caribbean,white_and_black_african,white_and_asian,other_mixed,indian,pakistani,bangladeshi,other_asian,caribbean,african,other_black,chinese,any_other_ethnic_group
ethnicity_16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
White_British,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_Irish,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_White,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_Caribbean,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_African,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Asian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Indian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Pakistani,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Bangladeshi,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


Unnamed: 0_level_0,matching (nan%),not matching (nan%)
ethnicity_new_16,Unnamed: 1_level_1,Unnamed: 2_level_1
White_British,,0
White_Irish,,0
Other_White,,0
White_and_Black_Caribbean,,0
White_and_Black_African,,0
White_and_Asian,,0
Other_Mixed,,0
Indian,,0
Pakistani,,0
Bangladeshi,,0


Unnamed: 0_level_0,white_british,white_irish,other_white,white_and_black_caribbean,white_and_black_african,white_and_asian,other_mixed,indian,pakistani,bangladeshi,other_asian,caribbean,african,other_black,chinese,any_other_ethnic_group
ethnicity_new_16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
White_British,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_Irish,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_White,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_Caribbean,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_African,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Asian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Indian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Pakistani,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Bangladeshi,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


Unnamed: 0_level_0,matching (nan%),not matching (nan%)
ethnicity_primis_16,Unnamed: 1_level_1,Unnamed: 2_level_1
White_British,,0
White_Irish,,0
Other_White,,0
White_and_Black_Caribbean,,0
White_and_Black_African,,0
White_and_Asian,,0
Other_Mixed,,0
Indian,,0
Pakistani,,0
Bangladeshi,,0


Unnamed: 0_level_0,white_british,white_irish,other_white,white_and_black_caribbean,white_and_black_african,white_and_asian,other_mixed,indian,pakistani,bangladeshi,other_asian,caribbean,african,other_black,chinese,any_other_ethnic_group
ethnicity_primis_16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
White_British,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_Irish,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_White,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_Caribbean,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_African,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Asian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Indian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Pakistani,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Bangladeshi,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


### State Change

#### 5 Group

Patients whose latest recorded ethnicity were categorised as Mixed were most likely to have a discordant ethnicity recording (32%) with 24.5% of the 31,040 patients with the latest recording of Mixed ethnicity also having a recording of White ethnicity. Surprisingly 5.5% of those with the latest recorded ethnicity categorised as Black were also had a recorded ethnicity of White 

In [359]:
for definition in definitions_5:
        df_state_change = pd.read_csv(f'output/{output_path_5}/simple_state_change_{definition}{suffix}_registered.csv').set_index(definition)
        df_state_change.columns = df_state_change.columns.str.replace(definition + "_", "")
        #resort rows
        df_state_change = df_state_change.reindex(list(code_dict_5[definition].values()))
        df_state_change = df_state_change.reset_index()
        
        df_state_change[definition]=df_state_change[definition]+": " +df_state_change["n"].apply(lambda x: "{:,.0f}".format(x))
        df_state_change = df_state_change.set_index(definition)
        for item in lowerlist_5:
            df_state_change[item + "_pct"]= round(
                    (df_state_change[item].div(df_state_change["n"])) * 100, 1
                )
        
            df_state_change[item] = (
                    df_state_change[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_state_change[item + "_pct"].astype(str)
                    + ")"
                )
        df_state_change=df_state_change[lowerlist_5]
        display(df_state_change)

Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White: 90,nan (nan),10 (11.1),10 (11.1),10 (11.1),10 (11.1)
Mixed: 100,10 (10.0),10 (10.0),10 (10.0),15 (15.0),10 (10.0)
Asian: 85,10 (11.8),10 (11.8),nan (nan),nan (nan),10 (11.8)
Black: 120,10 (8.3),10 (8.3),15 (12.5),10 (8.3),15 (12.5)
Other: 95,10 (10.5),15 (15.8),10 (10.5),10 (10.5),15 (15.8)


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White: 285,30 (10.5),25 (8.8),25 (8.8),35 (12.3),30 (10.5)
Mixed: 45,nan (nan),nan (nan),10 (22.2),nan (nan),nan (nan)
Asian: 45,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Black: 60,nan (nan),nan (nan),nan (nan),10 (16.7),nan (nan)
Other: 50,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White: 90,nan (nan),nan (nan),10 (11.1),15 (16.7),10 (11.1)
Mixed: 95,10 (10.5),10 (10.5),10 (10.5),10 (10.5),10 (10.5)
Asian: 100,nan (nan),nan (nan),10 (10.0),10 (10.0),15 (15.0)
Black: 90,10 (11.1),10 (11.1),nan (nan),10 (11.1),10 (11.1)
Other: 105,10 (9.5),10 (9.5),nan (nan),nan (nan),nan (nan)


#### 16 Group

In [360]:
for definition in definitions_16:
        df_state_change = pd.read_csv(f'output/{output_path_16}/simple_state_change_{definition}{suffix}_registered.csv').set_index(definition)
        df_state_change.columns = df_state_change.columns.str.replace(definition + "_", "")
        df_state_change.columns = df_state_change.columns.str.lower()
        #resort rows
        df_state_change = df_state_change.reindex(list(code_dict_16[definition].values()))
        df_state_change = df_state_change.reset_index()
        
        df_state_change[definition]=df_state_change[definition]+": " +df_state_change["n"].apply(lambda x: "{:,.0f}".format(x))
        df_state_change = df_state_change.set_index(definition)
        
        for item in lowerlist_16:
            df_state_change[item + "_pct"]= round(
                    (df_state_change[item].div(df_state_change["n"])) * 100, 1
                )
        
            df_state_change[item] = (
                    df_state_change[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_state_change[item + "_pct"].astype(str)
                    + ")"
                )
        df_state_change=df_state_change[lowerlist_16]
        display(df_state_change)

Unnamed: 0_level_0,white_british,white_irish,other_white,white_and_black_caribbean,white_and_black_african,white_and_asian,other_mixed,indian,pakistani,bangladeshi,other_asian,caribbean,african,other_black,chinese,any_other_ethnic_group
ethnicity_16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
White_British: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_Irish: 20,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_White: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_Caribbean: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_African: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),10 (28.6),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Asian: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_Mixed: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Indian: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),10 (28.6),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Pakistani: 40,nan (nan),nan (nan),nan (nan),nan (nan),10 (25.0),nan (nan),nan (nan),nan (nan),10 (25.0),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Bangladeshi: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


Unnamed: 0_level_0,white_british,white_irish,other_white,white_and_black_caribbean,white_and_black_african,white_and_asian,other_mixed,indian,pakistani,bangladeshi,other_asian,caribbean,african,other_black,chinese,any_other_ethnic_group
ethnicity_new_16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
White_British: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_Irish: 45,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),10 (22.2)
Other_White: 40,10 (25.0),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),10 (25.0),nan (nan),nan (nan),nan (nan)
White_and_Black_Caribbean: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_African: 20,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Asian: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_Mixed: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Indian: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Pakistani: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Bangladeshi: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


Unnamed: 0_level_0,white_british,white_irish,other_white,white_and_black_caribbean,white_and_black_african,white_and_asian,other_mixed,indian,pakistani,bangladeshi,other_asian,caribbean,african,other_black,chinese,any_other_ethnic_group
ethnicity_primis_16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
White_British: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_Irish: 40,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),10 (25.0),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_White: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_Caribbean: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_African: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Asian: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_Mixed: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Indian: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Pakistani: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Bangladeshi: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


### Discussion

This study has shown that primary care ethnicity data made available via OpenSAFELY is complete for around three quarters of all patients. However, recording ethnicity is not straightforward. Indeed, despite often being used as a key variable to describe health, the idea of “ethnicity” has been disputed16. Self-identified ethnicity is not a fixed concept and evolving socio-cultural trends could contribute to changes in a person’s self-identified ethnic group, particularly for those with mixed heritage. It is therefore perhaps not surprising to see lower levels of concordance between latest ethnicity and most common ethnicity in those with latest ethnicity coded as ‘mixed’. 

The common practice of supplementing 2020-CTV3 coded ethnicity with either SUS data or the PRIMIS codelists could lead to inconsistent classification as both SUS data and PRIMIS codelists follow the 2001 census categories. 

We believe that the 2022-SNOMED codelist provides a more consistent representation of ethnicity as defined by the 2001 census categories and should be the preferred codelist for primary care ethnicity. 

### Limitations

It is common for OpenSAFELY studies to supplement the primary care recorded ethnicity, where missing, with ethnicity data from the Secondary Uses Service (SUS). This study has focussed solely on the primary care recorded ethnicity. Due to the way that non-native data, such as GP2GP data and historical data, are imported into TPP the date of ethnicity recorded is not always available therefore chronology is unreliable for ethnicity data. 

### Conclusion

This report describes existing methods to derive primary care ethnicity in OpenSAFELY-TPP and suggests the adoption of the 2022-SNOMED codelist as the new standard method. It is a living document that can be periodically re-run to evaluate the most current best practices for research. If you have improvements or forks, please contact the OpenSAFELY data team.

In [361]:
from datetime import date, timedelta
# get data extraction date
extract_date = pd.to_datetime(os.path.getmtime(f"output/{output_path_16}/simple_patient_counts_registered.csv"), unit='s')
# get notebook run date
run_date = date.today()

display(Markdown(f"""
## Technical details

This notebook was run on {run_date.strftime('%Y-%m-%d')}. The information below is based on data extracted from the OpenSAFELY-TPP database on {extract_date.strftime('%Y-%m-%d')}.

If a clinical code appears in the primary care record on multiple dates, the earliest date is used. 


Only patients registered at their practice on January 1 2022 are included.

"""))


## Technical details

This notebook was run on 2022-10-19. The information below is based on data extracted from the OpenSAFELY-TPP database on 2022-10-17.

If a clinical code appears in the primary care record on multiple dates, the earliest date is used. 


Only patients registered at their practice on January 1 2022 are included.

