# Identifying Ethnicity in OpenSAFELY-TPP

This short report describes how ethnicity can be identified in the OpenSAFELY-TPP database, and the strengths and weaknesses of the methods. Ethnicity is known to be an important determinant of health outcomes, particularly during the COVID-19 outbreak where a complex interplay of social and biological factors resulted in increased exposure, reduced protection, and increased severity of illness. The recording of patients’ ethnic group in primary care can support efforts to achieve equity in service provision and outcomes. 

The [NHS Data Model and Dictionary](https://www.datadictionary.nhs.uk/data_elements/ethnic_category.html?hl=ethnicity) states that ethnic data groups defined in the [2001 census](https://www.ethnicity-facts-figures.service.gov.uk/style-guide/ethnic-groups#2001-census) is the national mandatory standard for the collection and analysis of ethnicity.

In OpenSAFELY-TPP, there is no categorical “ethnicity” variable to record this information. Rather, ethnicity is recorded using clinical codes, like any other clinical or administrative event, with specific codes relating to specific ethnic groups. 

We define three codelists to capture primary care ethnicity in OpenSAFELY-TPP : "[2020-CTV3](https://www.opencodelists.org/codelist/opensafely/ethnicity/2020-04-27)", "[2022-SNOMED](https://www.opencodelists.org/codelist/opensafely/ethnicity-snomed-0removed/2e641f61/)" and "[2021-PRIMIS](https://www.opencodelists.org/codelist/primis-covid19-vacc-uptake/eth2001/v1/)".


This is a living document that will be updated to reflect changes to the OpenSAFELY-TPP database and the patient records within.

## OpenSAFELY
OpenSAFELY is an analytics platform for conducting analyses on Electronic Health Records inside the secure environment where the records are held. This has multiple benefits: 

* We don't transport large volumes of potentially disclosive pseudonymised patient data outside of the secure environments for analysis
* Analyses can run in near real-time as records are ready for analysis as soon as they appear in the secure environment
* All infrastructure and analysis code is stored in GitHub repositories, which are open for security review, scientific review, and re-use

A key feature of OpenSAFELY is the use of study definitions, which are formal specifications of the datasets to be generated from the OpenSAFELY database. This takes care of much of the complex EHR data wrangling required to create a dataset in an analysis-ready format. It also creates a library of standardised and validated variable definitions that can be deployed consistently across multiple projects. 

The purpose of this report is to describe the main variables that relate ethnicity, and their relative strengths and weaknesses.

## Available Records
OpenSAFELY-TPP runs inside TPP’s data centre which contains the primary care records for all patients registered at practices using TPP’s SystmOne Clinical Information System. This data centre also imports external datasets from other sources, including A&E attendances and hospital admissions from NHS Digital’s Secondary Use Service, and death registrations from the ONS. More information on available data sources can be found within the [OpenSAFELY documentation](https://docs.opensafely.org/data-sources/intro/). 

# Methods

We define three codelists to capture primary care ethnicity in OpenSAFELY-TPP : "[2020-CTV3](https://www.opencodelists.org/codelist/opensafely/ethnicity/2020-04-27)", "[2022-SNOMED](https://www.opencodelists.org/codelist/opensafely/ethnicity-snomed-0removed/2e641f61/)" and "[2021-PRIMIS](https://www.opencodelists.org/codelist/primis-covid19-vacc-uptake/eth2001/v1/)".



### Completeness of ethnicity data
To evaluate how well each of these codelists are populated, the proportion of patients with ethnicity recorded (that is, the presence of any code in the codelist in the patient record) was calculated for patients registered as of 1 January 2022. 

We examine trends across the whole population and by each of the following demographic and clinical subgroups to detect any inequalities.

Demographic covariates:

- Age band
- Sex
- Ethnicity
- Region
- IMD


Clinical covariates:

- Dementia
- Diabetes
- Learning disability


### Ethnicity by group

These codes were grouped into one of two ethnicity groups based on the 2001 Census groups: 

5-level group: 
- Asian or Asian British
- Black or Black British 
- Mixed 
- White 
- Chinese or other ethnic group 



16-level group: 
- Asian or Asian British
    - Indian
    - Pakistani
    - Bangladeshi
    - Any other Asian background
- Black or Black British 
    - Caribbean
    - African
    - Any other Black background
- Mixed 
    - White and Black Caribbean
    - White and Black African
    - White and Asian
    - Any other Mixed background
- White 
    - British
    - Irish
    - Any other White background
- Chinese or other ethnic group 
    - Chinese
    - Any other
    

For patients with multiple ethnicity records, the most recent record was chosen (even if this is later than the cohort date). The proportion of patients with each ethnicity groups was calculated, within each clinical and demographic subgroup.


### Changes in coded ethnicity groups. 
In order to investigate the extent of discrepancies within individual patients’ recorded grouped ethnicity the proportion of patients with any grouped ethnicity recorded which does not match their ‘latest’ recorded grouped ethnicity was calculated for each of the five ethnic groups. 

### Comparison of ‘Latest’ and ‘Most Frequent’ coded ethnicity.

The proportion of patients with a recorded latest ethnicity whose most frequently recorded ethnicity does not match their latest recorded ethnicity was calculated for each of the five ethnic groups.

### Comparison with the 2011 UK census population
The UK Census collects individual and household-level demographic data every 10 years for the whole UK population. Data on ethnicity were obtained from the 2011 UK Census for England. The most recent census across the UK was undertaken on 27 March 2011.

Any counts below 6 were redacted, and all other values were rounded to the nearest 5.


In [37]:
import sys

In [38]:
import os
import pandas as pd
import numpy as np
from itertools import product
from IPython.display import display, Markdown, Image
from datetime import date, timedelta

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)
pd.options.mode.chained_assignment = None 
pd.options.display.float_format = '{:,.0f}'.format


In [39]:

def local_patient_counts(
    definitions, output_path, code_dict="", categories=False, missing=False,
):
    import pandas as pd

    suffix = "_filled"
    overlap = "all_filled"
    if missing == True:
        suffix = "_missing"
        overlap = "all_missing"
    if categories:
        df_population = pd.read_csv(
            f"../output/{output_path}/simple_patient_counts_registered.csv"
        ).set_index(["group", "subgroup"])
        

        df_append = pd.read_csv(
            f"../output/{output_path}/simple_patient_counts_categories_registered.csv"
        ).set_index(["group", "subgroup"])
        
        if output_path == output_path_5:
            global df_append_cat_5
            df_append_cat_5 = df_append

        if output_path == output_path_16:
            global df_append_cat_16
            df_append_cat_16 = df_append

        df_append.drop("population", inplace=True, axis=1)
        df_append["population"] = df_population[definitions[0]+"_filled"]
        # ensure definitions[n] in code_dict[definitions[n]] below refers to one of the definitions of interest
        definitions = [
            f"{category}_{definition}"
            for category, definition in product(
                code_dict[definitions[1]].values(), definitions
            )
        ]
    else:
        df_append = pd.read_csv(
            f"../output/{output_path}/simple_patient_counts_registered.csv"
        ).set_index(["group", "subgroup"])
        global total
        total =  df_append
    for definition in definitions:
        if missing:
            df_append[definition + suffix] = (
                df_append["population"] - df_append[definition + "_filled"]
            )    
        df_append[definition + "_pct"] = round(
            (df_append[definition + suffix].div(df_append["population"])) * 100, 1
        )
        df_append[overlap + "_pct"] = round(
            (df_append[overlap].div(df_append["population"])) * 100, 1
        )

        # Combine count and percentage columns
        df_append[definition] = (
            df_append[definition + suffix].apply(lambda x: "{:,.0f}".format(x))
            + " ("
            + df_append[definition + "_pct"].astype(str)
            + ")"
        )
        df_append = df_append.drop(columns=[definition + suffix, definition + "_pct"])
    df_append[overlap] = (
        df_append[overlap].apply(lambda x: "{:,.0f}".format(x))
        + " ("
        + df_append[overlap + "_pct"].astype(str)
        + ")"
    )
    df_append = df_append.drop(columns=[overlap + "_pct"])
    df_patient_counts = df_append[definitions + [overlap] + ["population"]]
    # Final redaction step
    df_patient_counts = df_patient_counts.replace(np.nan, "-")
    df_patient_counts = df_patient_counts.replace("nan (nan)", "- (-)")
    for k, v in definition_dict.items():
        df_patient_counts.columns = df_patient_counts.columns.str.replace(k,v) 
    df_patient_counts.columns = df_patient_counts.columns.str.replace("_", " ")
    display(df_patient_counts)
    
    if categories:
        df_patient_counts.to_csv(
                f"../output/{output_path}/local_patient_counts_categories_registered.csv"
            )
    
    

In [40]:
### CONFIGURE ###
definitions_5 = ['ethnicity_5', 'ethnicity_new_5', 'ethnicity_primis_5']
definitions_16 = ['ethnicity_16', 'ethnicity_new_16', 'ethnicity_primis_16']
covariates = ['_age_band','_sex','_region','_imd','_dementia','_diabetes','_hypertension','_learning_disability']
output_path_5 = 'simplified_output/5_group/tables'
output_path_16 = 'simplified_output/16_group/tables'
suffixes = ['','_missing']
suffix = ''
code_dict_5 = {
    "imd": {
        0: "Unknown",
        1: "1 Most deprived",
        2: "2",
        3: "3",
        4: "4",
        5: "5 Least deprived",
    },
    "ethnicity_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
    "ethnicity_new_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
    "ethnicity_primis_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
}

# Code dictionary
code_dict_16 = {
    "imd": {
        0: "Unknown",
        1: "1 Most deprived",
        2: "2",
        3: "3",
        4: "4",
        5: "5 Least deprived",
    },
    "ethnicity_16": {
        1: "White_British",
        2: "White_Irish",
        3: "Other_White",
        4: "White_and_Black_Caribbean",
        5: "White_and_Black_African",
        6: "White_and_Asian",
        7: "Other_Mixed",
        8: "Indian",
        9: "Pakistani",
        10: "Bangladeshi",
        11: "Other_Asian",
        12: "Caribbean",
        13: "African",
        14: "Other_Black",
        15: "Chinese",
        16: "Any_other_ethnic_group",
    },
    "ethnicity_new_16": {
        1: "White_British",
        2: "White_Irish",
        3: "Other_White",
        4: "White_and_Black_Caribbean",
        5: "White_and_Black_African",
        6: "White_and_Asian",
        7: "Other_Mixed",
        8: "Indian",
        9: "Pakistani",
        10: "Bangladeshi",
        11: "Other_Asian",
        12: "Caribbean",
        13: "African",
        14: "Other_Black",
        15: "Chinese",
        16: "Any_other_ethnic_group",
    },
    "ethnicity_primis_16": {
        1: "White_British",
        2: "White_Irish",
        3: "Other_White",
        4: "White_and_Black_Caribbean",
        5: "White_and_Black_African",
        6: "White_and_Asian",
        7: "Other_Mixed",
        8: "Indian",
        9: "Pakistani",
        10: "Bangladeshi",
        11: "Other_Asian",
        12: "Caribbean",
        13: "African",
        14: "Other_Black",
        15: "Chinese",
        16: "Any_other_ethnic_group",
    },
}

definition_dict = {
        "ethnicity_new_5": "5 SNOMED:2022",
        "ethnicity_primis_5": "5 PRIMIS:2021",
        "ethnicity_5": "5 CTV3:2020",
        "ethnicity_new_16": "16 SNOMED:2022",
        "ethnicity_primis_16": "16 PRIMIS:2021",
        "ethnicity_16": "16 CTV3:2020",
}


In [41]:
# get data extraction date
extract_date = pd.to_datetime(os.path.getmtime(f"../output/{output_path_5}/simple_patient_counts_registered.csv"), unit='s')
# get notebook run date
run_date = date.today()

## Results

### Completeness of ethnicity data

In [42]:
local_patient_counts(
         definitions_5,  output_path_5
    )


Unnamed: 0_level_0,Unnamed: 1_level_0,5 CTV3:2020,5 SNOMED:2022,5 PRIMIS:2021,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
all,with records,480 (73.8),495 (76.2),485 (74.6),280 (43.1),650
age_band,0-19,50 (71.4),60 (85.7),55 (78.6),35 (50.0),70
age_band,20-29,70 (77.8),65 (72.2),70 (77.8),35 (38.9),90
age_band,30-39,55 (68.8),65 (81.2),60 (75.0),35 (43.8),80
age_band,40-49,50 (66.7),55 (73.3),55 (73.3),30 (40.0),75
age_band,50-59,50 (71.4),55 (78.6),55 (78.6),25 (35.7),70
age_band,60-69,70 (73.7),70 (73.7),60 (63.2),40 (42.1),95
age_band,70-79,65 (81.2),60 (75.0),65 (81.2),40 (50.0),80
age_band,80+,65 (68.4),70 (73.7),65 (68.4),35 (36.8),95
sex,F,250 (74.6),255 (76.1),240 (71.6),145 (43.3),335


In [43]:
display(Markdown(f"""
Around {float('%.3g' % total["all_filled"][0])/1000000} million patients who have been registered in OpenSAFELY-TPP have each have all three codelists. `2020-CTV3` is the most well-populated with {float('%.3g' % total["ethnicity_new_5_filled"][0])/1000000} million patients having at least one `2020-CTV3` recording of ethnicity. 
"""))


Around 0.00028 million patients who have been registered in OpenSAFELY-TPP have each have all three codelists. `2020-CTV3` is the most well-populated with 0.000495 million patients having at least one `2020-CTV3` recording of ethnicity. 


### Ethnicity by group

#### 5 Group

In [44]:
local_patient_counts(
         definitions_5,  output_path_5,code_dict_5, categories=True,missing=False
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,White 5 CTV3:2020,White 5 SNOMED:2022,White 5 PRIMIS:2021,Mixed 5 CTV3:2020,Mixed 5 SNOMED:2022,Mixed 5 PRIMIS:2021,Asian 5 CTV3:2020,Asian 5 SNOMED:2022,Asian 5 PRIMIS:2021,Black 5 CTV3:2020,Black 5 SNOMED:2022,Black 5 PRIMIS:2021,Other 5 CTV3:2020,Other 5 SNOMED:2022,Other 5 PRIMIS:2021,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
all,with records,100 (20.8),300 (62.5),90 (18.8),95 (19.8),45 (9.4),80 (16.7),95 (19.8),55 (11.5),110 (22.9),90 (18.8),50 (10.4),110 (22.9),100 (20.8),45 (9.4),95 (19.8),280 (58.3),480
age_band,0-19,10 (20.0),35 (70.0),15 (30.0),10 (20.0),- (-),10 (20.0),10 (20.0),- (-),15 (30.0),10 (20.0),- (-),- (-),10 (20.0),- (-),10 (20.0),35 (70.0),50
age_band,20-29,10 (14.3),45 (64.3),10 (14.3),20 (28.6),- (-),10 (14.3),20 (28.6),- (-),20 (28.6),10 (14.3),- (-),10 (14.3),15 (21.4),10 (14.3),20 (28.6),35 (50.0),70
age_band,30-39,- (-),45 (81.8),15 (27.3),10 (18.2),- (-),10 (18.2),10 (18.2),- (-),10 (18.2),15 (27.3),- (-),15 (27.3),15 (27.3),- (-),10 (18.2),35 (63.6),55
age_band,40-49,15 (30.0),35 (70.0),15 (30.0),- (-),- (-),10 (20.0),- (-),- (-),10 (20.0),15 (30.0),- (-),10 (20.0),- (-),- (-),15 (30.0),30 (60.0),50
age_band,50-59,15 (30.0),35 (70.0),- (-),10 (20.0),- (-),- (-),10 (20.0),- (-),15 (30.0),- (-),- (-),15 (30.0),10 (20.0),- (-),- (-),25 (50.0),50
age_band,60-69,20 (28.6),35 (50.0),10 (14.3),10 (14.3),10 (14.3),10 (14.3),20 (28.6),10 (14.3),15 (21.4),10 (14.3),10 (14.3),20 (28.6),15 (21.4),- (-),10 (14.3),40 (57.1),70
age_band,70-79,15 (23.1),35 (53.8),15 (23.1),15 (23.1),- (-),10 (15.4),10 (15.4),- (-),15 (23.1),10 (15.4),10 (15.4),15 (23.1),15 (23.1),- (-),10 (15.4),40 (61.5),65
age_band,80+,15 (23.1),40 (61.5),10 (15.4),15 (23.1),10 (15.4),15 (23.1),10 (15.4),- (-),15 (23.1),15 (23.1),10 (15.4),15 (23.1),10 (15.4),- (-),15 (23.1),35 (53.8),65
sex,F,55 (22.0),150 (60.0),50 (20.0),40 (16.0),25 (10.0),45 (18.0),55 (22.0),25 (10.0),50 (20.0),50 (20.0),30 (12.0),50 (20.0),50 (20.0),20 (8.0),45 (18.0),145 (58.0),250


In [45]:
display(Markdown(f"""
The `2022-SNOMED` codelist is most well-populated for `White` ({'{:,.0f}'.format(float('{:,.3g}'.format(df_append_cat_5["White_ethnicity_new_5_filled"][0])))}), `Mixed` ({'{:,.0f}'.format(float('{:,.3g}'.format(df_append_cat_5["Mixed_ethnicity_new_5_filled"][0])))}), `Asian` ({'{:,.0f}'.format(float('{:,.3g}'.format(df_append_cat_5["Asian_ethnicity_new_5_filled"][0])))}) and `Black` ({'{:,.0f}'.format(float('{:,.3g}'.format(df_append_cat_5["Black_ethnicity_new_5_filled"][0])))}) ethnicities. The `2020-CTV3` codelist classifies more people as `Other` than the `2022-SNOMED` codelist ({'{:,.0f}'.format(float('{:,.3g}'.format(df_append_cat_5["Other_ethnicity_5_filled"][0])))} and {'{:,.0f}'.format(float('{:,.3g}'.format(df_append_cat_5["Other_ethnicity_new_5_filled"][0])))} respectively), however, the `2020-CTV3` codelist includes some codes relating to religion rather than ethnicity (e.g. “XaJSe: Muslim - ethnic category 2001 census”) which were excluded from the `2022-SNOMED` codelist.
"""))



The `2022-SNOMED` codelist is most well-populated for `White` (300), `Mixed` (45), `Asian` (55) and `Black` (50) ethnicities. The `2020-CTV3` codelist classifies more people as `Other` than the `2022-SNOMED` codelist (100 and 45 respectively), however, the `2020-CTV3` codelist includes some codes relating to religion rather than ethnicity (e.g. “XaJSe: Muslim - ethnic category 2001 census”) which were excluded from the `2022-SNOMED` codelist.


#### 16 group

In [46]:
local_patient_counts(
         definitions_16,  output_path_16,code_dict_16, categories=True,missing=False
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,White British 16 CTV3:2020,White British 16 SNOMED:2022,White British 16 PRIMIS:2021,White Irish 16 CTV3:2020,White Irish 16 SNOMED:2022,White Irish 16 PRIMIS:2021,Other White 16 CTV3:2020,Other White 16 SNOMED:2022,Other White 16 PRIMIS:2021,White and Black Caribbean 16 CTV3:2020,White and Black Caribbean 16 SNOMED:2022,White and Black Caribbean 16 PRIMIS:2021,White and Black African 16 CTV3:2020,White and Black African 16 SNOMED:2022,White and Black African 16 PRIMIS:2021,White and Asian 16 CTV3:2020,White and Asian 16 SNOMED:2022,White and Asian 16 PRIMIS:2021,Other Mixed 16 CTV3:2020,Other Mixed 16 SNOMED:2022,Other Mixed 16 PRIMIS:2021,Indian 16 CTV3:2020,Indian 16 SNOMED:2022,Indian 16 PRIMIS:2021,Pakistani 16 CTV3:2020,Pakistani 16 SNOMED:2022,Pakistani 16 PRIMIS:2021,Bangladeshi 16 CTV3:2020,Bangladeshi 16 SNOMED:2022,Bangladeshi 16 PRIMIS:2021,Other Asian 16 CTV3:2020,Other Asian 16 SNOMED:2022,Other Asian 16 PRIMIS:2021,Caribbean 16 CTV3:2020,Caribbean 16 SNOMED:2022,Caribbean 16 PRIMIS:2021,African 16 CTV3:2020,African 16 SNOMED:2022,African 16 PRIMIS:2021,Other Black 16 CTV3:2020,Other Black 16 SNOMED:2022,Other Black 16 PRIMIS:2021,Chinese 16 CTV3:2020,Chinese 16 SNOMED:2022,Chinese 16 PRIMIS:2021,Any other ethnic group 16 CTV3:2020,Any other ethnic group 16 SNOMED:2022,Any other ethnic group 16 PRIMIS:2021,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
all,with records,25 (5.0),30 (6.0),35 (7.0),35 (7.0),35 (7.0),30 (6.0),25 (5.0),25 (5.0),30 (6.0),30 (6.0),25 (5.0),25 (5.0),30 (6.0),35 (7.0),30 (6.0),25 (5.0),30 (6.0),25 (5.0),40 (8.0),20 (4.0),30 (6.0),35 (7.0),25 (5.0),30 (6.0),35 (7.0),40 (8.0),35 (7.0),25 (5.0),25 (5.0),35 (7.0),30 (6.0),30 (6.0),30 (6.0),30 (6.0),35 (7.0),35 (7.0),35 (7.0),35 (7.0),40 (8.0),30 (6.0),30 (6.0),25 (5.0),30 (6.0),35 (7.0),35 (7.0),30 (6.0),35 (7.0),25 (5.0),285 (57.0),500
age_band,0-19,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),35 (58.3),60
age_band,20-29,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (14.3),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (14.3),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),35 (50.0),70
age_band,30-39,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (16.7),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),30 (50.0),60
age_band,40-49,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),30 (54.5),55
age_band,50-59,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (16.7),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),35 (58.3),60
age_band,60-69,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (14.3),- (-),- (-),10 (14.3),- (-),- (-),- (-),- (-),10 (14.3),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),45 (64.3),70
age_band,70-79,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),30 (50.0),60
age_band,80+,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),15 (21.4),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (14.3),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),10 (14.3),- (-),- (-),- (-),- (-),10 (14.3),- (-),- (-),- (-),- (-),45 (64.3),70
sex,F,10 (3.8),15 (5.8),20 (7.7),20 (7.7),20 (7.7),20 (7.7),15 (5.8),10 (3.8),15 (5.8),20 (7.7),10 (3.8),10 (3.8),20 (7.7),25 (9.6),15 (5.8),15 (5.8),15 (5.8),15 (5.8),20 (7.7),10 (3.8),15 (5.8),15 (5.8),10 (3.8),20 (7.7),20 (7.7),15 (5.8),15 (5.8),10 (3.8),10 (3.8),10 (3.8),15 (5.8),15 (5.8),20 (7.7),15 (5.8),20 (7.7),15 (5.8),20 (7.7),15 (5.8),15 (5.8),15 (5.8),15 (5.8),15 (5.8),15 (5.8),20 (7.7),15 (5.8),15 (5.8),20 (7.7),20 (7.7),150 (57.7),260


In [47]:
display(Markdown(f"""
In the `16 group` ethnicity the `Other` ethnic group is expanded to `Chinese` and `Any other ethnic group`. For `Chinese` the `2022-SNOMED` codelist is most well-populated  ({'{:,.0f}'.format(float('{:,.3g}'.format(df_append_cat_16["Chinese_ethnicity_new_16_filled"][0])))}) and for `Any other ethnic group` the `2020-CTV3` codelist is most well populated ({'{:,.0f}'.format(float('{:,.3g}'.format(df_append_cat_16["Any_other_ethnic_group_ethnicity_new_16_filled"][0])))}) . 
"""))


In the `16 group` ethnicity the `Other` ethnic group is expanded to `Chinese` and `Any other ethnic group`. For `Chinese` the `2022-SNOMED` codelist is most well-populated  (35) and for `Any other ethnic group` the `2020-CTV3` codelist is most well populated (35) . 


### Comparison of ‘Latest’ and ‘Most Frequent’ coded ethnicity

#### 5 Group

In [48]:
for definition in definitions_5:
        df_sum = pd.read_csv(f'../output/{output_path_5}/simple_latest_common_{definition}{suffix}_registered.csv').set_index(definition)
        # sort rows by category index
        df_sum.columns = df_sum.columns.str.replace(definition + "_", "")
        df_sum.columns = df_sum.columns.str.lower()
        df_sum = df_sum.reindex(list(code_dict_5[definition].values()))
        
        df_counts = pd.DataFrame(
            np.diagonal(df_sum),
            index=df_sum.index,
        #   columns=[f"matching (n={np.diagonal(df_sum).sum()})"],
        )

        df_sum2 = df_sum.copy(deep=True)
        np.fill_diagonal(df_sum2.values, 0)
        df_diag = pd.DataFrame(
            df_sum2.sum(axis=1),
        )
        df_out = df_counts.merge(df_diag, right_index=True, left_index=True)
        columns=round(df_out.sum()/df_out.sum(axis=1).sum()*100,1)
        globals()[f'df_col_{definition}'] = columns
        df_out.columns=[f"matching ({columns[0]}%)",f"not matching ({columns[1]}%)"]
        df_out = df_out.reset_index()
        df_out = df_out.rename(definition_dict, axis='columns')
        df_out = df_out.set_index(definition_dict[definition])
        display(df_out)
        
        if code_dict_5 != "":
            lowerlist_5 = [x.lower() for x in (list(code_dict_5[definition].values()))]
            df_sum = df_sum[lowerlist_5]
        else:
            df_sum = df_sum.reindex(sorted(df_sum.columns), axis=1)

        # Combine count and percentage columns
        df_sum["population"]=df_sum.sum(axis = 1)
        globals()[f'df_sum_pct_{definition}'] = df_sum
        for item in lowerlist_5:
            df_sum[item + "_pct"]= round(
                    (df_sum[item].div(df_sum["population"])) * 100, 1
                )
            df_sum[item] = (
                    df_sum[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_sum[item + "_pct"].astype(str)
                    + ")"
                )
        df_sum = df_sum[lowerlist_5]
        df_sum = df_sum.reset_index()
        df_sum = df_sum.rename(definition_dict, axis='columns')
        df_sum = df_sum.set_index(definition_dict[definition])
        display(df_sum)
    # df_expanded = pd.read_csv(f'../output/{output_path}/tables/latest_common_expanded_{definition}.csv').set_index(definition)
    
    # display(df_expanded)


Unnamed: 0_level_0,matching (22.2%),not matching (77.8%)
5 CTV3:2020,Unnamed: 1_level_1,Unnamed: 2_level_1
White,10.0,30
Mixed,,20
Asian,,30
Black,10.0,10
Other,10.0,15


Unnamed: 0_level_0,white,mixed,asian,black,other
5 CTV3:2020,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,10 (25.0),nan (nan),10 (25.0),10 (25.0),10 (25.0)
Mixed,10 (50.0),nan (nan),10 (50.0),nan (nan),nan (nan)
Asian,10 (33.3),10 (33.3),nan (nan),10 (33.3),nan (nan)
Black,nan (nan),nan (nan),10 (50.0),10 (50.0),nan (nan)
Other,nan (nan),nan (nan),15 (60.0),nan (nan),10 (40.0)


Unnamed: 0_level_0,matching (16.7%),not matching (83.3%)
5 SNOMED:2022,Unnamed: 1_level_1,Unnamed: 2_level_1
White,25.0,105
Mixed,,0
Asian,,10
Black,,10
Other,,0


Unnamed: 0_level_0,white,mixed,asian,black,other
5 SNOMED:2022,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,25 (19.2),25 (19.2),25 (19.2),30 (23.1),25 (19.2)
Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Asian,nan (nan),10 (100.0),nan (nan),nan (nan),nan (nan)
Black,nan (nan),10 (100.0),nan (nan),nan (nan),nan (nan)
Other,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


Unnamed: 0_level_0,matching (9.1%),not matching (90.9%)
5 PRIMIS:2021,Unnamed: 1_level_1,Unnamed: 2_level_1
White,,15
Mixed,,0
Asian,,45
Black,10.0,30
Other,,10


Unnamed: 0_level_0,white,mixed,asian,black,other
5 PRIMIS:2021,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,nan (nan),nan (nan),nan (nan),nan (nan),15 (100.0)
Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Asian,10 (22.2),10 (22.2),nan (nan),10 (22.2),15 (33.3)
Black,nan (nan),10 (25.0),10 (25.0),10 (25.0),10 (25.0)
Other,nan (nan),nan (nan),nan (nan),10 (100.0),nan (nan)


In [49]:
df_sum_pct_ethnicity_new_5

Unnamed: 0_level_0,white,mixed,asian,black,other,population,white_pct,mixed_pct,asian_pct,black_pct,other_pct
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
White,25 (19.2),25 (19.2),25 (19.2),30 (23.1),25 (19.2),130,19.0,19.0,19.0,23.0,19.0
Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),0,,,,,
Asian,nan (nan),10 (100.0),nan (nan),nan (nan),nan (nan),10,,100.0,,,
Black,nan (nan),10 (100.0),nan (nan),nan (nan),nan (nan),10,,100.0,,,
Other,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),0,,,,,


In [50]:
display(Markdown(f"""
Overall {'{:,.0f}'.format(df_col_ethnicity_5[0])}% of the latest 6 group ethnicity matched the most frequent 6 group ethnicity for all codelists. {'{:,.0f}'.format(df_sum_pct_ethnicity_new_5["white_pct"][0])}% of those with the most recent ethnicity classified as ‘White’ also had the most frequent ethnicity ‘White’ for all three codelists. 'Mixed' was the least concordant for all three codelists with {'{:,.0f}'.format(df_sum_pct_ethnicity_new_5["mixed_pct"][1])}% (SNOMED:2022 and CTV:2020) and {'{:,.0f}'.format(df_sum_pct_ethnicity_primis_5["mixed_pct"][1])}% (PRIMIS:2021) of those with the most recent ethnicity ‘Mixed’ also had the most frequent ethnicity ‘Mixed’. Of those with latest ethnicity ‘Black’ {'{:,.0f}'.format(df_sum_pct_ethnicity_new_5["white_pct"][3])}% also had the most frequent ethnicity ‘White’.
"""))


Overall 22% of the latest 6 group ethnicity matched the most frequent 6 group ethnicity for all codelists. 19% of those with the most recent ethnicity classified as ‘White’ also had the most frequent ethnicity ‘White’ for all three codelists. 'Mixed' was the least concordant for all three codelists with nan% (SNOMED:2022 and CTV:2020) and nan% (PRIMIS:2021) of those with the most recent ethnicity ‘Mixed’ also had the most frequent ethnicity ‘Mixed’. Of those with latest ethnicity ‘Black’ nan% also had the most frequent ethnicity ‘White’.


#### 16 Group

In [51]:
for definition in definitions_16:
        df_sum = pd.read_csv(f'../output/{output_path_16}/simple_latest_common_{definition}{suffix}_registered.csv').set_index(definition)    
        # sort rows by category index
        df_sum.columns = df_sum.columns.str.replace(definition + "_", "")
        df_sum.columns = df_sum.columns.str.lower()
        df_sum = df_sum.reindex(list(code_dict_16[definition].values()))
        
        df_counts = pd.DataFrame(
            np.diagonal(df_sum),
            index=df_sum.index,
        #   columns=[f"matching (n={np.diagonal(df_sum).sum()})"],
        )

        df_sum2 = df_sum.copy(deep=True)
        np.fill_diagonal(df_sum2.values, 0)
        df_diag = pd.DataFrame(
            df_sum2.sum(axis=1),
        )
        df_out = df_counts.merge(df_diag, right_index=True, left_index=True)
        columns=round(df_out.sum()/df_out.sum(axis=1).sum()*100,1)
        globals()[f'df_col_{definition}'] = columns
        df_out.columns=[f"matching ({columns[0]}%)",f"not matching ({columns[1]}%)"]
        df_out = df_out.reset_index()
        df_out = df_out.rename(definition_dict, axis='columns')
        df_out = df_out.set_index(definition_dict[definition])
        display(df_out)
        
        if code_dict_16 != "":
            lowerlist_16 = [x.lower() for x in (list(code_dict_16[definition].values()))]
            df_sum = df_sum[lowerlist_16]
        else:
            df_sum = df_sum.reindex(sorted(df_sum.columns), axis=1)

        # Combine count and percentage columns
        df_sum["population"]=df_sum.sum(axis = 1)
        globals()[f'df_sum_pct_{definition}'] = df_sum
        for item in lowerlist_16:
            df_sum[item + "_pct"]= round(
                    (df_sum[item].div(df_sum["population"])) * 100, 1
                )
        
            df_sum[item] = (
                    df_sum[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_sum[item + "_pct"].astype(str)
                    + ")"
                )
        df_sum = df_sum[lowerlist_16]
        df_sum = df_sum.reset_index()
        df_sum = df_sum.rename(definition_dict, axis='columns')
        df_sum = df_sum.set_index(definition_dict[definition])
        df_sum.columns = df_sum.columns.str.replace("_", " ")
        
        
        display(df_sum)
    # df_expanded = pd.read_csv(f'../output/{output_path}/tables/latest_common_expanded_{definition}.csv').set_index(definition)
    
    # display(df_expanded)

Unnamed: 0_level_0,matching (nan%),not matching (nan%)
16 CTV3:2020,Unnamed: 1_level_1,Unnamed: 2_level_1
White_British,,0
White_Irish,,0
Other_White,,0
White_and_Black_Caribbean,,0
White_and_Black_African,,0
White_and_Asian,,0
Other_Mixed,,0
Indian,,0
Pakistani,,0
Bangladeshi,,0


Unnamed: 0_level_0,white british,white irish,other white,white and black caribbean,white and black african,white and asian,other mixed,indian,pakistani,bangladeshi,other asian,caribbean,african,other black,chinese,any other ethnic group
16 CTV3:2020,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
White_British,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_Irish,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_White,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_Caribbean,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_African,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Asian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Indian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Pakistani,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Bangladeshi,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


Unnamed: 0_level_0,matching (nan%),not matching (nan%)
16 SNOMED:2022,Unnamed: 1_level_1,Unnamed: 2_level_1
White_British,,0
White_Irish,,0
Other_White,,0
White_and_Black_Caribbean,,0
White_and_Black_African,,0
White_and_Asian,,0
Other_Mixed,,0
Indian,,0
Pakistani,,0
Bangladeshi,,0


Unnamed: 0_level_0,white british,white irish,other white,white and black caribbean,white and black african,white and asian,other mixed,indian,pakistani,bangladeshi,other asian,caribbean,african,other black,chinese,any other ethnic group
16 SNOMED:2022,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
White_British,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_Irish,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_White,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_Caribbean,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_African,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Asian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Indian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Pakistani,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Bangladeshi,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


Unnamed: 0_level_0,matching (nan%),not matching (nan%)
16 PRIMIS:2021,Unnamed: 1_level_1,Unnamed: 2_level_1
White_British,,0
White_Irish,,0
Other_White,,0
White_and_Black_Caribbean,,0
White_and_Black_African,,0
White_and_Asian,,0
Other_Mixed,,0
Indian,,0
Pakistani,,0
Bangladeshi,,0


Unnamed: 0_level_0,white british,white irish,other white,white and black caribbean,white and black african,white and asian,other mixed,indian,pakistani,bangladeshi,other asian,caribbean,african,other black,chinese,any other ethnic group
16 PRIMIS:2021,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
White_British,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_Irish,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_White,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_Caribbean,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Black_African,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
White_and_Asian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Other_Mixed,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Indian,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Pakistani,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)
Bangladeshi,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan)


In [52]:
display(Markdown(f"""
Expanding to the 16 group the percentage of latest ethnicity that match the most frequent ethnicity falls to {'{:,.0f}'.format(df_col_ethnicity_new_16[0])}% for both SNOMED:2022 and CTV3:2020 and {'{:,.0f}'.format(df_col_ethnicity_primis_16[0])}% for PRIMIS:2021. 'White British' was the most concordant for both SNOMED:2022 and CTV3:2021 with {'{:,.0f}'.format(df_sum_pct_ethnicity_new_16["white_british_pct"][0])}% and {'{:,.0f}'.format(df_sum_pct_ethnicity_16["white_british_pct"][0])}%, respectively, of those with the most recent ethnicity classified as ‘White British’ also had the most frequent ethnicity ‘White British’. For both SNOMED:2022 and CTV3:2020 'Other Black' was the least concordant with {'{:,.0f}'.format(df_sum_pct_ethnicity_new_16["other_black_pct"][13])}% and {'{:,.0f}'.format(df_sum_pct_ethnicity_16["other_black_pct"][13])}% of those with the most recent ethnicity 'Other Black' also had the most frequent ethnicity 'Other Black'.
"""))


Expanding to the 16 group the percentage of latest ethnicity that match the most frequent ethnicity falls to nan% for both SNOMED:2022 and CTV3:2020 and nan% for PRIMIS:2021. 'White British' was the most concordant for both SNOMED:2022 and CTV3:2021 with nan% and nan%, respectively, of those with the most recent ethnicity classified as ‘White British’ also had the most frequent ethnicity ‘White British’. For both SNOMED:2022 and CTV3:2020 'Other Black' was the least concordant with nan% and nan% of those with the most recent ethnicity 'Other Black' also had the most frequent ethnicity 'Other Black'.


### Changes in coded ethnicity groups

#### 5 Group

In [53]:
for definition in definitions_5:
        df_state_change = pd.read_csv(f'../output/{output_path_5}/simple_state_change_{definition}{suffix}_registered.csv').set_index(definition)
        df_state_change.columns = df_state_change.columns.str.replace(definition + "_", "")
        #resort rows
        df_state_change = df_state_change.reindex(list(code_dict_5[definition].values()))
        df_state_change = df_state_change.reset_index()
        
        df_state_change[definition]=df_state_change[definition]+": " +df_state_change["n"].apply(lambda x: "{:,.0f}".format(x))
        df_state_change = df_state_change.set_index(definition)
        globals()[f'df_sc_pct_{definition}'] = df_state_change
        for item in lowerlist_5 + list(["any"]):
            df_state_change[item + "_pct"]= round(
                    (df_state_change[item].div(df_state_change["n"])) * 100, 1
                )
        
            df_state_change[item] = (
                    df_state_change[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_state_change[item + "_pct"].astype(str)
                    + ")"
                )
        df_state_change=df_state_change[lowerlist_5 + list(["any"])]
        df_state_change = df_state_change.reset_index()
        df_state_change = df_state_change.rename(definition_dict, axis='columns')
        df_state_change = df_state_change.set_index(definition_dict[definition])
        display(df_state_change)

Unnamed: 0_level_0,white,mixed,asian,black,other,any
5 CTV3:2020,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
White: 100,10 (10.0),nan (nan),10 (10.0),10 (10.0),10 (10.0),70 (70.0)
Mixed: 95,10 (10.5),nan (nan),10 (10.5),nan (nan),nan (nan),70 (73.7)
Asian: 95,10 (10.5),10 (10.5),nan (nan),10 (10.5),nan (nan),70 (73.7)
Black: 90,nan (nan),nan (nan),10 (11.1),10 (11.1),nan (nan),60 (66.7)
Other: 100,10 (10.0),10 (10.0),15 (15.0),10 (10.0),10 (10.0),70 (70.0)


Unnamed: 0_level_0,white,mixed,asian,black,other,any
5 SNOMED:2022,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
White: 300,35 (11.7),30 (10.0),30 (10.0),35 (11.7),30 (10.0),210 (70.0)
Mixed: 45,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),35 (77.8)
Asian: 55,nan (nan),10 (18.2),10 (18.2),nan (nan),nan (nan),35 (63.6)
Black: 50,nan (nan),10 (20.0),nan (nan),nan (nan),nan (nan),25 (50.0)
Other: 45,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),35 (77.8)


Unnamed: 0_level_0,white,mixed,asian,black,other,any
5 PRIMIS:2021,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
White: 90,10 (11.1),nan (nan),nan (nan),10 (11.1),15 (16.7),70 (77.8)
Mixed: 80,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),60 (75.0)
Asian: 110,10 (9.1),15 (13.6),nan (nan),10 (9.1),15 (13.6),70 (63.6)
Black: 110,nan (nan),10 (9.1),15 (13.6),10 (9.1),10 (9.1),80 (72.7)
Other: 95,nan (nan),10 (10.5),nan (nan),10 (10.5),10 (10.5),65 (68.4)


In [54]:
display(Markdown(f"""
Patients whose latest recorded ethnicity were grouped as Mixed were most likely to have a discordant ethnicity recording (32%) with 24.5% of the 31,040 patients with the latest recording of Mixed ethnicity also having a recording of White ethnicity. Surprisingly 5.5% of those with the latest recorded ethnicity grouped as Black were also had a recorded ethnicity of White 
"""))


Patients whose latest recorded ethnicity were grouped as Mixed were most likely to have a discordant ethnicity recording (32%) with 24.5% of the 31,040 patients with the latest recording of Mixed ethnicity also having a recording of White ethnicity. Surprisingly 5.5% of those with the latest recorded ethnicity grouped as Black were also had a recorded ethnicity of White 


#### 16 Group

In [55]:
for definition in definitions_16:
        df_state_change = pd.read_csv(f'../output/{output_path_16}/simple_state_change_{definition}{suffix}_registered.csv').set_index(definition)
        df_state_change.columns = df_state_change.columns.str.replace(definition + "_", "")
        df_state_change.columns = df_state_change.columns.str.lower()
        #resort rows
        df_state_change = df_state_change.reindex(list(code_dict_16[definition].values()))
        df_state_change = df_state_change.reset_index()
        
        df_state_change[definition]=df_state_change[definition]+": " +df_state_change["n"].apply(lambda x: "{:,.0f}".format(x))
        df_state_change = df_state_change.set_index(definition)
        globals()[f'df_sc_pct_{definition}'] = df_state_change
        for item in lowerlist_16 + list(["any"]):
            df_state_change[item + "_pct"]= round(
                    (df_state_change[item].div(df_state_change["n"])) * 100, 1
                )
        
            df_state_change[item] = (
                    df_state_change[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_state_change[item + "_pct"].astype(str)
                    + ")"
                )
        df_state_change=df_state_change[lowerlist_16+ list(["any"])]
        df_state_change = df_state_change.reset_index()
        df_state_change = df_state_change.rename(definition_dict, axis='columns')
        df_state_change = df_state_change.set_index(definition_dict[definition])
        df_state_change.columns = df_state_change.columns.str.replace("_", " ")
        display(df_state_change)

Unnamed: 0_level_0,white british,white irish,other white,white and black caribbean,white and black african,white and asian,other mixed,indian,pakistani,bangladeshi,other asian,caribbean,african,other black,chinese,any other ethnic group,any
16 CTV3:2020,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
White_British: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (80.0)
White_Irish: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),30 (85.7)
Other_White: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (60.0)
White_and_Black_Caribbean: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),25 (83.3)
White_and_Black_African: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (66.7)
White_and_Asian: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (60.0)
Other_Mixed: 40,nan (nan),nan (nan),nan (nan),10 (25.0),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),10 (25.0),nan (nan),nan (nan),nan (nan),nan (nan),25 (62.5)
Indian: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),25 (71.4)
Pakistani: 35,nan (nan),nan (nan),10 (28.6),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),25 (71.4)
Bangladeshi: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (60.0)


Unnamed: 0_level_0,white british,white irish,other white,white and black caribbean,white and black african,white and asian,other mixed,indian,pakistani,bangladeshi,other asian,caribbean,african,other black,chinese,any other ethnic group,any
16 SNOMED:2022,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
White_British: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (66.7)
White_Irish: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),25 (71.4)
Other_White: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (80.0)
White_and_Black_Caribbean: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (60.0)
White_and_Black_African: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (57.1)
White_and_Asian: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (50.0)
Other_Mixed: 20,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (75.0)
Indian: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (60.0)
Pakistani: 40,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),25 (62.5)
Bangladeshi: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (60.0)


Unnamed: 0_level_0,white british,white irish,other white,white and black caribbean,white and black african,white and asian,other mixed,indian,pakistani,bangladeshi,other asian,caribbean,african,other black,chinese,any other ethnic group,any
16 PRIMIS:2021,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
White_British: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),25 (71.4)
White_Irish: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),25 (83.3)
Other_White: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (66.7)
White_and_Black_Caribbean: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (80.0)
White_and_Black_African: 30,10 (33.3),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (66.7)
White_and_Asian: 25,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (60.0)
Other_Mixed: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),15 (50.0)
Indian: 30,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (66.7)
Pakistani: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),20 (57.1)
Bangladeshi: 35,nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),nan (nan),25 (71.4)


### Discussion

In [56]:
display(Markdown(f"""
This study has shown that primary care ethnicity data made available via OpenSAFELY is complete for around three quarters of all patients. However, recording ethnicity is not straightforward. Indeed, despite often being used as a key variable to describe health, the idea of “ethnicity” has been disputed. Self-identified ethnicity is not a fixed concept and evolving socio-cultural trends could contribute to changes in a person’s self-identified ethnic group, particularly for those with mixed heritage. It is therefore perhaps not surprising to see lower levels of concordance between latest ethnicity and most common ethnicity in those with latest ethnicity coded as ‘mixed’. 

In OpenSAFELY ethnicity in primary care was originally grouped using the 2020-CTV3 codelist. However, this codelist does not strictly follow the grouping of the 2001 Census, which is the NHS standard for ethnicity. The common practice of supplementing 2020-CTV3 coded ethnicity with either SUS data or the PRIMIS codelists could lead to inconsistent classification as both SUS data and PRIMIS codelists follow the 2001 census groups. 

We believe that the 2022-SNOMED codelist provides a more consistent representation of ethnicity as defined by the 2001 census groups and should be the preferred codelist for primary care ethnicity. 
"""))


This study has shown that primary care ethnicity data made available via OpenSAFELY is complete for around three quarters of all patients. However, recording ethnicity is not straightforward. Indeed, despite often being used as a key variable to describe health, the idea of “ethnicity” has been disputed. Self-identified ethnicity is not a fixed concept and evolving socio-cultural trends could contribute to changes in a person’s self-identified ethnic group, particularly for those with mixed heritage. It is therefore perhaps not surprising to see lower levels of concordance between latest ethnicity and most common ethnicity in those with latest ethnicity coded as ‘mixed’. 

In OpenSAFELY ethnicity in primary care was originally grouped using the 2020-CTV3 codelist. However, this codelist does not strictly follow the grouping of the 2001 Census, which is the NHS standard for ethnicity. The common practice of supplementing 2020-CTV3 coded ethnicity with either SUS data or the PRIMIS codelists could lead to inconsistent classification as both SUS data and PRIMIS codelists follow the 2001 census groups. 

We believe that the 2022-SNOMED codelist provides a more consistent representation of ethnicity as defined by the 2001 census groups and should be the preferred codelist for primary care ethnicity. 


### Limitations

In [57]:
display(Markdown(f"""
It is common for OpenSAFELY studies to supplement the primary care recorded ethnicity, where missing, with ethnicity data from the Secondary Uses Service (SUS). This study has focussed solely on the primary care recorded ethnicity. Due to the way that non-native data, such as GP2GP data and historical data, are imported into TPP the date of ethnicity recorded is not always available therefore chronology is unreliable for ethnicity data. 
"""))


It is common for OpenSAFELY studies to supplement the primary care recorded ethnicity, where missing, with ethnicity data from the Secondary Uses Service (SUS). This study has focussed solely on the primary care recorded ethnicity. Due to the way that non-native data, such as GP2GP data and historical data, are imported into TPP the date of ethnicity recorded is not always available therefore chronology is unreliable for ethnicity data. 


### Conclusion

In [58]:
display(Markdown(f"""
This report describes existing methods to derive primary care ethnicity in OpenSAFELY-TPP and suggests the adoption of the 2022-SNOMED codelist as the new standard method. It is a living document that can be periodically re-run to evaluate the most current best practices for research. If you have improvements or forks, please contact the OpenSAFELY data team.
"""))


This report describes existing methods to derive primary care ethnicity in OpenSAFELY-TPP and suggests the adoption of the 2022-SNOMED codelist as the new standard method. It is a living document that can be periodically re-run to evaluate the most current best practices for research. If you have improvements or forks, please contact the OpenSAFELY data team.


In [59]:
from datetime import date, timedelta
# get data extraction date
extract_date = pd.to_datetime(os.path.getmtime(f"../output/{output_path_16}/simple_patient_counts_registered.csv"), unit='s')
# get notebook run date
run_date = date.today()

display(Markdown(f"""
## Technical details

This notebook was run on {run_date.strftime('%Y-%m-%d')}. The information below is based on data extracted from the OpenSAFELY-TPP database on {extract_date.strftime('%Y-%m-%d')}.

If a clinical code appears in the primary care record on multiple dates, the latest date is used. 


Only patients registered at their practice on January 1 2022 are included.

"""))


## Technical details

This notebook was run on 2022-10-26. The information below is based on data extracted from the OpenSAFELY-TPP database on 2022-10-24.

If a clinical code appears in the primary care record on multiple dates, the latest date is used. 


Only patients registered at their practice on January 1 2022 are included.



In [60]:
print("state change")
for definition in definitions_5:
    print(definition)
    percs=globals()[f"df_sc_pct_{definition}"]
    percs=percs.loc[:, percs.columns.str.endswith('pct')]
    percs = percs.drop( columns='any_pct')
    diags=np.diagonal(percs)
    display(diags)


    print(f"minimum is {min(diags)} in {percs.columns.values[diags==min(diags)]}")
    print(f"maximum is {max(diags)} in {percs.columns.values[diags==max(diags)]}")

for definition in definitions_16:
    print(definition)
    percs=globals()[f"df_sc_pct_{definition}"]
    percs=percs.loc[:, percs.columns.str.endswith('pct')]
    percs = percs.drop( columns='any_pct')
    diags=np.diagonal(percs)

    print(f"minimum is {min(diags)} in {percs.columns.values[diags==min(diags)]}")
    print(f"maximum is {max(diags)} in {percs.columns.values[diags==max(diags)]}")



state change
ethnicity_5


array([10. ,  nan,  nan, 11.1, 10. ])

minimum is 10.0 in ['white_pct' 'other_pct']
maximum is 11.1 in ['black_pct']
ethnicity_new_5


array([11.7,  nan, 18.2,  nan,  nan])

minimum is 11.7 in ['white_pct']
maximum is 18.2 in ['asian_pct']
ethnicity_primis_5


array([11.1,  nan,  nan,  9.1, 10.5])

minimum is 9.1 in ['black_pct']
maximum is 11.1 in ['white_pct']
ethnicity_16
minimum is nan in []
maximum is nan in []
ethnicity_new_16
minimum is nan in []
maximum is nan in []
ethnicity_primis_16
minimum is nan in []
maximum is nan in []


In [61]:
print("latest / most frequent")
for definition in definitions_5:
    print(definition)
    percs=globals()[f"df_sum_pct_{definition}"]
    percs=percs.loc[:, percs.columns.str.endswith('pct')]
    np.fill_diagonal(percs.values, np.nan)
    print("Minimums")
    display(percs.min(axis=1))
    print("Maximums")
    display(percs.max(axis=1))
    print("")

for definition in definitions_16:
    print(definition)
    percs=globals()[f"df_sum_pct_{definition}"]
    percs=percs.loc[:, percs.columns.str.endswith('pct')]
    np.fill_diagonal(percs.values, np.nan)
    print("Minimums")
    display(percs.min(axis=1))
    display(percs.min(axis=1).min())
    print("Maximums")
    display(percs.max(axis=1))
    percs.max(axis=1).max()
    print("")



latest / most frequent
ethnicity_5
Minimums


ethnicity_5
White   25
Mixed   50
Asian   33
Black   50
Other   60
dtype: float64

Maximums


ethnicity_5
White   25
Mixed   50
Asian   33
Black   50
Other   60
dtype: float64


ethnicity_new_5
Minimums


ethnicity_new_5
White    19
Mixed   NaN
Asian   100
Black   100
Other   NaN
dtype: float64

Maximums


ethnicity_new_5
White    23
Mixed   NaN
Asian   100
Black   100
Other   NaN
dtype: float64


ethnicity_primis_5
Minimums


ethnicity_primis_5
White   100
Mixed   NaN
Asian    22
Black    25
Other   100
dtype: float64

Maximums


ethnicity_primis_5
White   100
Mixed   NaN
Asian    33
Black    25
Other   100
dtype: float64


ethnicity_16
Minimums


ethnicity_16
White_British               NaN
White_Irish                 NaN
Other_White                 NaN
White_and_Black_Caribbean   NaN
White_and_Black_African     NaN
White_and_Asian             NaN
Other_Mixed                 NaN
Indian                      NaN
Pakistani                   NaN
Bangladeshi                 NaN
Other_Asian                 NaN
Caribbean                   NaN
African                     NaN
Other_Black                 NaN
Chinese                     NaN
Any_other_ethnic_group      NaN
dtype: float64

nan

Maximums


ethnicity_16
White_British               NaN
White_Irish                 NaN
Other_White                 NaN
White_and_Black_Caribbean   NaN
White_and_Black_African     NaN
White_and_Asian             NaN
Other_Mixed                 NaN
Indian                      NaN
Pakistani                   NaN
Bangladeshi                 NaN
Other_Asian                 NaN
Caribbean                   NaN
African                     NaN
Other_Black                 NaN
Chinese                     NaN
Any_other_ethnic_group      NaN
dtype: float64


ethnicity_new_16
Minimums


ethnicity_new_16
White_British               NaN
White_Irish                 NaN
Other_White                 NaN
White_and_Black_Caribbean   NaN
White_and_Black_African     NaN
White_and_Asian             NaN
Other_Mixed                 NaN
Indian                      NaN
Pakistani                   NaN
Bangladeshi                 NaN
Other_Asian                 NaN
Caribbean                   NaN
African                     NaN
Other_Black                 NaN
Chinese                     NaN
Any_other_ethnic_group      NaN
dtype: float64

nan

Maximums


ethnicity_new_16
White_British               NaN
White_Irish                 NaN
Other_White                 NaN
White_and_Black_Caribbean   NaN
White_and_Black_African     NaN
White_and_Asian             NaN
Other_Mixed                 NaN
Indian                      NaN
Pakistani                   NaN
Bangladeshi                 NaN
Other_Asian                 NaN
Caribbean                   NaN
African                     NaN
Other_Black                 NaN
Chinese                     NaN
Any_other_ethnic_group      NaN
dtype: float64


ethnicity_primis_16
Minimums


ethnicity_primis_16
White_British               NaN
White_Irish                 NaN
Other_White                 NaN
White_and_Black_Caribbean   NaN
White_and_Black_African     NaN
White_and_Asian             NaN
Other_Mixed                 NaN
Indian                      NaN
Pakistani                   NaN
Bangladeshi                 NaN
Other_Asian                 NaN
Caribbean                   NaN
African                     NaN
Other_Black                 NaN
Chinese                     NaN
Any_other_ethnic_group      NaN
dtype: float64

nan

Maximums


ethnicity_primis_16
White_British               NaN
White_Irish                 NaN
Other_White                 NaN
White_and_Black_Caribbean   NaN
White_and_Black_African     NaN
White_and_Asian             NaN
Other_Mixed                 NaN
Indian                      NaN
Pakistani                   NaN
Bangladeshi                 NaN
Other_Asian                 NaN
Caribbean                   NaN
African                     NaN
Other_Black                 NaN
Chinese                     NaN
Any_other_ethnic_group      NaN
dtype: float64


