# Identifying Ethnicity in OpenSAFELY-TPP
This short report describes how ethnicity can be identified in the OpenSAFELY-TPP database, and the strengths and weaknesses of the methods. Ethnicity is known to be an important determinant of health outcomes, particularly during the COVID-19 outbreak where a complex interplay of social and biological factors resulted in increased exposure, reduced protection, and increased severity of illness. The recording of patients’ ethnic group in primary care can support efforts to achieve equity in service provision and outcomes. This is a living document that will be updated to reflect changes to the OpenSAFELY-TPP database and the patient records within.

## OpenSAFELY
OpenSAFELY is an analytics platform for conducting analyses on Electronic Health Records inside the secure environment where the records are held. This has multiple benefits: 

* We don't transport large volumes of potentially disclosive pseudonymised patient data outside of the secure environments for analysis
* Analyses can run in near real-time as records are ready for analysis as soon as they appear in the secure environment
* All infrastructure and analysis code is stored in GitHub repositories, which are open for security review, scientific review, and re-use

A key feature of OpenSAFELY is the use of study definitions, which are formal specifications of the datasets to be generated from the OpenSAFELY database. This takes care of much of the complex EHR data wrangling required to create a dataset in an analysis-ready format. It also creates a library of standardised and validated variable definitions that can be deployed consistently across multiple projects. 

The purpose of this report is to describe the main variables that relate ethnicity, and their relative strengths and weaknesses.

## Available Records
OpenSAFELY-TPP runs inside TPP’s data centre which contains the primary care records for all patients registered at practices using TPP’s SystmOne Clinical Information System. This data centre also imports external datasets from other sources, including A&E attendances and hospital admissions from NHS Digital’s Secondary Use Service, and death registrations from the ONS. More information on available data sources can be found within the [OpenSAFELY documentation](https://docs.opensafely.org/data-sources/intro/). 

#Methods

In OpenSAFELY-TPP, there is no categorical “ethnicity” variable to record this information. Rather, ethnicity is recorded using clinical codes, like any other clinical or administrative event, with specific codes relating to specific ethnic groups

We define three codelists to capture primary care ethnicity in OpenSAFELY-TPP : "[2020-CTV3](https://www.opencodelists.org/codelist/opensafely/ethnicity/2020-04-27)", "[2022-SNOMED](https://www.opencodelists.org/codelist/opensafely/ethnicity-snomed-0removed/2e641f61/)" and "[2021-PRIMIS](https://www.opencodelists.org/codelist/primis-covid19-vacc-uptake/eth2001/v1/)".


To evaluate how well each of these codelists are populated, we count the number of patients with at least one instance each codelist, as well as the grouping of the ethncitity themselves.

We examine trends across the whole population and by each of the following demographic and clinical subgroups to detect any inequalities.

Demographic covariates:

Age band
Sex
Ethnicity
Region
IMD
Clinical covariates:

Dementia
Diabetes
Learning disability

Any counts below 6 were redacted, and all other values were rounded to the nearest 5.

In [34]:
import sys

In [26]:
import os
import pandas as pd
import numpy as np
from itertools import product
from IPython.display import display, Markdown, Image

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)
pd.options.mode.chained_assignment = None 
pd.options.display.float_format = '{:,.0f}'.format

In [27]:

def local_patient_counts(
    definitions, output_path, code_dict="", categories=False, missing=False,
):
    import pandas as pd

    suffix = "_filled"
    overlap = "all_filled"
    if missing == True:
        suffix = "_missing"
        overlap = "all_missing"
    if categories:
        df_population = pd.read_csv(
            f"../output/{output_path}/simple_patient_counts_fullset.csv"
        ).set_index(["group", "subgroup"])
        

        df_append = pd.read_csv(
            f"../output/{output_path}/simple_patient_counts_categories_fullset.csv"
        ).set_index(["group", "subgroup"])
        df_append.drop("population", inplace=True, axis=1)
        df_append["population"] = df_population["ethnicity_new_5_filled"]
        # ensure definitions[n] in code_dict[definitions[n]] below refers to one of the definitions of interest
        definitions = [
            f"{category}_{definition}"
            for category, definition in product(
                code_dict[definitions[1]].values(), definitions
            )
        ]

    else:
        df_append = pd.read_csv(
            f"../output/{output_path}/simple_patient_counts_fullset.csv"
        ).set_index(["group", "subgroup"])

    for definition in definitions:
        if missing:
            df_append[definition + suffix] = (
                df_append["population"] - df_append[definition + "_filled"]
            )    
        df_append[definition + "_pct"] = round(
            (df_append[definition + suffix].div(df_append["population"])) * 100, 1
        )
        df_append[overlap + "_pct"] = round(
            (df_append[overlap].div(df_append["population"])) * 100, 1
        )

        # Combine count and percentage columns
        df_append[definition] = (
            df_append[definition + suffix].apply(lambda x: "{:,.0f}".format(x))
            + " ("
            + df_append[definition + "_pct"].astype(str)
            + ")"
        )
        df_append = df_append.drop(columns=[definition + suffix, definition + "_pct"])
    df_append[overlap] = (
        df_append[overlap].apply(lambda x: "{:,.0f}".format(x))
        + " ("
        + df_append[overlap + "_pct"].astype(str)
        + ")"
    )
    df_append = df_append.drop(columns=[overlap + "_pct"])
    df_patient_counts = df_append[definitions + [overlap] + ["population"]]
    # Final redaction step
    df_patient_counts = df_patient_counts.replace(np.nan, "-")
    df_patient_counts = df_patient_counts.replace("nan (nan)", "- (-)")
    df_patient_counts.columns = df_patient_counts.columns.str.replace("_", " ")
    
    display(df_patient_counts)
    if categories:
        df_patient_counts.to_csv(
                f"../output/{output_path}/local_patient_counts_categories_fullset.csv"
            )


In [28]:
### CONFIGURE ###
definitions = ['ethnicity_5', 'ethnicity_new_5', 'ethnicity_primis_5']
covariates = ['_age_band','_sex','_region','_imd','_dementia','_diabetes','_hypertension','_learning_disability']
output_path = 'from_jobserver/release_2022_07_28/output/simplified_output/5_group/tables'
suffixes = ['','_missing']

code_dict = {
    "imd": {
        0: "Unknown",
        1: "1 Most deprived",
        2: "2",
        3: "3",
        4: "4",
        5: "5 Least deprived",
    },
    "ethnicity_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
    "ethnicity_new_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
    "ethnicity_primis_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
}



## Results

### Count of Patients

Around 14.8 million patients who have been registered in OpenSAFELY-TPP have each have all three codelists. 2020-CTV3 is the most well-populated with 19 million patients having at least one 2020-CTV3 recording of ethnicity.  

In [29]:
local_patient_counts(
         definitions,  output_path
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,ethnicity 5,ethnicity new 5,ethnicity primis 5,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
all,with records,"24,817,085 (75.2)","24,762,715 (75.0)","19,352,090 (58.6)","19,352,090 (58.6)",33018910
age_band,0-19,"4,006,955 (63.0)","3,993,025 (62.8)","3,025,820 (47.6)","3,025,820 (47.6)",6356925
age_band,20-29,"3,065,155 (72.7)","3,056,200 (72.5)","2,439,855 (57.9)","2,439,855 (57.9)",4214945
age_band,30-39,"4,014,055 (81.8)","4,002,260 (81.6)","3,222,795 (65.7)","3,222,795 (65.7)",4905765
age_band,40-49,"3,282,495 (82.0)","3,273,850 (81.8)","2,607,295 (65.1)","2,607,295 (65.1)",4003700
age_band,50-59,"3,215,505 (80.3)","3,210,730 (80.2)","2,527,070 (63.1)","2,527,070 (63.1)",4003210
age_band,60-69,"2,617,490 (80.4)","2,614,625 (80.3)","2,043,475 (62.8)","2,043,475 (62.8)",3254410
age_band,70-79,"2,258,985 (79.7)","2,257,210 (79.6)","1,747,990 (61.7)","1,747,990 (61.7)",2834130
age_band,80+,"2,329,560 (69.3)","2,328,110 (69.3)","1,723,985 (51.3)","1,723,985 (51.3)",3360495
age_band,missing,"26,885 (31.5)","26,710 (31.3)","13,805 (16.2)","13,805 (16.2)",85330


### Count of Missings

Around 5.8 million patients who have been registered in OpenSAFELY-TPP have no recorded etnicity across all three codelists. 2021-PRIMIS is the least well-populated with around 10 million patients having no 2021-PRIMIS recording of ethnicity.  

In [30]:
local_patient_counts(
         definitions,  output_path, missing= True
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,ethnicity 5,ethnicity new 5,ethnicity primis 5,all missing,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
all,with records,"8,201,825 (24.8)","8,256,195 (25.0)","13,666,820 (41.4)","8,201,825 (24.8)",33018910
age_band,0-19,"2,349,970 (37.0)","2,363,900 (37.2)","3,331,105 (52.4)","2,349,965 (37.0)",6356925
age_band,20-29,"1,149,790 (27.3)","1,158,745 (27.5)","1,775,090 (42.1)","1,149,785 (27.3)",4214945
age_band,30-39,"891,710 (18.2)","903,505 (18.4)","1,682,970 (34.3)","891,715 (18.2)",4905765
age_band,40-49,"721,205 (18.0)","729,850 (18.2)","1,396,405 (34.9)","721,205 (18.0)",4003700
age_band,50-59,"787,705 (19.7)","792,480 (19.8)","1,476,140 (36.9)","787,705 (19.7)",4003210
age_band,60-69,"636,920 (19.6)","639,785 (19.7)","1,210,935 (37.2)","636,920 (19.6)",3254410
age_band,70-79,"575,145 (20.3)","576,920 (20.4)","1,086,140 (38.3)","575,145 (20.3)",2834130
age_band,80+,"1,030,935 (30.7)","1,032,385 (30.7)","1,636,510 (48.7)","1,030,940 (30.7)",3360495
age_band,missing,"58,445 (68.5)","58,620 (68.7)","71,525 (83.8)","58,450 (68.5)",85330


### Count by Category

The 2022-SNOMED codelist is most well-populated for White (15.9 million), Mixed(360,000), Asian(1.7 million) and Black (570,000) ethnicities. The 2020-CTV3 codelist classifies more people as other than the SNOMED codelist (550,000 and 470,000 respectively), however, the 2020-CTV3 codelist included codes some relating to religion rather than ethnicity (e.g. “XaJSe: Muslim - ethnic category 2001 census”) which were excluded from the 2022-SNOMED codelist.

In [31]:
local_patient_counts(
         definitions,  output_path,code_dict, categories=True,missing=False
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,White ethnicity 5,White ethnicity new 5,White ethnicity primis 5,Mixed ethnicity 5,Mixed ethnicity new 5,Mixed ethnicity primis 5,Asian ethnicity 5,Asian ethnicity new 5,Asian ethnicity primis 5,Black ethnicity 5,Black ethnicity new 5,Black ethnicity primis 5,Other ethnicity 5,Other ethnicity new 5,Other ethnicity primis 5,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
all,with records,"20,501,205 (82.8)","20,514,355 (82.8)","15,715,345 (63.5)","479,120 (1.9)","483,350 (2.0)","433,665 (1.8)","2,194,055 (8.9)","2,239,570 (9.0)","1,898,245 (7.7)","792,185 (3.2)","794,645 (3.2)","616,020 (2.5)","850,520 (3.4)","730,800 (3.0)","688,810 (2.8)","19,352,090 (78.2)",24762715
age_band,0-19,"3,070,020 (76.9)","3,071,530 (76.9)","2,253,310 (56.4)","161,700 (4.0)","162,645 (4.1)","138,820 (3.5)","454,930 (11.4)","461,805 (11.6)","388,275 (9.7)","173,080 (4.3)","173,490 (4.3)","129,310 (3.2)","147,225 (3.7)","123,555 (3.1)","116,110 (2.9)","3,025,820 (75.8)",3993025
age_band,20-29,"2,308,045 (75.5)","2,310,130 (75.6)","1,799,635 (58.9)","89,720 (2.9)","90,590 (3.0)","81,190 (2.7)","338,065 (11.1)","345,210 (11.3)","287,910 (9.4)","129,275 (4.2)","129,720 (4.2)","101,105 (3.3)","200,055 (6.5)","180,550 (5.9)","170,015 (5.6)","2,439,855 (79.8)",3056200
age_band,30-39,"3,049,690 (76.2)","3,054,410 (76.3)","2,404,230 (60.1)","91,930 (2.3)","93,015 (2.3)","85,260 (2.1)","508,695 (12.7)","519,050 (13.0)","439,200 (11.0)","151,425 (3.8)","152,010 (3.8)","119,035 (3.0)","212,315 (5.3)","183,780 (4.6)","175,070 (4.4)","3,222,795 (80.5)",4002260
age_band,40-49,"2,541,775 (77.6)","2,544,920 (77.7)","1,976,715 (60.4)","61,165 (1.9)","61,870 (1.9)","57,810 (1.8)","401,000 (12.2)","410,945 (12.6)","351,925 (10.7)","139,290 (4.3)","139,745 (4.3)","109,965 (3.4)","139,265 (4.3)","116,365 (3.6)","110,875 (3.4)","2,607,295 (79.6)",3273850
age_band,50-59,"2,785,445 (86.8)","2,786,690 (86.8)","2,161,370 (67.3)","38,985 (1.2)","39,380 (1.2)","36,870 (1.1)","210,645 (6.6)","216,705 (6.7)","186,905 (5.8)","104,640 (3.3)","104,955 (3.3)","83,045 (2.6)","75,785 (2.4)","62,995 (2.0)","58,885 (1.8)","2,527,070 (78.7)",3210730
age_band,60-69,"2,369,120 (90.6)","2,369,565 (90.6)","1,832,900 (70.1)","18,950 (0.7)","19,095 (0.7)","18,105 (0.7)","138,500 (5.3)","141,265 (5.4)","121,370 (4.6)","49,975 (1.9)","50,125 (1.9)","39,425 (1.5)","40,950 (1.6)","34,570 (1.3)","31,675 (1.2)","2,043,475 (78.2)",2614625
age_band,70-79,"2,132,730 (94.5)","2,132,735 (94.5)","1,641,310 (72.7)","8,415 (0.4)","8,470 (0.4)","7,975 (0.4)","76,945 (3.4)","78,530 (3.5)","67,250 (3.0)","20,410 (0.9)","20,465 (0.9)","15,970 (0.7)","20,490 (0.9)","17,010 (0.8)","15,485 (0.7)","1,747,990 (77.4)",2257210
age_band,80+,"2,223,270 (95.5)","2,223,270 (95.5)","1,635,770 (70.3)","6,865 (0.3)","6,895 (0.3)","6,650 (0.3)","62,680 (2.7)","63,425 (2.7)","53,690 (2.3)","23,130 (1.0)","23,170 (1.0)","17,690 (0.8)","13,615 (0.6)","11,355 (0.5)","10,190 (0.4)","1,723,985 (74.1)",2328110
age_band,missing,"21,110 (79.0)","21,100 (79.0)","10,100 (37.8)","1,390 (5.2)","1,390 (5.2)",985 (3.7),"2,600 (9.7)","2,630 (9.8)","1,720 (6.4)",965 (3.6),965 (3.6),480 (1.8),820 (3.1),620 (2.3),515 (1.9),"13,805 (51.7)",26710


### Latest vs. Most Common

Overall 98% of the latest 6 group ethnicity matched the most frequent 6 group ethnicity. 99.3% (2022-SNOMED) of those with the most recent ethnicity ‘White’ also had the most frequent ethnicity ‘White’. Mixed was the least concordant with 77.0% of those with the most recent ethnicity ‘Mixed’ also had the most frequent ethnicity ‘Mixed’. 3.1% of those with latest ethnicity ‘Black’ also had the most frequent ethnicity ‘White’.


In [32]:
for definition in definitions:
    for suffix in suffixes:
        df_sum = pd.read_csv(f'../output/{output_path}/simple_latest_common_{definition}{suffix}_fullset.csv').set_index(definition)
        # sort rows by category index
        df_sum.columns = df_sum.columns.str.replace(definition + "_", "")
        df_sum.columns = df_sum.columns.str.lower()
        df_sum = df_sum.reindex(list(code_dict[definition].values()))
        
        df_counts = pd.DataFrame(
            np.diagonal(df_sum),
            index=df_sum.index,
        #   columns=[f"matching (n={np.diagonal(df_sum).sum()})"],
        )

        df_sum2 = df_sum.copy(deep=True)
        np.fill_diagonal(df_sum2.values, 0)
        df_diag = pd.DataFrame(
            df_sum2.sum(axis=1),
        )
        df_out = df_counts.merge(df_diag, right_index=True, left_index=True)
        columns=round(df_out.sum()/df_out.sum(axis=1).sum()*100,1)
        df_out.columns=[f"matching ({columns[0]}%)",f"not matching ({columns[1]}%)"]
        display(df_out)
        
        if code_dict != "":
            lowerlist = [x.lower() for x in (list(code_dict[definition].values()))]
            df_sum = df_sum[lowerlist]
        else:
            df_sum = df_sum.reindex(sorted(df_sum.columns), axis=1)

        # Combine count and percentage columns
        df_sum["population"]=df_sum.sum(axis = 1)
        for item in lowerlist:
            df_sum[item + "_pct"]= round(
                    (df_sum[item].div(df_sum["population"])) * 100, 1
                )
        
            df_sum[item] = (
                    df_sum[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_sum[item + "_pct"].astype(str)
                    + ")"
                )
        df_sum = df_sum[lowerlist]

        display(df_sum)
    # df_expanded = pd.read_csv(f'../output/{output_path}/tables/latest_common_expanded_{definition}.csv').set_index(definition)
    
    # display(df_expanded)


Unnamed: 0_level_0,matching (97.7%),not matching (2.3%)
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,20463590,157970
Mixed,438935,127000
Asian,2174830,88910
Black,777160,65380
Other,803000,150885


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"20,463,590 (99.2)","43,385 (0.2)","28,555 (0.1)","21,435 (0.1)","64,595 (0.3)"
Mixed,"57,240 (10.1)","438,935 (77.6)","18,085 (3.2)","33,480 (5.9)","18,195 (3.2)"
Asian,"25,760 (1.1)","15,350 (0.7)","2,174,830 (96.1)","5,585 (0.2)","42,215 (1.9)"
Black,"23,870 (2.8)","25,825 (3.1)","5,405 (0.6)","777,160 (92.2)","10,280 (1.2)"
Other,"80,740 (8.5)","16,990 (1.8)","42,195 (4.4)","10,960 (1.1)","803,000 (84.2)"


Unnamed: 0_level_0,matching (97.0%),not matching (3.0%)
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,1322775,14285
Mixed,23485,10930
Asian,171805,7490
Black,53870,5805
Other,36030,11890


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"1,322,775 (98.9)","3,975 (0.3)","2,720 (0.2)","1,810 (0.1)","5,780 (0.4)"
Mixed,"5,205 (15.1)","23,485 (68.2)","1,685 (4.9)","2,840 (8.3)","1,200 (3.5)"
Asian,"2,850 (1.6)","1,055 (0.6)","171,805 (95.8)",655 (0.4),"2,930 (1.6)"
Black,"2,590 (4.3)","1,705 (2.9)",620 (1.0),"53,870 (90.3)",890 (1.5)
Other,"5,910 (12.3)","1,165 (2.4)","3,990 (8.3)",825 (1.7),"36,030 (75.2)"


Unnamed: 0_level_0,matching (97.9%),not matching (2.1%)
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,20482790,145785
Mixed,443835,126780
Asian,2224215,80785
Black,781205,63610
Other,694955,118290


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"20,482,790 (99.3)","43,480 (0.2)","29,300 (0.1)","21,645 (0.1)","51,360 (0.2)"
Mixed,"58,535 (10.3)","443,835 (77.8)","19,120 (3.4)","34,170 (6.0)","14,955 (2.6)"
Asian,"26,575 (1.2)","15,920 (0.7)","2,224,215 (96.5)","5,720 (0.2)","32,570 (1.4)"
Black,"24,050 (2.8)","25,475 (3.0)","5,540 (0.7)","781,205 (92.5)","8,545 (1.0)"
Other,"65,195 (8.0)","13,535 (1.7)","30,520 (3.8)","9,040 (1.1)","694,955 (85.5)"


Unnamed: 0_level_0,matching (97.6%),not matching (2.4%)
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,1319570,8985
Mixed,23245,10140
Asian,172145,5500
Black,53580,4945
Other,21045,9385


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"1,319,570 (99.3)","3,875 (0.3)","2,710 (0.2)","1,805 (0.1)",595 (0.0)
Mixed,"5,240 (15.7)","23,245 (69.6)","1,705 (5.1)","2,850 (8.5)",345 (1.0)
Asian,"2,890 (1.6)","1,055 (0.6)","172,145 (96.9)",655 (0.4),900 (0.5)
Black,"2,590 (4.4)","1,630 (2.8)",620 (1.1),"53,580 (91.6)",105 (0.2)
Other,"4,875 (16.0)",910 (3.0),"2,945 (9.7)",655 (2.2),"21,045 (69.2)"


Unnamed: 0_level_0,matching (98.0%),not matching (2.0%)
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,15694085,111210
Mixed,408965,91330
Asian,1888365,56565
Black,606890,46270
Other,664325,89120


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"15,694,085 (99.3)","32,390 (0.2)","21,965 (0.1)","14,980 (0.1)","41,875 (0.3)"
Mixed,"41,995 (8.4)","408,965 (81.7)","13,965 (2.8)","23,215 (4.6)","12,155 (2.4)"
Asian,"20,335 (1.0)","11,315 (0.6)","1,888,365 (97.1)","3,360 (0.2)","21,555 (1.1)"
Black,"17,160 (2.6)","19,460 (3.0)","3,395 (0.5)","606,890 (92.9)","6,255 (1.0)"
Other,"46,635 (6.2)","11,600 (1.5)","24,065 (3.2)","6,820 (0.9)","664,325 (88.2)"


Unnamed: 0_level_0,matching (97.8%),not matching (2.2%)
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,1194405,7625
Mixed,24370,8270
Asian,168905,4260
Black,49850,3755
Other,4380,8860


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"1,194,405 (99.4)","3,330 (0.3)","2,395 (0.2)","1,570 (0.1)",330 (0.0)
Mixed,"4,220 (12.9)","24,370 (74.7)","1,385 (4.2)","2,520 (7.7)",145 (0.4)
Asian,"2,525 (1.5)",890 (0.5),"168,905 (97.5)",560 (0.3),285 (0.2)
Black,"1,985 (3.7)","1,250 (2.3)",440 (0.8),"49,850 (93.0)",80 (0.1)
Other,"4,630 (35.0)",860 (6.5),"2,690 (20.3)",680 (5.1),"4,380 (33.1)"


### State Change

Patients whose latest recorded ethnicity were categorised as Mixed were most likely to have a discordant ethnicity recording (32%) with 16.7% of the 358,055 patients with the latest recording of Mixed ethnicity also having a recording of White ethnicity. Surprisingly 5.5% of those with the latest recorded ethnicity categorised as Black were also had a recorded ethnicity of White 

In [33]:
for definition in definitions:
    for suffix in suffixes:
        df_state_change = pd.read_csv(f'../output/{output_path}/simple_state_change_{definition}{suffix}_fullset.csv').set_index(definition)
        df_state_change.columns = df_state_change.columns.str.replace(definition + "_", "")
        #resort rows
        df_state_change = df_state_change.reindex(list(code_dict[definition].values()))
        df_state_change = df_state_change.reset_index()
        
        df_state_change[definition]=df_state_change[definition]+": " +df_state_change["n"].apply(lambda x: "{:,.0f}".format(x))
        df_state_change = df_state_change.set_index(definition)
        for item in lowerlist:
            df_state_change[item + "_pct"]= round(
                    (df_state_change[item].div(df_state_change["n"])) * 100, 1
                )
        
            df_state_change[item] = (
                    df_state_change[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_state_change[item + "_pct"].astype(str)
                    + ")"
                )
        df_state_change=df_state_change[lowerlist]
        display(df_state_change)

Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"White: 20,501,205","20,501,205 (100.0)","75,805 (0.4)","38,020 (0.2)","31,425 (0.2)","127,420 (0.6)"
"Mixed: 479,120","75,745 (15.8)","479,120 (100.0)","22,315 (4.7)","42,155 (8.8)","25,030 (5.2)"
"Asian: 2,194,055","49,810 (2.3)","30,605 (1.4)","2,194,055 (100.0)","10,490 (0.5)","70,045 (3.2)"
"Black: 792,185","39,705 (5.0)","50,615 (6.4)","8,250 (1.0)","792,185 (100.0)","17,915 (2.3)"
"Other: 850,520","96,315 (11.3)","26,725 (3.1)","57,595 (6.8)","14,680 (1.7)","850,520 (100.0)"


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"White: 1,331,970","1,331,970 (100.0)","8,720 (0.7)","4,635 (0.3)","3,685 (0.3)","15,540 (1.2)"
"Mixed: 31,705","7,735 (24.4)","31,705 (100.0)","2,360 (7.4)","4,095 (12.9)","2,335 (7.4)"
"Asian: 176,690","6,430 (3.6)","3,610 (2.0)","176,690 (100.0)","1,655 (0.9)","7,535 (4.3)"
"Black: 57,740","4,700 (8.1)","4,950 (8.6)","1,155 (2.0)","57,740 (100.0)","2,070 (3.6)"
"Other: 45,400","8,145 (17.9)","2,350 (5.2)","5,695 (12.5)","1,420 (3.1)","45,400 (100.0)"


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"White: 20,514,355","20,514,355 (100.0)","76,245 (0.4)","38,795 (0.2)","31,500 (0.2)","102,725 (0.5)"
"Mixed: 483,350","76,310 (15.8)","483,350 (100.0)","23,150 (4.8)","42,300 (8.8)","20,250 (4.2)"
"Asian: 2,239,570","51,035 (2.3)","31,840 (1.4)","2,239,570 (100.0)","10,655 (0.5)","52,180 (2.3)"
"Black: 794,645","39,780 (5.0)","50,760 (6.4)","8,365 (1.1)","794,645 (100.0)","14,650 (1.8)"
"Other: 730,800","75,770 (10.4)","20,675 (2.8)","41,295 (5.7)","11,835 (1.6)","730,800 (100.0)"


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"White: 1,325,605","1,325,605 (100.0)","8,560 (0.6)","4,575 (0.3)","3,605 (0.3)","7,105 (0.5)"
"Mixed: 31,040","7,610 (24.5)","31,040 (100.0)","2,340 (7.5)","4,040 (13.0)","1,190 (3.8)"
"Asian: 175,665","6,385 (3.6)","3,575 (2.0)","175,665 (100.0)","1,625 (0.9)","4,125 (2.3)"
"Black: 56,900","4,645 (8.2)","4,875 (8.6)","1,125 (2.0)","56,900 (100.0)",885 (1.6)
"Other: 28,600","5,605 (19.6)","1,570 (5.5)","3,820 (13.4)",900 (3.1),"28,600 (100.0)"


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"White: 15,715,345","15,715,345 (100.0)","50,385 (0.3)","27,765 (0.2)","19,970 (0.1)","72,300 (0.5)"
"Mixed: 433,665","53,200 (12.3)","433,665 (100.0)","16,460 (3.8)","27,850 (6.4)","15,420 (3.6)"
"Asian: 1,898,245","35,540 (1.9)","20,000 (1.1)","1,898,245 (100.0)","5,435 (0.3)","32,365 (1.7)"
"Black: 616,020","25,215 (4.1)","32,715 (5.3)","4,895 (0.8)","616,020 (100.0)","9,825 (1.6)"
"Other: 688,810","54,270 (7.9)","16,450 (2.4)","31,025 (4.5)","8,700 (1.3)","688,810 (100.0)"


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"White: 1,199,425","1,199,425 (100.0)","6,275 (0.5)","3,730 (0.3)","2,605 (0.2)","5,325 (0.4)"
"Mixed: 30,370","5,980 (19.7)","30,370 (100.0)","1,845 (6.1)","3,210 (10.6)",710 (2.3)
"Asian: 171,540","4,965 (2.9)","2,505 (1.5)","171,540 (100.0)","1,070 (0.6)","2,265 (1.3)"
"Black: 52,420","3,275 (6.2)","3,365 (6.4)",770 (1.5),"52,420 (100.0)",620 (1.2)
"Other: 10,865","5,060 (46.6)","1,230 (11.3)","3,085 (28.4)",850 (7.8),"10,865 (100.0)"
