# Identifying Ethnicity in OpenSAFELY-TPP
This short report describes how ethnicity can be identified in the OpenSAFELY-TPP database, and the strengths and weaknesses of the methods. Ethnicity is known to be an important determinant of health outcomes, particularly during the COVID-19 outbreak where a complex interplay of social and biological factors resulted in increased exposure, reduced protection, and increased severity of illness. The recording of patients’ ethnic group in primary care can support efforts to achieve equity in service provision and outcomes. This is a living document that will be updated to reflect changes to the OpenSAFELY-TPP database and the patient records within.

## OpenSAFELY
OpenSAFELY is an analytics platform for conducting analyses on Electronic Health Records inside the secure environment where the records are held. This has multiple benefits: 

* We don't transport large volumes of potentially disclosive pseudonymised patient data outside of the secure environments for analysis
* Analyses can run in near real-time as records are ready for analysis as soon as they appear in the secure environment
* All infrastructure and analysis code is stored in GitHub repositories, which are open for security review, scientific review, and re-use

A key feature of OpenSAFELY is the use of study definitions, which are formal specifications of the datasets to be generated from the OpenSAFELY database. This takes care of much of the complex EHR data wrangling required to create a dataset in an analysis-ready format. It also creates a library of standardised and validated variable definitions that can be deployed consistently across multiple projects. 

The purpose of this report is to describe all such variables that relate to BMI, their relative strengths and weaknesses, in what scenarios they are best deployed. It will also describe potential future definitions that have not yet been implemented.

## Available Records
OpenSAFELY-TPP runs inside TPP’s data centre which contains the primary care records for all patients registered at practices using TPP’s SystmOne Clinical Information System. This data centre also imports external datasets from other sources, including A&E attendances and hospital admissions from NHS Digital’s Secondary Use Service, and death registrations from the ONS. More information on available data sources can be found within the [OpenSAFELY documentation](https://docs.opensafely.org/data-sources/intro/). 

#Methods

In OpenSAFELY-TPP, there is no categorical “ethnicity” variable to record this information. Rather, ethnicity is recorded using clinical codes, like any other clinical or administrative event, with specific codes relating to specific ethnic groups

We define three codelists to capture primary care ethnicity in OpenSAFELY-TPP : "[CTV3](https://www.opencodelists.org/codelist/opensafely/ethnicity/2020-04-27)", "[SNOMED](https://www.opencodelists.org/codelist/opensafely/ethnicity-snomed-0removed/2e641f61/)" and "[PRIMIS](https://www.opencodelists.org/codelist/primis-covid19-vacc-uptake/eth2001/v1/)".


To evaluate how well each of these codelists are populated, we count the number of patients with at least one instance each codelist, as well as the grouping of the ethncitity themselves.

We examine trends across the whole population and by each of the following demographic and clinical subgroups to detect any inequalities.

Demographic covariates:

Age band
Sex
Ethnicity
Region
IMD
Clinical covariates:

Dementia
Diabetes
Learning disability

Any counts below 6 were redacted, and all other values were rounded to the nearest 5.

In [1]:
import sys
print(sys.path)

['c:\\Users\\candrews\\Documents\\GitHub\\ethnicity-short-data-report\\notebooks_jupyter', 'c:\\Users\\candrews\\anaconda3\\python39.zip', 'c:\\Users\\candrews\\anaconda3\\DLLs', 'c:\\Users\\candrews\\anaconda3\\lib', 'c:\\Users\\candrews\\anaconda3', '', 'c:\\Users\\candrews\\anaconda3\\lib\\site-packages', 'c:\\Users\\candrews\\anaconda3\\lib\\site-packages\\locket-0.2.1-py3.9.egg', 'c:\\Users\\candrews\\anaconda3\\lib\\site-packages\\win32', 'c:\\Users\\candrews\\anaconda3\\lib\\site-packages\\win32\\lib', 'c:\\Users\\candrews\\anaconda3\\lib\\site-packages\\Pythonwin', 'c:\\Users\\candrews\\anaconda3\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\candrews\\.ipython']


In [2]:
import os
import pandas as pd
import numpy as np
from itertools import product
from IPython.display import display, Markdown, Image

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 500)
pd.options.mode.chained_assignment = None 
pd.options.display.float_format = '{:,.0f}'.format

In [3]:

def local_patient_counts(
    definitions, output_path, code_dict="", categories=False, missing=False,
):
    import pandas as pd

    suffix = "_filled"
    overlap = "all_filled"
    if missing == True:
        suffix = "_missing"
        overlap = "all_missing"
    if categories:
        df_population = pd.read_csv(
            f"../output/{output_path}/simple_patient_counts_registered.csv"
        ).set_index(["group", "subgroup"])
        

        df_append = pd.read_csv(
            f"../output/{output_path}/simple_patient_counts_categories_registered.csv"
        ).set_index(["group", "subgroup"])
        df_append.drop("population", inplace=True, axis=1)
        df_append["population"] = df_population["ethnicity_new_5_filled"]
        # ensure definitions[n] in code_dict[definitions[n]] below refers to one of the definitions of interest
        definitions = [
            f"{category}_{definition}"
            for category, definition in product(
                code_dict[definitions[1]].values(), definitions
            )
        ]

    else:
        df_append = pd.read_csv(
            f"../output/{output_path}/simple_patient_counts_registered.csv"
        ).set_index(["group", "subgroup"])

    for definition in definitions:
        if missing:
            df_append[definition + suffix] = (
                df_append["population"] - df_append[definition + "_filled"]
            )    
        df_append[definition + "_pct"] = round(
            (df_append[definition + suffix].div(df_append["population"])) * 100, 1
        )
        df_append[overlap + "_pct"] = round(
            (df_append[overlap].div(df_append["population"])) * 100, 1
        )

        # Combine count and percentage columns
        df_append[definition] = (
            df_append[definition + suffix].apply(lambda x: "{:,.0f}".format(x))
            + " ("
            + df_append[definition + "_pct"].astype(str)
            + ")"
        )
        df_append = df_append.drop(columns=[definition + suffix, definition + "_pct"])
    df_append[overlap] = (
        df_append[overlap].apply(lambda x: "{:,.0f}".format(x))
        + " ("
        + df_append[overlap + "_pct"].astype(str)
        + ")"
    )
    df_append = df_append.drop(columns=[overlap + "_pct"])
    df_patient_counts = df_append[definitions + [overlap] + ["population"]]
    # Final redaction step
    df_patient_counts = df_patient_counts.replace(np.nan, "-")
    df_patient_counts = df_patient_counts.replace("nan (nan)", "- (-)")
    df_patient_counts.columns = df_patient_counts.columns.str.replace("_", " ")
    
    display(df_patient_counts)
    if categories:
        df_patient_counts.to_csv(
                f"../output/{output_path}/local_patient_counts_categories_registered.csv"
            )


In [4]:
### CONFIGURE ###
definitions = ['ethnicity_5', 'ethnicity_new_5', 'ethnicity_primis_5']
covariates = ['_age_band','_sex','_region','_imd','_dementia','_diabetes','_hypertension','_learning_disability']
output_path = 'released/output'
suffixes = ['','_missing']

code_dict = {
    "imd": {
        0: "Unknown",
        1: "1 Most deprived",
        2: "2",
        3: "3",
        4: "4",
        5: "5 Least deprived",
    },
    "ethnicity_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
    "ethnicity_new_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
    "ethnicity_primis_5": {1: "White", 2: "Mixed", 3: "Asian", 4: "Black", 5: "Other"},
}



## Results

### Count of Patients

Around 14.8 million patients have all three codelists. 

In [5]:
local_patient_counts(
         definitions,  output_path
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,ethnicity 5,ethnicity new 5,ethnicity primis 5,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
all,with records,"19,004,920 (76.6)","18,968,795 (76.5)","14,813,175 (59.7)","14,813,175 (59.7)",24802285
age_band,0-19,"3,411,965 (62.6)","3,401,310 (62.4)","2,560,135 (47.0)","2,560,135 (47.0)",5450160
age_band,20-29,"2,221,440 (72.1)","2,215,600 (72.0)","1,750,305 (56.8)","1,750,305 (56.8)",3079220
age_band,30-39,"2,919,065 (82.0)","2,912,280 (81.8)","2,331,885 (65.5)","2,331,885 (65.5)",3560915
age_band,40-49,"2,660,880 (83.3)","2,655,330 (83.1)","2,115,075 (66.2)","2,115,075 (66.2)",3195005
age_band,50-59,"2,789,055 (82.0)","2,785,625 (81.9)","2,192,655 (64.5)","2,192,655 (64.5)",3399825
age_band,60-69,"2,237,240 (82.7)","2,235,200 (82.6)","1,747,410 (64.6)","1,747,410 (64.6)",2706520
age_band,70-79,"1,801,995 (83.0)","1,800,755 (82.9)","1,398,390 (64.4)","1,398,390 (64.4)",2172055
age_band,80+,"963,275 (77.8)","962,695 (77.7)","717,315 (57.9)","717,315 (57.9)",1238580
age_band,missing,- (-),- (-),- (-),- (-),10


### Count of Missings

In [6]:
local_patient_counts(
         definitions,  output_path, missing= True
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,ethnicity 5,ethnicity new 5,ethnicity primis 5,all missing,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
all,with records,"5,797,365 (23.4)","5,833,490 (23.5)","9,989,110 (40.3)","5,797,365 (23.4)",24802285
age_band,0-19,"2,038,195 (37.4)","2,048,850 (37.6)","2,890,025 (53.0)","2,038,195 (37.4)",5450160
age_band,20-29,"857,780 (27.9)","863,620 (28.0)","1,328,915 (43.2)","857,775 (27.9)",3079220
age_band,30-39,"641,850 (18.0)","648,635 (18.2)","1,229,030 (34.5)","641,850 (18.0)",3560915
age_band,40-49,"534,125 (16.7)","539,675 (16.9)","1,079,930 (33.8)","534,125 (16.7)",3195005
age_band,50-59,"610,770 (18.0)","614,200 (18.1)","1,207,170 (35.5)","610,770 (18.0)",3399825
age_band,60-69,"469,280 (17.3)","471,320 (17.4)","959,110 (35.4)","469,275 (17.3)",2706520
age_band,70-79,"370,060 (17.0)","371,300 (17.1)","773,665 (35.6)","370,060 (17.0)",2172055
age_band,80+,"275,305 (22.2)","275,885 (22.3)","521,265 (42.1)","275,305 (22.2)",1238580
age_band,missing,- (-),- (-),- (-),- (-),10


### Count by Category

In [7]:
local_patient_counts(
         definitions,  output_path,code_dict, categories=True,missing=False
    )

Unnamed: 0_level_0,Unnamed: 1_level_0,White ethnicity 5,White ethnicity new 5,White ethnicity primis 5,Mixed ethnicity 5,Mixed ethnicity new 5,Mixed ethnicity primis 5,Asian ethnicity 5,Asian ethnicity new 5,Asian ethnicity primis 5,Black ethnicity 5,Black ethnicity new 5,Black ethnicity primis 5,Other ethnicity 5,Other ethnicity new 5,Other ethnicity primis 5,all filled,population
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
all,with records,"15,879,885 (83.7)","15,889,355 (83.8)","12,171,010 (64.2)","354,880 (1.9)","358,055 (1.9)","319,815 (1.7)","1,646,140 (8.7)","1,678,115 (8.8)","1,430,140 (7.5)","570,645 (3.0)","572,485 (3.0)","446,005 (2.4)","553,365 (2.9)","470,790 (2.5)","446,205 (2.4)","14,813,175 (78.1)",18968795
age_band,0-19,"2,665,795 (78.4)","2,666,920 (78.4)","1,945,295 (57.2)","131,990 (3.9)","132,770 (3.9)","112,645 (3.3)","370,895 (10.9)","375,890 (11.1)","316,505 (9.3)","135,180 (4.0)","135,515 (4.0)","100,825 (3.0)","108,100 (3.2)","90,215 (2.7)","84,865 (2.5)","2,560,135 (75.3)",3401310
age_band,20-29,"1,707,510 (77.1)","1,709,125 (77.1)","1,316,450 (59.4)","60,350 (2.7)","60,960 (2.8)","54,280 (2.4)","248,365 (11.2)","253,010 (11.4)","211,350 (9.5)","84,495 (3.8)","84,815 (3.8)","66,085 (3.0)","120,725 (5.4)","107,695 (4.9)","102,140 (4.6)","1,750,305 (79.0)",2215600
age_band,30-39,"2,286,100 (78.5)","2,289,385 (78.6)","1,793,805 (61.6)","62,110 (2.1)","62,875 (2.2)","57,390 (2.0)","351,055 (12.1)","357,815 (12.3)","303,715 (10.4)","100,195 (3.4)","100,600 (3.5)","78,870 (2.7)","119,605 (4.1)","101,600 (3.5)","98,100 (3.4)","2,331,885 (80.1)",2912280
age_band,40-49,"2,112,890 (79.6)","2,115,260 (79.7)","1,644,065 (61.9)","45,270 (1.7)","45,795 (1.7)","42,925 (1.6)","304,090 (11.5)","311,200 (11.7)","268,945 (10.1)","102,640 (3.9)","102,990 (3.9)","82,220 (3.1)","95,985 (3.6)","80,080 (3.0)","76,920 (2.9)","2,115,075 (79.7)",2655330
age_band,50-59,"2,442,685 (87.7)","2,443,630 (87.7)","1,895,970 (68.1)","31,085 (1.1)","31,405 (1.1)","29,490 (1.1)","174,655 (6.3)","179,350 (6.4)","155,925 (5.6)","83,605 (3.0)","83,860 (3.0)","66,935 (2.4)","57,030 (2.0)","47,385 (1.7)","44,335 (1.6)","2,192,655 (78.7)",2785625
age_band,60-69,"2,039,030 (91.2)","2,039,290 (91.2)","1,578,080 (70.6)","14,910 (0.7)","15,030 (0.7)","14,305 (0.6)","112,925 (5.1)","115,085 (5.1)","99,715 (4.5)","39,355 (1.8)","39,470 (1.8)","31,290 (1.4)","31,020 (1.4)","26,320 (1.2)","24,025 (1.1)","1,747,410 (78.2)",2235200
age_band,70-79,"1,710,380 (95.0)","1,710,315 (95.0)","1,320,275 (73.3)","6,100 (0.3)","6,135 (0.3)","5,805 (0.3)","56,250 (3.1)","57,470 (3.2)","49,660 (2.8)","14,435 (0.8)","14,475 (0.8)","11,430 (0.6)","14,830 (0.8)","12,365 (0.7)","11,220 (0.6)","1,398,390 (77.7)",1800755
age_band,80+,"915,490 (95.1)","915,425 (95.1)","677,065 (70.3)","3,070 (0.3)","3,085 (0.3)","2,970 (0.3)","27,910 (2.9)","28,295 (2.9)","24,325 (2.5)","10,740 (1.1)","10,760 (1.1)","8,345 (0.9)","6,070 (0.6)","5,130 (0.5)","4,605 (0.5)","717,315 (74.5)",962695
age_band,missing,- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),- (-),-


### Overlapping Definitions
Idea: Use an upset plot

In [8]:
#display(Image(f"../output/{output_path}/../figures/heatmap.png"))

### Latest vs. Most Common

In [9]:
for definition in definitions:
    for suffix in suffixes:
        df_sum = pd.read_csv(f'../output/{output_path}/simple_latest_common_{definition}{suffix}.csv').set_index(definition)
        # sort rows by category index
        df_sum.columns = df_sum.columns.str.replace(definition + "_", "")
        df_sum.columns = df_sum.columns.str.lower()
        df_sum = df_sum.reindex(list(code_dict[definition].values()))
        
        df_counts = pd.DataFrame(
            np.diagonal(df_sum),
            index=df_sum.index,
        #   columns=[f"matching (n={np.diagonal(df_sum).sum()})"],
        )

        df_sum2 = df_sum.copy(deep=True)
        np.fill_diagonal(df_sum2.values, 0)
        df_diag = pd.DataFrame(
            df_sum2.sum(axis=1),
        )
        df_out = df_counts.merge(df_diag, right_index=True, left_index=True)
        columns=round(df_out.sum()/df_out.sum(axis=1).sum()*100,1)
        df_out.columns=[f"matching ({columns[0]}%)",f"not matching ({columns[1]}%)"]
        display(df_out)
        
        if code_dict != "":
            lowerlist = [x.lower() for x in (list(code_dict[definition].values()))]
            df_sum = df_sum[lowerlist]
        else:
            df_sum = df_sum.reindex(sorted(df_sum.columns), axis=1)

        # Combine count and percentage columns
        df_sum["population"]=df_sum.sum(axis = 1)
        for item in lowerlist:
            df_sum[item + "_pct"]= round(
                    (df_sum[item].div(df_sum["population"])) * 100, 1
                )
        
            df_sum[item] = (
                    df_sum[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_sum[item + "_pct"].astype(str)
                    + ")"
                )
        df_sum = df_sum[lowerlist]

        display(df_sum)
    # df_expanded = pd.read_csv(f'../output/{output_path}/tables/latest_common_expanded_{definition}.csv').set_index(definition)
    
    # display(df_expanded)

Unnamed: 0_level_0,matching (97.7%),not matching (2.3%)
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,15715175,120825
Mixed,321470,97355
Asian,1624315,67770
Black,555610,50225
Other,512885,113930


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"15,715,175 (99.2)","33,505 (0.2)","22,835 (0.1)","16,475 (0.1)","48,010 (0.3)"
Mixed,"44,600 (10.6)","321,470 (76.8)","13,805 (3.3)","25,375 (6.1)","13,575 (3.2)"
Asian,"20,355 (1.2)","11,710 (0.7)","1,624,315 (96.0)","4,315 (0.3)","31,390 (1.9)"
Black,"18,820 (3.1)","19,625 (3.2)","4,135 (0.7)","555,610 (91.7)","7,645 (1.3)"
Other,"61,085 (9.7)","12,765 (2.0)","31,875 (5.1)","8,205 (1.3)","512,885 (81.8)"


Unnamed: 0_level_0,matching (96.8%),not matching (3.2%)
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,970885,10660
Mixed,16145,8425
Asian,123900,5740
Black,38935,4555
Other,19730,9160


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"970,885 (98.9)","3,040 (0.3)","2,155 (0.2)","1,400 (0.1)","4,065 (0.4)"
Mixed,"4,090 (16.6)","16,145 (65.7)","1,270 (5.2)","2,185 (8.9)",880 (3.6)
Asian,"2,290 (1.8)",805 (0.6),"123,900 (95.6)",510 (0.4),"2,135 (1.6)"
Black,"2,090 (4.8)","1,310 (3.0)",495 (1.1),"38,935 (89.5)",660 (1.5)
Other,"4,580 (15.9)",910 (3.1),"3,025 (10.5)",645 (2.2),"19,730 (68.3)"


Unnamed: 0_level_0,matching (97.9%),not matching (2.1%)
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,15557865,109605
Mixed,317765,95425
Asian,1635455,60285
Black,550055,48125
Other,426920,86965


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"15,557,865 (99.3)","32,935 (0.2)","23,020 (0.1)","16,375 (0.1)","37,275 (0.2)"
Mixed,"44,800 (10.8)","317,765 (76.9)","14,285 (3.5)","25,565 (6.2)","10,775 (2.6)"
Asian,"20,695 (1.2)","11,915 (0.7)","1,635,455 (96.4)","4,330 (0.3)","23,345 (1.4)"
Black,"18,710 (3.1)","19,100 (3.2)","4,155 (0.7)","550,055 (92.0)","6,160 (1.0)"
Other,"48,475 (9.4)","9,825 (1.9)","22,085 (4.3)","6,580 (1.3)","426,920 (83.1)"


Unnamed: 0_level_0,matching (97.1%),not matching (2.9%)
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,962340,8935
Mixed,16000,8250
Asian,123845,5210
Black,38755,4290
Other,17965,8260


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_new_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"962,340 (99.1)","2,985 (0.3)","2,185 (0.2)","1,410 (0.1)","2,355 (0.2)"
Mixed,"4,145 (17.1)","16,000 (66.0)","1,300 (5.4)","2,195 (9.1)",610 (2.5)
Asian,"2,340 (1.8)",835 (0.6),"123,845 (96.0)",520 (0.4),"1,515 (1.2)"
Black,"2,085 (4.8)","1,265 (2.9)",500 (1.2),"38,755 (90.0)",440 (1.0)
Other,"4,310 (16.4)",825 (3.1),"2,485 (9.5)",640 (2.4),"17,965 (68.5)"


Unnamed: 0_level_0,matching (98.0%),not matching (2.0%)
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,12098680,86070
Mixed,299750,70385
Asian,1419295,43705
Black,437190,36055
Other,425400,67605


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"12,098,680 (99.3)","25,215 (0.2)","17,645 (0.1)","11,535 (0.1)","31,675 (0.3)"
Mixed,"32,760 (8.9)","299,750 (81.0)","10,765 (2.9)","17,695 (4.8)","9,165 (2.5)"
Asian,"16,245 (1.1)","8,785 (0.6)","1,419,295 (97.0)","2,560 (0.2)","16,115 (1.1)"
Black,"13,630 (2.9)","15,065 (3.2)","2,630 (0.6)","437,190 (92.4)","4,730 (1.0)"
Other,"35,640 (7.2)","8,775 (1.8)","18,040 (3.7)","5,150 (1.0)","425,400 (86.3)"


Unnamed: 0_level_0,matching (97.6%),not matching (2.4%)
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1
White,875420,5930
Mixed,16705,6345
Asian,121565,3320
Black,36050,2985
Other,3165,6945


Unnamed: 0_level_0,white,mixed,asian,black,other
ethnicity_primis_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
White,"875,420 (99.3)","2,575 (0.3)","1,900 (0.2)","1,230 (0.1)",225 (0.0)
Mixed,"3,295 (14.3)","16,705 (72.5)","1,045 (4.5)","1,905 (8.3)",100 (0.4)
Asian,"2,020 (1.6)",690 (0.6),"121,565 (97.3)",415 (0.3),195 (0.2)
Black,"1,605 (4.1)",975 (2.5),350 (0.9),"36,050 (92.4)",55 (0.1)
Other,"3,650 (36.1)",680 (6.7),"2,080 (20.6)",535 (5.3),"3,165 (31.3)"


### State Change

In [10]:
for definition in definitions:
    for suffix in suffixes:
        df_state_change = pd.read_csv(f'../output/{output_path}/local_state_change_{definition}{suffix}_registered.csv').set_index(definition)
        df_state_change.columns = df_state_change.columns.str.replace(definition + "_", "")
        #resort rows
        df_state_change = df_state_change.reindex(list(code_dict[definition].values()))
        df_state_change = df_state_change.reset_index()
        
        df_state_change[definition]=df_state_change[definition]+": " +df_state_change["n"].apply(lambda x: "{:,.0f}".format(x))
        df_state_change = df_state_change.set_index(definition)
        for item in lowerlist:
            df_state_change[item + "_pct"]= round(
                    (df_state_change[item].div(df_state_change["n"])) * 100, 1
                )
        
            df_state_change[item] = (
                    df_state_change[item].apply(lambda x: "{:,.0f}".format(x))
                    + " ("
                    + df_state_change[item + "_pct"].astype(str)
                    + ")"
                )
        df_state_change=df_state_change[lowerlist]
        display(df_state_change)

FileNotFoundError: [Errno 2] No such file or directory: '../output/released/output/local_state_change_ethnicity_5_registered.csv'