# Identifying Ethnicity in OpenSAFELY-TPP
This short report describes how ethnicity can be identified in the OpenSAFELY-TPP database, and the strengths and weaknesses of the methods. This is a living document that will be updated to reflect changes to the OpenSAFELY-TPP database and the patient records within.

## OpenSAFELY
OpenSAFELY is an analytics platform for conducting analyses on Electronic Health Records inside the secure environment where the records are held. This has multiple benefits: 

* We don't transport large volumes of potentially disclosive pseudonymised patient data outside of the secure environments for analysis
* Analyses can run in near real-time as records are ready for analysis as soon as they appear in the secure environment
* All infrastructure and analysis code is stored in GitHub repositories, which are open for security review, scientific review, and re-use

A key feature of OpenSAFELY is the use of study definitions, which are formal specifications of the datasets to be generated from the OpenSAFELY database. This takes care of much of the complex EHR data wrangling required to create a dataset in an analysis-ready format. It also creates a library of standardised and validated variable definitions that can be deployed consistently across multiple projects. 

The purpose of this report is to describe all such variables that relate to BMI, their relative strengths and weaknesses, in what scenarios they are best deployed. It will also describe potential future definitions that have not yet been implemented.

## Available Records
OpenSAFELY-TPP runs inside TPP’s data centre which contains the primary care records for all patients registered at practices using TPP’s SystmOne Clinical Information System. This data centre also imports external datasets from other sources, including A&E attendances and hospital admissions from NHS Digital’s Secondary Use Service, and death registrations from the ONS. More information on available data sources can be found within the OpenSAFELY documentation. 

In [1]:
from IPython.display import display, Markdown
from lib import *

pd.set_option('display.max_rows', 500)
pd.options.mode.chained_assignment = None 

In [2]:
### CONFIGURE OPTIONS HERE ###

# Import file
input_path = '../output/data/input.feather'

# Definitions
definitions = ['ethnicity_snomed_5', 'ethnicity_5']

other_vars = []

# Dates
date_min = '2019-01-01'
date_max = '2019-12-31'
time_delta = 'M'

# Null value – 0 or NA
null = "0"

# Covariates
demographic_covariates = ['age_band', 'sex', 'region', 'imd']
clinical_covariates = ['dementia', 'diabetes', 'hypertension', 'learning_disability']

In [3]:
df_clean = import_clean(input_path, definitions, other_vars, demographic_covariates, clinical_covariates, null, time_delta, dates=False)

## Results

### Count of Patients

In [4]:
patient_counts(df_clean, definitions, demographic_covariates, clinical_covariates)

Unnamed: 0_level_0,Unnamed: 1_level_0,ethnicity_snomed_5_filled,ethnicity_snomed_5_pct,ethnicity_5_filled,ethnicity_5_pct
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
population,N,1000,100,1000,100
population,with records,680,68,750,75
age_band,0-19,95,9.5,95,9.5
age_band,20-29,90,9,100,10
age_band,30-39,80,8,85,8.5
age_band,40-49,75,7.5,80,8
age_band,50-59,80,8,95,9.5
age_band,60-69,90,9,105,10.5
age_band,70-79,95,9.5,100,10
age_band,80+,75,7.5,85,8.5


### Count of Missings

In [5]:
patient_counts(df_clean, definitions, demographic_covariates, clinical_covariates, missing=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,ethnicity_snomed_5_missing,ethnicity_snomed_5_pct,ethnicity_5_missing,ethnicity_5_pct
group,subgroup,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
population,N,1000,100,1000,100
population,missing records,320,32,250,25
age_band,0-19,35,3.5,30,3
age_band,20-29,40,4,30,3
age_band,30-39,40,4,30,3
age_band,40-49,45,4.5,40,4
age_band,50-59,40,4,25,2.5
age_band,60-69,40,4,25,2.5
age_band,70-79,40,4,35,3.5
age_band,80+,40,4,30,3


### Overlapping Definitions

In [11]:
print(df_clean)
display(df_clean['ethnicity_snomed_5','white_count'].groupby(definitions[0]))

display_heatmap(df_clean, definitions)

     patient_id ethnicity_snomed_5 ethnicity_5 age_band sex  \
0            24                  2         NaN     0-19   M   
1            44                  3           1    30-39   F   
2            59                NaN           4    60-69   F   
3            86                  2         NaN    20-29   M   
4            88                  4           3    30-39   M   
..          ...                ...         ...      ...  ..   
995        9937                  4           5    60-69   F   
996        9952                  1           1    50-59   F   
997        9964                  5           4    20-29   M   
998        9986                  2         NaN    50-59   M   
999        9995                  2           3    70-79   M   

                       region  imd  dementia  diabetes  hypertension  \
0                         NaN  100     False     False         False   
1                  North East  200     False     False         False   
2                      Lond

KeyError: ('ethnicity_snomed_5', 'white_count')