**Loads baseline Showcase data and processes into format that is suitable for input into XGBoost.**

Baseline data is exported in *scripts/extract_ukb_baseline.py*

Note: field is defined as a high-level field, such as BMI. Columns are defined as all the columns provided in the dataset that relate to that field (i.e. the multiple instances/arrays for each field) - e.g. BP Measure array 0, BP Measure array 1 etc... There may be (and indeed often are) multiple columns for each field, but not multiple fields for each column.

In [1]:
%load_ext autoreload
%autoreload 2
    
import pandas as pd
import numpy as np
import sys
sys.path.append("../src")
from processing_utils import display_included_cats, remove_cat_cols, get_cat_and_downstream, display_arrayed_fields, remove_fields

In [2]:
data = pd.read_feather("../data/processed/all_showcase_baseline.feather")

# Select categories
Not all categories captured by baseline extraction are relevant. Review these and remove irrelevant.

## View all included categories
These should all be baseline categories, as selected from the master showcase dataset in *scripts/extract_ukb_baseline.py*

In [3]:
cat_tree, pretty_df = display_included_cats(data.columns[1:])
pd.set_option('display.max_rows', None)
pretty_df

Level 1,Level 2,Level 3,Level 4,Level 5,Level 6
Assessment centre (100000),Recruitment (100021),Reception (100024) ✅,,,
Assessment centre (100000),Recruitment (100021),Consent (100023) ✅,,,
Assessment centre (100000),Recruitment (100021),Conclusion (100022),,,
Assessment centre (100000),Recruitment (100021),Consent for imaging (100119),,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Household (100066) ✅,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Employment (100064) ✅,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Education (100063) ✅,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Ethnicity (100065) ✅,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Other sociodemographic factors (100067) ✅,,
Assessment centre (100000),Touchscreen (100025),Lifestyle and environment (100050),Physical activity (100054) ✅,MET Scores (54) ✅,


## Remove undesired categories

In [4]:
print(f"Before pruning: {len(data.columns[1:])} columns \n")

# Remove health-related outcomes. To be processed separately.
pruned_cols = remove_cat_cols(columns=data.columns[1:], categories=get_cat_and_downstream(100091, cat_tree))

# Remove kidney-derived measures from imaging: kidney fusion field (21164) wrongly labelled as instance 0.
pruned_cols = remove_cat_cols(columns=pruned_cols, categories=[159])

# Remove procedural metrics related to assessment centre
pruned_cols = remove_cat_cols(columns=pruned_cols, categories=get_cat_and_downstream(100004, cat_tree))

# Remove biological samples processing
pruned_cols = remove_cat_cols(columns=pruned_cols, categories=[9081, 18518, 1307, 221, 222, 148])

# Remove biological sample inventory
pruned_cols = remove_cat_cols(columns=pruned_cols, categories=get_cat_and_downstream(100084, cat_tree))

# Remove additional physical activity measurements (not baseline)
pruned_cols = remove_cat_cols(columns=pruned_cols, categories=get_cat_and_downstream(1008, cat_tree))

# Remove any online follow-up (not baseline)
pruned_cols = remove_cat_cols(columns=pruned_cols, categories=get_cat_and_downstream(100089, cat_tree))

# Remove genotyping: indicators of data availability (genotypes provided as separate bulk files)
pruned_cols = remove_cat_cols(columns=pruned_cols, categories=get_cat_and_downstream(263, cat_tree))
pruned_cols = remove_cat_cols(columns=pruned_cols, categories=[100319])

print(f"\n= {len(pruned_cols)} columns after pruning")

data = data[['eid'] + pruned_cols]

Before pruning: 12879 columns 

Removed 4497 columns
Removed 1 columns
Removed 36 columns
Removed 544 columns
Removed 13 columns
Removed 194 columns
Removed 3539 columns
Removed 85 columns
Removed 2 columns

= 3968 columns after pruning


# Arrayed fields
Arrayed fields are unsuitable for feeding straight into XGBoost. Some of these are too granular so can be removed altogether (e.g. ECG), while others can be engineered into a suitable format (e.g. one-hot-encoding of multi-choice fields).

## Identify all arrayed fields.

In [5]:
arrayed_fields = display_arrayed_fields(pruned_cols)
arrayed_fields

Unnamed: 0_level_0,Unnamed: 1_level_0,Field ID,Field type,Number of arrays
Primary category,Field title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Blood pressure (100011),"Systolic blood pressure, manual reading",93,Integer,2
Blood pressure (100011),"Diastolic blood pressure, manual reading",94,Integer,2
Blood pressure (100011),Pulse rate (during blood-pressure measurement),95,Integer,2
Blood pressure (100011),Time since interview start at which blood pressure screen(s) shown,96,Integer,2
Blood pressure (100011),"Pulse rate, automated reading",102,Integer,2
Blood pressure (100011),"Diastolic blood pressure, automated reading",4079,Integer,2
Blood pressure (100011),"Systolic blood pressure, automated reading",4080,Integer,2
Blood pressure (100011),Method of measuring blood pressure,4081,Single choice,2
Blood sample collection (100002),Time blood sample collected,3166,Datetime,7
Blood sample collection (100002),"Blood sample #, note contents",20049,Single choice,7


## Remove highly granular fields with large arrays.
Some of these (e.g. those relating to medical history) need to be processed separately.

In [6]:
# Select field IDs of the following arrayed categories and remove. Note that this is not removing the 
# whole category (as done previously) but just the arrayed fields within each category selected for removal, as listed above.

arrayed_categories = ['ECG during exercise (100012)', 'Hearing test (100049)', 'Medical conditions (100074)', 
                      'Medications (100075)', 'Numeric memory (100029)', 'Operations (100076)', 'Pairs matching (100030)',
                      'Polygenic Risk Scores (300)', 'Reaction time (100032)', 'Refractometer 1 (100014)',
                      'Retinal optical coherence tomography (100016)', 'Visual acuity (100017)', 'Word production (100077)']

for category in arrayed_categories:
    # Use remove_fields() with field IDs for arrayed fields from table above as input for removal.
    pruned_cols = remove_fields(columns=pruned_cols, fields_to_remove=arrayed_fields.loc[(category), 'Field ID'].tolist())

print(f"\n= {len(pruned_cols)} columns after pruning")

arrayed_fields = display_arrayed_fields(pruned_cols)

data = data[['eid'] + pruned_cols]

Removed 691 columns
Removed 300 columns
Removed 200 columns
Removed 48 columns
Removed 176 columns
Removed 160 columns
Removed 29 columns
Removed 4 columns
Removed 128 columns
Removed 388 columns
Removed 18 columns
Removed 160 columns
Removed 51 columns

= 1615 columns after pruning


## One-hot-encode multiple choice fields
There is no direct 1:1 relationship between a field's arrayed column and a specific multiple-choice option for that field: order in array columns seems to be order in which user picked options. As the same option may be present in different columns for each user, these must be aggregated across all columns and used to create a one-hot-encoding.

In [7]:
arrayed_fields = display_arrayed_fields(pruned_cols)
multi_choice = arrayed_fields[arrayed_fields['Field type'] == 'Multi choice']

In [8]:
multi_choice

Unnamed: 0_level_0,Unnamed: 1_level_0,Field ID,Field type,Number of arrays
Primary category,Field title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Diet (100052),"Never eat eggs, dairy, wheat, sugar",6144,Multi choice,4
Diet (100052),"Never eat eggs, dairy, wheat, sugar (pilot)",10855,Multi choice,4
Education (100063),Qualifications,6138,Multi choice,6
Education (100063),Qualifications (pilot),10722,Multi choice,5
Employment (100064),Current employment status,6142,Multi choice,7
Employment (100064),Transport type for commuting to job workplace,6143,Multi choice,4
Eyesight (100041),Reason for glasses/contact lenses,6147,Multi choice,6
Eyesight (100041),Eye problems/disorders,6148,Multi choice,5
Family history (100034),Illnesses of father,20107,Multi choice,10
Family history (100034),Illnesses of mother,20110,Multi choice,11


In [9]:
for index, row in multi_choice.iterrows():
    print(index[1])
    array_cols = [col for col in pruned_cols if int(col.split('.')[0]) == row['Field ID']]
    array_df = data[['eid'] + array_cols]
    # Create missing category and assign Na to this as well as -1 (Do not know) -3 (Prefer not to answer), -7 (None of the above)
    # Only consider first arrayed column as if missing in first will be missing for all
    if "Missing" not in array_df[array_df.columns[1]].cat.categories:
        array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
    array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].fillna("Missing")
    array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})

    # Stack all array cols
    onehot_col_name_root = f"{row['Field ID']}.0"
    stacked = array_df.melt(id_vars='eid', value_name=onehot_col_name_root).drop(columns=['variable'])
    # One hot encode stacked df. Use _ separator (rather than usual '.') between field name and choice to enable identification/filtering of these OHE'd multi-choice columns if required
    onehot = pd.get_dummies(stacked, prefix_sep='_')
    # Aggregate back to eid level
    onehot = onehot.groupby('eid').sum().reset_index()
    # Join onto main dataframe and remove old array cols
    data = data.drop(columns=array_cols)
    data = pd.merge(data, onehot, on='eid', how='left')

Never eat eggs, dairy, wheat, sugar


Length: 502180
Categories (7, object): ['2', '4', '5', '1', '3', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (6, object): ['2', '4', '5', '1', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Never eat eggs, dairy, wheat, sugar (pilot)


Length: 502180
Categories (7, object): ['5', '4', '2', '3', '1', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (6, object): ['5', '4', '2', '3', '1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Qualifications


Length: 502180
Categories (9, object): ['-7', '1', '2', '3', ..., '5', '6', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['1', '2', '3', '4', '5', '6', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Qualifications (pilot)


Length: 502180
Categories (8, object): ['3', '2', '1', '5', '-7', '4', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (6, object): ['3', '2', '1', '5', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Current employment status


Length: 502180
Categories (10, object): ['1', '2', '4', '-7', ..., '6', '-3', '7', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (8, object): ['1', '2', '4', '3', '5', '6', '7', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Transport type for commuting to job workplace


Length: 502180
Categories (7, object): ['1', '3', '2', '4', '-7', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (5, object): ['1', '3', '2', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Reason for glasses/contact lenses


Length: 502180
Categories (10, object): ['1', '3', '7', '-1', ..., '5', '6', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (8, object): ['1', '3', '7', '2', '4', '5', '6', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Eye problems/disorders


Length: 502180
Categories (10, object): ['-7', '6', '-1', '2', ..., '5', '3', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['6', '2', '4', '1', '5', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Illnesses of father


Length: 502180
Categories (15, object): ['-17', '1', '10', '13', ..., '12', '11', '-13', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")


Illnesses of mother


Length: 502180
Categories (15, object): ['-17', '1', '10', '11', ..., '12', '2', '-13', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")


Illnesses of siblings


Length: 502180
Categories (16, object): ['-11', '-17', '12', '5', ..., '4', '13', '-13', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")


Illnesses of adopted father


Length: 502180
Categories (15, object): ['2', '-11', '-17', '8', ..., '12', '10', '-13', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")


Illnesses of adopted mother


Length: 502180
Categories (15, object): ['12', '-17', '1', '8', ..., '3', '6', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")


Illnesses of adopted siblings


Length: 502180
Categories (16, object): ['-11', '12', '-17', '9', ..., '-13', '10', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")


Gas or solid-fuel cooking/heating


Length: 502180
Categories (7, object): ['-7', '1', '2', '3', '-3', '-1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (4, object): ['1', '2', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Heating type(s) in home


Length: 502180
Categories (10, object): ['1', '3', '2', '-7', ..., '-1', '5', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['1', '3', '2', '6', '5', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


How are people in household related to participant


Length: 502180
Categories (10, object): ['1', '2', '4', '3', ..., '7', '6', '5', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (9, object): ['1', '2', '4', '3', ..., '7', '6', '5', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Gas or solid-fuel cooking/heating (pilot)


Length: 502180
Categories (7, object): ['1', '2', '-7', '3', '-1', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (4, object): ['1', '2', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Vascular/heart problems diagnosed by doctor


Length: 502180
Categories (7, object): ['-7', '3', '4', '2', '1', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (5, object): ['3', '4', '2', '1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Fractured bone site(s)


Length: 502180
Categories (10, object): ['1', '7', '5', '2', ..., '3', '-1', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (8, object): ['1', '7', '5', '2', '6', '4', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Blood clot, DVT, bronchitis, emphysema, asthma, rhinitis, eczema, allergy diagnosed by doctor


Length: 502180
Categories (8, object): ['-7', '5', '8', '9', '6', '7', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (6, object): ['5', '8', '9', '6', '7', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Medication for cholesterol, blood pressure, diabetes, or take exogenous hormones


Length: 502180
Categories (9, object): ['-7', '2', '1', '4', ..., '-3', '3', '-1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (6, object): ['2', '1', '4', '5', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Medication for pain relief, constipation, heartburn


Length: 502180
Categories (10, object): ['-7', '1', '2', '3', ..., '6', '4', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['1', '2', '3', '5', '6', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Vitamin and mineral supplements


Length: 502180
Categories (10, object): ['-3', '-7', '3', '6', ..., '2', '1', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (8, object): ['3', '6', '7', '5', '2', '1', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Medication for cholesterol, blood pressure or diabetes


Length: 502180
Categories (7, object): ['-7', '1', '2', '-1', '-3', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (4, object): ['1', '2', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Mineral and other dietary supplements


Length: 502180
Categories (9, object): ['-7', '1', '5', '2', ..., '4', '-3', '6', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['1', '5', '2', '3', '4', '6', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Medication for pain relief, constipation, heartburn (pilot)


Length: 502180
Categories (9, object): ['-7', '3', '1', '5', ..., '4', '-3', '-1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (6, object): ['3', '1', '5', '2', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Medication for smoking cessation, constipation, heartburn, allergies (pilot)


Length: 502180
Categories (8, object): ['-7', '4', '-1', '2', '3', '1', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (5, object): ['4', '2', '3', '1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Vitamin and mineral supplements (pilot)


Length: 502180
Categories (9, object): ['-7', '5', '2', '1', ..., '6', '4', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['5', '2', '1', '3', '6', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Vitamin supplements (pilot)


Length: 502180
Categories (9, object): ['-7', '1', '3', '6', ..., '2', '-3', '5', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['1', '3', '6', '4', '2', '5', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Other dietary supplements (pilot)


Length: 502180
Categories (9, object): ['-7', '3', '6', '2', ..., '5', '4', '1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['3', '6', '2', '5', '4', '1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Illness, injury, bereavement, stress in last 2 years


Length: 502180
Categories (9, object): ['-7', '1', '2', '3', ..., '4', '6', '5', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['1', '2', '3', '4', '6', '5', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Manic/hyper symptoms


Length: 502180
Categories (7, object): ['12', '14', '-7', '13', '11', '15', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (6, object): ['12', '14', '13', '11', '15', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Illness, injury, bereavement, stress in last 2 years (pilot)


Length: 502180
Categories (10, object): ['6', '2', '5', '-7', ..., '4', '-3', '-1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['6', '2', '5', '3', '1', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Mouth/teeth dental problems


Length: 502180
Categories (9, object): ['-7', '1', '3', '6', ..., '4', '-3', '5', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (7, object): ['1', '3', '6', '2', '4', '5', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Mouth/teeth dental problems (pilot)


Length: 502180
Categories (7, object): ['3', '-7', '1', '2', '4', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (5, object): ['3', '1', '2', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Attendance/disability/mobility allowance


Length: 502180
Categories (7, object): ['-7', '1', '2', '3', '-1', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (4, object): ['1', '2', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Pain type(s) experienced in last month


Length: 502180
Categories (11, object): ['-3', '-7', '1', '3', ..., '8', '6', '2', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (9, object): ['1', '3', '4', '5', ..., '8', '6', '2', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Types of transport used (excluding work)


Length: 502180
Categories (7, object): ['1', '2', '3', '4', '-7', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (5, object): ['1', '2', '3', '4', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Types of physical activity in last 4 weeks


Length: 502180
Categories (8, object): ['-7', '1', '4', '5', '2', '3', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (6, object): ['1', '4', '5', '2', '3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Why stopped smoking


Length: 502180
Categories (8, object): ['3', '4', '-7', '1', '2', '-1', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (5, object): ['3', '4', '1', '2', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Why reduced smoking


Length: 502180
Categories (8, object): ['4', '3', '1', '2', '-7', '-3', '-1', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (5, object): ['4', '3', '1', '2', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


Leisure/social activities


Length: 502180
Categories (8, object): ['-7', '1', '2', '3', '4', '5', '-3', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].cat.add_categories("Missing")
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})
Length: 502180
Categories (6, object): ['1', '2', '3', '4', '5', 'Missing']' has dtype incompatible with category, please explicitly cast to a compatible dtype first.
  array_df.loc[:, array_df.columns[1]] = array_df[array_df.columns[1]].replace({'-1': 'Missing', '-3': 'Missing', '-7': 'Missing'})


## Pick first array for all other field types
Not a perfect option, as some fields have measurements that should probably be averaged - but will do for now!

In [10]:
arrayed_fields = display_arrayed_fields(pruned_cols)
other_dtypes = arrayed_fields[arrayed_fields['Field type'] != 'Multi choice']
other_dtypes

Unnamed: 0_level_0,Unnamed: 1_level_0,Field ID,Field type,Number of arrays
Primary category,Field title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Blood pressure (100011),"Systolic blood pressure, manual reading",93,Integer,2
Blood pressure (100011),"Diastolic blood pressure, manual reading",94,Integer,2
Blood pressure (100011),Pulse rate (during blood-pressure measurement),95,Integer,2
Blood pressure (100011),Time since interview start at which blood pressure screen(s) shown,96,Integer,2
Blood pressure (100011),"Pulse rate, automated reading",102,Integer,2
Blood pressure (100011),"Diastolic blood pressure, automated reading",4079,Integer,2
Blood pressure (100011),"Systolic blood pressure, automated reading",4080,Integer,2
Blood pressure (100011),Method of measuring blood pressure,4081,Single choice,2
Blood sample collection (100002),Time blood sample collected,3166,Datetime,7
Blood sample collection (100002),"Blood sample #, note contents",20049,Single choice,7


In [11]:
# Get all columns for other data types
other_dtypes_cols = [col for col in pruned_cols if int(col.split('.')[0]) in other_dtypes['Field ID'].tolist()]
# Get all non-first arrays and remove from main dataframe 
not_first_array = [col for col in other_dtypes_cols if int(col.split('.')[2]) > 0]
data = data.drop(columns=not_first_array)

In [12]:
# Check that no arrayed columns remain (excluding one-hot-encoded fields that contain '_')
display_arrayed_fields([col for col in data.columns[1:] if '_' not in col])

Unnamed: 0_level_0,Unnamed: 1_level_0,Field ID,Field type,Number of arrays
Primary category,Field title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


# Final processing

## Remove inappropriate data types
String and datetime

In [13]:
fields_metadata = pd.read_csv("../data/raw/ukb_metadata/field.txt", sep="\t") # Schema 1

fields_to_remove = []

# Strings
fields_to_remove.extend(fields_metadata.loc[fields_metadata['value_type'] == 41, 'field_id'].tolist())

# Date
fields_to_remove.extend(fields_metadata.loc[fields_metadata['value_type'] == 51, 'field_id'].tolist())

# Time
fields_to_remove.extend(fields_metadata.loc[fields_metadata['value_type'] == 61, 'field_id'].tolist())

data = data[['eid'] + [col for col in data.columns[1:] if int(col.split('.')[0]) not in fields_to_remove]]

## Handle missing values
-1: Do not know
-3: Prefer not to answer
-7: None of the above

Make these all NaN

In [14]:
missing_values = ['-1', '-3', '-7']

data[data.select_dtypes(['category']).columns] = data.select_dtypes(['category']).apply(
    lambda col: col.map(lambda x: np.nan if x in missing_values else x).astype('category')
)

# View final included categories

In [15]:
pd.set_option('display.max_rows', None)
cat_tree, pretty_df = display_included_cats(data.columns[1:])
pretty_df.to_excel('../output/Final included categories.xlsx')
pretty_df

Level 1,Level 2,Level 3,Level 4,Level 5,Level 6
Assessment centre (100000),Recruitment (100021),Reception (100024) ✅,,,
Assessment centre (100000),Recruitment (100021),Consent (100023) ✅,,,
Assessment centre (100000),Recruitment (100021),Conclusion (100022),,,
Assessment centre (100000),Recruitment (100021),Consent for imaging (100119),,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Household (100066) ✅,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Employment (100064) ✅,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Education (100063) ✅,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Ethnicity (100065) ✅,,
Assessment centre (100000),Touchscreen (100025),Sociodemographics (100062),Other sociodemographic factors (100067) ✅,,
Assessment centre (100000),Touchscreen (100025),Lifestyle and environment (100050),Physical activity (100054) ✅,MET Scores (54) ✅,


# Save

In [16]:
data.to_feather("../data/processed/features_processed.feather")