# Preprocess Demographics - DS-Connect

The "dsc_demographics.tsv" demographics file needs to be transformed to use with the LinkML data ingest pipeline in order to have only one column for race. This notebook will transform the dsc_demographics data to have only one column for “race” and the value will be the race value from the column header where a value of 1 is present and if more than one column has a value of 1 then the race value is “more than one race”. The existing "race_" columns need to be removed so that they are not included in the implicit, study specific model.

For example, the race columns are: race_american_indian, race_asian, race_black, race_pacific_islander, race_white, race_unknown.
For any row/participant where the value for one of the race columns is 1 the header value for that column will be the race value, unless more than one race column has a value of 1 then the value will be "more than one race".

In [1]:
# Imports
import pandas as pd

In [2]:
# Read in data file

file_path = "../data/DS-Connect-STUDY/raw_data/TSV/dsc_demographics.tsv"

df = pd.read_csv(file_path, sep='\t')

In [3]:
# Identify all race_ columns
race_cols = [col for col in df.columns if col.startswith("race_")]

def determine_race(row):
    # find which race columns are flagged with 1
    races = [col for col in race_cols if row[col] == 1]
    if len(races) == 1:
        return races[0]   # keep full header name
    elif len(races) > 1:
        return "more than one race"
    else:
        return ""

# create the new race column
df["race"] = df.apply(determine_race, axis=1)

In [4]:
# Remove original race columns, e.g. race_american_indian, race_asian, race_black, race_pacific_islander, race_white, race_unknown

cols_to_remove = [
    "race_american_indian",
    "race_asian",
    "race_black",
    "race_pacific_islander",
    "race_white",
    "race_unknown",
]

df = df.drop(columns=cols_to_remove)

In [5]:
# Save transformed file
df.to_csv('../data/DS-Connect-STUDY/raw_data/TSV_Preprocessed/dsc_demographics_preprocessed.tsv', index=False, sep='\t')