My name is Matt and I'm an intermediate Python programmer, with a focus on data cleaning and harmonisation - my role is to *harmonise* data across different follow-ups, ensuring that the same questions captured at different time-points have the same variable name, label, and field values, as well as ensuring that different questions (if they're semantically different, or the options to answer the question vary) are named and labelled differently, to ensure consistency across all follow-ups over time.

I have a fairly solid understanding of the basic foundations of programming and data cleaning/analysis. I like using Polars, and have written a simple library `banksia` which is a wrapper around `pyreadstat` to format SPSS files in a way that's more manageable for my workflow. I like to use a more functional style of programming, and prefer concise, simple code. I want to learn about new data structures, algorithms, libraries (standard and third-party) and other tips and tricks that help to improve my processes.

In [None]:
import banksia as bk
import polars as pl
from pathlib import Path
from fastcore.utils import *
import fastcore.all as fc, numpy as np, matplotlib.pyplot as plt
import re, math, itertools, functools, types, typing, dataclasses, collections, regex, time, asyncio

In [None]:
INPUT = Path("../data/input")
OUTPUT = Path("../data/output")

In [None]:
vl = pl.read_excel("../value_labels.xlsx", columns=["SPSS .SAV file (03.12.25)", "Actual Variable Name"]).rename({"SPSS .SAV file (03.12.25)": "file", "Actual Variable Name": "variable"})
vl

file,variable
str,str
"""G0G1_Q.sav""","""G0G1_BRC_D1"""
"""G0G1_Q.sav""","""G0G1_BRC_D2"""
"""G0G1_Q.sav""","""G0G1_BRC_D3"""
"""G0G1_Q.sav""","""G0G1_BRC_D4"""
"""G0G1_Q.sav""","""G0G1_BRC_D5"""
…,…
"""G228_MainQandRQ.sav""","""G228_SIS1_OC_AGE"""
"""G228_MainQandRQ.sav""","""G228_SIS2_BRC_AGE"""
"""G228_MainQandRQ.sav""","""G228_SIS2_OC_AGE"""
"""G228_MainQandRQ.sav""","""G228_SIS3_BRC_AGE"""


So these are all the variables we've filtered and select in the "value labels" spreadsheet.  
Let's quickly trawl through the SPSS files, pick up the relevant groups of variables, and compare to make sure we've captured everything.  
We'll also investigate the documentation/pro-formas to ensure nothing has slipped through the gaps.

## G227

In [None]:
vars_g227_str = """G227_BROC
G227_BROC_COM
G227_MO_BRC
G227_MO_OC
G227_MO_AGE
G227_SIS1_BRC
G227_SIS1_OC
G227_SIS1_AGE
G227_SIS2_BRC
G227_SIS2_OC
G227_SIS2_AGE
G227_SIS3_BRC
G227_SIS3_OC
G227_SIS3_AGE
G227_MA1_BRC
G227_MA1_OC
G227_MA1_AGE
G227_MA2_BRC
G227_MA2_OC
G227_MA2_AGE
G227_PA1_BRC
G227_PA1_OC
G227_PA1_AGE
G227_PA2_BRC
G227_PA2_OC
G227_PA2_AGE
G227_MG_BRC
G227_MG_OC
G227_MG_AGE
G227_PG_BRC
G227_PG_OC
G227_PG_AGE"""

vars_g227 = set(vars_g227_str.splitlines())
vars_g227

{'G227_BROC',
 'G227_BROC_COM',
 'G227_MA1_AGE',
 'G227_MA1_BRC',
 'G227_MA1_OC',
 'G227_MA2_AGE',
 'G227_MA2_BRC',
 'G227_MA2_OC',
 'G227_MG_AGE',
 'G227_MG_BRC',
 'G227_MG_OC',
 'G227_MO_AGE',
 'G227_MO_BRC',
 'G227_MO_OC',
 'G227_PA1_AGE',
 'G227_PA1_BRC',
 'G227_PA1_OC',
 'G227_PA2_AGE',
 'G227_PA2_BRC',
 'G227_PA2_OC',
 'G227_PG_AGE',
 'G227_PG_BRC',
 'G227_PG_OC',
 'G227_SIS1_AGE',
 'G227_SIS1_BRC',
 'G227_SIS1_OC',
 'G227_SIS2_AGE',
 'G227_SIS2_BRC',
 'G227_SIS2_OC',
 'G227_SIS3_AGE',
 'G227_SIS3_BRC',
 'G227_SIS3_OC'}

In [None]:
vl_g227 = set(vl.filter(pl.col("file").eq("G227_PA.sav")).get_column("variable").to_list())
vl_g227

{'G227_BROC',
 'G227_BROC_COM',
 'G227_MA1_AGE',
 'G227_MA1_BRC',
 'G227_MA1_OC',
 'G227_MA2_AGE',
 'G227_MA2_BRC',
 'G227_MA2_OC',
 'G227_MG_AGE',
 'G227_MG_BRC',
 'G227_MG_OC',
 'G227_MO_AGE',
 'G227_MO_BRC',
 'G227_MO_OC',
 'G227_PA1_AGE',
 'G227_PA1_BRC',
 'G227_PA1_OC',
 'G227_PA2_AGE',
 'G227_PA2_BRC',
 'G227_PA2_OC',
 'G227_PG_AGE',
 'G227_PG_BRC',
 'G227_PG_OC',
 'G227_SIS1_AGE',
 'G227_SIS1_BRC',
 'G227_SIS1_OC',
 'G227_SIS2_AGE',
 'G227_SIS2_BRC',
 'G227_SIS2_OC',
 'G227_SIS3_AGE',
 'G227_SIS3_BRC',
 'G227_SIS3_OC'}

In [None]:
vars_g227 - vl_g227, vl_g227 - vars_g227

(set(), set())

## G228

In [None]:
vars_g228_str = """G228_BROC
G228_MO_BRC
G228_SIS1_BRC
G228_SIS2_BRC
G228_SIS3_BRC
G228_MA1_BRC
G228_MA2_BRC
G228_PA1_BRC
G228_PA2_BRC
G228_MG_BRC
G228_PG_BRC
G228_OR1_BRC
G228_OR2_BRC
G228_OR1_BRC_OTH
G228_OR2_BRC_OTH
G228_MO_BRC_AGE
G228_SIS1_BRC_AGE
G228_SIS2_BRC_AGE
G228_SIS3_BRC_AGE
G228_MA1_BRC_AGE
G228_MA2_BRC_AGE
G228_PA1_BRC_AGE
G228_PA2_BRC_AGE
G228_MG_BRC_AGE
G228_PG_BRC_AGE
G228_OR1_BRC_AGE
G228_OR2_BRC_AGE
G228_MO_OC
G228_SIS1_OC
G228_SIS2_OC
G228_SIS3_OC
G228_MA1_OC
G228_MA2_OC
G228_PA1_OC
G228_PA2_OC
G228_MG_OC
G228_PG_OC
G228_OR1_OC
G228_OR1_OC_OTH
G228_MO_OC_AGE
G228_SIS1_OC_AGE
G228_SIS2_OC_AGE
G228_SIS3_OC_AGE
G228_MA1_OC_AGE
G228_MA2_OC_AGE
G228_PA1_OC_AGE
G228_PA2_OC_AGE
G228_MG_OC_AGE
G228_PG_OC_AGE
G228_OR1_OC_AGE"""

vars_g228 = set(vars_g228_str.splitlines())
vars_g228

{'G228_BROC',
 'G228_MA1_BRC',
 'G228_MA1_BRC_AGE',
 'G228_MA1_OC',
 'G228_MA1_OC_AGE',
 'G228_MA2_BRC',
 'G228_MA2_BRC_AGE',
 'G228_MA2_OC',
 'G228_MA2_OC_AGE',
 'G228_MG_BRC',
 'G228_MG_BRC_AGE',
 'G228_MG_OC',
 'G228_MG_OC_AGE',
 'G228_MO_BRC',
 'G228_MO_BRC_AGE',
 'G228_MO_OC',
 'G228_MO_OC_AGE',
 'G228_OR1_BRC',
 'G228_OR1_BRC_AGE',
 'G228_OR1_BRC_OTH',
 'G228_OR1_OC',
 'G228_OR1_OC_AGE',
 'G228_OR1_OC_OTH',
 'G228_OR2_BRC',
 'G228_OR2_BRC_AGE',
 'G228_OR2_BRC_OTH',
 'G228_PA1_BRC',
 'G228_PA1_BRC_AGE',
 'G228_PA1_OC',
 'G228_PA1_OC_AGE',
 'G228_PA2_BRC',
 'G228_PA2_BRC_AGE',
 'G228_PA2_OC',
 'G228_PA2_OC_AGE',
 'G228_PG_BRC',
 'G228_PG_BRC_AGE',
 'G228_PG_OC',
 'G228_PG_OC_AGE',
 'G228_SIS1_BRC',
 'G228_SIS1_BRC_AGE',
 'G228_SIS1_OC',
 'G228_SIS1_OC_AGE',
 'G228_SIS2_BRC',
 'G228_SIS2_BRC_AGE',
 'G228_SIS2_OC',
 'G228_SIS2_OC_AGE',
 'G228_SIS3_BRC',
 'G228_SIS3_BRC_AGE',
 'G228_SIS3_OC',
 'G228_SIS3_OC_AGE'}

In [None]:
vl_g228 = set(vl.filter(pl.col("file").eq("G228_MainQandRQ.sav")).get_column("variable").to_list())

In [None]:
vars_g228 - vl_g228, vl_g228 - vars_g228

(set(), set())

## G0G1

In [None]:
vars_g0g1_str = """G0G1_FH_BROV
G0G1_BRC_MO
G0G1_OVC_MO
G0G1_BRCA_MO
G0G1_OVCA_MO
G0G1_REL_SIS
G0G1_BRC_S1
G0G1_BRC_S2
G0G1_BRC_S3
G0G1_BRC_S4
G0G1_BRC_S5
G0G1_BRCA_S1
G0G1_BRCA_S2
G0G1_BRCA_S3
G0G1_BRCA_S4
G0G1_BRCA_S5
G0G1_OVC_S1
G0G1_OVC_S2
G0G1_OVC_S3
G0G1_OVC_S4
G0G1_OVC_S5
G0G1_OVCA_S1
G0G1_OVCA_S2
G0G1_OVCA_S3
G0G1_OVCA_S4
G0G1_OVCA_S5
G0G1_REL_DAU
G0G1_BRC_D1
G0G1_BRC_D2
G0G1_BRC_D3
G0G1_BRC_D4
G0G1_BRC_D5
G0G1_BRCA_D1
G0G1_BRCA_D2
G0G1_BRCA_D3
G0G1_BRCA_D4
G0G1_BRCA_D5
G0G1_OVC_D1
G0G1_OVC_D2
G0G1_OVC_D3
G0G1_OVC_D4
G0G1_OVC_D5
G0G1_OVCA_D1
G0G1_OVCA_D2
G0G1_OVCA_D3
G0G1_OVCA_D4
G0G1_OVCA_D5
G0G1_REL_PA
G0G1_BRC_PA1
G0G1_BRC_PA2
G0G1_BRC_PA3
G0G1_BRC_PA4
G0G1_BRC_PA5
G0G1_BRCA_PA1
G0G1_BRCA_PA2
G0G1_BRCA_PA3
G0G1_BRCA_PA4
G0G1_BRCA_PA5
G0G1_OVC_PA1
G0G1_OVC_PA2
G0G1_OVC_PA3
G0G1_OVC_PA4
G0G1_OVC_PA5
G0G1_OVCA_PA1
G0G1_OVCA_PA2
G0G1_OVCA_PA3
G0G1_OVCA_PA4
G0G1_OVCA_PA5
G0G1_REL_MA
G0G1_BRC_MA1
G0G1_BRC_MA2
G0G1_BRC_MA3
G0G1_BRC_MA4
G0G1_BRC_MA5
G0G1_BRCA_MA1
G0G1_BRCA_MA2
G0G1_BRCA_MA3
G0G1_BRCA_MA4
G0G1_BRCA_MA5
G0G1_OVC_MA1
G0G1_OVC_MA2
G0G1_OVC_MA3
G0G1_OVC_MA4
G0G1_OVC_MA5
G0G1_OVCA_MA1
G0G1_OVCA_MA2
G0G1_OVCA_MA3
G0G1_OVCA_MA4
G0G1_OVCA_MA5
G0G1_BRC_MG
G0G1_OVC_MG
G0G1_BRC_PG
G0G1_OVC_PG
G0G1_BRCA_MG
G0G1_OVCA_MG
G0G1_BRCA_PG
G0G1_OVCA_PG"""

vars_g0g1 = set(vars_g0g1_str.splitlines())
vars_g0g1

{'G0G1_BRCA_D1',
 'G0G1_BRCA_D2',
 'G0G1_BRCA_D3',
 'G0G1_BRCA_D4',
 'G0G1_BRCA_D5',
 'G0G1_BRCA_MA1',
 'G0G1_BRCA_MA2',
 'G0G1_BRCA_MA3',
 'G0G1_BRCA_MA4',
 'G0G1_BRCA_MA5',
 'G0G1_BRCA_MG',
 'G0G1_BRCA_MO',
 'G0G1_BRCA_PA1',
 'G0G1_BRCA_PA2',
 'G0G1_BRCA_PA3',
 'G0G1_BRCA_PA4',
 'G0G1_BRCA_PA5',
 'G0G1_BRCA_PG',
 'G0G1_BRCA_S1',
 'G0G1_BRCA_S2',
 'G0G1_BRCA_S3',
 'G0G1_BRCA_S4',
 'G0G1_BRCA_S5',
 'G0G1_BRC_D1',
 'G0G1_BRC_D2',
 'G0G1_BRC_D3',
 'G0G1_BRC_D4',
 'G0G1_BRC_D5',
 'G0G1_BRC_MA1',
 'G0G1_BRC_MA2',
 'G0G1_BRC_MA3',
 'G0G1_BRC_MA4',
 'G0G1_BRC_MA5',
 'G0G1_BRC_MG',
 'G0G1_BRC_MO',
 'G0G1_BRC_PA1',
 'G0G1_BRC_PA2',
 'G0G1_BRC_PA3',
 'G0G1_BRC_PA4',
 'G0G1_BRC_PA5',
 'G0G1_BRC_PG',
 'G0G1_BRC_S1',
 'G0G1_BRC_S2',
 'G0G1_BRC_S3',
 'G0G1_BRC_S4',
 'G0G1_BRC_S5',
 'G0G1_FH_BROV',
 'G0G1_OVCA_D1',
 'G0G1_OVCA_D2',
 'G0G1_OVCA_D3',
 'G0G1_OVCA_D4',
 'G0G1_OVCA_D5',
 'G0G1_OVCA_MA1',
 'G0G1_OVCA_MA2',
 'G0G1_OVCA_MA3',
 'G0G1_OVCA_MA4',
 'G0G1_OVCA_MA5',
 'G0G1_OVCA_MG',
 'G0G1_OVCA_

In [None]:
vl_g0g1 = set(vl.filter(pl.col("file").eq("G0G1_Q.sav")).get_column("variable").to_list())

In [None]:
vars_g0g1 - vl_g0g1, vl_g0g1 - vars_g0g1

({'G0G1_REL_DAU', 'G0G1_REL_MA', 'G0G1_REL_PA', 'G0G1_REL_SIS'}, set())

## Checking we have all the variables we want

Let's take the latest version of our now cleaned value labels spreadsheet as an input, and then combine all the SPSS variables we found, and assert that it's identical

In [None]:
value_labels = """G0G1_Q.sav	G0G1	G0G1_BRCA_D1
G0G1_Q.sav	G0G1	G0G1_BRCA_D2
G0G1_Q.sav	G0G1	G0G1_BRCA_D3
G0G1_Q.sav	G0G1	G0G1_BRCA_D4
G0G1_Q.sav	G0G1	G0G1_BRCA_D5
G0G1_Q.sav	G0G1	G0G1_BRCA_MA1
G0G1_Q.sav	G0G1	G0G1_BRCA_MA2
G0G1_Q.sav	G0G1	G0G1_BRCA_MA3
G0G1_Q.sav	G0G1	G0G1_BRCA_MA4
G228_MainQandRQ.sav	G228	G228_AREL
G227_PA.sav	G227	G227_AreL
G0G1_PA.sav	G0G1	G0G1_AREL
G228_MainQandRQ.sav	G228	G228_ARER
G227_PA.sav	G227	G227_AreR
G0G1_PA.sav	G0G1	G0G1_ARER
G228_MainQandRQ.sav	G228	G228_BR_COL
G227_PA.sav	G227	G227_BR_Col
G0G1_PA.sav	G0G1	G0G1_BR_COL
G228_MainQandRQ.sav	G228	G228_BR1
G227_PA.sav	G227	G227_BR1
G228_MainQandRQ.sav	G228	G228_BR1_AGE
G227_PA.sav	G227	G227_BR1_AGE
G228_MainQandRQ.sav	G228	G228_BR2
G227_PA.sav	G227	G227_BR2
G228_MainQandRQ.sav	G228	G228_BR2_AGE
G227_PA.sav	G227	G227_BR2_AGE
G228_MainQandRQ.sav	G228	G228_BR3
G227_PA.sav	G227	G227_BR3
G228_MainQandRQ.sav	G228	G228_BR3_AGE
G227_PA.sav	G227	G227_BR3_AGE
G228_MainQandRQ.sav	G228	G228_BR4
G227_PA.sav	G227	G227_BR4
G228_MainQandRQ.sav	G228	G228_BR4_AGE
G227_PA.sav	G227	G227_BR4_AGE
G228_MainQandRQ.sav	G228	G228_BR4_SD
G227_PA.sav	G227	G227_BR4_SD
G228_MainQandRQ.sav	G228	G228_BR5
G227_PA.sav	G227	G227_BR5
G228_MainQandRQ.sav	G228	G228_BR5_AGE
G227_PA.sav	G227	G227_BR5_AGE
G228_MainQandRQ.sav	G228	G228_BR5_SD
G227_PA.sav	G227	G227_BR5_SD
G228_MainQandRQ.sav	G228	G228_BR6
G227_PA.sav	G227	G227_BR6
G228_MainQandRQ.sav	G228	G228_BR6_AGE
G227_PA.sav	G227	G227_BR6_AGE
G228_MainQandRQ.sav	G228	G228_BR6_SD
G227_PA.sav	G227	G227_BR6_SD
G0G1_Q.sav	G0G1	G0G1_BRC_D1
G0G1_Q.sav	G0G1	G0G1_BRC_D2
G0G1_Q.sav	G0G1	G0G1_BRC_D3
G0G1_Q.sav	G0G1	G0G1_BRC_D4
G0G1_Q.sav	G0G1	G0G1_BRC_D5
G0G1_Q.sav	G0G1	G0G1_BRC_MA1
G0G1_Q.sav	G0G1	G0G1_BRC_MA2
G0G1_Q.sav	G0G1	G0G1_BRC_MA3
G0G1_Q.sav	G0G1	G0G1_BRC_MA4
G0G1_Q.sav	G0G1	G0G1_BRC_MA5
G0G1_Q.sav	G0G1	G0G1_BRC_MG
G0G1_Q.sav	G0G1	G0G1_BRC_MO
G0G1_Q.sav	G0G1	G0G1_BRC_PA1
G0G1_Q.sav	G0G1	G0G1_BRC_PA2
G0G1_Q.sav	G0G1	G0G1_BRC_PA3
G0G1_Q.sav	G0G1	G0G1_BRC_PA4
G0G1_Q.sav	G0G1	G0G1_BRC_PA5
G0G1_Q.sav	G0G1	G0G1_BRC_PG
G0G1_Q.sav	G0G1	G0G1_BRC_S1
G0G1_Q.sav	G0G1	G0G1_BRC_S2
G0G1_Q.sav	G0G1	G0G1_BRC_S3
G0G1_Q.sav	G0G1	G0G1_BRC_S4
G0G1_Q.sav	G0G1	G0G1_BRC_S5
G0G1_Q.sav	G0G1	G0G1_BRCA_MA5
G0G1_Q.sav	G0G1	G0G1_BRCA_MG
G0G1_Q.sav	G0G1	G0G1_BRCA_MO
G0G1_Q.sav	G0G1	G0G1_BRCA_PA1
G0G1_Q.sav	G0G1	G0G1_BRCA_PA2
G0G1_Q.sav	G0G1	G0G1_BRCA_PA3
G0G1_Q.sav	G0G1	G0G1_BRCA_PA4
G0G1_Q.sav	G0G1	G0G1_BRCA_PA5
G0G1_Q.sav	G0G1	G0G1_BRCA_PG
G0G1_Q.sav	G0G1	G0G1_BRCA_S1
G0G1_Q.sav	G0G1	G0G1_BRCA_S2
G0G1_Q.sav	G0G1	G0G1_BRCA_S3
G0G1_Q.sav	G0G1	G0G1_BRCA_S4
G0G1_Q.sav	G0G1	G0G1_BRCA_S5
G0G1_Q.sav	G0G1	G0G1_BRS3
G0G1_Q.sav	G0G1	G0G1_BRS5
G0G1_Q.sav	G0G1	G0G1_BRS6
G0G1_Q.sav	G0G1	G0G1_BRS7
G0G1_Q.sav	G0G1	G0G1_BRSA5
G0G1_Q.sav	G0G1	G0G1_BRSA6_1
G0G1_Q.sav	G0G1	G0G1_BRSA6_2
G0G1_Q.sav	G0G1	G0G1_BRSA7
G0G1_Q.sav	G0G1	G0G1_BRSS5
G228_MainQandRQ.sav	G228	G228_BROC
G227_PA.sav	G227	G227_BROC
G227_PA.sav	G227	G227_BROC_COM
G0G1_Q.sav	G0G1	G0G1_BRS1
G0G1_Q.sav	G0G1	G0G1_BRS2
G0G1_Q.sav	G0G1	G0G1_BRSS7
G0G1_Q.sav	G0G1	G0G1_BRS4
G0G1_Q.sav	G0G1	G0G1_BRSA3
G0G1_Q.sav	G0G1	G0G1_BRSA4_1
G0G1_Q.sav	G0G1	G0G1_BRSA4_2
G0G1_Q.sav	G0G1	G0G1_BRSA4_3
G0G1_Q.sav	G0G1	G0G1_BRSS4_1
G0G1_Q.sav	G0G1	G0G1_BRSS4_2
G0G1_Q.sav	G0G1	G0G1_BRSS4_3
G0G1_Q.sav	G0G1	G0G1_BRSS6_1
G0G1_Q.sav	G0G1	G0G1_BRSS6_2
G0G1_Q.sav	G0G1	G0G1_OVCA_D1
G0G1_Q.sav	G0G1	G0G1_OVCA_D2
G0G1_Q.sav	G0G1	G0G1_OVCA_D3
G0G1_Q.sav	G0G1	G0G1_OVCA_D4
G0G1_Q.sav	G0G1	G0G1_OVCA_D5
G0G1_Q.sav	G0G1	G0G1_OVCA_MA1
G0G1_Q.sav	G0G1	G0G1_OVCA_MA2
G0G1_Q.sav	G0G1	G0G1_OVCA_MA3
G0G1_Q.sav	G0G1	G0G1_OVCA_MA4
G0G1_Q.sav	G0G1	G0G1_OVCA_MA5
G0G1_Q.sav	G0G1	G0G1_OVCA_MG
G0G1_Q.sav	G0G1	G0G1_OVCA_MO
G0G1_Q.sav	G0G1	G0G1_OVCA_PA1
G0G1_Q.sav	G0G1	G0G1_OVCA_PA2
G0G1_Q.sav	G0G1	G0G1_OVCA_PA3
G0G1_Q.sav	G0G1	G0G1_OVCA_PA4
G0G1_Q.sav	G0G1	G0G1_OVCA_PA5
G0G1_Q.sav	G0G1	G0G1_OVCA_PG
G0G1_Q.sav	G0G1	G0G1_OVCA_S1
G0G1_Q.sav	G0G1	G0G1_OVCA_S2
G0G1_Q.sav	G0G1	G0G1_OVCA_S3
G0G1_Q.sav	G0G1	G0G1_OVCA_S4
G0G1_Q.sav	G0G1	G0G1_OVCA_S5
G0G1_Q.sav	G0G1	G0G1_FH_BROV
G227_PA.sav	G227	G227_MA1_AGE
G228_MainQandRQ.sav	G228	G228_MA1_BRC
G227_PA.sav	G227	G227_MA1_BRC
G228_MainQandRQ.sav	G228	G228_MA1_OC
G227_PA.sav	G227	G227_MA1_OC
G227_PA.sav	G227	G227_MA2_AGE
G228_MainQandRQ.sav	G228	G228_MA2_BRC
G227_PA.sav	G227	G227_MA2_BRC
G228_MainQandRQ.sav	G228	G228_MA2_OC
G227_PA.sav	G227	G227_MA2_OC
G227_PA.sav	G227	G227_MG_AGE
G228_MainQandRQ.sav	G228	G228_MG_BRC
G227_PA.sav	G227	G227_MG_BRC
G228_MainQandRQ.sav	G228	G228_MG_OC
G227_PA.sav	G227	G227_MG_OC
G227_PA.sav	G227	G227_MO_AGE
G228_MainQandRQ.sav	G228	G228_MO_BRC
G227_PA.sav	G227	G227_MO_BRC
G228_MainQandRQ.sav	G228	G228_MO_OC
G227_PA.sav	G227	G227_MO_OC
G228_MainQandRQ.sav	G228	G228_OR1_BRC
G228_MainQandRQ.sav	G228	G228_OR1_BRC_OTH
G228_MainQandRQ.sav	G228	G228_OR1_OC
G228_MainQandRQ.sav	G228	G228_OR1_OC_OTH
G228_MainQandRQ.sav	G228	G228_OR2_BRC
G228_MainQandRQ.sav	G228	G228_OR2_BRC_OTH
G0G1_Q.sav	G0G1	G0G1_OVC_D1
G0G1_Q.sav	G0G1	G0G1_OVC_D2
G0G1_Q.sav	G0G1	G0G1_OVC_D3
G0G1_Q.sav	G0G1	G0G1_OVC_D4
G0G1_Q.sav	G0G1	G0G1_OVC_D5
G0G1_Q.sav	G0G1	G0G1_OVC_MA1
G0G1_Q.sav	G0G1	G0G1_OVC_MA2
G0G1_Q.sav	G0G1	G0G1_OVC_MA3
G0G1_Q.sav	G0G1	G0G1_OVC_MA4
G0G1_Q.sav	G0G1	G0G1_OVC_MA5
G0G1_Q.sav	G0G1	G0G1_OVC_MG
G0G1_Q.sav	G0G1	G0G1_OVC_MO
G0G1_Q.sav	G0G1	G0G1_OVC_PA1
G0G1_Q.sav	G0G1	G0G1_OVC_PA2
G0G1_Q.sav	G0G1	G0G1_OVC_PA3
G0G1_Q.sav	G0G1	G0G1_OVC_PA4
G0G1_Q.sav	G0G1	G0G1_OVC_PA5
G0G1_Q.sav	G0G1	G0G1_OVC_PG
G0G1_Q.sav	G0G1	G0G1_OVC_S1
G0G1_Q.sav	G0G1	G0G1_OVC_S2
G0G1_Q.sav	G0G1	G0G1_OVC_S3
G0G1_Q.sav	G0G1	G0G1_OVC_S4
G0G1_Q.sav	G0G1	G0G1_OVC_S5
G227_PA.sav	G227	G227_PA1_AGE
G228_MainQandRQ.sav	G228	G228_PA1_BRC
G227_PA.sav	G227	G227_PA1_BRC
G228_MainQandRQ.sav	G228	G228_PA1_OC
G227_PA.sav	G227	G227_PA1_OC
G227_PA.sav	G227	G227_PA2_AGE
G228_MainQandRQ.sav	G228	G228_PA2_BRC
G227_PA.sav	G227	G227_PA2_BRC
G228_MainQandRQ.sav	G228	G228_PA2_OC
G227_PA.sav	G227	G227_PA2_OC
G227_PA.sav	G227	G227_PG_AGE
G228_MainQandRQ.sav	G228	G228_PG_BRC
G227_PA.sav	G227	G227_PG_BRC
G228_MainQandRQ.sav	G228	G228_PG_OC
G227_PA.sav	G227	G227_PG_OC
G228_MainQandRQ.sav	G228	G228_PIER
G0G1_PA.sav	G0G1	G0G1_PIER
G228_MainQandRQ.sav	G228	G228_PIERL
G227_PA.sav	G227	G227_PIERL
G0G1_PA.sav	G0G1	G0G1_PIERL
G228_MainQandRQ.sav	G228	G228_PIERR
G227_PA.sav	G227	G227_PIERR
G0G1_PA.sav	G0G1	G0G1_PIERR
G0G1_Q.sav	G0G1	G0G1_REL_DAU
G0G1_Q.sav	G0G1	G0G1_REL_MA
G0G1_Q.sav	G0G1	G0G1_REL_PA
G0G1_Q.sav	G0G1	G0G1_REL_SIS
G228_MainQandRQ.sav	G228	G228_SCAR_LC
G0G1_PA.sav	G0G1	G0G1_SCAR_LC
G228_MainQandRQ.sav	G228	G228_SCAR_LLIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LLIQ
G228_MainQandRQ.sav	G228	G228_SCAR_LLOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LLOQ
G228_MainQandRQ.sav	G228	G228_SCAR_LUIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LUIQ
G228_MainQandRQ.sav	G228	G228_SCAR_LUOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LUOQ
G228_MainQandRQ.sav	G228	G228_SCAR_RC
G0G1_PA.sav	G0G1	G0G1_SCAR_RC
G228_MainQandRQ.sav	G228	G228_SCAR_RLIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RLIQ
G228_MainQandRQ.sav	G228	G228_SCAR_RLOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RLOQ
G228_MainQandRQ.sav	G228	G228_SCAR_RUIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RUIQ
G228_MainQandRQ.sav	G228	G228_SCAR_RUOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RUOQ
G228_MainQandRQ.sav	G228	G228_SCARL
G227_PA.sav	G227	G227_SCARL
G0G1_PA.sav	G0G1	G0G1_SCARL
G228_MainQandRQ.sav	G228	G228_SCARS
G227_PA.sav	G227	G227_SCARS
G0G1_PA.sav	G0G1	G0G1_SCARS
G228_MainQandRQ.sav	G228	G228_SCARW
G227_PA.sav	G227	G227_SCARW
G0G1_PA.sav	G0G1	G0G1_SCARW
G228_MainQandRQ.sav	G228	G228_MA1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MA1_OC_AGE
G228_MainQandRQ.sav	G228	G228_MA2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MA2_OC_AGE
G228_MainQandRQ.sav	G228	G228_MG_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MG_OC_AGE
G228_MainQandRQ.sav	G228	G228_MO_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MO_OC_AGE
G228_MainQandRQ.sav	G228	G228_OR1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_OR1_OC_AGE
G228_MainQandRQ.sav	G228	G228_OR2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PA1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PA1_OC_AGE
G228_MainQandRQ.sav	G228	G228_PA2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PA2_OC_AGE
G228_MainQandRQ.sav	G228	G228_PG_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PG_OC_AGE
G227_PA.sav	G227	G227_SIS1_AGE
G228_MainQandRQ.sav	G228	G228_SIS1_BRC
G227_PA.sav	G227	G227_SIS1_BRC
G228_MainQandRQ.sav	G228	G228_SIS1_OC
G227_PA.sav	G227	G227_SIS1_OC
G227_PA.sav	G227	G227_SIS2_AGE
G228_MainQandRQ.sav	G228	G228_SIS2_BRC
G227_PA.sav	G227	G227_SIS2_BRC
G228_MainQandRQ.sav	G228	G228_SIS2_OC
G227_PA.sav	G227	G227_SIS2_OC
G227_PA.sav	G227	G227_SIS3_AGE
G228_MainQandRQ.sav	G228	G228_SIS3_BRC
G227_PA.sav	G227	G227_SIS3_BRC
G228_MainQandRQ.sav	G228	G228_SIS3_OC
G227_PA.sav	G227	G227_SIS3_OC
G228_MainQandRQ.sav	G228	G228_SIS1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_SIS1_OC_AGE
G228_MainQandRQ.sav	G228	G228_SIS2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_SIS2_OC_AGE
G228_MainQandRQ.sav	G228	G228_SIS3_BRC_AGE
G228_MainQandRQ.sav	G228	G228_SIS3_OC_AGE
G228_MainQandRQ.sav	G228	G228_TATT
G227_PA.sav	G227	G227_TATT
G0G1_PA.sav	G0G1	G0G1_TATT
G228_MainQandRQ.sav	G228	G228_TATT_LC
G0G1_PA.sav	G0G1	G0G1_TATT_LC
G228_MainQandRQ.sav	G228	G228_TATT_LLIQ
G0G1_PA.sav	G0G1	G0G1_TATT_LLIQ
G228_MainQandRQ.sav	G228	G228_TATT_LLOQ
G0G1_PA.sav	G0G1	G0G1_TATT_LLOQ
G228_MainQandRQ.sav	G228	G228_TATT_LUIQ
G0G1_PA.sav	G0G1	G0G1_TATT_LUIQ
G228_MainQandRQ.sav	G228	G228_TATT_LUOQ
G0G1_PA.sav	G0G1	G0G1_TATT_LUOQ
G228_MainQandRQ.sav	G228	G228_TATT_RC
G0G1_PA.sav	G0G1	G0G1_TATT_RC
G228_MainQandRQ.sav	G228	G228_TATT_RLIQ
G0G1_PA.sav	G0G1	G0G1_TATT_RLIQ
G228_MainQandRQ.sav	G228	G228_TATT_RLOQ
G0G1_PA.sav	G0G1	G0G1_TATT_RLOQ
G228_MainQandRQ.sav	G228	G228_TATT_RUIQ
G0G1_PA.sav	G0G1	G0G1_TATT_RUIQ
G228_MainQandRQ.sav	G228	G228_TATT_RUOQ
G0G1_PA.sav	G0G1	G0G1_TATT_RUOQ
G228_MainQandRQ.sav	G228	G228_TATTL
G227_PA.sav	G227	G227_TATTL
G0G1_PA.sav	G0G1	G0G1_TATTL
G228_MainQandRQ.sav	G228	G228_TATTW
G227_PA.sav	G227	G227_TATTW
G0G1_PA.sav	G0G1	G0G1_TATTW
G227_PA.sav	G227	G227_TiBs_COM
G0G1_PA.sav	G0G1	G0G1_TIBS_COM"""

In [None]:
vl = parse_variables(value_labels)
vl.height, vl.head()

(308,
 shape: (5, 3)
 ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
 ‚îÇ File       ‚îÜ Dataset ‚îÜ Variable     ‚îÇ
 ‚îÇ ---        ‚îÜ ---     ‚îÜ ---          ‚îÇ
 ‚îÇ str        ‚îÜ str     ‚îÜ str          ‚îÇ
 ‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D1 ‚îÇ
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D2 ‚îÇ
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D3 ‚îÇ
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D4 ‚îÇ
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D5 ‚îÇ
 ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò)

In [None]:
total_columns = vars_g227_str + vars_g228_str + vars_g0g1_pa_str + vars_g0g1_str
total_columns

'G227_BR1\nG227_BR1_AGE\nG227_BR2\nG227_BR2_AGE\nG227_BR3\nG227_BR3_AGE\nG227_BR4\nG227_BR4_AGE\nG227_BR4_SD\nG227_BR5\nG227_BR5_AGE\nG227_BR5_SD\nG227_BR6\nG227_BR6_AGE\nG227_BR6_SD\nG227_BROC\nG227_BROC_COM\nG227_MO_BRC\nG227_MO_OC\nG227_MO_AGE\nG227_SIS1_BRC\nG227_SIS1_OC\nG227_SIS1_AGE\nG227_SIS2_BRC\nG227_SIS2_OC\nG227_SIS2_AGE\nG227_SIS3_BRC\nG227_SIS3_OC\nG227_SIS3_AGE\nG227_MA1_BRC\nG227_MA1_OC\nG227_MA1_AGE\nG227_MA2_BRC\nG227_MA2_OC\nG227_MA2_AGE\nG227_PA1_BRC\nG227_PA1_OC\nG227_PA1_AGE\nG227_PA2_BRC\nG227_PA2_OC\nG227_PA2_AGE\nG227_MG_BRC\nG227_MG_OC\nG227_MG_AGE\nG227_PG_BRC\nG227_PG_OC\nG227_PG_AGE\nG227_AreR\nG227_AreL\nG227_SCARS\nG227_SCARW\nG227_SCARL\nG227_TATT\nG227_TATTW\nG227_TATTL\nG227_PIERR\nG227_PIERL\nG227_BR_Col\nG227_TiBs_COMG228_BR1\nG228_BR1_AGE\nG228_BR2\nG228_BR2_AGE\nG228_BR3\nG228_BR3_AGE\nG228_BR4\nG228_BR4_AGE\nG228_BR4_SD\nG228_BR5\nG228_BR5_AGE\nG228_BR5_SD\nG228_BR6\nG228_BR6_AGE\nG228_BR6_SD\nG228_BROC\nG228_MO_BRC\nG228_SIS1_BRC\nG228_SIS2_BRC\n

In [None]:
cols = total_columns.splitlines()
cols[:5]

['G227_BR1', 'G227_BR1_AGE', 'G227_BR2', 'G227_BR2_AGE', 'G227_BR3']

In [None]:
len(cols)

305

It seems like there's still a few columns missing. Let's check.

In [None]:
vl_set = set(vl.get_column("Variable").to_list())
col_set = set(cols)

In [None]:
vl_set - col_set

{'G0G1_BRS1',
 'G0G1_BR_COL',
 'G0G1_TIBS_COM',
 'G227_TiBs_COM',
 'G228_BR1',
 'G228_BR_COL'}

In [None]:
col_set - vl_set

{'G0G1_TIBS_COMG0G1_BRS1', 'G227_TiBs_COMG228_BR1', 'G228_BR_COLG0G1_BR_COL'}

Looks like this is an error due to concatenating the strings. How can we fix this?

##### ü§ñReplyü§ñ<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†

You're right! When you concatenated the strings with `+`, the last line of one string got joined directly to the first line of the next (no newline between them). 

Instead of:
```python
total_columns = vars_g227_str + vars_g228_str + vars_g0g1_pa_str + vars_g0g1_str
```

Try:
```python
total_columns = "\n".join([vars_g227_str, vars_g228_str, vars_g0g1_pa_str, vars_g0g1_str])
```

In [None]:
total_columns = "\n".join([vars_g227_str, vars_g228_str, vars_g0g1_pa_str, vars_g0g1_str])
cols = total_columns.splitlines()
col_set = set(cols)

In [None]:
vl_set - col_set, col_set - vl_set

(set(), set())

Perfect, we've captured all the variables we're after!