I'm planning to harmonise the breast and ovary-cancer questions for the TiBs assessment, done across G227, G228 and G0G1.  
John had harmonised G227 and G228, but not G0G1, so hopefully I can just ensure G0G1 can be appropriately harmonised with the previous variables.  

The purpose of this notebook is to cross-check and verify that I've selected all the relevant variables and haven't missed anything that currently exists in core.

In [1]:
import banksia as bk
import polars as pl
from pathlib import Path
from fastcore.utils import *
import fastcore.all as fc, numpy as np, matplotlib.pyplot as plt
# import re, math, itertools, functools, types, typing, dataclasses, collections, regex, time, asyncio

In [None]:
INPUT = Path("../data/input")
OUTPUT = Path("../data/output")

In [None]:
variables = """G0G1_Q.sav	G0G1	G0G1_BRCA_D1
G0G1_Q.sav	G0G1	G0G1_BRCA_D2
G0G1_Q.sav	G0G1	G0G1_BRCA_D3
G0G1_Q.sav	G0G1	G0G1_BRCA_D4
G0G1_Q.sav	G0G1	G0G1_BRCA_D5
G0G1_Q.sav	G0G1	G0G1_BRCA_MA1
G0G1_Q.sav	G0G1	G0G1_BRCA_MA2
G0G1_Q.sav	G0G1	G0G1_BRCA_MA3
G0G1_Q.sav	G0G1	G0G1_BRCA_MA4
G228_MainQandRQ.sav	G228	G228_BR_COL
G227_PA.sav	G227	G227_BR_Col
G0G1_PA.sav	G0G1	G0G1_BR_COL
G228_MainQandRQ.sav	G228	G228_BR1
G227_PA.sav	G227	G227_BR1
G228_MainQandRQ.sav	G228	G228_BR1_AGE
G227_PA.sav	G227	G227_BR1_AGE
G228_MainQandRQ.sav	G228	G228_BR2
G227_PA.sav	G227	G227_BR2
G228_MainQandRQ.sav	G228	G228_BR2_AGE
G227_PA.sav	G227	G227_BR2_AGE
G228_MainQandRQ.sav	G228	G228_BR3
G227_PA.sav	G227	G227_BR3
G228_MainQandRQ.sav	G228	G228_BR3_AGE
G227_PA.sav	G227	G227_BR3_AGE
G228_MainQandRQ.sav	G228	G228_BR4
G227_PA.sav	G227	G227_BR4
G228_MainQandRQ.sav	G228	G228_BR4_AGE
G227_PA.sav	G227	G227_BR4_AGE
G228_MainQandRQ.sav	G228	G228_BR4_SD
G227_PA.sav	G227	G227_BR4_SD
G228_MainQandRQ.sav	G228	G228_BR5
G227_PA.sav	G227	G227_BR5
G228_MainQandRQ.sav	G228	G228_BR5_AGE
G227_PA.sav	G227	G227_BR5_AGE
G228_MainQandRQ.sav	G228	G228_BR5_SD
G227_PA.sav	G227	G227_BR5_SD
G228_MainQandRQ.sav	G228	G228_BR6
G227_PA.sav	G227	G227_BR6
G228_MainQandRQ.sav	G228	G228_BR6_AGE
G227_PA.sav	G227	G227_BR6_AGE
G228_MainQandRQ.sav	G228	G228_BR6_SD
G227_PA.sav	G227	G227_BR6_SD
G0G1_Q.sav	G0G1	G0G1_BRC_D1
G0G1_Q.sav	G0G1	G0G1_BRC_D2
G0G1_Q.sav	G0G1	G0G1_BRC_D3
G0G1_Q.sav	G0G1	G0G1_BRC_D4
G0G1_Q.sav	G0G1	G0G1_BRC_D5
G0G1_Q.sav	G0G1	G0G1_BRC_MA1
G0G1_Q.sav	G0G1	G0G1_BRC_MA2
G0G1_Q.sav	G0G1	G0G1_BRC_MA3
G0G1_Q.sav	G0G1	G0G1_BRC_MA4
G0G1_Q.sav	G0G1	G0G1_BRC_MA5
G0G1_Q.sav	G0G1	G0G1_BRC_MG
G0G1_Q.sav	G0G1	G0G1_BRC_MO
G0G1_Q.sav	G0G1	G0G1_BRC_PA1
G0G1_Q.sav	G0G1	G0G1_BRC_PA2
G0G1_Q.sav	G0G1	G0G1_BRC_PA3
G0G1_Q.sav	G0G1	G0G1_BRC_PA4
G0G1_Q.sav	G0G1	G0G1_BRC_PA5
G0G1_Q.sav	G0G1	G0G1_BRC_PG
G0G1_Q.sav	G0G1	G0G1_BRC_S1
G0G1_Q.sav	G0G1	G0G1_BRC_S2
G0G1_Q.sav	G0G1	G0G1_BRC_S3
G0G1_Q.sav	G0G1	G0G1_BRC_S4
G0G1_Q.sav	G0G1	G0G1_BRC_S5
G0G1_Q.sav	G0G1	G0G1_BRCA_MA5
G0G1_Q.sav	G0G1	G0G1_BRCA_MG
G0G1_Q.sav	G0G1	G0G1_BRCA_MO
G0G1_Q.sav	G0G1	G0G1_BRCA_PA1
G0G1_Q.sav	G0G1	G0G1_BRCA_PA2
G0G1_Q.sav	G0G1	G0G1_BRCA_PA3
G0G1_Q.sav	G0G1	G0G1_BRCA_PA4
G0G1_Q.sav	G0G1	G0G1_BRCA_PA5
G0G1_Q.sav	G0G1	G0G1_BRCA_PG
G0G1_Q.sav	G0G1	G0G1_BRCA_S1
G0G1_Q.sav	G0G1	G0G1_BRCA_S2
G0G1_Q.sav	G0G1	G0G1_BRCA_S3
G0G1_Q.sav	G0G1	G0G1_BRCA_S4
G0G1_Q.sav	G0G1	G0G1_BRCA_S5
G0G1_Q.sav	G0G1	G0G1_BRS3
G0G1_Q.sav	G0G1	G0G1_BRS5
G0G1_Q.sav	G0G1	G0G1_BRS6
G0G1_Q.sav	G0G1	G0G1_BRS7
G0G1_Q.sav	G0G1	G0G1_BRSA5
G0G1_Q.sav	G0G1	G0G1_BRSA6_1
G0G1_Q.sav	G0G1	G0G1_BRSA6_2
G0G1_Q.sav	G0G1	G0G1_BRSA7
G0G1_Q.sav	G0G1	G0G1_BRSS5
G0G1_PA.sav	G0G1	G0G1_BREXA
G228_MainQandRQ.sav	G228	G228_BROC
G227_PA.sav	G227	G227_BROC
G227_PA.sav	G227	G227_BROC_COM
G0G1_Q.sav	G0G1	G0G1_BRS1
G0G1_Q.sav	G0G1	G0G1_BRS2
G0G1_Q.sav	G0G1	G0G1_BRSS7
G0G1_Q.sav	G0G1	G0G1_BRS4
G0G1_Q.sav	G0G1	G0G1_BRSA3
G0G1_Q.sav	G0G1	G0G1_BRSA4_1
G0G1_Q.sav	G0G1	G0G1_BRSA4_2
G0G1_Q.sav	G0G1	G0G1_BRSA4_3
G0G1_Q.sav	G0G1	G0G1_BRSS4_1
G0G1_Q.sav	G0G1	G0G1_BRSS4_2
G0G1_Q.sav	G0G1	G0G1_BRSS4_3
G0G1_Q.sav	G0G1	G0G1_BRSS6_1
G0G1_Q.sav	G0G1	G0G1_BRSS6_2
G0G1_Q.sav	G0G1	G0G1_OVCA_D1
G0G1_Q.sav	G0G1	G0G1_OVCA_D2
G0G1_Q.sav	G0G1	G0G1_OVCA_D3
G0G1_Q.sav	G0G1	G0G1_OVCA_D4
G0G1_Q.sav	G0G1	G0G1_OVCA_D5
G0G1_Q.sav	G0G1	G0G1_OVCA_MA1
G0G1_Q.sav	G0G1	G0G1_OVCA_MA2
G0G1_Q.sav	G0G1	G0G1_OVCA_MA3
G0G1_Q.sav	G0G1	G0G1_OVCA_MA4
G0G1_Q.sav	G0G1	G0G1_OVCA_MA5
G0G1_Q.sav	G0G1	G0G1_OVCA_MG
G0G1_Q.sav	G0G1	G0G1_OVCA_MO
G0G1_Q.sav	G0G1	G0G1_OVCA_PA1
G0G1_Q.sav	G0G1	G0G1_OVCA_PA2
G0G1_Q.sav	G0G1	G0G1_OVCA_PA3
G0G1_Q.sav	G0G1	G0G1_OVCA_PA4
G0G1_Q.sav	G0G1	G0G1_OVCA_PA5
G0G1_Q.sav	G0G1	G0G1_OVCA_PG
G0G1_Q.sav	G0G1	G0G1_OVCA_S1
G0G1_Q.sav	G0G1	G0G1_OVCA_S2
G0G1_Q.sav	G0G1	G0G1_OVCA_S3
G0G1_Q.sav	G0G1	G0G1_OVCA_S4
G0G1_Q.sav	G0G1	G0G1_OVCA_S5
G0G1_Q.sav	G0G1	G0G1_FH_BROV
G227_PA.sav	G227	G227_MA1_AGE
G228_MainQandRQ.sav	G228	G228_MA1_BRC
G227_PA.sav	G227	G227_MA1_BRC
G228_MainQandRQ.sav	G228	G228_MA1_OC
G227_PA.sav	G227	G227_MA1_OC
G227_PA.sav	G227	G227_MA2_AGE
G228_MainQandRQ.sav	G228	G228_MA2_BRC
G227_PA.sav	G227	G227_MA2_BRC
G228_MainQandRQ.sav	G228	G228_MA2_OC
G227_PA.sav	G227	G227_MA2_OC
G0G1_Q.sav	G0G1	G0G1_MENS_R8
G228_MainQandRQ.sav	G228	G228_MENS8
G227_PA.sav	G227	G227_MENS8
G227_PA.sav	G227	G227_MG_AGE
G228_MainQandRQ.sav	G228	G228_MG_BRC
G227_PA.sav	G227	G227_MG_BRC
G228_MainQandRQ.sav	G228	G228_MG_OC
G227_PA.sav	G227	G227_MG_OC
G227_PA.sav	G227	G227_MO_AGE
G228_MainQandRQ.sav	G228	G228_MO_BRC
G227_PA.sav	G227	G227_MO_BRC
G228_MainQandRQ.sav	G228	G228_MO_OC
G227_PA.sav	G227	G227_MO_OC
G228_MainQandRQ.sav	G228	G228_OR1_BRC
G228_MainQandRQ.sav	G228	G228_OR1_BRC_OTH
G228_MainQandRQ.sav	G228	G228_OR1_OC
G228_MainQandRQ.sav	G228	G228_OR1_OC_OTH
G228_MainQandRQ.sav	G228	G228_OR2_BRC
G228_MainQandRQ.sav	G228	G228_OR2_BRC_OTH
G0G1_Q.sav	G0G1	G0G1_OVC_D1
G0G1_Q.sav	G0G1	G0G1_OVC_D2
G0G1_Q.sav	G0G1	G0G1_OVC_D3
G0G1_Q.sav	G0G1	G0G1_OVC_D4
G0G1_Q.sav	G0G1	G0G1_OVC_D5
G0G1_Q.sav	G0G1	G0G1_OVC_MA1
G0G1_Q.sav	G0G1	G0G1_OVC_MA2
G0G1_Q.sav	G0G1	G0G1_OVC_MA3
G0G1_Q.sav	G0G1	G0G1_OVC_MA4
G0G1_Q.sav	G0G1	G0G1_OVC_MA5
G0G1_Q.sav	G0G1	G0G1_OVC_MG
G0G1_Q.sav	G0G1	G0G1_OVC_MO
G0G1_Q.sav	G0G1	G0G1_OVC_PA1
G0G1_Q.sav	G0G1	G0G1_OVC_PA2
G0G1_Q.sav	G0G1	G0G1_OVC_PA3
G0G1_Q.sav	G0G1	G0G1_OVC_PA4
G0G1_Q.sav	G0G1	G0G1_OVC_PA5
G0G1_Q.sav	G0G1	G0G1_OVC_PG
G0G1_Q.sav	G0G1	G0G1_OVC_S1
G0G1_Q.sav	G0G1	G0G1_OVC_S2
G0G1_Q.sav	G0G1	G0G1_OVC_S3
G0G1_Q.sav	G0G1	G0G1_OVC_S4
G0G1_Q.sav	G0G1	G0G1_OVC_S5
G227_PA.sav	G227	G227_PA1_AGE
G228_MainQandRQ.sav	G228	G228_PA1_BRC
G227_PA.sav	G227	G227_PA1_BRC
G228_MainQandRQ.sav	G228	G228_PA1_OC
G227_PA.sav	G227	G227_PA1_OC
G227_PA.sav	G227	G227_PA2_AGE
G228_MainQandRQ.sav	G228	G228_PA2_BRC
G227_PA.sav	G227	G227_PA2_BRC
G228_MainQandRQ.sav	G228	G228_PA2_OC
G227_PA.sav	G227	G227_PA2_OC
G227_PA.sav	G227	G227_PG_AGE
G228_MainQandRQ.sav	G228	G228_PG_BRC
G227_PA.sav	G227	G227_PG_BRC
G228_MainQandRQ.sav	G228	G228_PG_CBF
G227_PA.sav	G227	G227_PG_CBF
G0G1_Q.sav	G0G1	G0G1_PG_CBF
G228_MainQandRQ.sav	G228	G228_PG_OC
G227_PA.sav	G227	G227_PG_OC
G228_MainQandRQ.sav	G228	G228_PG1_BF
G227_PA.sav	G227	G227_PG1_BF
G0G1_Q.sav	G0G1	G0G1_PG1_BF
G228_MainQandRQ.sav	G228	G228_PG1_BF_MON
G0G1_Q.sav	G0G1	G0G1_PG1_BF_MON
G227_PA.sav	G227	G227_PG1_BF_WK
G228_MainQandRQ.sav	G228	G228_PG2_BF
G227_PA.sav	G227	G227_PG2_BF
G0G1_Q.sav	G0G1	G0G1_PG2_BF
G228_MainQandRQ.sav	G228	G228_PG2_BF_MON
G0G1_Q.sav	G0G1	G0G1_PG2_BF_MON
G227_PA.sav	G227	G227_PG2_BF_WK
G228_MainQandRQ.sav	G228	G228_PG3_BF
G227_PA.sav	G227	G227_PG3_BF
G0G1_Q.sav	G0G1	G0G1_PG3_BF
G228_MainQandRQ.sav	G228	G228_PG3_BF_MON
G0G1_Q.sav	G0G1	G0G1_PG3_BF_MON
G227_PA.sav	G227	G227_PG3_BF_WK
G228_MainQandRQ.sav	G228	G228_PG4_BF
G227_PA.sav	G227	G227_PG4_BF
G0G1_Q.sav	G0G1	G0G1_PG4_BF
G228_MainQandRQ.sav	G228	G228_PG4_BF_MON
G0G1_Q.sav	G0G1	G0G1_PG4_BF_MON
G227_PA.sav	G227	G227_PG4_BF_WK
G228_MainQandRQ.sav	G228	G228_PG5_BF
G0G1_Q.sav	G0G1	G0G1_PG5_BF
G228_MainQandRQ.sav	G228	G228_PG5_BF_MON
G0G1_Q.sav	G0G1	G0G1_PG5_BF_MON
G0G1_Q.sav	G0G1	G0G1_PG6_BF
G0G1_Q.sav	G0G1	G0G1_PG6_BF_MON
G228_MainQandRQ.sav	G228	G228_PIERL
G227_PA.sav	G227	G227_PIERL
G0G1_PA.sav	G0G1	G0G1_PIERL
G228_MainQandRQ.sav	G228	G228_PIERR
G227_PA.sav	G227	G227_PIERR
G0G1_PA.sav	G0G1	G0G1_PIERR
G228_MainQandRQ.sav	G228	G228_SCAR_LC
G0G1_PA.sav	G0G1	G0G1_SCAR_LC
G228_MainQandRQ.sav	G228	G228_SCAR_LLIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LLIQ
G228_MainQandRQ.sav	G228	G228_SCAR_LLOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LLOQ
G228_MainQandRQ.sav	G228	G228_SCAR_LUIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LUIQ
G228_MainQandRQ.sav	G228	G228_SCAR_LUOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LUOQ
G228_MainQandRQ.sav	G228	G228_SCAR_RC
G0G1_PA.sav	G0G1	G0G1_SCAR_RC
G228_MainQandRQ.sav	G228	G228_SCAR_RLIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RLIQ
G228_MainQandRQ.sav	G228	G228_SCAR_RLOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RLOQ
G228_MainQandRQ.sav	G228	G228_SCAR_RUIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RUIQ
G228_MainQandRQ.sav	G228	G228_SCAR_RUOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RUOQ
G228_MainQandRQ.sav	G228	G228_SCARS
G227_PA.sav	G227	G227_SCARS
G0G1_PA.sav	G0G1	G0G1_SCARS
G228_MainQandRQ.sav	G228	G228_MA1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MA1_OC_AGE
G228_MainQandRQ.sav	G228	G228_MA2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MA2_OC_AGE
G228_MainQandRQ.sav	G228	G228_MG_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MG_OC_AGE
G228_MainQandRQ.sav	G228	G228_MO_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MO_OC_AGE
G228_MainQandRQ.sav	G228	G228_OR1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_OR1_OC_AGE
G228_MainQandRQ.sav	G228	G228_OR2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PA1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PA1_OC_AGE
G228_MainQandRQ.sav	G228	G228_PA2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PA2_OC_AGE
G228_MainQandRQ.sav	G228	G228_PG_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PG_OC_AGE
G227_PA.sav	G227	G227_SIS1_AGE
G228_MainQandRQ.sav	G228	G228_SIS1_BRC
G227_PA.sav	G227	G227_SIS1_BRC
G228_MainQandRQ.sav	G228	G228_SIS1_OC
G227_PA.sav	G227	G227_SIS1_OC
G227_PA.sav	G227	G227_SIS2_AGE
G228_MainQandRQ.sav	G228	G228_SIS2_BRC
G227_PA.sav	G227	G227_SIS2_BRC
G228_MainQandRQ.sav	G228	G228_SIS2_OC
G227_PA.sav	G227	G227_SIS2_OC
G227_PA.sav	G227	G227_SIS3_AGE
G228_MainQandRQ.sav	G228	G228_SIS3_BRC
G227_PA.sav	G227	G227_SIS3_BRC
G228_MainQandRQ.sav	G228	G228_SIS3_OC
G227_PA.sav	G227	G227_SIS3_OC
G228_MainQandRQ.sav	G228	G228_SIS1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_SIS1_OC_AGE
G228_MainQandRQ.sav	G228	G228_SIS2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_SIS2_OC_AGE
G228_MainQandRQ.sav	G228	G228_SIS3_BRC_AGE
G228_MainQandRQ.sav	G228	G228_SIS3_OC_AGE
G228_MainQandRQ.sav	G228	G228_TATT
G227_PA.sav	G227	G227_TATT
G0G1_PA.sav	G0G1	G0G1_TATT
G228_MainQandRQ.sav	G228	G228_TATT_LC
G0G1_PA.sav	G0G1	G0G1_TATT_LC
G228_MainQandRQ.sav	G228	G228_TATT_LLIQ
G0G1_PA.sav	G0G1	G0G1_TATT_LLIQ
G228_MainQandRQ.sav	G228	G228_TATT_LLOQ
G0G1_PA.sav	G0G1	G0G1_TATT_LLOQ
G228_MainQandRQ.sav	G228	G228_TATT_LUIQ
G0G1_PA.sav	G0G1	G0G1_TATT_LUIQ
G228_MainQandRQ.sav	G228	G228_TATT_LUOQ
G0G1_PA.sav	G0G1	G0G1_TATT_LUOQ
G228_MainQandRQ.sav	G228	G228_TATT_RC
G0G1_PA.sav	G0G1	G0G1_TATT_RC
G228_MainQandRQ.sav	G228	G228_TATT_RLIQ
G0G1_PA.sav	G0G1	G0G1_TATT_RLIQ
G228_MainQandRQ.sav	G228	G228_TATT_RLOQ
G0G1_PA.sav	G0G1	G0G1_TATT_RLOQ
G228_MainQandRQ.sav	G228	G228_TATT_RUIQ
G0G1_PA.sav	G0G1	G0G1_TATT_RUIQ
G228_MainQandRQ.sav	G228	G228_TATT_RUOQ
G0G1_PA.sav	G0G1	G0G1_TATT_RUOQ"""

In [None]:
variables.splitlines()[:5]

['G0G1_Q.sav\tG0G1\tG0G1_BRCA_D1',
 'G0G1_Q.sav\tG0G1\tG0G1_BRCA_D2',
 'G0G1_Q.sav\tG0G1\tG0G1_BRCA_D3',
 'G0G1_Q.sav\tG0G1\tG0G1_BRCA_D4',
 'G0G1_Q.sav\tG0G1\tG0G1_BRCA_D5']

How can I convert this into a Polars DataFrame, where I split each line on "\t" and return a dataframe with three columns, where the first column contains the first value in each line ("File"), the second column contains the dataset ("Dataset") and the third contains the variable ("Variable").

##### ü§ñReplyü§ñ<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†

You're on the right track with `splitlines()`! Now you need to:

1. Split each line on `"\t"` to get the three values
2. Create a Polars DataFrame from that data

A couple of hints:
- You can use a list comprehension to split each line: `[line.split("\t") for line in ...]`
- `pl.DataFrame` can take a list of lists, but you'll need to tell it the column names using the `schema` or `orient` parameter

Would you like to try building on your `variables.splitlines()` output? Let me know if you get stuck on any part!

In [None]:
lines = [line.split("\t") for line in variables.splitlines()]
lines[:5]

[['G0G1_Q.sav', 'G0G1', 'G0G1_BRCA_D1'],
 ['G0G1_Q.sav', 'G0G1', 'G0G1_BRCA_D2'],
 ['G0G1_Q.sav', 'G0G1', 'G0G1_BRCA_D3'],
 ['G0G1_Q.sav', 'G0G1', 'G0G1_BRCA_D4'],
 ['G0G1_Q.sav', 'G0G1', 'G0G1_BRCA_D5']]

In [None]:
vars = pl.DataFrame(lines, schema={"File": str, "Dataset": str, "Variable": str}, orient="row")
vars.head()

File,Dataset,Variable
str,str,str
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D1"""
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D2"""
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D3"""
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D4"""
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D5"""


Write me a function that takes the string input (like for `variables`) and returns the dataframe like `vars` using the pieces we wrote above

##### ü§ñReplyü§ñ<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†

```python
def parse_variables(text):
    lines = [line.split("\t") for line in text.splitlines()]
    return pl.DataFrame(lines, schema={"File": str, "Dataset": str, "Variable": str}, orient="row")
```

You can test it with:
```python
vars = parse_variables(variables)
vars.head()
```

In [None]:
def parse_variables(text):
    lines = [line.split("\t") for line in text.splitlines()]
    return pl.DataFrame(lines, schema={"File": str, "Dataset": str, "Variable": str}, orient="row")

In [None]:
vars = parse_variables(variables)
vars.head()

File,Dataset,Variable
str,str,str
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D1"""
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D2"""
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D3"""
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D4"""
"""G0G1_Q.sav""","""G0G1""","""G0G1_BRCA_D5"""


So these are all the variables we've filtered and select in the "value labels" spreadsheet.  
Let's quickly trawl through the SPSS files, pick up the relevant groups of variables, and compare to make sure we've captured everything.  
We'll also investigate the documentation/pro-formas to ensure nothing has slipped through the gaps.

## G227

In [None]:
vars_g227_str = """G227_BR1
G227_BR1_AGE
G227_BR2
G227_BR2_AGE
G227_BR3
G227_BR3_AGE
G227_BR4
G227_BR4_AGE
G227_BR4_SD
G227_BR5
G227_BR5_AGE
G227_BR5_SD
G227_BR6
G227_BR6_AGE
G227_BR6_SD
G227_BROC
G227_BROC_COM
G227_MO_BRC
G227_MO_OC
G227_MO_AGE
G227_SIS1_BRC
G227_SIS1_OC
G227_SIS1_AGE
G227_SIS2_BRC
G227_SIS2_OC
G227_SIS2_AGE
G227_SIS3_BRC
G227_SIS3_OC
G227_SIS3_AGE
G227_MA1_BRC
G227_MA1_OC
G227_MA1_AGE
G227_MA2_BRC
G227_MA2_OC
G227_MA2_AGE
G227_PA1_BRC
G227_PA1_OC
G227_PA1_AGE
G227_PA2_BRC
G227_PA2_OC
G227_PA2_AGE
G227_MG_BRC
G227_MG_OC
G227_MG_AGE
G227_PG_BRC
G227_PG_OC
G227_PG_AGE
G227_AreR
G227_AreL
G227_SCARS
G227_SCARW
G227_SCARL
G227_TATT
G227_TATTW
G227_TATTL
G227_PIERR
G227_PIERL
G227_BR_Col
G227_TiBs_COM"""

vars_g227 = set(vars_g227_str.splitlines())
vars_g227

{'G227_AreL',
 'G227_AreR',
 'G227_BR1',
 'G227_BR1_AGE',
 'G227_BR2',
 'G227_BR2_AGE',
 'G227_BR3',
 'G227_BR3_AGE',
 'G227_BR4',
 'G227_BR4_AGE',
 'G227_BR4_SD',
 'G227_BR5',
 'G227_BR5_AGE',
 'G227_BR5_SD',
 'G227_BR6',
 'G227_BR6_AGE',
 'G227_BR6_SD',
 'G227_BROC',
 'G227_BROC_COM',
 'G227_BR_Col',
 'G227_MA1_AGE',
 'G227_MA1_BRC',
 'G227_MA1_OC',
 'G227_MA2_AGE',
 'G227_MA2_BRC',
 'G227_MA2_OC',
 'G227_MG_AGE',
 'G227_MG_BRC',
 'G227_MG_OC',
 'G227_MO_AGE',
 'G227_MO_BRC',
 'G227_MO_OC',
 'G227_PA1_AGE',
 'G227_PA1_BRC',
 'G227_PA1_OC',
 'G227_PA2_AGE',
 'G227_PA2_BRC',
 'G227_PA2_OC',
 'G227_PG_AGE',
 'G227_PG_BRC',
 'G227_PG_OC',
 'G227_PIERL',
 'G227_PIERR',
 'G227_SCARL',
 'G227_SCARS',
 'G227_SCARW',
 'G227_SIS1_AGE',
 'G227_SIS1_BRC',
 'G227_SIS1_OC',
 'G227_SIS2_AGE',
 'G227_SIS2_BRC',
 'G227_SIS2_OC',
 'G227_SIS3_AGE',
 'G227_SIS3_BRC',
 'G227_SIS3_OC',
 'G227_TATT',
 'G227_TATTL',
 'G227_TATTW',
 'G227_TiBs_COM'}

In [None]:
len(vars_g227)

59

In [None]:
vl_g227 = set(vars.filter(pl.col("File").eq("G227_PA.sav")).get_column("Variable").to_list())
vl_g227

{'G227_BR1',
 'G227_BR1_AGE',
 'G227_BR2',
 'G227_BR2_AGE',
 'G227_BR3',
 'G227_BR3_AGE',
 'G227_BR4',
 'G227_BR4_AGE',
 'G227_BR4_SD',
 'G227_BR5',
 'G227_BR5_AGE',
 'G227_BR5_SD',
 'G227_BR6',
 'G227_BR6_AGE',
 'G227_BR6_SD',
 'G227_BROC',
 'G227_BROC_COM',
 'G227_BR_Col',
 'G227_MA1_AGE',
 'G227_MA1_BRC',
 'G227_MA1_OC',
 'G227_MA2_AGE',
 'G227_MA2_BRC',
 'G227_MA2_OC',
 'G227_MENS8',
 'G227_MG_AGE',
 'G227_MG_BRC',
 'G227_MG_OC',
 'G227_MO_AGE',
 'G227_MO_BRC',
 'G227_MO_OC',
 'G227_PA1_AGE',
 'G227_PA1_BRC',
 'G227_PA1_OC',
 'G227_PA2_AGE',
 'G227_PA2_BRC',
 'G227_PA2_OC',
 'G227_PG1_BF',
 'G227_PG1_BF_WK',
 'G227_PG2_BF',
 'G227_PG2_BF_WK',
 'G227_PG3_BF',
 'G227_PG3_BF_WK',
 'G227_PG4_BF',
 'G227_PG4_BF_WK',
 'G227_PG_AGE',
 'G227_PG_BRC',
 'G227_PG_CBF',
 'G227_PG_OC',
 'G227_PIERL',
 'G227_PIERR',
 'G227_SCARS',
 'G227_SIS1_AGE',
 'G227_SIS1_BRC',
 'G227_SIS1_OC',
 'G227_SIS2_AGE',
 'G227_SIS2_BRC',
 'G227_SIS2_OC',
 'G227_SIS3_AGE',
 'G227_SIS3_BRC',
 'G227_SIS3_OC',
 'G227_T

What's the most concise and robust way to show the columns that are exist in `vars_g227` and not in `vars.filter(pl.col("File").eq("G227_PA.sav")).get_column("Variable").to_list()`, and vice versa?

##### ü§ñReplyü§ñ<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†

Using Python sets is the cleanest approach:

```python
set_g227 = set(vars_g227)
set_vars = set(vars.filter(pl.col("File").eq("G227_PA.sav")).get_column("Variable").to_list())

# In vars_g227 but not in vars
set_g227 - set_vars

# In vars but not in vars_g227
set_vars - set_g227
```

Or if you want both at once, you can use symmetric difference: `set_g227 ^ set_vars` (shows all items that are in exactly one of the sets).

In [None]:
# From SPSS that HAVE NOT been captured in value labels
vars_g227 - vl_g227

{'G227_AreL',
 'G227_AreR',
 'G227_SCARL',
 'G227_SCARW',
 'G227_TATTL',
 'G227_TATTW',
 'G227_TiBs_COM'}

In [None]:
# From value labels that HAVE NOT been captured in SPSS
vl_g227 - vars_g227

{'G227_MENS8',
 'G227_PG1_BF',
 'G227_PG1_BF_WK',
 'G227_PG2_BF',
 'G227_PG2_BF_WK',
 'G227_PG3_BF',
 'G227_PG3_BF_WK',
 'G227_PG4_BF',
 'G227_PG4_BF_WK',
 'G227_PG_CBF'}

Can you take these parts and turn it into a simple function, such that I input a string of variables, and specify the file, filter on, and you output two sets, with appropriate labelling of where the variables are missing from

##### ü§ñReplyü§ñ<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†

```python
def compare_vars(var_string, file_filter, vars_df=vars):
    """Compare variables from a string against those in the filtered DataFrame."""
    from_string = set(var_string.splitlines())
    from_df = set(vars_df.filter(pl.col("File").eq(file_filter)).get_column("Variable").to_list())
    
    return {
        f"In SPSS but not in value labels": from_string - from_df,
        f"In value labels but not in SPSS": from_df - from_string,
    }
```

Usage:

```python
compare_vars(vars_g227_str, "G227_PA.sav")
```

Note: You'll need to keep the original string (before converting to a set) to pass to this function, or adjust the function to accept a set directly. Would you like me to modify it to handle either input type?

In [None]:
def compare_vars(var_string, file_filter, vars_df=vars):
    """Compare variables from a string against those in the filtered DataFrame."""
    from_string = set(var_string.splitlines())
    from_df = set(vars_df.filter(pl.col("File").eq(file_filter)).get_column("Variable").to_list())
    
    print("In SPSS but not in value labels", sorted(from_string - from_df), "\nIn value labels but not in SPSS", sorted(from_df - from_string))

In [None]:
compare_vars(vars_g227_str, "G227_PA.sav")

In SPSS but not in value labels ['G227_AreL', 'G227_AreR', 'G227_SCARL', 'G227_SCARW', 'G227_TATTL', 'G227_TATTW', 'G227_TiBs_COM'] 
In value labels but not in SPSS ['G227_MENS8', 'G227_PG1_BF', 'G227_PG1_BF_WK', 'G227_PG2_BF', 'G227_PG2_BF_WK', 'G227_PG3_BF', 'G227_PG3_BF_WK', 'G227_PG4_BF', 'G227_PG4_BF_WK', 'G227_PG_CBF']


Great, I've now amended these differences - the variables not captured in SPSS were related to breastfeeding, not to the TIBS variables I'm investigating, so that was useful to flag!  
I'll fix up the differences in the other datasets first, and then re-run these checks to ensure I've captured everything I want, and nothing extra.

## G228

In [None]:
vars_g228_str = """G228_BR1
G228_BR1_AGE
G228_BR2
G228_BR2_AGE
G228_BR3
G228_BR3_AGE
G228_BR4
G228_BR4_AGE
G228_BR4_SD
G228_BR5
G228_BR5_AGE
G228_BR5_SD
G228_BR6
G228_BR6_AGE
G228_BR6_SD
G228_BROC
G228_MO_BRC
G228_SIS1_BRC
G228_SIS2_BRC
G228_SIS3_BRC
G228_MA1_BRC
G228_MA2_BRC
G228_PA1_BRC
G228_PA2_BRC
G228_MG_BRC
G228_PG_BRC
G228_OR1_BRC
G228_OR2_BRC
G228_OR1_BRC_OTH
G228_OR2_BRC_OTH
G228_MO_BRC_AGE
G228_SIS1_BRC_AGE
G228_SIS2_BRC_AGE
G228_SIS3_BRC_AGE
G228_MA1_BRC_AGE
G228_MA2_BRC_AGE
G228_PA1_BRC_AGE
G228_PA2_BRC_AGE
G228_MG_BRC_AGE
G228_PG_BRC_AGE
G228_OR1_BRC_AGE
G228_OR2_BRC_AGE
G228_MO_OC
G228_SIS1_OC
G228_SIS2_OC
G228_SIS3_OC
G228_MA1_OC
G228_MA2_OC
G228_PA1_OC
G228_PA2_OC
G228_MG_OC
G228_PG_OC
G228_OR1_OC
G228_OR1_OC_OTH
G228_MO_OC_AGE
G228_SIS1_OC_AGE
G228_SIS2_OC_AGE
G228_SIS3_OC_AGE
G228_MA1_OC_AGE
G228_MA2_OC_AGE
G228_PA1_OC_AGE
G228_PA2_OC_AGE
G228_MG_OC_AGE
G228_PG_OC_AGE
G228_OR1_OC_AGE
G228_ARER
G228_AREL
G228_SCARS
G228_TATT
G228_SCARW
G228_SCARL
G228_TATTW
G228_TATTL
G228_SCAR_RUOQ
G228_SCAR_RUIQ
G228_SCAR_RLIQ
G228_SCAR_RLOQ
G228_SCAR_RC
G228_SCAR_LUIQ
G228_SCAR_LUOQ
G228_SCAR_LLOQ
G228_SCAR_LLIQ
G228_SCAR_LC
G228_TATT_RUOQ
G228_TATT_RUIQ
G228_TATT_RLIQ
G228_TATT_RLOQ
G228_TATT_RC
G228_TATT_LUIQ
G228_TATT_LUOQ
G228_TATT_LLOQ
G228_TATT_LLIQ
G228_TATT_LC
G228_PIER
G228_PIERR
G228_PIERL
G228_BR_COL"""

In [None]:
compare_vars(vars_g228_str, "G228_MainQandRQ.sav")

In SPSS but not in value labels ['G228_AREL', 'G228_ARER', 'G228_PIER', 'G228_SCARL', 'G228_SCARW', 'G228_TATTL', 'G228_TATTW'] 
In value labels but not in SPSS ['G228_MENS8', 'G228_PG1_BF', 'G228_PG1_BF_MON', 'G228_PG2_BF', 'G228_PG2_BF_MON', 'G228_PG3_BF', 'G228_PG3_BF_MON', 'G228_PG4_BF', 'G228_PG4_BF_MON', 'G228_PG5_BF', 'G228_PG5_BF_MON', 'G228_PG_CBF']


## G0G1

In [None]:
vars_g0g1_pa_str = """G0G1_BR_COL
G0G1_AREL
G0G1_ARER
G0G1_PIER
G0G1_PIERL
G0G1_PIERR
G0G1_SCARS
G0G1_SCARL
G0G1_SCARW
G0G1_SCAR_LC
G0G1_SCAR_LUIQ
G0G1_SCAR_LUOQ
G0G1_SCAR_LLIQ
G0G1_SCAR_LLOQ
G0G1_SCAR_RC
G0G1_SCAR_RUOQ
G0G1_SCAR_RUIQ
G0G1_SCAR_RLOQ
G0G1_SCAR_RLIQ
G0G1_TATT
G0G1_TATTL
G0G1_TATTW
G0G1_TATT_LC
G0G1_TATT_LUIQ
G0G1_TATT_LUOQ
G0G1_TATT_LLIQ
G0G1_TATT_LLOQ
G0G1_TATT_RC
G0G1_TATT_RUOQ
G0G1_TATT_RUIQ
G0G1_TATT_RLOQ
G0G1_TATT_RLIQ
G0G1_TIBS_COM"""

In [None]:
compare_vars(vars_g0g1_pa_str, "G0G1_PA.sav")

In SPSS but not in value labels ['G0G1_AREL', 'G0G1_ARER', 'G0G1_PIER', 'G0G1_SCARL', 'G0G1_SCARW', 'G0G1_TATTL', 'G0G1_TATTW', 'G0G1_TIBS_COM'] 
In value labels but not in SPSS ['G0G1_BREXA']


In [None]:
vars_g0g1_str = """G0G1_BRS1
G0G1_BRS2
G0G1_BRS3
G0G1_BRSA3
G0G1_BRS4
G0G1_BRSA4_1
G0G1_BRSA4_2
G0G1_BRSA4_3
G0G1_BRSS4_1
G0G1_BRSS4_2
G0G1_BRSS4_3
G0G1_BRS5
G0G1_BRSA5
G0G1_BRSS5
G0G1_BRS7
G0G1_BRSA7
G0G1_BRSS7
G0G1_BRS6
G0G1_BRSA6_1
G0G1_BRSA6_2
G0G1_BRSS6_1
G0G1_BRSS6_2
G0G1_FH_BROV
G0G1_BRC_MO
G0G1_OVC_MO
G0G1_BRCA_MO
G0G1_OVCA_MO
G0G1_REL_SIS
G0G1_BRC_S1
G0G1_BRC_S2
G0G1_BRC_S3
G0G1_BRC_S4
G0G1_BRC_S5
G0G1_BRCA_S1
G0G1_BRCA_S2
G0G1_BRCA_S3
G0G1_BRCA_S4
G0G1_BRCA_S5
G0G1_OVC_S1
G0G1_OVC_S2
G0G1_OVC_S3
G0G1_OVC_S4
G0G1_OVC_S5
G0G1_OVCA_S1
G0G1_OVCA_S2
G0G1_OVCA_S3
G0G1_OVCA_S4
G0G1_OVCA_S5
G0G1_REL_DAU
G0G1_BRC_D1
G0G1_BRC_D2
G0G1_BRC_D3
G0G1_BRC_D4
G0G1_BRC_D5
G0G1_BRCA_D1
G0G1_BRCA_D2
G0G1_BRCA_D3
G0G1_BRCA_D4
G0G1_BRCA_D5
G0G1_OVC_D1
G0G1_OVC_D2
G0G1_OVC_D3
G0G1_OVC_D4
G0G1_OVC_D5
G0G1_OVCA_D1
G0G1_OVCA_D2
G0G1_OVCA_D3
G0G1_OVCA_D4
G0G1_OVCA_D5
G0G1_REL_PA
G0G1_BRC_PA1
G0G1_BRC_PA2
G0G1_BRC_PA3
G0G1_BRC_PA4
G0G1_BRC_PA5
G0G1_BRCA_PA1
G0G1_BRCA_PA2
G0G1_BRCA_PA3
G0G1_BRCA_PA4
G0G1_BRCA_PA5
G0G1_OVC_PA1
G0G1_OVC_PA2
G0G1_OVC_PA3
G0G1_OVC_PA4
G0G1_OVC_PA5
G0G1_OVCA_PA1
G0G1_OVCA_PA2
G0G1_OVCA_PA3
G0G1_OVCA_PA4
G0G1_OVCA_PA5
G0G1_REL_MA
G0G1_BRC_MA1
G0G1_BRC_MA2
G0G1_BRC_MA3
G0G1_BRC_MA4
G0G1_BRC_MA5
G0G1_BRCA_MA1
G0G1_BRCA_MA2
G0G1_BRCA_MA3
G0G1_BRCA_MA4
G0G1_BRCA_MA5
G0G1_OVC_MA1
G0G1_OVC_MA2
G0G1_OVC_MA3
G0G1_OVC_MA4
G0G1_OVC_MA5
G0G1_OVCA_MA1
G0G1_OVCA_MA2
G0G1_OVCA_MA3
G0G1_OVCA_MA4
G0G1_OVCA_MA5
G0G1_BRC_MG
G0G1_OVC_MG
G0G1_BRC_PG
G0G1_OVC_PG
G0G1_BRCA_MG
G0G1_OVCA_MG
G0G1_BRCA_PG
G0G1_OVCA_PG"""

In [None]:
compare_vars(vars_g0g1_str, "G0G1_Q.sav")

In SPSS but not in value labels ['G0G1_REL_DAU', 'G0G1_REL_MA', 'G0G1_REL_PA', 'G0G1_REL_SIS'] 
In value labels but not in SPSS ['G0G1_MENS_R8', 'G0G1_PG1_BF', 'G0G1_PG1_BF_MON', 'G0G1_PG2_BF', 'G0G1_PG2_BF_MON', 'G0G1_PG3_BF', 'G0G1_PG3_BF_MON', 'G0G1_PG4_BF', 'G0G1_PG4_BF_MON', 'G0G1_PG5_BF', 'G0G1_PG5_BF_MON', 'G0G1_PG6_BF', 'G0G1_PG6_BF_MON', 'G0G1_PG_CBF']


## Checking we have all the variables we want

Let's take the latest version of our now cleaned value labels spreadsheet as an input, and then combine all the SPSS variables we found, and assert that it's identical

In [None]:
value_labels = """G0G1_Q.sav	G0G1	G0G1_BRCA_D1
G0G1_Q.sav	G0G1	G0G1_BRCA_D2
G0G1_Q.sav	G0G1	G0G1_BRCA_D3
G0G1_Q.sav	G0G1	G0G1_BRCA_D4
G0G1_Q.sav	G0G1	G0G1_BRCA_D5
G0G1_Q.sav	G0G1	G0G1_BRCA_MA1
G0G1_Q.sav	G0G1	G0G1_BRCA_MA2
G0G1_Q.sav	G0G1	G0G1_BRCA_MA3
G0G1_Q.sav	G0G1	G0G1_BRCA_MA4
G228_MainQandRQ.sav	G228	G228_AREL
G227_PA.sav	G227	G227_AreL
G0G1_PA.sav	G0G1	G0G1_AREL
G228_MainQandRQ.sav	G228	G228_ARER
G227_PA.sav	G227	G227_AreR
G0G1_PA.sav	G0G1	G0G1_ARER
G228_MainQandRQ.sav	G228	G228_BR_COL
G227_PA.sav	G227	G227_BR_Col
G0G1_PA.sav	G0G1	G0G1_BR_COL
G228_MainQandRQ.sav	G228	G228_BR1
G227_PA.sav	G227	G227_BR1
G228_MainQandRQ.sav	G228	G228_BR1_AGE
G227_PA.sav	G227	G227_BR1_AGE
G228_MainQandRQ.sav	G228	G228_BR2
G227_PA.sav	G227	G227_BR2
G228_MainQandRQ.sav	G228	G228_BR2_AGE
G227_PA.sav	G227	G227_BR2_AGE
G228_MainQandRQ.sav	G228	G228_BR3
G227_PA.sav	G227	G227_BR3
G228_MainQandRQ.sav	G228	G228_BR3_AGE
G227_PA.sav	G227	G227_BR3_AGE
G228_MainQandRQ.sav	G228	G228_BR4
G227_PA.sav	G227	G227_BR4
G228_MainQandRQ.sav	G228	G228_BR4_AGE
G227_PA.sav	G227	G227_BR4_AGE
G228_MainQandRQ.sav	G228	G228_BR4_SD
G227_PA.sav	G227	G227_BR4_SD
G228_MainQandRQ.sav	G228	G228_BR5
G227_PA.sav	G227	G227_BR5
G228_MainQandRQ.sav	G228	G228_BR5_AGE
G227_PA.sav	G227	G227_BR5_AGE
G228_MainQandRQ.sav	G228	G228_BR5_SD
G227_PA.sav	G227	G227_BR5_SD
G228_MainQandRQ.sav	G228	G228_BR6
G227_PA.sav	G227	G227_BR6
G228_MainQandRQ.sav	G228	G228_BR6_AGE
G227_PA.sav	G227	G227_BR6_AGE
G228_MainQandRQ.sav	G228	G228_BR6_SD
G227_PA.sav	G227	G227_BR6_SD
G0G1_Q.sav	G0G1	G0G1_BRC_D1
G0G1_Q.sav	G0G1	G0G1_BRC_D2
G0G1_Q.sav	G0G1	G0G1_BRC_D3
G0G1_Q.sav	G0G1	G0G1_BRC_D4
G0G1_Q.sav	G0G1	G0G1_BRC_D5
G0G1_Q.sav	G0G1	G0G1_BRC_MA1
G0G1_Q.sav	G0G1	G0G1_BRC_MA2
G0G1_Q.sav	G0G1	G0G1_BRC_MA3
G0G1_Q.sav	G0G1	G0G1_BRC_MA4
G0G1_Q.sav	G0G1	G0G1_BRC_MA5
G0G1_Q.sav	G0G1	G0G1_BRC_MG
G0G1_Q.sav	G0G1	G0G1_BRC_MO
G0G1_Q.sav	G0G1	G0G1_BRC_PA1
G0G1_Q.sav	G0G1	G0G1_BRC_PA2
G0G1_Q.sav	G0G1	G0G1_BRC_PA3
G0G1_Q.sav	G0G1	G0G1_BRC_PA4
G0G1_Q.sav	G0G1	G0G1_BRC_PA5
G0G1_Q.sav	G0G1	G0G1_BRC_PG
G0G1_Q.sav	G0G1	G0G1_BRC_S1
G0G1_Q.sav	G0G1	G0G1_BRC_S2
G0G1_Q.sav	G0G1	G0G1_BRC_S3
G0G1_Q.sav	G0G1	G0G1_BRC_S4
G0G1_Q.sav	G0G1	G0G1_BRC_S5
G0G1_Q.sav	G0G1	G0G1_BRCA_MA5
G0G1_Q.sav	G0G1	G0G1_BRCA_MG
G0G1_Q.sav	G0G1	G0G1_BRCA_MO
G0G1_Q.sav	G0G1	G0G1_BRCA_PA1
G0G1_Q.sav	G0G1	G0G1_BRCA_PA2
G0G1_Q.sav	G0G1	G0G1_BRCA_PA3
G0G1_Q.sav	G0G1	G0G1_BRCA_PA4
G0G1_Q.sav	G0G1	G0G1_BRCA_PA5
G0G1_Q.sav	G0G1	G0G1_BRCA_PG
G0G1_Q.sav	G0G1	G0G1_BRCA_S1
G0G1_Q.sav	G0G1	G0G1_BRCA_S2
G0G1_Q.sav	G0G1	G0G1_BRCA_S3
G0G1_Q.sav	G0G1	G0G1_BRCA_S4
G0G1_Q.sav	G0G1	G0G1_BRCA_S5
G0G1_Q.sav	G0G1	G0G1_BRS3
G0G1_Q.sav	G0G1	G0G1_BRS5
G0G1_Q.sav	G0G1	G0G1_BRS6
G0G1_Q.sav	G0G1	G0G1_BRS7
G0G1_Q.sav	G0G1	G0G1_BRSA5
G0G1_Q.sav	G0G1	G0G1_BRSA6_1
G0G1_Q.sav	G0G1	G0G1_BRSA6_2
G0G1_Q.sav	G0G1	G0G1_BRSA7
G0G1_Q.sav	G0G1	G0G1_BRSS5
G228_MainQandRQ.sav	G228	G228_BROC
G227_PA.sav	G227	G227_BROC
G227_PA.sav	G227	G227_BROC_COM
G0G1_Q.sav	G0G1	G0G1_BRS1
G0G1_Q.sav	G0G1	G0G1_BRS2
G0G1_Q.sav	G0G1	G0G1_BRSS7
G0G1_Q.sav	G0G1	G0G1_BRS4
G0G1_Q.sav	G0G1	G0G1_BRSA3
G0G1_Q.sav	G0G1	G0G1_BRSA4_1
G0G1_Q.sav	G0G1	G0G1_BRSA4_2
G0G1_Q.sav	G0G1	G0G1_BRSA4_3
G0G1_Q.sav	G0G1	G0G1_BRSS4_1
G0G1_Q.sav	G0G1	G0G1_BRSS4_2
G0G1_Q.sav	G0G1	G0G1_BRSS4_3
G0G1_Q.sav	G0G1	G0G1_BRSS6_1
G0G1_Q.sav	G0G1	G0G1_BRSS6_2
G0G1_Q.sav	G0G1	G0G1_OVCA_D1
G0G1_Q.sav	G0G1	G0G1_OVCA_D2
G0G1_Q.sav	G0G1	G0G1_OVCA_D3
G0G1_Q.sav	G0G1	G0G1_OVCA_D4
G0G1_Q.sav	G0G1	G0G1_OVCA_D5
G0G1_Q.sav	G0G1	G0G1_OVCA_MA1
G0G1_Q.sav	G0G1	G0G1_OVCA_MA2
G0G1_Q.sav	G0G1	G0G1_OVCA_MA3
G0G1_Q.sav	G0G1	G0G1_OVCA_MA4
G0G1_Q.sav	G0G1	G0G1_OVCA_MA5
G0G1_Q.sav	G0G1	G0G1_OVCA_MG
G0G1_Q.sav	G0G1	G0G1_OVCA_MO
G0G1_Q.sav	G0G1	G0G1_OVCA_PA1
G0G1_Q.sav	G0G1	G0G1_OVCA_PA2
G0G1_Q.sav	G0G1	G0G1_OVCA_PA3
G0G1_Q.sav	G0G1	G0G1_OVCA_PA4
G0G1_Q.sav	G0G1	G0G1_OVCA_PA5
G0G1_Q.sav	G0G1	G0G1_OVCA_PG
G0G1_Q.sav	G0G1	G0G1_OVCA_S1
G0G1_Q.sav	G0G1	G0G1_OVCA_S2
G0G1_Q.sav	G0G1	G0G1_OVCA_S3
G0G1_Q.sav	G0G1	G0G1_OVCA_S4
G0G1_Q.sav	G0G1	G0G1_OVCA_S5
G0G1_Q.sav	G0G1	G0G1_FH_BROV
G227_PA.sav	G227	G227_MA1_AGE
G228_MainQandRQ.sav	G228	G228_MA1_BRC
G227_PA.sav	G227	G227_MA1_BRC
G228_MainQandRQ.sav	G228	G228_MA1_OC
G227_PA.sav	G227	G227_MA1_OC
G227_PA.sav	G227	G227_MA2_AGE
G228_MainQandRQ.sav	G228	G228_MA2_BRC
G227_PA.sav	G227	G227_MA2_BRC
G228_MainQandRQ.sav	G228	G228_MA2_OC
G227_PA.sav	G227	G227_MA2_OC
G227_PA.sav	G227	G227_MG_AGE
G228_MainQandRQ.sav	G228	G228_MG_BRC
G227_PA.sav	G227	G227_MG_BRC
G228_MainQandRQ.sav	G228	G228_MG_OC
G227_PA.sav	G227	G227_MG_OC
G227_PA.sav	G227	G227_MO_AGE
G228_MainQandRQ.sav	G228	G228_MO_BRC
G227_PA.sav	G227	G227_MO_BRC
G228_MainQandRQ.sav	G228	G228_MO_OC
G227_PA.sav	G227	G227_MO_OC
G228_MainQandRQ.sav	G228	G228_OR1_BRC
G228_MainQandRQ.sav	G228	G228_OR1_BRC_OTH
G228_MainQandRQ.sav	G228	G228_OR1_OC
G228_MainQandRQ.sav	G228	G228_OR1_OC_OTH
G228_MainQandRQ.sav	G228	G228_OR2_BRC
G228_MainQandRQ.sav	G228	G228_OR2_BRC_OTH
G0G1_Q.sav	G0G1	G0G1_OVC_D1
G0G1_Q.sav	G0G1	G0G1_OVC_D2
G0G1_Q.sav	G0G1	G0G1_OVC_D3
G0G1_Q.sav	G0G1	G0G1_OVC_D4
G0G1_Q.sav	G0G1	G0G1_OVC_D5
G0G1_Q.sav	G0G1	G0G1_OVC_MA1
G0G1_Q.sav	G0G1	G0G1_OVC_MA2
G0G1_Q.sav	G0G1	G0G1_OVC_MA3
G0G1_Q.sav	G0G1	G0G1_OVC_MA4
G0G1_Q.sav	G0G1	G0G1_OVC_MA5
G0G1_Q.sav	G0G1	G0G1_OVC_MG
G0G1_Q.sav	G0G1	G0G1_OVC_MO
G0G1_Q.sav	G0G1	G0G1_OVC_PA1
G0G1_Q.sav	G0G1	G0G1_OVC_PA2
G0G1_Q.sav	G0G1	G0G1_OVC_PA3
G0G1_Q.sav	G0G1	G0G1_OVC_PA4
G0G1_Q.sav	G0G1	G0G1_OVC_PA5
G0G1_Q.sav	G0G1	G0G1_OVC_PG
G0G1_Q.sav	G0G1	G0G1_OVC_S1
G0G1_Q.sav	G0G1	G0G1_OVC_S2
G0G1_Q.sav	G0G1	G0G1_OVC_S3
G0G1_Q.sav	G0G1	G0G1_OVC_S4
G0G1_Q.sav	G0G1	G0G1_OVC_S5
G227_PA.sav	G227	G227_PA1_AGE
G228_MainQandRQ.sav	G228	G228_PA1_BRC
G227_PA.sav	G227	G227_PA1_BRC
G228_MainQandRQ.sav	G228	G228_PA1_OC
G227_PA.sav	G227	G227_PA1_OC
G227_PA.sav	G227	G227_PA2_AGE
G228_MainQandRQ.sav	G228	G228_PA2_BRC
G227_PA.sav	G227	G227_PA2_BRC
G228_MainQandRQ.sav	G228	G228_PA2_OC
G227_PA.sav	G227	G227_PA2_OC
G227_PA.sav	G227	G227_PG_AGE
G228_MainQandRQ.sav	G228	G228_PG_BRC
G227_PA.sav	G227	G227_PG_BRC
G228_MainQandRQ.sav	G228	G228_PG_OC
G227_PA.sav	G227	G227_PG_OC
G228_MainQandRQ.sav	G228	G228_PIER
G0G1_PA.sav	G0G1	G0G1_PIER
G228_MainQandRQ.sav	G228	G228_PIERL
G227_PA.sav	G227	G227_PIERL
G0G1_PA.sav	G0G1	G0G1_PIERL
G228_MainQandRQ.sav	G228	G228_PIERR
G227_PA.sav	G227	G227_PIERR
G0G1_PA.sav	G0G1	G0G1_PIERR
G0G1_Q.sav	G0G1	G0G1_REL_DAU
G0G1_Q.sav	G0G1	G0G1_REL_MA
G0G1_Q.sav	G0G1	G0G1_REL_PA
G0G1_Q.sav	G0G1	G0G1_REL_SIS
G228_MainQandRQ.sav	G228	G228_SCAR_LC
G0G1_PA.sav	G0G1	G0G1_SCAR_LC
G228_MainQandRQ.sav	G228	G228_SCAR_LLIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LLIQ
G228_MainQandRQ.sav	G228	G228_SCAR_LLOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LLOQ
G228_MainQandRQ.sav	G228	G228_SCAR_LUIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LUIQ
G228_MainQandRQ.sav	G228	G228_SCAR_LUOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_LUOQ
G228_MainQandRQ.sav	G228	G228_SCAR_RC
G0G1_PA.sav	G0G1	G0G1_SCAR_RC
G228_MainQandRQ.sav	G228	G228_SCAR_RLIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RLIQ
G228_MainQandRQ.sav	G228	G228_SCAR_RLOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RLOQ
G228_MainQandRQ.sav	G228	G228_SCAR_RUIQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RUIQ
G228_MainQandRQ.sav	G228	G228_SCAR_RUOQ
G0G1_PA.sav	G0G1	G0G1_SCAR_RUOQ
G228_MainQandRQ.sav	G228	G228_SCARL
G227_PA.sav	G227	G227_SCARL
G0G1_PA.sav	G0G1	G0G1_SCARL
G228_MainQandRQ.sav	G228	G228_SCARS
G227_PA.sav	G227	G227_SCARS
G0G1_PA.sav	G0G1	G0G1_SCARS
G228_MainQandRQ.sav	G228	G228_SCARW
G227_PA.sav	G227	G227_SCARW
G0G1_PA.sav	G0G1	G0G1_SCARW
G228_MainQandRQ.sav	G228	G228_MA1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MA1_OC_AGE
G228_MainQandRQ.sav	G228	G228_MA2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MA2_OC_AGE
G228_MainQandRQ.sav	G228	G228_MG_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MG_OC_AGE
G228_MainQandRQ.sav	G228	G228_MO_BRC_AGE
G228_MainQandRQ.sav	G228	G228_MO_OC_AGE
G228_MainQandRQ.sav	G228	G228_OR1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_OR1_OC_AGE
G228_MainQandRQ.sav	G228	G228_OR2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PA1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PA1_OC_AGE
G228_MainQandRQ.sav	G228	G228_PA2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PA2_OC_AGE
G228_MainQandRQ.sav	G228	G228_PG_BRC_AGE
G228_MainQandRQ.sav	G228	G228_PG_OC_AGE
G227_PA.sav	G227	G227_SIS1_AGE
G228_MainQandRQ.sav	G228	G228_SIS1_BRC
G227_PA.sav	G227	G227_SIS1_BRC
G228_MainQandRQ.sav	G228	G228_SIS1_OC
G227_PA.sav	G227	G227_SIS1_OC
G227_PA.sav	G227	G227_SIS2_AGE
G228_MainQandRQ.sav	G228	G228_SIS2_BRC
G227_PA.sav	G227	G227_SIS2_BRC
G228_MainQandRQ.sav	G228	G228_SIS2_OC
G227_PA.sav	G227	G227_SIS2_OC
G227_PA.sav	G227	G227_SIS3_AGE
G228_MainQandRQ.sav	G228	G228_SIS3_BRC
G227_PA.sav	G227	G227_SIS3_BRC
G228_MainQandRQ.sav	G228	G228_SIS3_OC
G227_PA.sav	G227	G227_SIS3_OC
G228_MainQandRQ.sav	G228	G228_SIS1_BRC_AGE
G228_MainQandRQ.sav	G228	G228_SIS1_OC_AGE
G228_MainQandRQ.sav	G228	G228_SIS2_BRC_AGE
G228_MainQandRQ.sav	G228	G228_SIS2_OC_AGE
G228_MainQandRQ.sav	G228	G228_SIS3_BRC_AGE
G228_MainQandRQ.sav	G228	G228_SIS3_OC_AGE
G228_MainQandRQ.sav	G228	G228_TATT
G227_PA.sav	G227	G227_TATT
G0G1_PA.sav	G0G1	G0G1_TATT
G228_MainQandRQ.sav	G228	G228_TATT_LC
G0G1_PA.sav	G0G1	G0G1_TATT_LC
G228_MainQandRQ.sav	G228	G228_TATT_LLIQ
G0G1_PA.sav	G0G1	G0G1_TATT_LLIQ
G228_MainQandRQ.sav	G228	G228_TATT_LLOQ
G0G1_PA.sav	G0G1	G0G1_TATT_LLOQ
G228_MainQandRQ.sav	G228	G228_TATT_LUIQ
G0G1_PA.sav	G0G1	G0G1_TATT_LUIQ
G228_MainQandRQ.sav	G228	G228_TATT_LUOQ
G0G1_PA.sav	G0G1	G0G1_TATT_LUOQ
G228_MainQandRQ.sav	G228	G228_TATT_RC
G0G1_PA.sav	G0G1	G0G1_TATT_RC
G228_MainQandRQ.sav	G228	G228_TATT_RLIQ
G0G1_PA.sav	G0G1	G0G1_TATT_RLIQ
G228_MainQandRQ.sav	G228	G228_TATT_RLOQ
G0G1_PA.sav	G0G1	G0G1_TATT_RLOQ
G228_MainQandRQ.sav	G228	G228_TATT_RUIQ
G0G1_PA.sav	G0G1	G0G1_TATT_RUIQ
G228_MainQandRQ.sav	G228	G228_TATT_RUOQ
G0G1_PA.sav	G0G1	G0G1_TATT_RUOQ
G228_MainQandRQ.sav	G228	G228_TATTL
G227_PA.sav	G227	G227_TATTL
G0G1_PA.sav	G0G1	G0G1_TATTL
G228_MainQandRQ.sav	G228	G228_TATTW
G227_PA.sav	G227	G227_TATTW
G0G1_PA.sav	G0G1	G0G1_TATTW
G227_PA.sav	G227	G227_TiBs_COM
G0G1_PA.sav	G0G1	G0G1_TIBS_COM"""

In [None]:
vl = parse_variables(value_labels)
vl.height, vl.head()

(308,
 shape: (5, 3)
 ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
 ‚îÇ File       ‚îÜ Dataset ‚îÜ Variable     ‚îÇ
 ‚îÇ ---        ‚îÜ ---     ‚îÜ ---          ‚îÇ
 ‚îÇ str        ‚îÜ str     ‚îÜ str          ‚îÇ
 ‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D1 ‚îÇ
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D2 ‚îÇ
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D3 ‚îÇ
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D4 ‚îÇ
 ‚îÇ G0G1_Q.sav ‚îÜ G0G1    ‚îÜ G0G1_BRCA_D5 ‚îÇ
 ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò)

In [None]:
total_columns = vars_g227_str + vars_g228_str + vars_g0g1_pa_str + vars_g0g1_str
total_columns

'G227_BR1\nG227_BR1_AGE\nG227_BR2\nG227_BR2_AGE\nG227_BR3\nG227_BR3_AGE\nG227_BR4\nG227_BR4_AGE\nG227_BR4_SD\nG227_BR5\nG227_BR5_AGE\nG227_BR5_SD\nG227_BR6\nG227_BR6_AGE\nG227_BR6_SD\nG227_BROC\nG227_BROC_COM\nG227_MO_BRC\nG227_MO_OC\nG227_MO_AGE\nG227_SIS1_BRC\nG227_SIS1_OC\nG227_SIS1_AGE\nG227_SIS2_BRC\nG227_SIS2_OC\nG227_SIS2_AGE\nG227_SIS3_BRC\nG227_SIS3_OC\nG227_SIS3_AGE\nG227_MA1_BRC\nG227_MA1_OC\nG227_MA1_AGE\nG227_MA2_BRC\nG227_MA2_OC\nG227_MA2_AGE\nG227_PA1_BRC\nG227_PA1_OC\nG227_PA1_AGE\nG227_PA2_BRC\nG227_PA2_OC\nG227_PA2_AGE\nG227_MG_BRC\nG227_MG_OC\nG227_MG_AGE\nG227_PG_BRC\nG227_PG_OC\nG227_PG_AGE\nG227_AreR\nG227_AreL\nG227_SCARS\nG227_SCARW\nG227_SCARL\nG227_TATT\nG227_TATTW\nG227_TATTL\nG227_PIERR\nG227_PIERL\nG227_BR_Col\nG227_TiBs_COMG228_BR1\nG228_BR1_AGE\nG228_BR2\nG228_BR2_AGE\nG228_BR3\nG228_BR3_AGE\nG228_BR4\nG228_BR4_AGE\nG228_BR4_SD\nG228_BR5\nG228_BR5_AGE\nG228_BR5_SD\nG228_BR6\nG228_BR6_AGE\nG228_BR6_SD\nG228_BROC\nG228_MO_BRC\nG228_SIS1_BRC\nG228_SIS2_BRC\n

In [None]:
cols = total_columns.splitlines()
cols[:5]

['G227_BR1', 'G227_BR1_AGE', 'G227_BR2', 'G227_BR2_AGE', 'G227_BR3']

In [None]:
len(cols)

305

It seems like there's still a few columns missing. Let's check.

In [None]:
vl_set = set(vl.get_column("Variable").to_list())
col_set = set(cols)

In [None]:
vl_set - col_set

{'G0G1_BRS1',
 'G0G1_BR_COL',
 'G0G1_TIBS_COM',
 'G227_TiBs_COM',
 'G228_BR1',
 'G228_BR_COL'}

In [None]:
col_set - vl_set

{'G0G1_TIBS_COMG0G1_BRS1', 'G227_TiBs_COMG228_BR1', 'G228_BR_COLG0G1_BR_COL'}

Looks like this is an error due to concatenating the strings. How can we fix this?

##### ü§ñReplyü§ñ<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†üß†

You're right! When you concatenated the strings with `+`, the last line of one string got joined directly to the first line of the next (no newline between them). 

Instead of:
```python
total_columns = vars_g227_str + vars_g228_str + vars_g0g1_pa_str + vars_g0g1_str
```

Try:
```python
total_columns = "\n".join([vars_g227_str, vars_g228_str, vars_g0g1_pa_str, vars_g0g1_str])
```

In [None]:
total_columns = "\n".join([vars_g227_str, vars_g228_str, vars_g0g1_pa_str, vars_g0g1_str])
cols = total_columns.splitlines()
col_set = set(cols)

In [None]:
vl_set - col_set, col_set - vl_set

(set(), set())

Perfect, we've captured all the variables we're after!