My name is Matt and I'm an intermediate Python programmer, with a focus on data cleaning and harmonisation - my role is to *harmonise* data across different follow-ups, ensuring that the same questions captured at different time-points have the same variable name, label, and field values, as well as ensuring that different questions (if they're semantically different, or the options to answer the question vary) are named and labelled differently, to ensure consistency across all follow-ups over time.

I have a fairly solid understanding of the basic foundations of programming and data cleaning/analysis. I like using Polars, and have written a simple library `banksia` which is a wrapper around `pyreadstat` to format SPSS files in a way that's more manageable for my workflow. I like to use a more functional style of programming, and prefer concise, simple code. I want to learn about new data structures, algorithms, libraries (standard and third-party) and other tips and tricks that help to improve my processes.

In [None]:
import banksia as bk
import polars as pl
from pathlib import Path
from fastcore.utils import *
import fastcore.all as fc, numpy as np, matplotlib.pyplot as plt
import re, math, itertools, functools, types, typing, dataclasses, collections, regex, time, asyncio

In [None]:
INPUT = Path("../data/input")
OUTPUT = Path("../data/output")

In [None]:
vl = pl.read_excel("../changes.xlsx", )
vl

Could not determine dtype for column 2, falling back to string


Could not determine dtype for column 4, falling back to string


Could not determine dtype for column 6, falling back to string


Could not determine dtype for column 7, falling back to string


file,old_var_name,new_var_name,old_var_label,new_var_label,old_field_values,new_field_values,recode,status
str,str,str,str,str,str,str,str,str
"""G0G1_PA.sav""","""G0G1_AREL""",,"""Areola Size (Diameter): Right …",,"""88=""Not applicable"";99=""Not st…",,,"""in progress"""
"""G0G1_PA.sav""","""G0G1_ARER""",,"""Areola Size (Diameter): Left (…",,"""88=""Not applicable"";99=""Not st…",,,"""in progress"""
"""G0G1_PA.sav""","""G0G1_BR_COL""",,"""Breast skin colour""",,"""1=""Light"";2=""Light/Medium"";3=""…",,,"""in progress"""
"""G0G1_PA.sav""","""G0G1_SCAR_LC""",,"""Breast Scar location: Left Cen…",,"""-99=""Missing"";-88=""N/A"";0=""No""…",,,"""in progress"""
"""G0G1_PA.sav""","""G0G1_SCAR_LLIQ""",,"""Breast Scar location: Left Low…",,"""-99=""Missing"";-88=""N/A"";0=""No""…",,,"""in progress"""
…,…,…,…,…,…,…,…,…
"""G228_MainQandRQ.sav""","""G228_TATTL""",,"""Tattoo size: Length (mm)""",,"""-99=""Missing"";-88=""N/A""""",,,"""in progress"""
"""G228_MainQandRQ.sav""","""G228_TATTW""",,"""Tattoo size: Width (mm)""",,"""-99=""Missing"";-88=""N/A""""",,,"""in progress"""
"""G228_MainQandRQ.sav""","""G228_PIER""",,"""Any nipple piercings""",,"""-99=""Missing"";-88=""N/A"";0=""No""…",,,"""in progress"""
"""G228_MainQandRQ.sav""","""G228_PIERL""",,"""Any nipple piercings - Left br…",,"""-99=""Missing"";-88=""N/A"";0=""No""…",,,"""in progress"""


So these are all the variables we've filtered and select in the "value labels" spreadsheet.  
Let's quickly trawl through the SPSS files, pick up the relevant groups of variables, and compare to make sure we've captured everything.  
We'll also investigate the documentation/pro-formas to ensure nothing has slipped through the gaps.

## G227

In [None]:
vars_g227_str = """G227_AreR
G227_AreL
G227_SCARS
G227_SCARW
G227_SCARL
G227_TATT
G227_TATTW
G227_TATTL
G227_PIERR
G227_PIERL
G227_BR_Col
G227_TiBs_COM"""

vars_g227 = set(vars_g227_str.splitlines())
vars_g227

{'G227_AreL',
 'G227_AreR',
 'G227_BR_Col',
 'G227_PIERL',
 'G227_PIERR',
 'G227_SCARL',
 'G227_SCARS',
 'G227_SCARW',
 'G227_TATT',
 'G227_TATTL',
 'G227_TATTW',
 'G227_TiBs_COM'}

In [None]:
vl_g227 = set(vl.filter(pl.col("file").eq("G227_PA.sav")).get_column("old_var_name").to_list())
vl_g227

{'G227_AreL',
 'G227_AreR',
 'G227_BR_Col',
 'G227_PIERL',
 'G227_PIERR',
 'G227_SCARL',
 'G227_SCARS',
 'G227_SCARW',
 'G227_TATT',
 'G227_TATTL',
 'G227_TATTW',
 'G227_TiBs_COM'}

In [None]:
vars_g227 - vl_g227, vl_g227 - vars_g227

(set(), set())

## G228

In [None]:
vars_g228_str = """G228_ARER
G228_AREL
G228_SCARS
G228_TATT
G228_SCARW
G228_SCARL
G228_TATTW
G228_TATTL
G228_SCAR_RUOQ
G228_SCAR_RUIQ
G228_SCAR_RLIQ
G228_SCAR_RLOQ
G228_SCAR_RC
G228_SCAR_LUIQ
G228_SCAR_LUOQ
G228_SCAR_LLOQ
G228_SCAR_LLIQ
G228_SCAR_LC
G228_TATT_RUOQ
G228_TATT_RUIQ
G228_TATT_RLIQ
G228_TATT_RLOQ
G228_TATT_RC
G228_TATT_LUIQ
G228_TATT_LUOQ
G228_TATT_LLOQ
G228_TATT_LLIQ
G228_TATT_LC
G228_PIER
G228_PIERR
G228_PIERL
G228_BR_COL"""

vars_g228 = set(vars_g228_str.splitlines())
vars_g228

{'G228_AREL',
 'G228_ARER',
 'G228_BR_COL',
 'G228_PIER',
 'G228_PIERL',
 'G228_PIERR',
 'G228_SCARL',
 'G228_SCARS',
 'G228_SCARW',
 'G228_SCAR_LC',
 'G228_SCAR_LLIQ',
 'G228_SCAR_LLOQ',
 'G228_SCAR_LUIQ',
 'G228_SCAR_LUOQ',
 'G228_SCAR_RC',
 'G228_SCAR_RLIQ',
 'G228_SCAR_RLOQ',
 'G228_SCAR_RUIQ',
 'G228_SCAR_RUOQ',
 'G228_TATT',
 'G228_TATTL',
 'G228_TATTW',
 'G228_TATT_LC',
 'G228_TATT_LLIQ',
 'G228_TATT_LLOQ',
 'G228_TATT_LUIQ',
 'G228_TATT_LUOQ',
 'G228_TATT_RC',
 'G228_TATT_RLIQ',
 'G228_TATT_RLOQ',
 'G228_TATT_RUIQ',
 'G228_TATT_RUOQ'}

In [None]:
vl_g228 = set(vl.filter(pl.col("file").eq("G228_MainQandRQ.sav")).get_column("old_var_name").to_list())

In [None]:
vars_g228 - vl_g228, vl_g228 - vars_g228

(set(), set())

## G0G1

In [None]:
vars_g0g1_str = """G0G1_BR_COL
G0G1_AREL
G0G1_ARER
G0G1_PIER
G0G1_PIERL
G0G1_PIERR
G0G1_SCARS
G0G1_SCARL
G0G1_SCARW
G0G1_SCAR_LC
G0G1_SCAR_LUIQ
G0G1_SCAR_LUOQ
G0G1_SCAR_LLIQ
G0G1_SCAR_LLOQ
G0G1_SCAR_RC
G0G1_SCAR_RUOQ
G0G1_SCAR_RUIQ
G0G1_SCAR_RLOQ
G0G1_SCAR_RLIQ
G0G1_TATT
G0G1_TATTL
G0G1_TATTW
G0G1_TATT_LC
G0G1_TATT_LUIQ
G0G1_TATT_LUOQ
G0G1_TATT_LLIQ
G0G1_TATT_LLOQ
G0G1_TATT_RC
G0G1_TATT_RUOQ
G0G1_TATT_RUIQ
G0G1_TATT_RLOQ
G0G1_TATT_RLIQ
G0G1_TIBS_COM"""

vars_g0g1 = set(vars_g0g1_str.splitlines())
vars_g0g1

{'G0G1_AREL',
 'G0G1_ARER',
 'G0G1_BR_COL',
 'G0G1_PIER',
 'G0G1_PIERL',
 'G0G1_PIERR',
 'G0G1_SCARL',
 'G0G1_SCARS',
 'G0G1_SCARW',
 'G0G1_SCAR_LC',
 'G0G1_SCAR_LLIQ',
 'G0G1_SCAR_LLOQ',
 'G0G1_SCAR_LUIQ',
 'G0G1_SCAR_LUOQ',
 'G0G1_SCAR_RC',
 'G0G1_SCAR_RLIQ',
 'G0G1_SCAR_RLOQ',
 'G0G1_SCAR_RUIQ',
 'G0G1_SCAR_RUOQ',
 'G0G1_TATT',
 'G0G1_TATTL',
 'G0G1_TATTW',
 'G0G1_TATT_LC',
 'G0G1_TATT_LLIQ',
 'G0G1_TATT_LLOQ',
 'G0G1_TATT_LUIQ',
 'G0G1_TATT_LUOQ',
 'G0G1_TATT_RC',
 'G0G1_TATT_RLIQ',
 'G0G1_TATT_RLOQ',
 'G0G1_TATT_RUIQ',
 'G0G1_TATT_RUOQ',
 'G0G1_TIBS_COM'}

In [None]:
vl_g0g1 = set(vl.filter(pl.col("file").eq("G0G1_PA.sav")).get_column("old_var_name").to_list())

In [None]:
vars_g0g1 - vl_g0g1, vl_g0g1 - vars_g0g1

(set(), set())