## Health Data Reusability Project

This notebook is an informal investigation into the technologies needed to take the data contained in "open data" publications from the UK Department of Health. This will allow researchers to automate computations and respond more speedily to changes.

Some of this is ugly, some of it will doubtless be unnecessary, but it shows at least some of the preliminary work that goes into getting one's thinking straightened out about a particular program or set of programs.

In [None]:
import openpyxl as xl

Note that this softare cannot read ".xls" files. `wb = xl.load_workbook("data/gpearnextime.xls")` raises an exception, so I toook the quick route and converted it to a ".xlsx" file with Word before further processing.

It might be worth investigating the older `xlrd` module, which can reals ".xls" files (though sadly there appears
to be no easy way to write them out as ".xslx" files which I had hopes `xlwt` might have provided. I suspect that there will be an easy fix for this, but I'll need to speak to Chris Withers.

In [None]:
wb = xl.load_workbook("data/gpearnextime.xlsx")

In [None]:
wb.sheetnames

In [None]:
ws = wb.get_sheet_by_name('1a. GPMS Cash Terms ')

In [None]:
ws["B7"].value

Many of the spreadsheets express an extra dimensionality in the data using a merged cell intended to apply to all rows next to it. Since the cell value is only given once (it can be extracted from the first cell in the merged range, the rest having no value) we need some way to replicate the values as we progress down the sheet.

Maybe a generator function taking the worksheet and the start position as an argument?

In [None]:
for i in range(1, 200):
    print(ws["B{}".format(i)].value, ws["C{}".format(i)].value)

Note that the date values and the footnote numbers run together to give a single string value.
That means some parsing has to be applied to separate it into a `(date, footnote)` pair, whose
second member will be `None` if no notes apply.
From an openness point of view it wold be much better to have a separate column for the footnotes that should be applied to the row.

In [None]:
def repeat_merge(ws, row, col):
    last_value =  ws.cell(row=row, column=col).value
    if last_value is None:
        raise ValueError("Cell sequence must start with a non-empty cell")
    while not last_value.startswith("Copyright ©"):
        this_cell = ws.cell(row=row, column=col)
        if this_cell.value is not None:
            last_value = this_cell.value
        yield last_value
        row += 1

titles = repeat_merge(ws, 1, 2)

for i, t in enumerate(titles):
    print(i, t)

Turns out that may not be as useful as I thought. It would probably be eaiser to maintain the column values as part ofthe processing logic.

(This was borne out when I wrote a non-terminting loop when experimenting with the code below)

In [None]:
ws["B3"].value

In [None]:
def year_refs(s):
    """Separate the year string into the year plus the list of references"""
    return s[:7], s[7:].split(",")

In [None]:
years = repeat_merge(ws, 6, 3)
for i in range(100):
    print(year_refs(next(years)))

In [None]:
ws["d81"].value

In [None]:
def num_val(val):
    return 0 if val == "-" else val

In [None]:
num_val(32.456)

In [None]:
num_val("-")

In [None]:
3 == "banana"

In [None]:
cell = ws["B3"]

Probably a good idea to look at how we can find the relevant areas in a worksheet, then analyze the content of those areas (which will vary in size, increasing as the years go by.

In [None]:
ws["B3"].value # Sheet heading

In [None]:
ws["B5"].value # Table heading

In [None]:
cells = ws.get_cell_collection()

In [None]:
from collections import defaultdict

cols_in_row = defaultdict(list)

for cell in cells:
    if cell.value is not None:
        cols_in_row[cell.row].append(cell.column)

In [None]:
max_row = max(c for c in sorted(cols_in_row.keys()))
max_row

Note that cell J43 has a spurious value that should really be ignored. Wonder how long that's been there and who knows it is ...

In [None]:
cols_in_row[43].remove('J')
cols_in_row[43]

In [None]:
pixels = [] # straight list of pixel values for graphic
matrix = []
#print("  ".join(list("ABCDEFG"))) # Column headings
for row_num in range(max_row):
    cols = cols_in_row[row_num]
    row_string = []
    row_matrix = []
    for col_name in "ABCDEFG":
        row_string.append("*" if col_name in cols else " ")
        row_matrix.append(col_name in cols)
    #print("  ".join(row_string))
    matrix.append(row_matrix)
    pixels += [1-p for p in row_matrix]

In [None]:
str_sizes = "".join(str(sum(x for x in row)) for row in matrix)

In [None]:
import re
for m in re.finditer("(?P<m>1665)", str_sizes):
    print(m.span()[0])

In [None]:
im = Image.new("1", (7, 198))

In [None]:
im.putdata(pixels)

In [None]:
im.resize((245, 198*4))

In [None]:
im = ImageColour

In [None]:
im = ImageColour

In [None]:
im = Image.Image

In [None]:
help(Image)

In order to analyze the cell more fully the crummy visualization shows us we need to focus on Column B.

In [None]:
from itertools import count

def next_non_empty(r):
    for r in count(r):
        if ws["B{}".format(r)].value is not None:
            return r

In [None]:
r = 0
while r < max_row:
    r = next_non_empty(r+1)
    print(r)

In [None]:
2