## Welcome to the data cleaning phase of this project!
We currently have a folder `table-data` with a bunch of dirty json files. The goal of this notebook is to illustrate the process of data cleaning by subsequently transforming the dirty json tables into a cohesive dataframe with tidied up data, ready to be analyzed!

There are 6 different tables for every printing log:
* Overview
* Bioinks
* Material Settings
* Printer Setup
* Hardware Setup
* Printing Log

First, let's import some libraries

In [154]:
import os
import json
import pandas as pd

Now, let's iterate over the json files and collect them in one large dataframe.

Lets start with cleaning the "overview" table.

The following code creates a dataframe, which is just a big table out of every json file that fits in the category "overview".

# Overview

In [155]:
type = "overview"

data = []
folder_path = 'table-data-cleaned'
# Iterate through each subfolder
for index, subfolder in enumerate(os.listdir(folder_path)):
    subfolder_path = os.path.join(folder_path, subfolder)
    if os.path.isdir(subfolder_path):
        # Check if "bioinks.json" exists in the subfolder
        jsonPath = os.path.join(subfolder_path, type + ".json")
        if os.path.exists(jsonPath):
            # Read the content of "bioinks.json" and append it to the list
            with open(jsonPath, 'r') as f:
                content = json.load(f)
                data.extend(content)
                
# Convert the list to a pandas DataFrame
overviewDf = pd.DataFrame(data)

overviewDf.to_csv("data-frames-raw/" + type + '.csv', index=False)

First, let's drop all the rows without content. Also, the column "Log no." does not provide any additional information, so let's drop it

In [156]:
# Select all columns except 'id'
columns_to_check = overviewDf.columns.drop('id')

# Remove rows where any of the specified columns have an empty string
overviewDf = overviewDf[~overviewDf[columns_to_check].map(lambda x: x == '').all(axis=1)]

overviewDf = overviewDf.drop(columns=['Log no.'])

# Save the cleaned DataFrame back to a CSV file
overviewDf.to_csv('data-frames-cleaned/' + type + '.csv', index=False)


It is visible that the column "Folder name" contains valuable information, for example a date. But it is not normalized. Let's extract the date from that column and put it in a new column.

In [157]:
# Adjust the function to handle additional rules
import re
from datetime import datetime

def extract_date(log):
    # Extract the numeric part of the log string
    numeric_part = re.search(r'\d+', log)
    if numeric_part:
        numeric_part = numeric_part.group()
        # If the numeric part ends with '01', '02', '03', '04', '05', '06', '07', which might not be part of the date, remove it
        if numeric_part.endswith(('01', '02', '03', '04', '05', '06', '07')) and len(numeric_part) > 6:
            numeric_part = numeric_part[:-2]
        # Handle different date formats
        if len(numeric_part) == 6:
            # Assume the format is YYMMDD
            date_format = '%y%m%d'
        elif len(numeric_part) == 8:
            # Assume the format is YYYYMMDD
            date_format = '%Y%m%d'
        else:
            return None  # Return None if the format is unrecognized
        # Parse the date
        try:
            return datetime.strptime(numeric_part, date_format).date()
        except ValueError:
            return None  # Return None if the date parsing fails
    return None  # Return None if no numeric part is found

# Apply the adjusted date extraction function
overviewDf['Date'] = overviewDf['folderName'].apply(extract_date)

# Save the cleaned DataFrame back to a CSV file
overviewDf.to_csv('data-frames-cleaned/' + type + '.csv', index=False)


This worked well. There are some rows which look like this: 281,Title,Operator,Description,Log230719-HAMA-30-2%-Thiol-ene
This points to the table not being filled out by the operator. 
Let's enrich the "Title" row by taking the folderName as the title and replace the empty values with empty strings.
After that, we can drop that column, since we extracted all useful information.

In [158]:
def enrich_title(row):
    # Check if the 'Title' column contains 'Title'
    if row['Title'] == 'Title':
       row['Title'] = row['folderName']
       row['Operator'] = ''
       row['Description'] = ''
    elif row['Title'] == '':
        row['Title'] = row['folderName']
    return row

# Apply the function to each row
overviewDf = overviewDf.apply(enrich_title, axis=1)

overviewDf = overviewDf.drop(columns=['folderName'], axis=1)

# Save the cleaned DataFrame back to a CSV file
overviewDf.to_csv('data-frames-cleaned/' + type + '.csv', index=False)

Now, let's look at the column "Operator". This column currently includes ambigous values. For example "JV/TL" and "JV,TL" mean the same thing.
The following cell cleans thos ambiguities and puts the operators in an array of strings, which can be easily processed later on.

In [159]:
import re

def split_operators(operator):
    operator = operator.upper()
    # Remove spaces around delimiters and parentheses
    operator = re.sub(r'\s*([,/&+()])\s*', r'\1', operator)
    # Split on the delimiters
    parts = re.split(r'[,/&+() ]', operator)
    # Remove empty strings and strip whitespace
    parts = [part.strip() for part in parts if part.strip()]
    # Special handling for initials (e.g., 'John Doe' -> 'JD')
    if len(parts) == 1 and ' ' in parts[0]:
        parts = [''.join([name[0] for name in parts[0].split() if name])]
        parts.upper()
    return parts

overviewDf["operator_array"] = overviewDf["Operator"].apply(split_operators)

# Save the cleaned DataFrame back to a CSV file
overviewDf.to_csv('data-frames-cleaned/' + type + '.csv', index=False)

That looks good, now let's drop the "operator" column.

In [160]:
overviewDf = overviewDf.drop(columns=['Operator'], axis=1)

# Save the cleaned DataFrame back to a CSV file
overviewDf.to_csv('data-frames-cleaned/' + type + '.csv', index=False)

The overview table looks good now.

# Bioinks

Let's clean the "bioInks" table next.

In [161]:
type = "bioInks"

data = []
folder_path = 'table-data-cleaned'
# Iterate through each subfolder
for index, subfolder in enumerate(os.listdir(folder_path)):
    subfolder_path = os.path.join(folder_path, subfolder)
    if os.path.isdir(subfolder_path):
        # Check if "bioinks.json" exists in the subfolder
        jsonPath = os.path.join(subfolder_path, type + ".json")
        if os.path.exists(jsonPath):
            # Read the content of "bioinks.json" and append it to the list
            with open(jsonPath, 'r') as f:
                content = json.load(f)
                data.extend(content)
                
# Convert the list to a pandas DataFrame
bioInksDf = pd.DataFrame(data)
bioInksDf.to_csv("data-frames-raw/" + type + '.csv', index=False)

The dataframe has several columns which are mostly empty. Before removing columns, let's look at them, maybe we can map some to similar values.

In [162]:
for column in bioInksDf.columns:
    print(column)

Name
Polymer
cpolymer [v%]
cLAP [wt%]
CTartrazine [mM]
Solvent
logId
spheres
CIodixanol [v/v]
Gelma FIC A29
other
cLAP [v%]
CDTT [v%]
Color [mg]
additives
Additives
Filter [µM]
Add
Fluorospheres [dil]
Comment
Amount of color(mg/ml)
Beads Mio/ml
cDTT [v%]
Peptide (g/L)


cLAP [v%] and cLAP [wt%] describe the same thing but are in different units. Let's change that.

Same goes for CDTT [v%] and CTartrazine [mM]

In [163]:
bioInksDf['CTartrazine [mM]'] = bioInksDf['CTartrazine [mM]'].combine_first(bioInksDf['CDTT [v%]'])
# only get the values in mM

def extract_mM(value):
    # Check if value is a string
    if isinstance(value, str):
        # Use regular expression to find the pattern
        match = re.search(r'\((\d+)mM\)', value)
        if match:
            # Extract and return the number
            return int(match.group(1))
    # Return NaN or some default value if pattern not found or value is not a string
    return value

# Apply the function to the desired column
bioInksDf['CTartrazine [mM]'] = bioInksDf['CTartrazine [mM]'].apply(extract_mM)

# Write to csv
bioInksDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

In the following step we will remove the superfluous columns.

In [164]:
rmCols = [
    "spheres" ,
    "CIodixanol [v/v]",
    "Gelma FIC A29",
    "other",
    "cLAP [v%]",
    "CDTT [v%]",
    "Color [mg]",
    "additives",
    "Additives",
    "Filter [µM]",
    "Add",
    "Fluorospheres [dil]",
    "Comment",
    "Amount of color(mg/ml)",
    "Beads Mio/ml",
    "cDTT [v%]",
    "Peptide (g/L)"
]

# Remove the specified columns from the DataFrame
bioInksDf = bioInksDf.drop(columns=rmCols, errors='ignore')
print(f"Columns {rmCols} removed for type {type}.")

# Write to csv
bioInksDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Columns ['spheres', 'CIodixanol [v/v]', 'Gelma FIC A29', 'other', 'cLAP [v%]', 'CDTT [v%]', 'Color [mg]', 'additives', 'Additives', 'Filter [µM]', 'Add', 'Fluorospheres [dil]', 'Comment', 'Amount of color(mg/ml)', 'Beads Mio/ml', 'cDTT [v%]', 'Peptide (g/L)'] removed for type bioInks.


Now, let's remove empty lines. First, characters indicating an empty cell are being replaced with an empty string to help streamline the process

In [165]:
# remove empty-indicating characters
rmCharacters = ['-', '•', '/', '%']
bioInksDf = bioInksDf.replace(rmCharacters, '', regex=True)

# Select all columns except 'logId'
columns_to_check = bioInksDf.columns.drop('logId')

# Remove rows where any of the specified columns have an empty string
bioInksDf = bioInksDf[~bioInksDf[columns_to_check].map(lambda x: x == '').all(axis=1)]

# Write to csv
bioInksDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Let's look at the column 'polymer'. We want to map similar bioinks to the same value.

In [166]:
def map_polymer(polymer):
    polymer = polymer.strip()  # Remove any leading/trailing whitespace
    if 'gelma' in polymer or 'glema' in polymer:
        return 'gelma'
    elif 'hama' in polymer or ' hyaluronic' in polymer or 'hyaloron' in polymer or 'acid' in polymer:
        return 'hama'
    elif 'pegda' in polymer or 'pegda700' in polymer:
        return 'pegda'
    elif 'elma' in polymer:
        return 'elma'
    elif 'cmcma' in polymer or 'carboxymethyl' in polymer or 'chitosan' in polymer:
        return 'cmcma'
    elif 'dextran' in polymer:
        return 'dextran'
    elif 'handb' in polymer or 'ha_nb' in polymer or 'ha' in polymer and 'methacrylate' not in polymer and 'shares' not in polymer:
        return 'hanb'
    else:
        return 'unknown'  # For polymers that don't match
    
def map_solvent(solvent):
    solvent = solvent.strip()  # Remove any leading/trailing whitespace
    if 'dpbs' in solvent:
        return 'dpbs'
    elif 'pbs' in solvent:
        return 'pbs'
    elif 'rpmi' in solvent:
        return 'rpmi'
    elif 'ddh2o' in solvent or 'ddh20' in solvent:
        return 'ddh2o'
    elif 'williams' in solvent:
        return 'williams e'
    else:
        return 'unknown'  # For solvents that don't match

def removeChars(row):
    polymerString = row['Polymer']
    solventString = row['Solvent']
    polymerString = ''.join(filter(lambda x: not x.isdigit(), polymerString))
    splitStrings = polymerString.split()
    splitSolventStrings = solventString.split()
    
    # Map each potential ink and remove 'unknown' values
    mapped_inks = [map_polymer(potentialInk) for potentialInk in splitStrings]
    splitStrings = [ink for ink in mapped_inks if ink != 'unknown']
    row['Polymer'] = splitStrings
    
    # Map solvents
    mapped_solvents = [map_solvent(potentialSolvent) for potentialSolvent in splitSolventStrings]
    splitSolventStrings = [solvent for solvent in mapped_solvents if solvent != 'unknown']
    row['Solvent'] = splitSolventStrings
             
    return row
    
# column to lowercase
bioInksDf['Polymer'] = bioInksDf['Polymer'].str.lower()
bioInksDf['Solvent'] = bioInksDf['Solvent'].str.lower()

bioInksDf = bioInksDf.apply(removeChars, axis=1)

# Write to csv
bioInksDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Next, we want to give every double ink in a row its own row.

In [167]:
def scanForTwoInks(row, new_rows):
    polymerArray = row['Polymer']
    cPolymers = row['cpolymer [v%]'].split()
    solventsArray = row['Solvent']
    inkName = row['Name']
    if len(polymerArray) > 1 and len(cPolymers) > 1:
        # Ensure that polymerArray and cPolymers have the same length
        # This step is crucial as it ensures that there's a 1:1 mapping between polymers and concentrations
        min_length = min(len(polymerArray), len(cPolymers))
        polymerArray = polymerArray[:min_length]
        cPolymers = cPolymers[:min_length]
        
        for i in range(min_length):
            new_row = row.copy()
            new_row['Polymer'] = [polymerArray[i]]
            new_row['cpolymer [v%]'] = [cPolymers[i]]
            if len(solventsArray) == min_length:
                new_row['Solvents'] = solventsArray[i]
            new_row['Name'] = inkName
            new_rows.append(new_row)
            # Mark the index of the row to be dropped since it has been expanded into new rows
            index_to_drop.append(row.name)
    elif len(polymerArray) == 2 and len(cPolymers) == 1:
        new_row = row.copy()
        new_row['Polymer'] = [polymerArray[0]]
        new_row['cpolymer [v%]'] = [cPolymers[0]]
        new_row['Name'] = inkName
        new_rows.append(new_row)
        # Mark the index of the row to be dropped since it has been expanded into new rows
        index_to_drop.append(row.name)
    else:
        new_rows.append(row)  # If there's only one polymer, keep the row as is
        
new_rows = []  # Initialize an empty list to collect new rows
index_to_drop = []

# Iterate over each row in the DataFrame
for index, row in bioInksDf.iterrows():
    scanForTwoInks(row, new_rows)
    
# Drop the original rows that have been expanded into multiple rows
bioInksDf = bioInksDf.drop(index_to_drop)
    
new_bioInksDf = pd.DataFrame(new_rows)

bioInksDf = pd.concat([bioInksDf, new_bioInksDf], ignore_index=True)

# Write to csv
bioInksDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Let's clean up the table some more

In [168]:
def clean_dataframe(df):
    # Polymer: Remove brackets and quotes
    df['Polymer'] = df['Polymer'].str[0]

    # cpolymer [v%]: Keep only numbers and replace commas with points
    df['cpolymer [v%]'] = df['cpolymer [v%]'].str.replace(r'[^0-9,\.]', '', regex=True)
    df['cpolymer [v%]'] = df['cpolymer [v%]'].str.replace(',', '.')

    # cLAP [wt%]: Remove non-numeric entries and entries with ">"
    df['cLAP [wt%]'] = df['cLAP [wt%]'].str.replace(r'[^0-9,\.]', '', regex=True)
    df['cLAP [wt%]'] = df['cLAP [wt%]'].str.replace(',', '.')
    df['cLAP [wt%]'] = df['cLAP [wt%]'].replace(r'.*?>.*', '', regex=True)

    # CTartrazine [mM]: Same cleaning as cLAP
    df['CTartrazine [mM]'] = df['CTartrazine [mM]'].str.replace(r'[^0-9,\.]', '', regex=True)
    df['CTartrazine [mM]'] = df['CTartrazine [mM]'].str.replace(',', '.')
    df['CTartrazine [mM]'] = df['CTartrazine [mM]'].replace(r'.*?>.*', '', regex=True)

    # Solvent: Take the first entry and remove brackets and quotes
    df['Solvent'] = df['Solvent'].str[0]

    return df

# Apply the cleaning function to the DataFrame
bioInksDf = clean_dataframe(bioInksDf)

bioInksDf = bioInksDf.drop(columns=['Solvents'])

# Write to csv
bioInksDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Next, fill missing numbers with the mean

In [169]:
def fill_missing_with_zeros(row):
    for column in ['cpolymer [v%]', 'cLAP [wt%]', 'CTartrazine [mM]']:
        if isinstance(row[column], str):
            if row[column] == '':
                row[column] = 0
        try:
            row[column] = float(row[column])
        except ValueError:
            row[column] = 0

    return row


# Apply the function to the DataFrame
bioInksDf = bioInksDf.apply(fill_missing_with_zeros, axis=1)

# Write to csv
bioInksDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

There are some duplicate rows which need to be removed. Also, there are spaces in the column names. Replace them with a '_' character.

In [170]:
bioInksDf = bioInksDf.drop_duplicates()
bioInksDf.columns = bioInksDf.columns.str.replace(' ', '_').str.replace('[\[\]]', '', regex=True).str.replace('%', '')

# Write to csv
bioInksDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

# Hardware

Let's clean the hardware table.

In [171]:
type = "hardwareSetup"

data = []
folder_path = 'table-data-cleaned'
# Iterate through each subfolder
for index, subfolder in enumerate(os.listdir(folder_path)):
    subfolder_path = os.path.join(folder_path, subfolder)
    if os.path.isdir(subfolder_path):
        # Check if "bioinks.json" exists in the subfolder
        jsonPath = os.path.join(subfolder_path, type + ".json")
        if os.path.exists(jsonPath):
            # Read the content of "bioinks.json" and append it to the list
            with open(jsonPath, 'r') as f:
                content = json.load(f)
                data.extend(content)
                
# Convert the list to a pandas DataFrame
hardwareSetupDf = pd.DataFrame(data)

hardwareSetupDf.to_csv("data-frames-raw/" + type + '.csv', index=False)

Now, let's remove empty lines. First, characters indicating an empty cell are being replaced with an empty string to help streamline the process

In [172]:
# Replace specific characters with empty strings
rmCharacters = ['-', '•', '/', '%']
hardwareSetupDf = hardwareSetupDf.replace(rmCharacters, '', regex=True)

# Select all columns except 'logId'
columns_to_check = hardwareSetupDf.columns.drop(['logId', 'Position'])

# Remove rows where any of the specified columns have an empty string
hardwareSetupDf = hardwareSetupDf[~hardwareSetupDf[columns_to_check].apply(lambda x: x == '').any(axis=1)]

# Write to csv
hardwareSetupDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Now let's map "Inkreservoir/Drying", "Ink reservoir/drying" ,"Ink reservoir/Drying" to the same column.

In [173]:
rows_to_concat = ['Inkreservoir/Drying', 'Ink reservoir/drying', 'Ink reservoir/Drying']

hardwareSetupDf['inkreservoir/drying'] = hardwareSetupDf.apply(
    lambda row: row['Inkreservoir/Drying'] if pd.notnull(row['Inkreservoir/Drying'])
    else (row['Ink reservoir/drying'] if pd.notnull(row['Ink reservoir/drying'])
    else row['Ink reservoir/Drying']), axis=1)

hardwareSetupDf = hardwareSetupDf.drop(['Inkreservoir/Drying', 'Ink reservoir/drying', 'Ink reservoir/Drying', 'Printhead: CB5'], axis=1)

# Write to csv
hardwareSetupDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)


Next, map similar values.

In [174]:
# dish mappings

# pol_001, black
# al_001, black metal
# al_002
# al_003
# al_004, hepabrick

# al_007
# al_008, hepabrick
# al_009, blue
# al_009, implant silver torpedo
# al_010, blue
# al_011, blue silver metal
# al_012, tartrazino, silver
# al_013, clear ink

def map_values(row):
    # map printhead col
    if 'd4001' in row['Printhead'] or 'd40_01' in row['Printhead'] or 'br1' in row['Printhead']:
        row['Printhead'] = 'd4001'
    elif 'd4002' in row['Printhead'] or 'd40_02' in row['Printhead']:
        row['Printhead'] = 'd4002'
    elif 'd4003' in row['Printhead'] or 'd40-03' in row['Printhead'] or 'cr5' in row['Printhead']:
        row['Printhead'] = 'd4003'
    elif 'd4004' in row['Printhead'] or 'd40_04' in row['Printhead'] or 'gen4' in row['Printhead']:
        row['Printhead'] = 'd4004'
    elif 'small' in row['Printhead'] or 'Small' in row['Printhead']:
        row['Printhead'] = 'small'
    elif 'big' in row['Printhead'] or 'large' in row['Printhead'] or 'black metal' in row['Printhead'] or 'metal black' in row['Printhead'] or '4 cm' in row['Printhead'] or 'aluminium black' in row['Printhead'] or 'alu' in row['Printhead'] or 'black' in row['Printhead']:
        row['Printhead'] = 'large'
    else:
        row['Printhead'] = 'other'
    
    # Define a pattern to match the codes, considering 'ai_' and 'al_' prefixes and optional underscores or spaces
    pattern = r"(ai_|al_)?_?(al_)?0?(\d{2,3})"
    
    # Search for the pattern in the 'inkreservoir/drying' column
    match = re.search(pattern, row['inkreservoir/drying'])
    if match:
        # Construct the replacement string using the matched group which represents the number part
        row['inkreservoir/drying'] = f"al_{int(match.group(3)):03}"
    
    if 'wash' in row['inkreservoir/drying']:
        row['inkreservoir/drying'] = 'wash'
    elif 'silver metal' in row['inkreservoir/drying']:
        row['inkreservoir/drying'] = 'al_009'
    elif 'al_060' in row['inkreservoir/drying']:
        row['inkreservoir/drying'] = 'al_006'
    elif 'al_220' in row['inkreservoir/drying'] or 'al_200' in row['inkreservoir/drying']:
        row['inkreservoir/drying'] = 'al_002'
    elif 'tartrazino' in row['inkreservoir/drying']:
        row['inkreservoir/drying'] = 'al_012'
    elif 'blue metal' in row['inkreservoir/drying'] or 'aluminium blue' in row['inkreservoir/drying'] or 'metal blue' in row['inkreservoir/drying'] or 'aluminum blue' in row['inkreservoir/drying']:
        row['inkreservoir/drying'] = 'al_011'
    elif 'big metal black' in row['inkreservoir/drying'] or 'black metal' in row['inkreservoir/drying'] or 'black alu vat' in row['inkreservoir/drying'] or 'metal black' in row['inkreservoir/drying']:
        row['inkreservoir/drying'] = 'al_001'
    elif 'hepabrick' in row['inkreservoir/drying']:
        row['inkreservoir/drying'] = 'al_008'
    elif 'dry' in row['inkreservoir/drying'] or 'drying' in row['inkreservoir/drying']:
        row['inkreservoir/drying'] = 'dry'
    else:
        if not match:
            row['inkreservoir/drying'] = 'other'
 
    return row
    
# column to lowercase
hardwareSetupDf['Printhead'] = hardwareSetupDf['Printhead'].str.lower()
hardwareSetupDf['inkreservoir/drying'] = hardwareSetupDf['inkreservoir/drying'].str.lower()

# Apply the function to the DataFrame
hardwareSetupDf = hardwareSetupDf.apply(map_values, axis=1)

# Write to csv
hardwareSetupDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

# Printer Setup

Next up is the printer setup table.

In [175]:
type = "printerSetup"

data = []
folder_path = 'table-data-cleaned'
# Iterate through each subfolder
for index, subfolder in enumerate(os.listdir(folder_path)):
    subfolder_path = os.path.join(folder_path, subfolder)
    if os.path.isdir(subfolder_path):
        # Check if "bioinks.json" exists in the subfolder
        jsonPath = os.path.join(subfolder_path, type + ".json")
        if os.path.exists(jsonPath):
            # Read the content of "bioinks.json" and append it to the list
            with open(jsonPath, 'r') as f:
                content = json.load(f)
                data.extend(content)
                
# Convert the list to a pandas DataFrame
printerSetupDf = pd.DataFrame(data)

printerSetupDf.to_csv("data-frames-raw/" + type + '.csv', index=False)

The printerSetup table only contains filenames and has a lot of missing values. Thus, it is considered irrelevant for this analysis and can be discarded.

# Material Settings

Next is the material settings table.

In [176]:
type = "materialSettings"

data = []
folder_path = 'table-data-cleaned'
# Iterate through each subfolder
for index, subfolder in enumerate(os.listdir(folder_path)):
    subfolder_path = os.path.join(folder_path, subfolder)
    if os.path.isdir(subfolder_path):
        # Check if "bioinks.json" exists in the subfolder
        jsonPath = os.path.join(subfolder_path, type + ".json")
        if os.path.exists(jsonPath):
            # Read the content of "bioinks.json" and append it to the list
            with open(jsonPath, 'r') as f:
                content = json.load(f)
                data.extend(content)
                
# Convert the list to a pandas DataFrame
materialSettingDf = pd.DataFrame(data)

materialSettingDf.to_csv("data-frames-raw/" + type + '.csv', index=False)

Delete the empty lines.

In [177]:
# Replace specific characters with empty strings
rmCharacters = ['-', '•', '/', '%']
materialSettingDf = materialSettingDf.replace(rmCharacters, '', regex=True)

# Select all columns except 'logId'
columns_to_check = materialSettingDf.columns.drop(['logId', 'Position'])

# Remove rows where any of the specified columns have an empty string
materialSettingDf = materialSettingDf[~materialSettingDf[columns_to_check].apply(lambda x: x == '').any(axis=1)]

# Write to csv
materialSettingDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Next, merge the column .JSON with column "Material parameters".

In [178]:
materialSettingDf.loc[materialSettingDf['Material parameters'].isna(), 'Material parameters'] = materialSettingDf['.JSON']

materialSettingDf = materialSettingDf.drop(['.JSON', 'Attempt', 'Material'], axis=1)


# Write to csv
materialSettingDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

The column "Material parameters" has json values which need their own columns.

In [179]:
import re

columns_to_initialize = [
    'temperature', 'temperatureTolerance', 'brightness', 'exposureTime', 'zHop',
    'zHopSpeed', 'washTime', 'dabCount', 'washCount', 'projectionDelay'
]
for column in columns_to_initialize:
    materialSettingDf[column] = ""

def processJson(row):
    string = row['Material parameters']
    # Check for substrings that should cause the function to abort and return the string as is
    if "http" in string or "Slice" in string:
        return row
    # Normalize the string by removing any leading/trailing braces and whitespace
    if "{" in string or '"' in string:   
        normalized_str = re.sub(r'^\s*{\s*|\s*}\s*$', '', string)
        pairs = re.findall(r'(?:"(\w+)":\s*([^,}]+))', normalized_str)
    else:
        normalized_str = string
        pairs = re.findall(r'(?:(\w+):\s*([^,}]+))', normalized_str)

    # Process each pair and convert numerical values from string to their appropriate type
    for key, value in pairs:
        # Remove any non-numeric trailing characters from the value
        value = re.sub(r'[^0-9.]+$', '', value.strip())
        # Try converting to integer, if fail, then to float, if fail, keep as string
        try:
            value = int(value)
        except ValueError:
            try:
                value = float(value)
            except ValueError:
                pass  # keep the value as string if it's neither int nor float
        # Assign the value to the row if the key is a column in the DataFrame
        if key in materialSettingDf.columns:
            row[key] = value
            
    return row

materialSettingDf = materialSettingDf.apply(processJson, axis=1)

# Drop useless columns
materialSettingDf = materialSettingDf.drop(["Material parameters", "File name", "File name of material .json", "File name of material .Json"], axis=1)

# Write to csv
materialSettingDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Now, a few rows still miss values. Let's fill those with the mean value of the corresponding row.

In [180]:
import numpy as np

columns_to_fill = ['temperature', 'temperatureTolerance', 'brightness', 'exposureTime', 'zHop', 'zHopSpeed', 'washTime']

materialSettingDf['exposureTime'] = materialSettingDf['exposureTime'].apply(lambda x: 0 if len(str(x)) > 3 else x)

for column in columns_to_fill:
    materialSettingDf[column] = pd.to_numeric(materialSettingDf[column], errors='coerce')
    

column_means = materialSettingDf[columns_to_fill].mean()

materialSettingDf.fillna(column_means, inplace=True)

for column in columns_to_fill:
    materialSettingDf[column].replace(0, column_means[column], inplace=True)

# Write to csv
materialSettingDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

# Printing Log

In [181]:
type = "printingLog"

data = []
folder_path = 'table-data-cleaned'
# Iterate through each subfolder
for index, subfolder in enumerate(os.listdir(folder_path)):
    subfolder_path = os.path.join(folder_path, subfolder)
    if os.path.isdir(subfolder_path):
        # Check if "bioinks.json" exists in the subfolder
        jsonPath = os.path.join(subfolder_path, type + ".json")
        if os.path.exists(jsonPath):
            # Read the content of "bioinks.json" and append it to the list
            with open(jsonPath, 'r') as f:
                content = json.load(f)
                data.extend(content)
                
# Convert the list to a pandas DataFrame
printingLogDf = pd.DataFrame(data)

printingLogDf.to_csv("data-frames-raw/" + type + '.csv', index=False)

Let's start by looking at the columns and figuring out which columns describe the same values and which columns can be dropped.

In [182]:
print(printingLogDf.columns.tolist())

['Attempt No.', 'Ink', 'layer height [µm]', 'no. bottom layers', 'texp.bottom\xa0 [s]', 'texposure_pos1 [s]', 'texposure_pos2 [s]', 'texposure_pos3 [s]', 'texposure_pos4 [s]', 'zSpeed', 'status', 'comment', 'next steps', 'pictures', 'logId', 'texposure_sec[s]', 'texposure_last [s]', 'Washing/drying process', 'Ink1', 'Ink2', 'washing', 'drying', 'WASH/ DRY technique', 'GLOBAL OFFSET', 'texp.bottom\u202f [s]', 'Texposure_pos1 [s]', 'Texposure_pos2 [s]', 'Texposure_pos3 [s]', 'Texposure_pos4 [s]', 'texp.layer [s]', 'zHop [mm]', 'LAP [%]', 'texp.layer 1 [s]', 'texp.layer 2 [s]', 'texp.layer 3 [s]', 'texp.layer 4 [s]', 'Post-curing time [min]', 'Tart', 'layer height [mm]', 'Base height', 'OFFSET', 'VARIABLE CHANGED', 'texposure [s]', 'Tart [mM]', 'zHop', 'Comment\xa010 s nachbelichtet']


In [183]:
# Drop
printingLogDf = printingLogDf.drop(columns=['OFFSET', 'Tart', 'Tart [mM]', 'Base height', 'zHop', 'Post-curing time [min]', 'VARIABLE CHANGED', 
                                            'Comment 10 s nachbelichtet', 'LAP [%]', 'GLOBAL OFFSET', 'texposure [s]', 'Ink1', 'Ink2', 
                                            'Washing/drying process', 'washing', 'drying', 'WASH/ DRY technique', 'texposure_sec[s]', 
                                            'texposure_last [s]', 'texp.bottom  [s]', 'Texposure_pos2 [s]', 'Texposure_pos3 [s]',
                                            'Texposure_pos4 [s]', 'zHop [mm]', 'layer height [mm]', 'texp.layer 2 [s]', 'texp.layer 3 [s]', 
                                            'texp.layer 4 [s]', 'texp.layer [s]', 'pictures'], axis=1)

# Map texposure_pos1 [s]
printingLogDf['texposure1_s'] = printingLogDf.apply(
    lambda row: row['texposure_pos1 [s]'] if pd.notnull(row['texposure_pos1 [s]'])
    else (row['Texposure_pos1 [s]'] if pd.notnull(row['Texposure_pos1 [s]'])
    else row['texp.layer 1 [s]']), axis=1)

printingLogDf = printingLogDf.drop(['texposure_pos1 [s]', 'Texposure_pos1 [s]', 'texp.layer 1 [s]'], axis=1)
# ink <-
# layerHeight <- 'layer height [µm]', layer height [mm]

# Save to csv
printingLogDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Next, remove empty rows.

In [184]:
# Replace empty strings with NaN
# Remove rows where the 'status' column is NaN or empty
printingLogDf = printingLogDf.dropna(subset=['status'])

# If you also want to consider empty strings as missing values and remove those rows:
printingLogDf = printingLogDf[printingLogDf['status'].astype(bool)]

# Save to csv
printingLogDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

Next, map the 'status' column

In [185]:
def map_values(row):
    # map printhead col
    statusRow = row['status']
    
    if 'success' in statusRow or 'Success' in statusRow or 'succes' in statusRow or 'succesfull' in statusRow or 'successfull' in statusRow or 'succsess' in statusRow or 'sucess' in statusRow or 'complete' in statusRow or 'ok' in statusRow or 'succe' in statusRow: 
        statusRow = 'success'
    elif 'failed' in statusRow or 'fail' in statusRow:
        statusRow = 'failed'
    elif 'partial success' in statusRow or 'partial sucess' in statusRow or 'semi' in statusRow or 'almost good' in statusRow:
        statusRow = 'partial success'
    elif 'aborted' in statusRow or 'cancelled' in statusRow:
        statusRow = 'aborted'
    else:
        statusRow = statusRow
        
    row['status'] = statusRow
 
    return row
    
# column to lowercase
printingLogDf['status'] = printingLogDf['status'].str.lower()

# Apply the function to the DataFrame
printingLogDf = printingLogDf.apply(map_values, axis=1)

# Save to csv
printingLogDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)


Let's remove some annoying chars.

In [186]:
# remove empty-indicating characters
rmCharacters = ['-', '•', '/']
printingLogDf = printingLogDf.replace(rmCharacters, '', regex=True)

# Write to csv
printingLogDf.to_csv("data-frames-cleaned/" + type + '.csv', index=False)

# Database

In [187]:
import sqlite3

# Create a new SQLite database
conn = sqlite3.connect('../analysis/bioprinting.db')

# Expanding the operator_array to a new dataframe
operator_data = [(row['id'], operator) for index, row in overviewDf.iterrows() for operator in row['operator_array']]
operatorsDf = pd.DataFrame(operator_data, columns=['id', 'operator'])

# Creating the main dataframe without the operator_array column
overviewDf = overviewDf.drop('operator_array', axis=1)

# Merge all slot related tables
slotSettingsDf = pd.merge(materialSettingDf, hardwareSetupDf, on=['logId', 'Position'])

# save the changed dataframes again
overviewDf.to_csv("data-frames-cleaned/" + "overview" + '.csv', index=False)
operatorsDf.to_csv("data-frames-cleaned/" + "operators" + '.csv', index=False)
slotSettingsDf.to_csv("data-frames-cleaned/" + "slotSettings" + '.csv', index=False)

# create sqlite tables
overviewDf.to_sql('overview', conn, if_exists='replace', index=False)
operatorsDf.to_sql('operators', conn, if_exists='replace', index=False)
slotSettingsDf.to_sql('slotSettings', conn, if_exists='replace', index=False)
bioInksDf.to_sql('bioInks', conn, if_exists='replace', index=False)
printingLogDf.to_sql('printingLog', conn, if_exists='replace', index=False)

908

# Test SQL queries

In [757]:
# SQL query to calculate the percentages
query = """
SELECT
  (SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) * 100.0 / COUNT(*)) AS success_percentage,
  (SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) * 100.0 / COUNT(*)) AS failed_percentage,
  (SUM(CASE WHEN status NOT IN ('success', 'failed') THEN 1 ELSE 0 END) * 100.0 / COUNT(*)) AS other_percentage
FROM
  printingLog;
"""

# Execute the SQL query
cursor = conn.cursor()
cursor.execute(query)

# Fetch the results
percentages = cursor.fetchone()

# Assign the percentages to variables
success_percentage, failed_percentage, other_percentage = percentages

# Print the results
print(f"Success percentage: {success_percentage}%")
print(f"Failed percentage: {failed_percentage}%")
print(f"Other statuses percentage: {other_percentage}%")

# Close the cursor and the connection
cursor.close()
conn.close()

Success percentage: 52.97356828193833%
Failed percentage: 43.392070484581495%
Other statuses percentage: 3.6343612334801763%
