
##Introduction
Inflammatory Bowel Disease (IBD), including Crohn's Disease (CD) and Ulcerative Colitis (UC), involves chronic inflammation of the digestive tract. Although the precise cause of IBD is unclear, serological markers are crucial for diagnosis, monitoring, and predicting therapeutic responses. Serology provides insights into the immune response, aiding a tailored therapeutic approach. Antibodies against antigens like ASCA, Omp-C, cBir1, and I2 help distinguish between IBD and non-IBD, as well as between CD and UC. Studies suggest serological markers can predict IBD years before diagnosis. However, missing data in serological studies can compromise their reliability, arising from patient non-compliance, logistical issues, and loss of follow-up. Traditional methods like listwise deletion or mean imputation often introduce biases. Sophisticated imputation techniques, including machine learning (ML) models like Random Forest (RF), Neural Networks (NN), and Deep Learning (DL), offer better solutions by detecting patterns in data. No single imputation method is best for all scenarios. This study explores and compares ML-based imputation methods for serological data in IBD, simulating various missingness scenarios and evaluating the efficacy of imputed datasets in robust statistical and ML analyses.



##Missingness in IBD
Understanding the different types of missing data is crucial as they determine the appropriate treatments for handling the missing data. According to Rubin's classification, missing data can be categorized into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). MCAR represents the highest level of randomness, indicating that the pattern of missing values is entirely random and unrelated to any variables, implying that the likelihood of data being missing is independent of both observed and unobserved data. An example of MCAR is if patient records are missing due to random administrative errors. MAR occurs when the probability of missing data is related to the observed data in the dataset but not to the unobserved data, such as if blood pressure measurements are more likely to be missing for patients with a higher recorded heart rate. MNAR is when the missingness is related to the unobserved data, meaning the probability of missing data depends on the value of the missing data itself, such as patients with severe mental health symptoms not reporting their condition accurately due to stigma. It is generally impossible to formally test the assumptions about the type of missingness, but some tests, like Little's test for MCAR, can provide insights. Distinguishing between MAR and MNAR is particularly challenging and often relies on clinical knowledge. In the current study, we have systematically generated all three types of missingness within our datasets.

## Impuation methods

In the current study, we focus on multiple imputations using different ML models. We use Multiple Imputation by Chained Equations (MICE) in R, the Iterative Imputer (II) from scikit-learn in Python, and Autoencoders (AEs) using PyTorch in Python. MICE utilizes chained equations involving a series of regression models to impute missing values. For each variable with missing data, a regression model is fitted using the other variables as predictors, iteratively cycling through each variable until convergence. Similarly, the II models each feature with missing values as a function of other features in a round-robin fashion, training a regression model on the non-missing values and predicting and imputing the missing values until convergence. II can be used as a multiple imputation technique by applying it repeatedly to the same dataset with different random seeds. Both MICE and II can be used with different ML models. Additionally, we employ AEs and variational autoencoders (VAEs) as neural network (NN) models for data imputation. AEs, a type of NN for unsupervised learning, aim to learn efficient data representations for dimensionality reduction or feature extraction. They consist of an encoder that maps the input data to a latent space and a decoder that reconstructs the input from this latent representation. VAEs extend this concept by introducing a probabilistic approach to the latent space, mapping inputs to distributions and allowing the generation of new data samples by sampling from these distributions. This makes VAEs particularly useful for generative tasks and learning more meaningful, continuous latent spaces. For multiple imputation, we generate five plausible values for each imputer using the MICE imputer and five different seed numbers to initialize the IIs, AEs, and VAEs to generate five plausible values for each missing scenario.


In [None]:
#Load the necessary libraries

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
from sklearn.model_selection import StratifiedKFold
import copy
import os
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.neighbors import KNeighborsRegressor
import torch
from torch import nn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
import torch.nn.functional as F

In [None]:
#Define the necessary functions

def impute_file(input_file_path, output_dir, estimator, method, n_imputations=5):
    data = pd.read_csv(input_file_path)
    # Automatically detect the name of the first column
    id_column_name = data.columns[0]
    id_data = data[id_column_name]
    # Exclude the first column by its detected name for imputation
    data_numeric = data.drop(columns=[id_column_name])
    #print(data.head())

    for i in range(1, n_imputations + 1):
        imputer = IterativeImputer(estimator=estimator,
                                   max_iter=10, random_state=i, imputation_order='random',tol=0.5)
        imputed_data_numeric = imputer.fit_transform(data_numeric)
        imputed_df = pd.DataFrame(imputed_data_numeric, columns=data_numeric.columns)
        # Reinsert the identifier column using its original name
        imputed_df.insert(0, id_column_name, id_data)

        base_filename = os.path.splitext(os.path.basename(input_file_path))[0]
        output_file_name = f"{base_filename}_Imputed_{method}_{i}.csv"
        output_file_path = os.path.join(output_dir, output_file_name)

        imputed_df.to_csv(output_file_path, index=False)
        print(f"Imputed dataset {i} saved to: {output_file_path}")


def impute_all_files(input_dir, output_dir,estimator,method,n_imputations=5):
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    file_paths = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith('.csv')]
    # Impute each file
    for file_path in file_paths:
        impute_file(file_path, output_dir, estimator,method,n_imputations)

def impute_file_AES(input_file_path, output_dir, estimator_factory, method, n_imputations=5):
    data = pd.read_csv(input_file_path)
    id_column_name = data.columns[0]
    id_data = data[id_column_name]
    data_numeric = data.drop(columns=[id_column_name])

    for i in range(1, n_imputations + 1):
        # Create a fresh instance of the estimator using the factory function with i as the seed
        imputer = estimator_factory(seed=i)
        imputed_data_numeric = imputer.fit_transform(data_numeric)
        imputed_df = pd.DataFrame(imputed_data_numeric, columns=data_numeric.columns)

        imputed_df.insert(0, id_column_name, id_data)



        base_filename = os.path.splitext(os.path.basename(input_file_path))[0]
        output_file_name = f"{base_filename}_Imputed_{method}_{i}.csv"
        output_file_path = os.path.join(output_dir, output_file_name)

        imputed_df.to_csv(output_file_path, index=False)
        print(f"Imputed dataset {i} saved to: {output_file_path}")

def impute_all_files_AES(input_dir, output_dir, estimator_factory, method, n_imputations=5):
    os.makedirs(output_dir, exist_ok=True)
    file_paths = [os.path.join(input_dir, f) for f in os.listdir(input_dir) if f.endswith('.csv')]

    for file_path in file_paths:
        impute_file_AES(file_path, output_dir, estimator_factory, method, n_imputations)



In [None]:
import torch
from torch import nn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
import numpy as np
import random

class StackedNumericalAutoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dims):
        super(StackedNumericalAutoencoder, self).__init__()
        self.encoding_dims = encoding_dims
        self.encoders = nn.ModuleList()
        self.decoders = nn.ModuleList()

        for encoding_dim in self.encoding_dims:
            self.encoders.append(nn.Linear(input_dim, encoding_dim))
            self.decoders.append(nn.Linear(encoding_dim, input_dim))
            input_dim = encoding_dim

    def forward(self, x):
        for encoder in self.encoders:
            x = encoder(x).relu()
        for decoder in reversed(self.decoders):
            x = decoder(x).relu()
        return x

class StackedMultiLayerAutoencoderImputer(BaseEstimator, TransformerMixin):
    def __init__(self, encoding_dims=[32], n_layers=1, epochs=50, batch_size=256, seed=42):
        self.encoding_dims = encoding_dims
        self.n_layers = n_layers
        self.epochs = epochs
        self.batch_size = batch_size
        self.autoencoders = []
        self.scaler = StandardScaler()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.seed = seed
        self._set_seed()

    def _set_seed(self):
        torch.manual_seed(self.seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed(self.seed)
            torch.cuda.manual_seed_all(self.seed)
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False
        np.random.seed(self.seed)
        random.seed(self.seed)

    def _create_autoencoders(self, input_dim):
        autoencoders = []
        for _ in range(self.n_layers):
            autoencoder = StackedNumericalAutoencoder(input_dim, self.encoding_dims)
            autoencoders.append(autoencoder.to(self.device))
        return autoencoders

    def fit(self, X, y=None):
        X_imputed = np.where(np.isnan(X), np.nanmean(X, axis=0), X)
        X_scaled = self.scaler.fit_transform(X_imputed)
        input_dim = X_scaled.shape[1]
        self.autoencoders = self._create_autoencoders(input_dim)

        X_transformed = torch.FloatTensor(X_scaled).to(self.device)

        for autoencoder in self.autoencoders:
            criterion = nn.MSELoss()
            optimizer = torch.optim.Adam(autoencoder.parameters())
            X_transformed = X_transformed.detach()  # Detach from the computation graph

            for epoch in range(self.epochs):
                autoencoder.train()
                permutation = torch.randperm(X_transformed.size()[0])
                for i in range(0, X_transformed.size()[0], self.batch_size):
                    indices = permutation[i:i+self.batch_size]
                    batch_x = X_transformed[indices]

                    optimizer.zero_grad()
                    outputs = autoencoder(batch_x)
                    loss = criterion(outputs, batch_x)
                    loss.backward()
                    optimizer.step()

                # Encode for next layer
                with torch.no_grad():
                    autoencoder.eval()
                    X_transformed = autoencoder(batch_x)  # Changed here

        return self

    def transform(self, X, y=None):
        X_imputed = np.where(np.isnan(X), np.nanmean(X, axis=0), X)
        X_scaled = self.scaler.transform(X_imputed)
        X_scaled_tensor = torch.FloatTensor(X_scaled).to(self.device)

        # Encode
        for autoencoder in self.autoencoders:
            autoencoder.eval()
            with torch.no_grad():
                X_scaled_tensor = autoencoder(X_scaled_tensor)  # Changed here

        # Decode in reverse
        for autoencoder in reversed(self.autoencoders):
            with torch.no_grad():
                X_scaled_tensor = autoencoder(X_scaled_tensor)  # Changed here

        X_imputed_scaled = X_scaled_tensor.cpu().numpy()
        X_imputed = self.scaler.inverse_transform(X_imputed_scaled)
        return np.where(np.isnan(X), X_imputed, X)

def create_sautoencoder_estimator(encoding_dims=[32], n_layers=3, epochs=50, batch_size=256, seed=None):
    return StackedMultiLayerAutoencoderImputer(encoding_dims=encoding_dims, n_layers=n_layers, epochs=epochs, batch_size=batch_size, seed=seed)


class VariationalAutoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dims):
        super(VariationalAutoencoder, self).__init__()
        self.encoding_dims = encoding_dims
        self.encoders = nn.ModuleList()
        self.decoders = nn.ModuleList()

        # Adding layers for mean and log variance
        self.mu_layers = nn.ModuleList()
        self.logvar_layers = nn.ModuleList()

        for encoding_dim in self.encoding_dims:
            self.encoders.append(nn.Linear(input_dim, encoding_dim))
            self.mu_layers.append(nn.Linear(encoding_dim, encoding_dim))
            self.logvar_layers.append(nn.Linear(encoding_dim, encoding_dim))
            self.decoders.append(nn.Linear(encoding_dim, input_dim))
            input_dim = encoding_dim

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        mu, logvar = x, x
        for encoder, mu_layer, logvar_layer in zip(self.encoders, self.mu_layers, self.logvar_layers):
            x = F.relu(encoder(x))
            mu = mu_layer(x)
            logvar = logvar_layer(x)

        z = self.reparameterize(mu, logvar)

        for decoder in reversed(self.decoders):
            z = F.relu(decoder(z))

        return z, mu, logvar

    def loss_function(self, recon_x, x, mu, logvar):
        BCE = F.mse_loss(recon_x, x, reduction='sum')
        KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
        return BCE + KLD


class StackedVAEImputer(BaseEstimator, TransformerMixin):
    def __init__(self, encoding_dims=[32], n_layers=1, epochs=50, batch_size=256, seed=42):
        self.encoding_dims = encoding_dims
        self.n_layers = n_layers
        self.epochs = epochs
        self.batch_size = batch_size
        self.vaes = []
        self.scaler = StandardScaler()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.seed = seed
        self._set_seed()

    def _set_seed(self):
        torch.manual_seed(self.seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(self.seed)
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False
        np.random.seed(self.seed)
        random.seed(self.seed)

    def _create_vaes(self, input_dim):
        vaes = []
        for _ in range(self.n_layers):
            vae = VariationalAutoencoder(input_dim, self.encoding_dims)
            vaes.append(vae.to(self.device))
        return vaes

    def fit(self, X, y=None):
        X_imputed = np.where(np.isnan(X), np.nanmean(X, axis=0), X)
        X_scaled = self.scaler.fit_transform(X_imputed)
        input_dim = X_scaled.shape[1]
        self.vaes = self._create_vaes(input_dim)

        X_transformed = torch.FloatTensor(X_scaled).to(self.device)

        for vae in self.vaes:
            criterion = vae.loss_function
            optimizer = torch.optim.Adam(vae.parameters())
            X_transformed = X_transformed.detach()  # Detach from the computation graph

            for epoch in range(self.epochs):
                vae.train()
                permutation = torch.randperm(X_transformed.size()[0])
                for i in range(0, X_transformed.size()[0], self.batch_size):
                    indices = permutation[i:i+self.batch_size]
                    batch_x = X_transformed[indices]

                    optimizer.zero_grad()
                    outputs, mu, logvar = vae(batch_x)
                    loss = criterion(outputs, batch_x, mu, logvar)
                    loss.backward()
                    optimizer.step()

                # Sample and encode for next layer
                with torch.no_grad():
                    vae.eval()
                    outputs, _, _ = vae(batch_x)
                    X_transformed = outputs

        return self

    def transform(self, X, y=None):
        X_imputed = np.where(np.isnan(X), np.nanmean(X, axis=0), X)
        X_scaled = self.scaler.transform(X_imputed)
        X_scaled_tensor = torch.FloatTensor(X_scaled).to(self.device)

        # This should iterate over vaes, not autoencoders
        for vae in self.vaes:
            vae.eval()
            with torch.no_grad():
                # We should capture and use the outputs, not just pass them through
                outputs, _, _ = vae(X_scaled_tensor)
                X_scaled_tensor = outputs

        # The data should be decoded if necessary, here it's assumed the last VAE pass does it
        X_imputed_scaled = X_scaled_tensor.cpu().numpy()
        X_imputed = self.scaler.inverse_transform(X_imputed_scaled)
        return np.where(np.isnan(X), X_imputed, X)


def create_svautoencoder_estimator(encoding_dims=[32], n_layers=3, epochs=50, batch_size=256, seed=None):
    return StackedVAEImputer(encoding_dims=encoding_dims, n_layers=n_layers, epochs=epochs, batch_size=batch_size, seed=seed)



# Function to create output directories
def create_output_dirs(method, num_folders, imputed_data_root, missing_type):
    return [f'{imputed_data_root}{method}_{missing_type}_{i+1}/' for i in range(num_folders)]

##Rearding the data

In [None]:


num_folders=3 #The numebr of datasets you like to include for imputation of each missingness type
input_dir_root='/content/drive/MyDrive/Data_imputaion/For_GitHub/Data_with_missingness/'
imputed_data_root='/content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/'

input_dirs_MAR=[]
missing_type='MAR'
for i in range(num_folders):
    input_dirs_MAR.append(input_dir_root+'Numerical_data_'+missing_type+f'_{i+1}/data/')

input_dirs_MCAR=[]
missing_type= 'MCAR'
for i in range(num_folders):
    input_dirs_MCAR.append(input_dir_root+'Numerical_data_'+missing_type+f'_{i+1}/data/')

input_dirs_MNAR=[]
missing_type= 'MNAR'
for i in range(num_folders):
    input_dirs_MNAR.append(input_dir_root+'Numerical_data_'+missing_type+f'_{i+1}/data/')



#MCAR

In [None]:
input_dirs=input_dirs_MCAR.copy()
missing_type='MCAR'
# Define the imputation methods and their corresponding estimators
imputation_methods = [
    ('II (RF)', RandomForestRegressor(
        n_estimators=10, max_depth=8, min_samples_leaf=5,
        max_features='sqrt', n_jobs=-1, random_state=42)),
    ('II (BR)', BayesianRidge()),
    ('II (KNN)', KNeighborsRegressor(n_neighbors=50))
]



# Impute files for each method
for method, estimator in imputation_methods:
    output_dirs = create_output_dirs(method, num_folders, imputed_data_root, missing_type)
    for input_dir, output_dir in zip(input_dirs, output_dirs):
        print(input_dir)
        impute_all_files(input_dir, output_dir, estimator, method)

# Autoencoder and Variational Autoencoder methods
ae_vae_methods = [
    ('AE', create_sautoencoder_estimator),
    ('VAE', create_svautoencoder_estimator)
]

for method, create_estimator in ae_vae_methods:
    output_dirs = create_output_dirs(method, num_folders, imputed_data_root, missing_type)
    for input_dir, output_dir in zip(input_dirs, output_dirs):
        impute_all_files_AES(
            input_dir, output_dir,
            lambda seed=None, create_estimator=create_estimator: create_estimator(encoding_dims=[32, 16], n_layers=1, epochs=100, seed=seed),
            method, n_imputations=5
        )


# MAR

In [None]:
input_dirs=input_dirs_MAR.copy()
# Define the imputation methods and their corresponding estimators
missing_type='MAR'
imputation_methods = [
    ('II (RF)', RandomForestRegressor(
        n_estimators=10, max_depth=8, min_samples_leaf=5,
        max_features='sqrt', n_jobs=-1, random_state=42)),
    ('II (BR)', BayesianRidge()),
    ('II (KNN)', KNeighborsRegressor(n_neighbors=50))
]



# Impute files for each method
for method, estimator in imputation_methods:
    output_dirs = create_output_dirs(method, num_folders, imputed_data_root, missing_type)
    for input_dir, output_dir in zip(input_dirs, output_dirs):
        print(input_dir)
        impute_all_files(input_dir, output_dir, estimator, method)

# Autoencoder and Variational Autoencoder methods
ae_vae_methods = [
    ('AE', create_sautoencoder_estimator),
    ('VAE', create_svautoencoder_estimator)
]

for method, create_estimator in ae_vae_methods:
    output_dirs = create_output_dirs(method, num_folders, imputed_data_root, missing_type)
    for input_dir, output_dir in zip(input_dirs, output_dirs):
        impute_all_files_AES(
            input_dir, output_dir,
            lambda seed=None, create_estimator=create_estimator: create_estimator(encoding_dims=[32, 16], n_layers=1, epochs=100, seed=seed),
            method, n_imputations=5
        )


/content/drive/MyDrive/Data_imputaion/For_GitHub/Data_with_missingness/Numerical_data_MAR_1/data/
Imputed dataset 1 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MAR_1/nan_added_40_Imputed_II (RF)_1.csv
Imputed dataset 2 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MAR_1/nan_added_40_Imputed_II (RF)_2.csv
Imputed dataset 3 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MAR_1/nan_added_40_Imputed_II (RF)_3.csv
Imputed dataset 4 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MAR_1/nan_added_40_Imputed_II (RF)_4.csv
Imputed dataset 5 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MAR_1/nan_added_40_Imputed_II (RF)_5.csv
Imputed dataset 1 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MAR_1/nan_added_35_Imputed_II (RF)_1.csv
Imputed dataset 2 saved to: /content/drive/M

#MNAR

In [None]:
input_dirs=input_dirs_MNAR.copy()
missing_type='MNAR'
# Define the imputation methods and their corresponding estimators
imputation_methods = [
    ('II (RF)', RandomForestRegressor(
        n_estimators=10, max_depth=8, min_samples_leaf=5,
        max_features='sqrt', n_jobs=-1, random_state=42)),
    ('II (BR)', BayesianRidge()),
    ('II (KNN)', KNeighborsRegressor(n_neighbors=50))
]



# Impute files for each method
for method, estimator in imputation_methods:
    output_dirs = create_output_dirs(method, num_folders, imputed_data_root, missing_type)
    for input_dir, output_dir in zip(input_dirs, output_dirs):
        print(input_dir)
        impute_all_files(input_dir, output_dir, estimator, method)

# Autoencoder and Variational Autoencoder methods
ae_vae_methods = [
    ('AE', create_sautoencoder_estimator),
    ('VAE', create_svautoencoder_estimator)
]

for method, create_estimator in ae_vae_methods:
    output_dirs = create_output_dirs(method, num_folders, imputed_data_root, missing_type)
    for input_dir, output_dir in zip(input_dirs, output_dirs):
        impute_all_files_AES(
            input_dir, output_dir,
            lambda seed=None, create_estimator=create_estimator: create_estimator(encoding_dims=[32, 16], n_layers=1, epochs=100, seed=seed),
            method, n_imputations=5
        )


/content/drive/MyDrive/Data_imputaion/For_GitHub/Data_with_missingness/Numerical_data_MNAR_1/data/
Imputed dataset 1 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MNAR_1/nan_added_5_Imputed_II (RF)_1.csv
Imputed dataset 2 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MNAR_1/nan_added_5_Imputed_II (RF)_2.csv
Imputed dataset 3 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MNAR_1/nan_added_5_Imputed_II (RF)_3.csv
Imputed dataset 4 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MNAR_1/nan_added_5_Imputed_II (RF)_4.csv
Imputed dataset 5 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MNAR_1/nan_added_5_Imputed_II (RF)_5.csv
Imputed dataset 1 saved to: /content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/II (RF)_MNAR_1/nan_added_40_Imputed_II (RF)_1.csv
Imputed dataset 2 saved to: /content/drive

# Using R

In [None]:
#for R magic
%load_ext rpy2.ipython

In [None]:
num_folders=3 #The numebr of datasets you like to include for imputation of each missingness type
input_dir_root='/content/drive/MyDrive/Data_imputaion/For_GitHub/Data_with_missingness/'
imputed_data_root='/content/drive/MyDrive/Data_imputaion/For_GitHub/Imputed_datasets/'

input_dirs_MAR=[]
missing_type='MAR'
for i in range(num_folders):
    input_dirs_MAR.append(input_dir_root+'Numerical_data_'+missing_type+f'_{i+1}/data/')

input_dirs_MCAR=[]
missing_type= 'MCAR'
for i in range(num_folders):
    input_dirs_MCAR.append(input_dir_root+'Numerical_data_'+missing_type+f'_{i+1}/data/')

input_dirs_MNAR=[]
missing_type= 'MNAR'
for i in range(num_folders):
    input_dirs_MNAR.append(input_dir_root+'Numerical_data_'+missing_type+f'_{i+1}/data/')


%R -i input_dirs_MAR
%R -i input_dirs_MCAR
%R -i input_dirs_MNAR
%R -i imputed_data_root
%R -i num_folders

#MCAR

In [None]:
%%R
# Install necessary packages if not already installed
if (!requireNamespace("mice", quietly = TRUE)) install.packages("mice")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")

library(mice)
library(dplyr)

# Function to impute all files in a directory
impute_all_files <- function(input_dir, output_dir, method, method_n) {
  # Ensure the output directory exists, create if it does not
  if (!dir.exists(output_dir)) {
    dir.create(output_dir, recursive = TRUE)
  }

  file_paths <- list.files(path = input_dir, full.names = TRUE, pattern = "*.csv")

  impute_file <- function(file_path, output_dir, method, method_n) {
    data <- read.csv(file_path)
    id_column_name <- names(data)[1] # Dynamically get the name of the first column
    data_numeric <- data %>% select(-!!sym(id_column_name))

    imputed_data <- mice(data_numeric, m = 5, method = method, maxit = 10, seed = 500)

    for(i in 1:imputed_data$m) {
      complete_data <- complete(imputed_data, action = i)
      # Directly add the ID column back to the complete_data DataFrame
      complete_data <- mutate(complete_data, !!id_column_name := data[[id_column_name]])
      base_filename <- tools::file_path_sans_ext(basename(file_path))
      modified_filename <- paste0(base_filename, "_Imputed_", method_n, '_', i, ".csv")
      output_file_path <- file.path(output_dir, modified_filename)
      write.csv(complete_data, output_file_path, row.names = FALSE)
    }

    return(base_filename)
  }

  # Apply the imputation function to each file
  lapply(file_paths, impute_file, output_dir = output_dir, method = method, method_n = method_n)
}

# Define missing type
missing_type <- "MCAR"

# Directories setup
num_folders <- length(input_dirs_MCAR)
output_dirs <- vector("list", num_folders)

imputation_methods <- list(
  list(method = 'pmm', method_n = 'MICE (PMM)'),
  list(method = 'cart', method_n = 'MICE (CART)'),
  list(method = 'rf', method_n = 'MICE (RF)'),
  list(method = 'norm', method_n = 'MICE (NORM)'),
  list(method = 'norm.boot', method_n = 'MICE (NORM.BOOT)')
)

for (method_info in imputation_methods) {
  method <- method_info$method
  method_n <- method_info$method_n

  input_dirs <- input_dirs_MCAR
  for (i in 1:num_folders) {
    output_dirs[[i]] <- sprintf("%s/%s_%s_%d/", imputed_data_root, method_n, missing_type, i)
  }

  # Loop through each pair of input and output directories and apply the imputation
  for(i in 1:num_folders) {
    cat("Imputing for directory:", i, "using method:", method_n, "\n")
    impute_all_files(input_dirs[[i]], output_dirs[[i]], method, method_n)
  }
}


(as ‘lib’ is unspecified)




















































































	‘/tmp/Rtmp2tlo3R/downloaded_packages’

Attaching package: ‘mice’



    filter



    cbind, rbind


Attaching package: ‘dplyr’



    filter, lag



    intersect, setdiff, setequal, union




Imputing for directory: 1 using method: MICE (PMM) 

 iter imp variable
  1   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   4  anca  cbir  ompc  i

(as ‘lib’ is unspecified)







	‘/tmp/Rtmp2tlo3R/downloaded_packages’



  cbir  ompc  iga.asca  igg.asca  i2
  1   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  5   1  anca

#MAR

In [None]:
%%R
# Install necessary packages if not already installed
if (!requireNamespace("mice", quietly = TRUE)) install.packages("mice")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")

library(mice)
library(dplyr)

# Function to impute all files in a directory
impute_all_files <- function(input_dir, output_dir, method, method_n) {
  # Ensure the output directory exists, create if it does not
  if (!dir.exists(output_dir)) {
    dir.create(output_dir, recursive = TRUE)
  }

  file_paths <- list.files(path = input_dir, full.names = TRUE, pattern = "*.csv")

  impute_file <- function(file_path, output_dir, method, method_n) {
    data <- read.csv(file_path)
    id_column_name <- names(data)[1] # Dynamically get the name of the first column
    data_numeric <- data %>% select(-!!sym(id_column_name))

    imputed_data <- mice(data_numeric, m = 5, method = method, maxit = 10, seed = 500)

    for(i in 1:imputed_data$m) {
      complete_data <- complete(imputed_data, action = i)
      # Directly add the ID column back to the complete_data DataFrame
      complete_data <- mutate(complete_data, !!id_column_name := data[[id_column_name]])
      base_filename <- tools::file_path_sans_ext(basename(file_path))
      modified_filename <- paste0(base_filename, "_Imputed_", method_n, '_', i, ".csv")
      output_file_path <- file.path(output_dir, modified_filename)
      write.csv(complete_data, output_file_path, row.names = FALSE)
    }

    return(base_filename)
  }

  # Apply the imputation function to each file
  lapply(file_paths, impute_file, output_dir = output_dir, method = method, method_n = method_n)
}

# Define missing type
missing_type <- "MAR"

# Directories setup
num_folders <- length(input_dirs_MAR)
output_dirs <- vector("list", num_folders)

imputation_methods <- list(
  list(method = 'pmm', method_n = 'MICE (PMM)'),
  list(method = 'cart', method_n = 'MICE (CART)'),
  list(method = 'rf', method_n = 'MICE (RF)'),
  list(method = 'norm', method_n = 'MICE (NORM)'),
  list(method = 'norm.boot', method_n = 'MICE (NORM.BOOT)')
)

for (method_info in imputation_methods) {
  method <- method_info$method
  method_n <- method_info$method_n

  input_dirs <- input_dirs_MAR
  for (i in 1:num_folders) {
    output_dirs[[i]] <- sprintf("%s/%s_%s_%d/", imputed_data_root, method_n, missing_type, i)
  }

  # Loop through each pair of input and output directories and apply the imputation
  for(i in 1:num_folders) {
    cat("Imputing for directory:", i, "using method:", method_n, "\n")
    impute_all_files(input_dirs[[i]], output_dirs[[i]], method, method_n)
  }
}


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  1   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  5   1  anca  cbir  ompc  iga.asca

In [None]:
%%R
# Install necessary packages if not already installed
if (!requireNamespace("mice", quietly = TRUE)) install.packages("mice")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")

library(mice)
library(dplyr)

# Function to impute all files in a directory
impute_all_files <- function(input_dir, output_dir, method, method_n) {
  # Ensure the output directory exists, create if it does not
  if (!dir.exists(output_dir)) {
    dir.create(output_dir, recursive = TRUE)
  }

  file_paths <- list.files(path = input_dir, full.names = TRUE, pattern = "*.csv")

  impute_file <- function(file_path, output_dir, method, method_n) {
    data <- read.csv(file_path)
    id_column_name <- names(data)[1] # Dynamically get the name of the first column
    data_numeric <- data %>% select(-!!sym(id_column_name))

    imputed_data <- mice(data_numeric, m = 5, method = method, maxit = 10, seed = 500)

    for(i in 1:imputed_data$m) {
      complete_data <- complete(imputed_data, action = i)
      # Directly add the ID column back to the complete_data DataFrame
      complete_data <- mutate(complete_data, !!id_column_name := data[[id_column_name]])
      base_filename <- tools::file_path_sans_ext(basename(file_path))
      modified_filename <- paste0(base_filename, "_Imputed_", method_n, '_', i, ".csv")
      output_file_path <- file.path(output_dir, modified_filename)
      write.csv(complete_data, output_file_path, row.names = FALSE)
    }

    return(base_filename)
  }

  # Apply the imputation function to each file
  lapply(file_paths, impute_file, output_dir = output_dir, method = method, method_n = method_n)
}

# Define missing type
missing_type <- "MNAR"

# Directories setup
num_folders <- length(input_dirs_MNAR)
output_dirs <- vector("list", num_folders)

imputation_methods <- list(
  list(method = 'pmm', method_n = 'MICE (PMM)'),
  list(method = 'cart', method_n = 'MICE (CART)'),
  list(method = 'rf', method_n = 'MICE (RF)'),
  list(method = 'norm', method_n = 'MICE (NORM)'),
  list(method = 'norm.boot', method_n = 'MICE (NORM.BOOT)')
)

for (method_info in imputation_methods) {
  method <- method_info$method
  method_n <- method_info$method_n

  input_dirs <- input_dirs_MNAR
  for (i in 1:num_folders) {
    output_dirs[[i]] <- sprintf("%s/%s_%s_%d/", imputed_data_root, method_n, missing_type, i)
  }

  # Loop through each pair of input and output directories and apply the imputation
  for(i in 1:num_folders) {
    cat("Imputing for directory:", i, "using method:", method_n, "\n")
    impute_all_files(input_dirs[[i]], output_dirs[[i]], method, method_n)
  }
}


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  1   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  1   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  2   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  3   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   1  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   2  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   3  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   4  anca  cbir  ompc  iga.asca  igg.asca  i2
  4   5  anca  cbir  ompc  iga.asca  igg.asca  i2
  5   1  anca  cbir  ompc  iga.asca