# IA04: The impact of the missing data in Machine Learning

Michał Dawid Kowalski (up202401554)

A. Develop an empirical analysis on the impact of missing data in machine learning algorithms. You may start by collecting several datasets without missing values and implement some solutions to create different missing mechanisms (MCAR, MAR, and MNAR). Then, select among several approaches to handle missing data (e.g., classifying with missing values, performing data imputation) and compare the obtained classification performance across different approaches. Do the results vary depending on the missing mechanism? Is the top performer in terms of classification metrics also the top performer if the focus shifts towards Predictive or Distributional Accuracy?

In [4]:
# Environment setting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import arff
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

# 1. **Collecting and Preparing Datasets**

At the beginning I chose and preprocess several datasets without missing values (or if they have, dropping these rows) for the future analysis on the impact of missing data in machine learning classifications.

## 1.1 **Autism Screening Adult Datset**

This dataset focuses on **autism screening** for adults, containing 20 features. It includes ten behavioral traits (AQ-10-Adult) and ten personal characteristics. The goal is to identify influential traits and improve ASD classification. It supports research on efficient and accessible ASD screening methods.

In [8]:
# Open the .arff file
data, meta = arff.loadarff('Autism-Adult-Data.arff')

# Convert to DataFrame format
df_a = pd.DataFrame(data)

# Decode byte strings
for col in df_a.select_dtypes([object]):
    df_a[col] = df_a[col].str.decode('utf-8')

In [10]:
df_a.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 704 entries, 0 to 703
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   A1_Score         704 non-null    object 
 1   A2_Score         704 non-null    object 
 2   A3_Score         704 non-null    object 
 3   A4_Score         704 non-null    object 
 4   A5_Score         704 non-null    object 
 5   A6_Score         704 non-null    object 
 6   A7_Score         704 non-null    object 
 7   A8_Score         704 non-null    object 
 8   A9_Score         704 non-null    object 
 9   A10_Score        704 non-null    object 
 10  age              702 non-null    float64
 11  gender           704 non-null    object 
 12  ethnicity        704 non-null    object 
 13  jundice          704 non-null    object 
 14  austim           704 non-null    object 
 15  contry_of_res    704 non-null    object 
 16  used_app_before  704 non-null    object 
 17  result          

In [12]:
df_a.head(5)

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,gender,ethnicity,jundice,austim,contry_of_res,used_app_before,result,age_desc,relation,Class/ASD
0,1,1,1,1,0,0,1,1,0,0,...,f,White-European,no,no,United States,no,6.0,18 and more,Self,NO
1,1,1,0,1,0,0,0,1,0,1,...,m,Latino,no,yes,Brazil,no,5.0,18 and more,Self,NO
2,1,1,0,1,1,0,1,1,1,1,...,m,Latino,yes,yes,Spain,no,8.0,18 and more,Parent,YES
3,1,1,0,1,0,0,1,1,0,1,...,f,White-European,no,yes,United States,no,6.0,18 and more,Self,NO
4,1,0,0,0,0,0,0,1,0,0,...,f,?,no,no,Egypt,no,2.0,18 and more,?,NO


In [14]:
# Classes distribution
df_a['Class/ASD'].value_counts()

Class/ASD
NO     515
YES    189
Name: count, dtype: int64

In [16]:
# Check missing values
df_a.replace("?", np.nan, inplace=True) # Replace '?' to NaN
df_a.isnull().sum()

A1_Score            0
A2_Score            0
A3_Score            0
A4_Score            0
A5_Score            0
A6_Score            0
A7_Score            0
A8_Score            0
A9_Score            0
A10_Score           0
age                 2
gender              0
ethnicity          95
jundice             0
austim              0
contry_of_res       0
used_app_before     0
result              0
age_desc            0
relation           95
Class/ASD           0
dtype: int64

### **Data Preprocessing**

In [19]:
# Drop rows with missing values
df_a.dropna(inplace=True)

In [21]:
df_a.shape # Shape after dropping rows

(609, 21)

- After droppin missing values dataset lost 95 records, but still seems to be sufficient for the experiment.

In [24]:
# Change Ax_Score types from object to int
score_cols = ['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 
              'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score']

df_a[score_cols] = df_a[score_cols].apply(pd.to_numeric)

In [26]:
# Drop irrelevant feature 'age_desc'
df_a.age_desc.value_counts()
df_a = df_a.drop('age_desc', axis=1)

In [28]:
# Drop feature 'result' which leaks the output
df_a = df_a.drop('result', axis=1)

In [30]:
# Encode binary features to 0/1
label_encoder = LabelEncoder()
df_a['gender'] = label_encoder.fit_transform(df_a['gender'])
df_a['jundice'] = label_encoder.fit_transform(df_a['jundice'])
df_a['austim'] = label_encoder.fit_transform(df_a['austim'])
df_a['used_app_before'] = label_encoder.fit_transform(df_a['used_app_before'])
df_a['Class/ASD'] = label_encoder.fit_transform(df_a['Class/ASD'])

In [32]:
# One-hot Encoding for categorical features
df_a = pd.get_dummies(df_a, columns=['ethnicity', 'contry_of_res', 'relation'], drop_first=True)
df_a[df_a.select_dtypes('bool').columns] = df_a.select_dtypes('bool').astype(int)

In [34]:
# Statistical Characteristics
df_a.describe()

Unnamed: 0,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,...,contry_of_res_Ukraine,contry_of_res_United Arab Emirates,contry_of_res_United Kingdom,contry_of_res_United States,contry_of_res_Uruguay,contry_of_res_Viet Nam,relation_Others,relation_Parent,relation_Relative,relation_Self
count,609.0,609.0,609.0,609.0,609.0,609.0,609.0,609.0,609.0,609.0,...,609.0,609.0,609.0,609.0,609.0,609.0,609.0,609.0,609.0,609.0
mean,0.740558,0.469622,0.481117,0.520525,0.525452,0.307061,0.428571,0.665025,0.341544,0.597701,...,0.001642,0.110016,0.124795,0.183908,0.001642,0.00821,0.00821,0.082102,0.045977,0.857143
std,0.438689,0.499487,0.500054,0.499989,0.499762,0.461654,0.495278,0.47237,0.474617,0.490765,...,0.040522,0.313167,0.330758,0.387728,0.040522,0.090311,0.090311,0.274745,0.209607,0.350215
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [35]:
X_a = df_a.drop('Class/ASD', axis=1)
y_a = df_a['Class/ASD']

# Train Test Split 70/30, because dataset is relatively small with just 214 instances
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(X_a, y_a, test_size=0.3, random_state=42)

In [38]:
# Scaling features using MinMaxScaler
scaler = MinMaxScaler().fit(X_train_a)
X_train_a = pd.DataFrame(scaler.transform(X_train_a), columns=X_train_a.columns)
X_test_a = pd.DataFrame(scaler.transform(X_test_a), columns=X_test_a.columns)

## 1.2 **Mushroom Dataset**

This dataset contains cleaned and preprocessed information about mushrooms, including features like cap diameter, shape, and stem color, for binary classification of edibility. The target class indicates whether the mushroom is edible or poisonous.

In [41]:
# Load dataset from the file
df_m = pd.read_csv('mushroom_cleaned.csv')

In [43]:
# Dataset info
df_m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54035 entries, 0 to 54034
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   cap-diameter     54035 non-null  int64  
 1   cap-shape        54035 non-null  int64  
 2   gill-attachment  54035 non-null  int64  
 3   gill-color       54035 non-null  int64  
 4   stem-height      54035 non-null  float64
 5   stem-width       54035 non-null  int64  
 6   stem-color       54035 non-null  int64  
 7   season           54035 non-null  float64
 8   class            54035 non-null  int64  
dtypes: float64(2), int64(7)
memory usage: 3.7 MB


In [45]:
# Display Dataframe
df_m.head(5)

Unnamed: 0,cap-diameter,cap-shape,gill-attachment,gill-color,stem-height,stem-width,stem-color,season,class
0,1372,2,2,10,3.807467,1545,11,1.804273,1
1,1461,2,2,10,3.807467,1557,11,1.804273,1
2,1371,2,2,10,3.612496,1566,11,1.804273,1
3,1261,6,2,10,3.787572,1566,11,1.804273,1
4,1305,6,2,10,3.711971,1464,11,0.943195,1


In [47]:
# Classes Distribution
df_m['class'].value_counts()

class
1    29675
0    24360
Name: count, dtype: int64

In [49]:
# Check missing values
df_m.isnull().sum()

cap-diameter       0
cap-shape          0
gill-attachment    0
gill-color         0
stem-height        0
stem-width         0
stem-color         0
season             0
class              0
dtype: int64

In [51]:
# Statistical Characteristics
df_m.describe()

Unnamed: 0,cap-diameter,cap-shape,gill-attachment,gill-color,stem-height,stem-width,stem-color,season,class
count,54035.0,54035.0,54035.0,54035.0,54035.0,54035.0,54035.0,54035.0,54035.0
mean,567.257204,4.000315,2.142056,7.329509,0.75911,1051.081299,8.418062,0.952163,0.549181
std,359.883763,2.160505,2.228821,3.200266,0.650969,782.056076,3.262078,0.305594,0.49758
min,0.0,0.0,0.0,0.0,0.000426,0.0,0.0,0.027372,0.0
25%,289.0,2.0,0.0,5.0,0.270997,421.0,6.0,0.88845,0.0
50%,525.0,5.0,1.0,8.0,0.593295,923.0,11.0,0.943195,1.0
75%,781.0,6.0,4.0,10.0,1.054858,1523.0,11.0,0.943195,1.0
max,1891.0,6.0,6.0,11.0,3.83532,3569.0,12.0,1.804273,1.0


In [53]:
X_m = df_m.drop('class', axis=1)
y_m = df_m['class']

# Train Test Split 80/30, because dataset is relatively big
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_m, y_m, test_size=0.2, random_state=42)

In [55]:
# Scaling features using MinMaxScaler
scaler = MinMaxScaler().fit(X_train_m)
X_train_m = pd.DataFrame(scaler.transform(X_train_m), columns=X_train_m.columns)
X_test_m = pd.DataFrame(scaler.transform(X_test_m), columns=X_test_m.columns)

## 1.3 **Occupancy Detection Dataset**

This dataset specifies Accurate Occupancy Detection of an office room from light, temperature, humidity, and CO2 measurements. It contains 20,560 instances and 6 features, used for binary classification to predict room occupancy.


Info: The dataset is already split into training and test files.

### 1.3.1 Training Dataset

In [59]:
# Read the training dataset
df_o = pd.read_csv('datatraining.txt', header=0, quotechar='"')

In [61]:
# Display dataset
df_o.head(5)

Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
1,2015-02-04 17:51:00,23.18,27.272,426.0,721.25,0.004793,1
2,2015-02-04 17:51:59,23.15,27.2675,429.5,714.0,0.004783,1
3,2015-02-04 17:53:00,23.15,27.245,426.0,713.5,0.004779,1
4,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1
5,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1


In [63]:
# Dataset Shape
print(f'Training dataset shape: {df_o.shape}')

Training dataset shape: (8143, 7)


In [65]:
# Dataset info
df_o.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8143 entries, 1 to 8143
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           8143 non-null   object 
 1   Temperature    8143 non-null   float64
 2   Humidity       8143 non-null   float64
 3   Light          8143 non-null   float64
 4   CO2            8143 non-null   float64
 5   HumidityRatio  8143 non-null   float64
 6   Occupancy      8143 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 508.9+ KB


In [67]:
# Classes distribution
df_o['Occupancy'].value_counts()

Occupancy
0    6414
1    1729
Name: count, dtype: int64

In [69]:
# Check missing values
df_o.isnull().sum()

date             0
Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupancy        0
dtype: int64

In [71]:
# Extract relevant features from the datetime
def features_datetime(df):
    df['date'] = pd.to_datetime(df['date']) # Convert into datetime format
    df['hour'] = df['date'].dt.hour # Extract hour
    df['day_of_week'] = df['date'].dt.dayofweek # Extract day of the week where: Monday=0 ...  Sunday=6
    df.drop('date', axis=1, inplace=True)

features_datetime(df_o)

In [73]:
# Display dataset after features extraction
df_o.head(5)

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy,hour,day_of_week
1,23.18,27.272,426.0,721.25,0.004793,1,17,2
2,23.15,27.2675,429.5,714.0,0.004783,1,17,2
3,23.15,27.245,426.0,713.5,0.004779,1,17,2
4,23.15,27.2,426.0,708.25,0.004772,1,17,2
5,23.1,27.2,426.0,704.5,0.004757,1,17,2


In [75]:
# Statistical Characteristics
print('Statistical Characteristics:\n')
df_o.describe()

Statistical Characteristics:



Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy,hour,day_of_week
count,8143.0,8143.0,8143.0,8143.0,8143.0,8143.0,8143.0,8143.0
mean,20.619084,25.731507,119.519375,606.546243,0.003863,0.21233,11.390642,3.344222
std,1.016916,5.531211,194.755805,314.320877,0.000852,0.408982,7.092195,2.067996
min,19.0,16.745,0.0,412.75,0.002674,0.0,0.0,0.0
25%,19.7,20.2,0.0,439.0,0.003078,0.0,5.0,2.0
50%,20.39,26.2225,0.0,453.5,0.003801,0.0,11.0,4.0
75%,21.39,30.533333,256.375,638.833333,0.004352,0.0,18.0,5.0
max,23.18,39.1175,1546.333333,2028.5,0.006476,1.0,23.0,6.0


In [77]:
# Split into Input and Labels
X_train_o = df_o.drop('Occupancy', axis=1)
y_train_o = df_o['Occupancy']

In [79]:
# Scaling features using MinMaxScaler
scaler = MinMaxScaler().fit(X_train_o)
X_train_o = pd.DataFrame(scaler.transform(X_train_o), columns=X_train_o.columns)

### 1.3.2 Test Dataset

In [82]:
# Read the test dataset
df_o_test = pd.read_csv('datatest.txt', header=0, quotechar='"')

In [84]:
# Dataset Shape
print(f'Test dataset shape: {df_o_test.shape}')

Test dataset shape: (2665, 7)


In [86]:
# Dataset info
df_o_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2665 entries, 140 to 2804
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           2665 non-null   object 
 1   Temperature    2665 non-null   float64
 2   Humidity       2665 non-null   float64
 3   Light          2665 non-null   float64
 4   CO2            2665 non-null   float64
 5   HumidityRatio  2665 non-null   float64
 6   Occupancy      2665 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 166.6+ KB


In [88]:
# Check missing values
df_o_test.isnull().sum()

date             0
Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupancy        0
dtype: int64

In [90]:
# Features extraction
features_datetime(df_o_test)

In [92]:
# Split into Input and Labels
X_test_o = df_o_test.drop('Occupancy', axis=1)
y_test_o = df_o_test['Occupancy']

In [94]:
# Scaling features using MinMaxScaler
X_test_o = pd.DataFrame(scaler.transform(X_test_o), columns=X_test_o.columns)

# 2. **Simulating Missing Values**

There are 3 possible mechanisms which lead to the missing values introdution: MCAR, MAR, MNAR. Short description below:

- **MCAR** (Missing Completely At Random): missing data is unrelated to other measured variables and unrelated to its values
- **MAR** (Missing At Random): missing data is related to some other measured variables and not related to its values
- **MNAR** (Missing Not At Random): missing data is not related to other measured variables, but related to its values

## 2.1 **Mechanisms Functions**

In this step I applied mentioned mechanisms, generating 3 variants of sets for each of my datasets. Those sets will become a base for developing the assignment case analysis. For this reason I used **mdatagen** Python library, which is a tool for artificial generation of the missing data in a controlled way consistent with MCAR, MAR and MNAR.

I focused my experiments exclusively on **univariate** missing value generators, where only one feature contains missing values. Moreover, the feature selected will be the one with the highest correlation to the target. Missing rate will be set during the experiment execution.

In [91]:
from mdatagen.univariate.uMAR import uMAR
from mdatagen.univariate.uMNAR import uMNAR
from mdatagen.univariate.uMCAR import uMCAR

### 2.1.1 MCAR Function
- method="correlated" chooses the most correlated feature with a target
- randomly sets missing values in the feature

In [94]:
# Univariate MCAR Mechanism
def MCAR(X, y, missing_rate):
    X = X.reset_index(drop=True)
    y = np.array(y)
    # Set the generator
    generator = uMCAR(X=X,
                      y=y,
                      missing_rate=missing_rate,
                      method="correlated")
    # Generate the missing data under MCAR mechanism
    generate_data = generator.random()
    # Return new dataset with missing values
    return generate_data

### 2.1.2 MAR Function
- the most correlated two features with a target will become x_miss (receives missing values) and x_obs (observed feature)
- mix() generates missing values in the feature using the N/2 lowest values and N/2 highest values from an observed feature

In [97]:
# Univariate MAR Mechanism
def MAR(X, y, missing_rate):
    X = X.reset_index(drop=True)
    y = np.array(y)
    # Set the generator
    generator = uMAR(X=X,
                     y=y,
                     missing_rate=missing_rate)
    # Generate the missing data under MAR mechanism
    generate_data = generator.mix()
    # Return new dataset with missing values
    return generate_data

### 2.1.3 MNAR Function

- the most correlated feature with a target will receive missing values
- threshold=0.8 means that missing values will be introduced into the observations with the highest 80% of values in the chosen feature

In [100]:
# Univariate MNAR Mechanism 
def MNAR(X, y, missing_rate):
    X = X.reset_index(drop=True)
    y = np.array(y)
    # Set the generator
    generator = uMNAR(X=X,
                      y=y,
                      missing_rate=missing_rate,
                      threshold = 0.8)
    # Generate the missing data under MNAR mechanism
    generate_data = generator.run()
    # Return new dataset with missing values
    return generate_data

## 2.2 **Generating Datasets with Missing Values**

Now it's turn to use MCAR, MAR and MNAR functions to generate datasets with missing values.

In [102]:
missing_rate=50

### 2.2.1 Autism Screening Adult Datsets

In [140]:
# Generating datasets with missing values but different mechanisms
X_MCAR_a = MCAR(X_train_a, y_train_a, missing_rate) # MCAR
X_MCAR_a = X_MCAR_a.drop('target', axis=1)
X_MAR_a = MAR(X_train_a, y_train_a, missing_rate) # MAR
X_MAR_a = X_MAR_a.drop('target', axis=1)
X_MNAR_a = MNAR(X_train_a, y_train_a, missing_rate) # MNAR
X_MNAR_a = X_MNAR_a.drop('target', axis=1)

missing_percentage = X_MCAR_a.isnull().mean() * 100
print("Missing values percentage:")
print(missing_percentage[missing_percentage > 0])

Missing values percentage:
A9_Score    50.0
dtype: float64


### 2.2.2 Mushroom Datasets

In [104]:
# Generating datasets with missing values but different mechanisms
X_MCAR_m = MCAR(X_train_m, y_train_m, missing_rate) # MCAR
X_MCAR_m = X_MCAR_m.drop('target', axis=1)
X_MAR_m = MAR(X_train_m, y_train_m, missing_rate) # MAR
X_MAR_m = X_MAR_m.drop('target', axis=1)
X_MNAR_m = MNAR(X_train_m, y_train_m, missing_rate) # MNAR
X_MNAR_m = X_MNAR_m.drop('target', axis=1)

missing_percentage = X_MCAR_m.isnull().mean() * 100
print("Missing values percentage:")
print(missing_percentage[missing_percentage > 0])

Missing values percentage:
stem-height    50.0
dtype: float64


### 2.2.3 Occupancy Detection Datasets

In [190]:
# Generating datasets with missing values but different mechanisms
X_MCAR_o = MCAR(X_train_o, y_train_o, missing_rate) # MCAR
X_MCAR_o = X_MCAR_o.drop('target', axis=1)
X_MAR_o = MAR(X_train_o, y_train_o, missing_rate) # MAR
X_MAR_o = X_MAR_o.drop('target', axis=1)
X_MNAR_o = MNAR(X_train_o, y_train_o, missing_rate) # MNAR
X_MNAR_o = X_MNAR_o.drop('target', axis=1)

missing_percentage = X_MCAR_o.isnull().mean() * 100
print("Missing values percentage:")
print(missing_percentage[missing_percentage > 0])

Missing values percentage:
Light    50.00614
dtype: float64


# 3. **Classifying with Missing Values using Random Forest**

To evaluate the impact of missing data on classification, the **Random Forest** classifier was chosen for its ability to handle missing values during splits. Performance was measured using precision, recall, F1-score, and accuracy both for complete datasets and datasets with missing mechanisms.

In [96]:
# Random Forest Classifier
def run_random_forest(X_train, X_test, y_train, y_test):

    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_classifier.fit(X_train, y_train)
    y_pred = rf_classifier.predict(X_test)

    report = classification_report(y_test, y_pred)
    return y_pred, report

### 3.1 Autism Screening Adult Results

In [100]:
# Dataset before generating missing values 
y_pred, report = run_random_forest(X_train_a, X_test_a, y_train_a, y_test_a)
print('Results:\n')
print(report)

Results:

              precision    recall  f1-score   support

           0       0.92      0.97      0.95       126
           1       0.92      0.82      0.87        57

    accuracy                           0.92       183
   macro avg       0.92      0.90      0.91       183
weighted avg       0.92      0.92      0.92       183



In [1736]:
# MCAR
y_pred, report = run_random_forest(X_MCAR_a, X_test_a, y_train_a, y_test_a)
print('MCAR Results:\n')
print(report)
# MAR
y_pred, report = run_random_forest(X_MAR_a, X_test_a, y_train_a, y_test_a)
print('MAR Results:\n')
print(report)
# MNAR
y_pred, report = run_random_forest(X_MNAR_a, X_test_a, y_train_a, y_test_a)
print('MNAR Results:\n')
print(report)

MCAR Results:

              precision    recall  f1-score   support

           0       0.96      0.97      0.96       146
           1       0.92      0.91      0.91        65

    accuracy                           0.95       211
   macro avg       0.94      0.94      0.94       211
weighted avg       0.95      0.95      0.95       211

MAR Results:

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       146
           1       0.97      0.88      0.92        65

    accuracy                           0.95       211
   macro avg       0.96      0.93      0.94       211
weighted avg       0.95      0.95      0.95       211

MNAR Results:

              precision    recall  f1-score   support

           0       0.96      0.97      0.96       146
           1       0.92      0.91      0.91        65

    accuracy                           0.95       211
   macro avg       0.94      0.94      0.94       211
weighted avg       0.95      0

### 3.2 Mushroom Results

In [108]:
# Dataset before generating missing values 
y_pred, report = run_random_forest(X_train_m, X_test_m, y_train_m, y_test_m)
print('Results:\n')
print(report)

Results:

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4909
           1       0.99      0.99      0.99      5898

    accuracy                           0.99     10807
   macro avg       0.99      0.99      0.99     10807
weighted avg       0.99      0.99      0.99     10807



In [110]:
# MCAR
y_pred, report = run_random_forest(X_MCAR_m, X_test_m, y_train_m, y_test_m)
print('MCAR Results:\n')
print(report)
# MAR
y_pred, report = run_random_forest(X_MAR_m, X_test_m, y_train_m, y_test_m)
print('MAR Results:\n')
print(report)
# MNAR
y_pred, report = run_random_forest(X_MNAR_m, X_test_m, y_train_m, y_test_m)
print('MNAR Results:\n')
print(report)

MCAR Results:

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4909
           1       0.99      0.99      0.99      5898

    accuracy                           0.99     10807
   macro avg       0.99      0.99      0.99     10807
weighted avg       0.99      0.99      0.99     10807

MAR Results:

              precision    recall  f1-score   support

           0       0.98      0.98      0.98      4909
           1       0.99      0.98      0.99      5898

    accuracy                           0.98     10807
   macro avg       0.98      0.98      0.98     10807
weighted avg       0.98      0.98      0.98     10807

MNAR Results:

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      4909
           1       0.99      0.97      0.98      5898

    accuracy                           0.98     10807
   macro avg       0.98      0.98      0.98     10807
weighted avg       0.98      0

### 3.3 Occupancy Detection Results

In [110]:
# Dataset before generating missing values 
y_pred, report = run_random_forest(X_train_o, X_test_o, y_train_o, y_test_o)
print('Results:\n')
print(report)

Results:

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1693
           1       0.95      0.94      0.94       972

    accuracy                           0.96      2665
   macro avg       0.96      0.96      0.96      2665
weighted avg       0.96      0.96      0.96      2665



In [1682]:
# MCAR
y_pred, report = run_random_forest(X_MCAR_o, X_test_o, y_train_o, y_test_o)
print('MCAR Results:\n')
print(report)
# MAR
y_pred, report = run_random_forest(X_MAR_o, X_test_o, y_train_o, y_test_o)
print('MAR Results:\n')
print(report)
# MNAR
y_pred, report = run_random_forest(X_MNAR_o, X_test_o, y_train_o, y_test_o)
print('MNAR Results:\n')
print(report)

MCAR Results:

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1693
           1       0.95      0.94      0.95       972

    accuracy                           0.96      2665
   macro avg       0.96      0.96      0.96      2665
weighted avg       0.96      0.96      0.96      2665

MAR Results:

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1693
           1       0.95      0.94      0.94       972

    accuracy                           0.96      2665
   macro avg       0.96      0.96      0.96      2665
weighted avg       0.96      0.96      0.96      2665

MNAR Results:

              precision    recall  f1-score   support

           0       0.95      0.96      0.95      1693
           1       0.93      0.92      0.92       972

    accuracy                           0.94      2665
   macro avg       0.94      0.94      0.94      2665
weighted avg       0.94      0

# 4. **Handling Missing Values with Data Imputation**

In this part I present various approaches to handle and recover missing values by using imputation strategies in datasets. The **SVM** classifier with an RBF kernel was evaluated using imputation strategies like: **Statistical Imputation, KNN Imputation, and Multiple Imputation (MICE)** with sklearn tools. Also, the **Predective Accuracy** was evaluated by estimating **Mean Abolute Error** (MAE) to assess how accurately imputers replace missing data comparing to a actual.

In [375]:
# SVM Classificator
def run_svm(X_miss, X_train, X_test, y_train, y_test, imputer):
    X_miss = pd.DataFrame(imputer.fit_transform(X_miss), columns=X_miss.columns)
    
    clf = SVC(kernel='rbf', C=1.0, gamma='scale')
    clf.fit(X_miss, y_train)
    y_pred = clf.predict(X_test)

    report = classification_report(y_test, y_pred, zero_division=1)   

    mae = mean_absolute_error(X_train, X_miss)
    
    return y_pred, report, mae

## 4.1 **Statistical Imputation**
Statistical imputation replaces missing values with statistical measures, in this case the mean and median of the respective feature, providing a simple approach to handle missing data. 

### 4.1.1 Autism Screening Adult Results

**- MEAN IMPUTATION**

In [371]:
mean_imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
#MCAR
y_pred, report, mae = run_svm(X_MCAR_a, X_train_a, X_test_a, y_train_a, y_test_a, mean_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_a, X_train_a, X_test_a, y_train_a, y_test_a, mean_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_a, X_train_a, X_test_a, y_train_a, y_test_a, mean_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.97      0.98      0.97       126
           1       0.95      0.93      0.94        57

    accuracy                           0.96       183
   macro avg       0.96      0.95      0.96       183
weighted avg       0.96      0.96      0.96       183

Mean Absolute Error (MAE) of MCAR Imputation: 0.0025

MAR Results:

              precision    recall  f1-score   support

           0       0.97      0.98      0.97       126
           1       0.95      0.93      0.94        57

    accuracy                           0.96       183
   macro avg       0.96      0.95      0.96       183
weighted avg       0.96      0.96      0.96       183

Mean Absolute Error (MAE) of MAR Imputation: 0.0026

MNAR Results:

              precision    recall  f1-score   support

           0       0.93      0.98      0.96       126
           1       0.96      0.84      0.90        57

    accuracy                   

**- MODE IMPUTATION**

In [381]:
median_imputer = SimpleImputer(missing_values=np.nan, strategy="median")
#MCAR
y_pred, report, mae = run_svm(X_MCAR_a, X_train_a, X_test_a, y_train_a, y_test_a, median_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_a, X_train_a, X_test_a, y_train_a, y_test_a, median_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_a, X_train_a, X_test_a, y_train_a, y_test_a, median_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.99      0.97      0.98       126
           1       0.93      0.98      0.96        57

    accuracy                           0.97       183
   macro avg       0.96      0.98      0.97       183
weighted avg       0.97      0.97      0.97       183

Mean Absolute Error (MAE) of MCAR Imputation: 0.0020

MAR Results:

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       126
           1       0.96      0.95      0.96        57

    accuracy                           0.97       183
   macro avg       0.97      0.97      0.97       183
weighted avg       0.97      0.97      0.97       183

Mean Absolute Error (MAE) of MAR Imputation: 0.0025

MNAR Results:

              precision    recall  f1-score   support

           0       0.93      0.98      0.96       126
           1       0.96      0.84      0.90        57

    accuracy                   

### 4.1.2 Mushroom Results

**- MEAN IMPUTATION**

In [388]:
mean_imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
#MCAR
y_pred, report, mae = run_svm(X_MCAR_m, X_train_m, X_test_m, y_train_m, y_test_m, mean_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_m, X_train_m, X_test_m, y_train_m, y_test_m, mean_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_m, X_train_m, X_test_m, y_train_m, y_test_m, mean_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.85      0.85      0.85      4909
           1       0.87      0.87      0.87      5898

    accuracy                           0.86     10807
   macro avg       0.86      0.86      0.86     10807
weighted avg       0.86      0.86      0.86     10807

Mean Absolute Error (MAE) of MCAR Imputation: 0.0081

MAR Results:

              precision    recall  f1-score   support

           0       0.85      0.81      0.83      4909
           1       0.84      0.88      0.86      5898

    accuracy                           0.85     10807
   macro avg       0.85      0.84      0.84     10807
weighted avg       0.85      0.85      0.85     10807

Mean Absolute Error (MAE) of MAR Imputation: 0.0079

MNAR Results:

              precision    recall  f1-score   support

           0       0.85      0.72      0.78      4909
           1       0.79      0.89      0.84      5898

    accuracy                   

**- MEDIAN IMPUTATION**

In [394]:
median_imputer = SimpleImputer(missing_values=np.nan, strategy="median")
#MCAR
y_pred, report, mae = run_svm(X_MCAR_m, X_train_m, X_test_m, y_train_m, y_test_m, median_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_m, X_train_m, X_test_m, y_train_m, y_test_m, median_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_m, X_train_m, X_test_m, y_train_m, y_test_m, median_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.85      0.83      0.84      4909
           1       0.86      0.88      0.87      5898

    accuracy                           0.86     10807
   macro avg       0.86      0.86      0.86     10807
weighted avg       0.86      0.86      0.86     10807

Mean Absolute Error (MAE) of MCAR Imputation: 0.0078

MAR Results:

              precision    recall  f1-score   support

           0       0.85      0.76      0.80      4909
           1       0.81      0.89      0.85      5898

    accuracy                           0.83     10807
   macro avg       0.83      0.82      0.83     10807
weighted avg       0.83      0.83      0.83     10807

Mean Absolute Error (MAE) of MAR Imputation: 0.0087

MNAR Results:

              precision    recall  f1-score   support

           0       0.85      0.72      0.78      4909
           1       0.79      0.89      0.84      5898

    accuracy                   

### 4.1.3 Occupancy Detection Results

**- MEAN IMPUTATION**

In [395]:
mean_imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
#MCAR
y_pred, report, mae = run_svm(X_MCAR_o, X_train_o, X_test_o, y_train_o, y_test_o, mean_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_o, X_train_o, X_test_o, y_train_o, y_test_o, mean_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_o, X_train_o, X_test_o, y_train_o, y_test_o, mean_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.96      0.95      0.96      1693
           1       0.92      0.94      0.93       972

    accuracy                           0.95      2665
   macro avg       0.94      0.94      0.94      2665
weighted avg       0.95      0.95      0.95      2665

Mean Absolute Error (MAE) of MCAR Imputation: 0.0077

MAR Results:

              precision    recall  f1-score   support

           0       0.97      0.95      0.96      1693
           1       0.91      0.95      0.93       972

    accuracy                           0.95      2665
   macro avg       0.94      0.95      0.95      2665
weighted avg       0.95      0.95      0.95      2665

Mean Absolute Error (MAE) of MAR Imputation: 0.0090

MNAR Results:

              precision    recall  f1-score   support

           0       0.81      0.97      0.88      1693
           1       0.92      0.60      0.72       972

    accuracy                   

**- MEDIAN IMPUTATION**

In [397]:
median_imputer = SimpleImputer(missing_values=np.nan, strategy="median")
#MCAR
y_pred, report, mae = run_svm(X_MCAR_o, X_train_o, X_test_o, y_train_o, y_test_o, median_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_o, X_train_o, X_test_o, y_train_o, y_test_o, median_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_o, X_train_o, X_test_o, y_train_o, y_test_o, median_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.97      0.95      0.96      1693
           1       0.91      0.95      0.93       972

    accuracy                           0.95      2665
   macro avg       0.94      0.95      0.95      2665
weighted avg       0.95      0.95      0.95      2665

Mean Absolute Error (MAE) of MCAR Imputation: 0.0056

MAR Results:

              precision    recall  f1-score   support

           0       0.98      0.94      0.96      1693
           1       0.91      0.96      0.93       972

    accuracy                           0.95      2665
   macro avg       0.94      0.95      0.95      2665
weighted avg       0.95      0.95      0.95      2665

Mean Absolute Error (MAE) of MAR Imputation: 0.0091

MNAR Results:

              precision    recall  f1-score   support

           0       0.81      0.97      0.88      1693
           1       0.92      0.60      0.72       972

    accuracy                   

## 4.2 **K-Nearest Neighbors Imputation**
Different approach called KNN Imputation, using n_neighbors=5, fills missing data based on values from the nearest neighbors in the feature space.

### 4.2.1 Autism Screening Adult Results

In [400]:
knn_imputer = KNNImputer(n_neighbors=5)
#MCAR
y_pred, report, mae = run_svm(X_MCAR_a, X_train_a, X_test_a, y_train_a, y_test_a, knn_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_a, X_train_a, X_test_a, y_train_a, y_test_a, knn_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_a, X_train_a, X_test_a, y_train_a, y_test_a, knn_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.95      0.98      0.97       126
           1       0.96      0.89      0.93        57

    accuracy                           0.96       183
   macro avg       0.96      0.94      0.95       183
weighted avg       0.96      0.96      0.96       183

Mean Absolute Error (MAE) of MCAR Imputation: 0.0017

MAR Results:

              precision    recall  f1-score   support

           0       0.98      0.97      0.97       126
           1       0.93      0.95      0.94        57

    accuracy                           0.96       183
   macro avg       0.95      0.96      0.96       183
weighted avg       0.96      0.96      0.96       183

Mean Absolute Error (MAE) of MAR Imputation: 0.0020

MNAR Results:

              precision    recall  f1-score   support

           0       0.93      0.98      0.96       126
           1       0.96      0.84      0.90        57

    accuracy                   

### 4.2.2 Mushroom Results

In [402]:
knn_imputer = KNNImputer(n_neighbors=5)
#MCAR
y_pred, report, mae = run_svm(X_MCAR_m, X_train_m, X_test_m, y_train_m, y_test_m, knn_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_m, X_train_m, X_test_m, y_train_m, y_test_m, knn_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_m, X_train_m, X_test_m, y_train_m, y_test_m, knn_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.85      0.86      0.86      4909
           1       0.89      0.88      0.88      5898

    accuracy                           0.87     10807
   macro avg       0.87      0.87      0.87     10807
weighted avg       0.87      0.87      0.87     10807

Mean Absolute Error (MAE) of MCAR Imputation: 0.0031

MAR Results:

              precision    recall  f1-score   support

           0       0.84      0.83      0.83      4909
           1       0.86      0.87      0.86      5898

    accuracy                           0.85     10807
   macro avg       0.85      0.85      0.85     10807
weighted avg       0.85      0.85      0.85     10807

Mean Absolute Error (MAE) of MAR Imputation: 0.0069

MNAR Results:

              precision    recall  f1-score   support

           0       0.84      0.71      0.77      4909
           1       0.79      0.89      0.83      5898

    accuracy                   

### 4.2.3 Occupancy Detection Results

In [404]:
knn_imputer = KNNImputer(n_neighbors=5)
#MCAR
y_pred, report, mae = run_svm(X_MCAR_o, X_train_o, X_test_o, y_train_o, y_test_o, knn_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_o, X_train_o, X_test_o, y_train_o, y_test_o, knn_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_o, X_train_o, X_test_o, y_train_o, y_test_o, knn_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1693
           1       0.95      1.00      0.97       972

    accuracy                           0.98      2665
   macro avg       0.97      0.98      0.98      2665
weighted avg       0.98      0.98      0.98      2665

Mean Absolute Error (MAE) of MCAR Imputation: 0.0004

MAR Results:

              precision    recall  f1-score   support

           0       0.94      0.97      0.96      1693
           1       0.95      0.89      0.92       972

    accuracy                           0.94      2665
   macro avg       0.95      0.93      0.94      2665
weighted avg       0.94      0.94      0.94      2665

Mean Absolute Error (MAE) of MAR Imputation: 0.0036

MNAR Results:

              precision    recall  f1-score   support

           0       0.81      0.97      0.88      1693
           1       0.92      0.60      0.72       972

    accuracy                   

## 4.3 **Multiple Imputation MICE**
Multiple Imputation involves modeling each feature with missing data conditionally on other features. The process is repeated max_iter = 100 times to enhance imputation stability.

### 4.3.1 Autism Screening Adult Results

In [407]:
mice_imputer = IterativeImputer(max_iter=100)
#MCAR
y_pred, report, mae = run_svm(X_MCAR_a, X_train_a, X_test_a, y_train_a, y_test_a, mice_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_a, X_train_a, X_test_a, y_train_a, y_test_a, mice_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_a, X_train_a, X_test_a, y_train_a, y_test_a, mice_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.96      0.98      0.97       126
           1       0.96      0.91      0.94        57

    accuracy                           0.96       183
   macro avg       0.96      0.95      0.95       183
weighted avg       0.96      0.96      0.96       183

Mean Absolute Error (MAE) of MCAR Imputation: 0.0018

MAR Results:

              precision    recall  f1-score   support

           0       0.97      0.97      0.97       126
           1       0.93      0.93      0.93        57

    accuracy                           0.96       183
   macro avg       0.95      0.95      0.95       183
weighted avg       0.96      0.96      0.96       183

Mean Absolute Error (MAE) of MAR Imputation: 0.0020

MNAR Results:

              precision    recall  f1-score   support

           0       0.93      0.98      0.96       126
           1       0.96      0.84      0.90        57

    accuracy                   

### 4.3.2 Mushroom Results

In [409]:
mice_imputer = IterativeImputer(max_iter=100)
#MCAR
y_pred, report, mae = run_svm(X_MCAR_m, X_train_m, X_test_m, y_train_m, y_test_m, mice_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_m, X_train_m, X_test_m, y_train_m, y_test_m, mice_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_m, X_train_m, X_test_m, y_train_m, y_test_m, mice_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.85      0.85      0.85      4909
           1       0.87      0.87      0.87      5898

    accuracy                           0.86     10807
   macro avg       0.86      0.86      0.86     10807
weighted avg       0.86      0.86      0.86     10807

Mean Absolute Error (MAE) of MCAR Imputation: 0.0081

MAR Results:

              precision    recall  f1-score   support

           0       0.84      0.80      0.82      4909
           1       0.84      0.87      0.86      5898

    accuracy                           0.84     10807
   macro avg       0.84      0.84      0.84     10807
weighted avg       0.84      0.84      0.84     10807

Mean Absolute Error (MAE) of MAR Imputation: 0.0085

MNAR Results:

              precision    recall  f1-score   support

           0       0.85      0.72      0.78      4909
           1       0.79      0.89      0.84      5898

    accuracy                   

### 4.3.3 Occupancy Detection Results

In [411]:
mice_imputer = IterativeImputer(max_iter=100)
#MCAR
y_pred, report, mae = run_svm(X_MCAR_o, X_train_o, X_test_o, y_train_o, y_test_o, mice_imputer)
print("MCAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MCAR Imputation: {mae:.4f}\n")
#MAR
y_pred, report, mae = run_svm(X_MAR_o, X_train_o, X_test_o, y_train_o, y_test_o, mice_imputer)
print("MAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MAR Imputation: {mae:.4f}\n")
#MNAR
y_pred, report, mae = run_svm(X_MNAR_o, X_train_o, X_test_o, y_train_o, y_test_o, mice_imputer)
print("MNAR Results:\n")
print(report)
print(f"Mean Absolute Error (MAE) of MNAR Imputation: {mae:.4f}\n")

MCAR Results:

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1693
           1       0.94      0.95      0.95       972

    accuracy                           0.96      2665
   macro avg       0.96      0.96      0.96      2665
weighted avg       0.96      0.96      0.96      2665

Mean Absolute Error (MAE) of MCAR Imputation: 0.0036

MAR Results:

              precision    recall  f1-score   support

           0       0.95      0.98      0.97      1693
           1       0.97      0.92      0.94       972

    accuracy                           0.96      2665
   macro avg       0.96      0.95      0.95      2665
weighted avg       0.96      0.96      0.96      2665

Mean Absolute Error (MAE) of MAR Imputation: 0.0097

MNAR Results:

              precision    recall  f1-score   support

           0       0.81      0.97      0.88      1693
           1       0.92      0.60      0.72       972

    accuracy                   