# Assignment 6: Imputation via Regression for Missing Data

In this assignment, I explored the challenge of handling missing data in the **UCI Credit Card Default Clients** dataset, focusing on how different **imputation strategies** influence downstream classification performance. Artificially introducing **Missing At Random (MAR)** values in key numerical attributes such as AGE and BILL_AMT, I implemented three distinct approaches — **median imputation, linear regression imputation, and non-linear regression imputation** (KNN) — alongside a listwise deletion method. Each cleaned dataset was used to train a Logistic Regression classifier, and the resulting performance metrics (Accuracy, Precision, Recall, F1-score) were compared to assess imputation efficacy.

## Part A: Data Preprocessing and Imputation

#### Importing the necessary libraries
We start by importing the required libraries .

In [1]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Scikit-learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.exceptions import ConvergenceWarning
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score, make_scorer, mean_squared_error

In [2]:
# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

### 1. Load and Prepare Data

We load the data into pandas DataFrame for further analysis.

In [4]:
df = pd.read_csv('UCI_Credit_Card.csv')

In [5]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


In [6]:
target_col = 'default.payment.next.month'

In [7]:
print('Shape:', df.shape)

Shape: (30000, 25)


In [8]:
df.isnull().sum()

ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64

We can see that there are no null values, so will now introduce some null values  to simulate a real-world scenario with a substantial missing data problem.

**Introduce MAR:**

We'll introduce missing values in two numerical columns: AGE and BILL_AMT1. We will make the missingness Missing At Random (MAR) by making the missing probability depend on LIMIT_BAL.

In [9]:
candidates = ['AGE', 'BILL_AMT1']

def introduce_mar(df, cols, frac=0.07, condition_col='LIMIT_BAL', random_state=RANDOM_STATE):
    """Introduce MAR by making missingness depend on a condition column.
    frac is the *approximate* overall fraction of values to be missing in each column.
    """
    rng = np.random.RandomState(random_state)
    df = df.copy()
    cond_med = df[condition_col].median()
    for col in cols:
        # probability depends on whether condition column is above median
        p_high = min(0.02 + frac*1.5, 0.5)
        p_low = max(0.005, frac*0.4)
        probs = np.where(df[condition_col] > cond_med, p_high, p_low)
        mask = rng.rand(len(df)) < probs
        df.loc[mask, col] = np.nan
        print(f'Introduced {mask.sum()} missing values in {col} (target ~{frac*100:.1f}%)')
    return df

# Make a copy to preserve original
df_missing = introduce_mar(df, candidates, frac=0.07)

# Quick missing summary
missing_summary = df_missing[candidates].isnull().mean().rename('missing_fraction')
missing_summary

Introduced 2238 missing values in AGE (target ~7.0%)
Introduced 2282 missing values in BILL_AMT1 (target ~7.0%)


AGE          0.074600
BILL_AMT1    0.076067
Name: missing_fraction, dtype: float64

We have introduced missing values in AGE and BILL_AMT1 column so we can proceed further.

In [10]:
df.describe()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,15000.5,167484.322667,1.603733,1.853133,1.551867,35.4855,-0.0167,-0.133767,-0.1662,-0.220667,...,43262.948967,40311.400967,38871.7604,5663.5805,5921.163,5225.6815,4826.076867,4799.387633,5215.502567,0.2212
std,8660.398374,129747.661567,0.489129,0.790349,0.52197,9.217904,1.123802,1.197186,1.196868,1.169139,...,64332.856134,60797.15577,59554.107537,16563.280354,23040.87,17606.96147,15666.159744,15278.305679,17777.465775,0.415062
min,1.0,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,...,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7500.75,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,...,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,15000.5,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,...,19052.0,18104.5,17071.0,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,22500.25,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,...,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,30000.0,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,...,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0,1.0


### 2. Imputation Strategy 1: Simple Imputation (Baseline)

The median is preferred over the mean because median is robust to outliers and skewed distributions. When a feature's distribution is non-normal or contains extreme values, the **median provides a central location less influenced by those extremes** than the mean.

In [11]:
df_A = df_missing.copy()

for col in candidates:
    median = df_A[col].median()
    df_A[col] = df_A[col].fillna(median)
    print(f'Filled {col} missing with median={median}')

Filled AGE missing with median=34.0
Filled BILL_AMT1 missing with median=22614.5


### 3. Imputation Strategy 2: Regression Imputation (Linear)

**Create a dataset with one missing column**

We are using the df_missing dataset then filling the BILL_AMT1 column as it was in original dataset sothat now one **AGE** column contains missing values

In [12]:
df_one_missing = df_missing.copy()
df_one_missing['BILL_AMT1'] = df['BILL_AMT1']

We will impute one column with a `linear regression` model using the other features as predictors. We are picking `AGE` as the column to impute with regression.

**Assumption**: Regression imputation assumes Missing At Random (MAR): that the probability of missingness depends on observed data (here `LIMIT_BAL`) but not on the unobserved (missing) values themselves.

In [13]:
col_to_impute = 'AGE'
exclude = [col_to_impute, target_col]

features = [c for c in df_one_missing.columns if c not in exclude]
print('Regression features count:', len(features))

# Separate rows
not_missing_mask = df_one_missing[col_to_impute].notnull()
X_train = df_one_missing.loc[not_missing_mask, features]
y_train = df_one_missing.loc[not_missing_mask, col_to_impute]
X_pred = df_one_missing.loc[~not_missing_mask, features]

# Fit linear regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predict
preds_lin = lin_reg.predict(X_pred)

# Clip unrealistic values (ages < 21 are unlikely in credit dataset)
preds_lin = np.clip(preds_lin, a_min=21, a_max=None)

# Create Dataset B
df_B = df_one_missing.copy()
df_B.loc[~not_missing_mask, col_to_impute] = preds_lin

print('Linear regression imputation done. Imputed rows:', len(preds_lin))

Regression features count: 23
Linear regression imputation done. Imputed rows: 2238


### 4. Imputation Strategy 3: Regression Imputation (Non-Linear)

We'll use `KNeighborsRegressor` (a non-linear method) to predict the same column (`AGE`). KNN can model local non-linear relationships and often works well when similar records exist.

**Finding the best parameters for KNN**

In [14]:
param_grid = {
    'n_neighbors': range(1, 31, 2),
    'weights': ['uniform', 'distance'],
    'metric': ['minkowski']  
}

knn = KNeighborsRegressor()
grid = GridSearchCV(
    knn,
    param_grid,
    scoring=make_scorer(mean_squared_error, greater_is_better=False),
    cv=5,
    n_jobs=-1
)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best CV MSE:", -grid.best_score_)

Best parameters: {'metric': 'minkowski', 'n_neighbors': 29, 'weights': 'uniform'}
Best CV MSE: 85.53912030992535


**Refit using the best parameters**

In [15]:
best_knn = grid.best_estimator_

# Predict missing AGE
preds_knn = best_knn.predict(X_pred)

# Clip unrealistic values (ages < 21 are unlikely in credit dataset)
preds_knn = np.clip(preds_knn, a_min=21, a_max=None)

# Create Dataset C
df_C = df_one_missing.copy()
df_C.loc[~not_missing_mask, col_to_impute] = preds_knn

print('Optimized KNN regression imputation done. Imputed rows:', len(preds_knn))

Optimized KNN regression imputation done. Imputed rows: 2238


## Part B: Model Training and Performance Assessment

**Create Dataset D (Listwise Deletion)**

In [16]:
df_D = df_missing.dropna().copy()
print('Dataset D (listwise deletion) shape:', df_D.shape)

Dataset D (listwise deletion) shape: (25705, 25)


### 1. Data Split, 2. Classifier Setup and 3. Model Evaluation

**Common function to prepare X,y, split, scale, fit logistic regression, evaluate**

In [17]:
warnings.filterwarnings('ignore', category=ConvergenceWarning)

def train_and_evaluate(df_in, dataset_name, features_exclude=['ID']):
    """
    Train and evaluate a tuned Logistic Regression model using GridSearchCV.
    """
    df_local = df_in.copy()

    # Define X, y
    drop_cols = [c for c in ['ID', target_col] if c in df_local.columns]
    X = df_local.drop(columns=drop_cols)
    y = df_local[target_col].astype(int)

    # Train-test split (stratify if both classes present)
    stratify_arg = y if len(np.unique(y)) > 1 else None
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=stratify_arg
    )

    # Standardize
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Logistic Regression with GridSearchCV
    param_grid = {
        'C': [0.001, 0.01, 0.1, 1, 10, 100],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear', 'saga'],
        'class_weight': [None, 'balanced']
    }

    base_clf = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)

    grid_search = GridSearchCV(
        estimator=base_clf,
        param_grid=param_grid,
        scoring='f1_weighted',
        cv=5,
        n_jobs=-1,
        verbose=0
    )

    grid_search.fit(X_train_scaled, y_train)

    best_params = grid_search.best_params_
    best_clf = grid_search.best_estimator_

    # Evaluate best model
    y_pred = best_clf.predict(X_test_scaled)

    report = classification_report(y_test, y_pred, output_dict=True)
    report_str = classification_report(y_test, y_pred)

    metrics = {
        'Dataset': dataset_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision_pos': precision_score(y_test, y_pred, zero_division=0),
        'Recall_pos': recall_score(y_test, y_pred, zero_division=0),
        'F1_pos': f1_score(y_test, y_pred, zero_division=0),
        'F1_macro': report['macro avg']['f1-score'] if 'macro avg' in report else np.nan,
        'F1_weighted': report['weighted avg']['f1-score'] if 'weighted avg' in report else np.nan,
        'n_train': X_train.shape[0],
        'n_test': X_test.shape[0],
    }

    print(f"\n {dataset_name}- Best Parameters: {best_params}")

    return metrics, report_str

**Evaluate models on datasets A, B, C, D**

In [18]:
results = []
for df_use, name in [(df_A, 'A-Baseline'), (df_B, 'B-Linear'), (df_C, 'C-Non-Linear'), (df_D, 'D-Listwise Deletion')]:
    print('Training on dataset', name)
    metrics, rep = train_and_evaluate(df_use, name)
    print(rep)
    results.append(metrics)

Training on dataset A-Baseline

 A-Baseline- Best Parameters: {'C': 0.001, 'class_weight': 'balanced', 'penalty': 'l1', 'solver': 'saga'}
              precision    recall  f1-score   support

           0       0.86      0.85      0.85      4673
           1       0.49      0.52      0.50      1327

    accuracy                           0.77      6000
   macro avg       0.68      0.68      0.68      6000
weighted avg       0.78      0.77      0.78      6000

Training on dataset B-Linear

 B-Linear- Best Parameters: {'C': 0.001, 'class_weight': 'balanced', 'penalty': 'l1', 'solver': 'saga'}
              precision    recall  f1-score   support

           0       0.86      0.85      0.85      4673
           1       0.49      0.52      0.50      1327

    accuracy                           0.77      6000
   macro avg       0.68      0.68      0.68      6000
weighted avg       0.78      0.77      0.78      6000

Training on dataset C-Non-Linear

 C-Non-Linear- Best Parameters: {'C': 0.

Performance on dataset A, B and C is same but dataset D is less efficient.

## Part C: Comparative Analysis

### 1. Results Comparison

In [19]:
results_df = pd.DataFrame(results).set_index('Dataset')
results_df

Unnamed: 0_level_0,Accuracy,Precision_pos,Recall_pos,F1_pos,F1_macro,F1_weighted,n_train,n_test
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A-Baseline,0.774,0.48965,0.516956,0.502933,0.678343,0.776163,24000,6000
B-Linear,0.774,0.48965,0.516956,0.502933,0.678343,0.776163,24000,6000
C-Non-Linear,0.774,0.48965,0.516956,0.502933,0.678343,0.776163,24000,6000
D-Listwise Deletion,0.562926,0.308902,0.733447,0.434717,0.539223,0.595836,20564,5141


All three imputation methods (A–C) achieved **identical performance**, while listwise deletion (D) showed a clear drop in every metric except recall.

### 2. Efficacy Discussion

**a) Trade-off: Listwise Deletion vs Imputation:**
- **Listwise Deletion**: This method discards all records containing missing values, reducing the training size from 24,000 to 20,564 samples.
Despite a higher recall (0.73), its **F1 Weighted (0.596)** and accuracy (0.56) are far lower due to **loss of data diversity** and **class imbalance amplification**.
The model overpredicts the positive class to compensate, harming precision.

- **Imputation methods** (A, B, C) retain full dataset that means they will have better generalization due to larger sample size and preserved variability.

Hence, **Model D** performs slightly worse because it trains on fewer, less representative samples.

**b) Linear vs Non-linear Regression Imputation:**

- Both **B (Linear)** and **C (Non-linear)** achieve nearly identical F1-scores.
- This implies the imputed variable (AGE) shows **weak or roughly linear dependence** on the other features.
- The non-linear KNN did not capture additional structure: possibly because other predictors (like LIMIT_BAL, BILL_AMT) have only moderate correlations with AGE, and KNN regression adds noise when relationships are weak.

**c) Recommendation:**
- **Median or Linear Regression Imputation:**
    - Both achieve the **highest Weighted F1 (0.776)** and accuracy (0.774).
    - They are efficient, stable, and conceptually appropriate for MAR (Missing At Random) data.
    - Linear imputation may be favored when features have a clear linear structure; otherwise, median imputation is a safe and interpretable baseline.
- **Non-linear Regression Imputation:**
    - Adds complexity without gain here we can use it for cases with stronger non-linear dependencies or higher missingness.
- **Listwise Deletion:**
    - Not recommended as it causes substantial data loss and lower model performance.

#### Conclusion:
Considering both classification metrics (Weighted F1 and Accuracy) and conceptual reasoning, **Median or Linear Regression Imputation** offer the best trade-off between simplicity, robustness, and predictive performance.
Non-linear imputation provides no benefit for this dataset, while listwise deletion should be avoided due to its severe information loss.