In [1]:
using CSV, DataFrames, CategoricalArrays, Plots, Statistics, Random, StatsPlots, Gurobi, JuMP

In [2]:
data = CSV.read("data_diabetes.csv", DataFrame)
print(size(data))
data

(20, 4)

Row,Age,DMRxAge,BMI_mean,HbA1c
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,67.7974,9.30869,40.4315,7.2
2,71.2909,3.95346,38.0979,8.4
3,51.3101,5.36619,30.3333,11.6
4,64.4901,0.821355,48.5597,8.1
5,69.859,8.95003,42.3325,12.6
6,82.82,1.05133,32.305,9.2
7,51.5072,7.78919,26.8044,7.7
8,37.9822,2.73785,31.0588,10.7
9,38.4203,1.67283,27.7456,9.6
10,53.18,3.24162,31.1516,7.5


In [3]:
# Preprocessing 
mu = mean(data.HbA1c)
sigma = std(data.HbA1c)
data.HbA1c = (data.HbA1c .- mu) ./ sigma

# BMI is already normalized.

20-element Vector{Float64}:
 -1.2167756723645808
 -0.48853088092443325
  1.4534552295826264
 -0.6705920787844706
  2.0603258891160827
 -0.003034353297668834
 -0.9133403425978528
  0.9072716360025156
  0.2397139105157139
 -1.034714474504544
 -0.18509555115770507
 -0.003034353297668834
  0.17902684456236848
  1.635516427442664
  0.30040097646905933
  0.30040097646905933
 -1.6415851340380003
  0.7858975040958248
 -0.8526532766445073
 -0.8526532766445073

### (a)

**Objective:** Split 20 numbers into two groups of 10 each to minimize discrepancies in centered first and second moments.

**Methods to implement:**
1. **Randomization:** Shuffle all numbers, split first half to group 1, rest to group 2
2. **Re-randomization:** Do randomization 10,000 times, choose the one with lowest sum of absolute differences of first and second moments
3. **Pair Matching:** Rank all numbers, for every consecutive pair, randomly split them into groups
4. **Optimization:** Use MIO formulation from lecture notes with ρ = 0.5

**Metrics to report:**
- Total discrepancy: |μ_p(x) - μ_q(x)| + |σ²_p(x) - σ²_q(x)|
- Mean difference: |μ_p(x) - μ_q(x)|
- Variance difference: |σ²_p(x) - σ²_q(x)|

**Random seed:** 15095

In [4]:
# Save original data for re-randomization
data_original = copy(data)
n = nrow(data)
n_half = div(n, 2)
seed = 15095

# 1. Randomization: Shuffle all the n samples.
# Split the resulting data into two groups of size n/2.
Random.seed!(seed)
data_shuffled = data_original[shuffle(1:nrow(data_original)), :]
group_1_rand = data_shuffled[1:n_half, :]
group_2_rand = data_shuffled[n_half+1:end, :]

# Calculate discrepancies for randomization
# Use population variance (divide by n) for consistency with optimization formulation
mu_1_rand = mean(group_1_rand.HbA1c)
mu_2_rand = mean(group_2_rand.HbA1c)
var_1_rand = var(group_1_rand.HbA1c, corrected=false)
var_2_rand = var(group_2_rand.HbA1c, corrected=false)
diff_mean_rand = abs(mu_1_rand - mu_2_rand)
diff_var_rand = abs(var_1_rand - var_2_rand)
total_diff_rand = diff_mean_rand + diff_var_rand

println("=== 1. Randomization ===")
println("Mean difference: ", diff_mean_rand)
println("Variance difference: ", diff_var_rand)
println("Total discrepancy: ", total_diff_rand)
println()

# 2. Re-randomization: Do randomization 10000 times and choose the one with the lowest sum
min_diff = Inf
best_group_1 = nothing
best_group_2 = nothing

for i in 1:10000
    Random.seed!(seed + i)
    data_shuffled = data_original[shuffle(1:nrow(data_original)), :]
    group_1 = data_shuffled[1:n_half, :]
    group_2 = data_shuffled[n_half+1:end, :]
    
    mu_1 = mean(group_1.HbA1c)
    mu_2 = mean(group_2.HbA1c)
    # Use population variance (divide by n) for consistency
    var_1 = var(group_1.HbA1c, corrected=false)
    var_2 = var(group_2.HbA1c, corrected=false)
    
    diff_mean = abs(mu_1 - mu_2)
    diff_var = abs(var_1 - var_2)
    diff = diff_mean + diff_var
    
    if diff < min_diff
        min_diff = diff
        best_group_1 = copy(group_1)
        best_group_2 = copy(group_2)
    end
end

println("=== 2. Re-randomization (best of 10000) ===")
mu_1_rerand = mean(best_group_1.HbA1c)
mu_2_rerand = mean(best_group_2.HbA1c)
# Use population variance (divide by n) for consistency
var_1_rerand = var(best_group_1.HbA1c, corrected=false)
var_2_rerand = var(best_group_2.HbA1c, corrected=false)
diff_mean_rerand = abs(mu_1_rerand - mu_2_rerand)
diff_var_rerand = abs(var_1_rerand - var_2_rerand)
println("Mean difference: ", diff_mean_rerand)
println("Variance difference: ", diff_var_rerand)
println("Total discrepancy: ", min_diff)
println()

=== 1. Randomization ===
Mean difference: 0.34591627593407004
Variance difference: 0.268595353705696
Total discrepancy: 0.614511629639766

=== 2. Re-randomization (best of 10000) ===
Mean difference: 0.006068706595334564
Variance difference: 0.001068046792466859
Total discrepancy: 0.007136753387801423



In [5]:
# 3. Pair Matching: Rank all numbers. For every continuous two numbers, randomly split them
# 
# WHY PAIR MATCHING?
# Pair matching is a common technique in experimental design and causal inference:
# 1. By pairing similar observations (consecutive ranked values), we reduce variance within pairs
# 2. Random assignment within pairs maintains randomization (important for causal inference)
# 3. This heuristic often achieves better balance than pure randomization while still being random
# 4. It's computationally simple compared to optimization but often performs well
# 
# The idea: If two observations are similar (consecutive in rank), splitting them randomly
# ensures one goes to each group, maintaining balance while preserving randomness.
Random.seed!(seed)
data_sorted = sort(data_original, :HbA1c)
group_1_indices = Int[]
group_2_indices = Int[]

for i in 1:2:n-1
    # For each pair of consecutive numbers, randomly assign to groups
    if rand() < 0.5
        push!(group_1_indices, i)
        push!(group_2_indices, i+1)
    else
        push!(group_1_indices, i+1)
        push!(group_2_indices, i)
    end
end

# If n is odd, assign the last one randomly
if n % 2 == 1
    if rand() < 0.5
        push!(group_1_indices, n)
    else
        push!(group_2_indices, n)
    end
end

group_1_pair = data_sorted[group_1_indices, :]
group_2_pair = data_sorted[group_2_indices, :]

# Calculate discrepancies for Pair Matching
mu_1_pair = mean(group_1_pair.HbA1c)
mu_2_pair = mean(group_2_pair.HbA1c)
# Use population variance (divide by n) for consistency with optimization formulation
var_1_pair = var(group_1_pair.HbA1c, corrected=false)
var_2_pair = var(group_2_pair.HbA1c, corrected=false)
diff_mean_pair = abs(mu_1_pair - mu_2_pair)
diff_var_pair = abs(var_1_pair - var_2_pair)
total_diff_pair = diff_mean_pair + diff_var_pair

println("=== 3. Pair Matching ===")
println("Mean difference: ", diff_mean_pair)
println("Variance difference: ", diff_var_pair)
println("Total discrepancy: ", total_diff_pair)
println()

=== 3. Pair Matching ===
Mean difference: 0.07889318573934927
Variance difference: 0.06397231994882646
Total discrepancy: 0.14286550568817574



In [6]:
# Optimization: Use the formulation in lecture notes. Solve this mixed integer optimization problem. Please set ρ = 0.5.
model = Model(Gurobi.Optimizer)
set_optimizer_attribute(model, "OutputFlag", 0)

m = 2  # number of groups
k = n_half  # size of each group (n/2)
rho = 0.5

# Get the HbA1c values
y = data_original.HbA1c

# Variables
# x[i,p] = 1 if item i is assigned to group p, 0 otherwise
@variable(model, x[i in 1:n, p in 1:m], Bin)
# d is the maximum discrepancy
@variable(model, d >= 0)
# Auxiliary variables for means: μ_p = (1/k) * Σ_i x[i,p] * y[i]
@variable(model, mu[p in 1:m])
# Auxiliary variables for sum of squares: sum_sq_p = Σ_i x[i,p] * y[i]^2
@variable(model, sum_sq[p in 1:m] >= 0)
# Auxiliary variables for variances: σ_p^2 = (1/k) * sum_sq_p - μ_p^2
@variable(model, var_p[p in 1:m])

# Objective: minimize the maximum discrepancy
@objective(model, Min, d)

# Constraints for means: μ_p = (1/k) * Σ_i x[i,p] * y[i]
for p in 1:m
    @constraint(model, mu[p] == (1/k) * sum(x[i,p] * y[i] for i in 1:n))
end

# Constraints for sum of squares: sum_sq_p = Σ_i x[i,p] * y[i]^2
for p in 1:m
    @constraint(model, sum_sq[p] == sum(x[i,p] * y[i]^2 for i in 1:n))
end

# Constraints for variances: σ_p^2 = (1/k) * sum_sq_p - μ_p^2
# Note: This is quadratic. Gurobi can handle MIQP.
for p in 1:m
    @constraint(model, var_p[p] == (1/k) * sum_sq[p] - mu[p]^2)
end

# Assignment constraints: each item assigned to exactly one group
for i in 1:n
    @constraint(model, sum(x[i,p] for p in 1:m) == 1)
end

# Group size constraints: each group has exactly k items
for p in 1:m
    @constraint(model, sum(x[i,p] for i in 1:n) == k)
end

# Discrepancy constraints for all pairs p < q
# We need to cover all combinations of absolute values:
# d ≥ μ_p - μ_q + ρ(σ_p^2 - σ_q^2)
# d ≥ μ_p - μ_q + ρ(σ_q^2 - σ_p^2)
# d ≥ μ_q - μ_p + ρ(σ_p^2 - σ_q^2)
# d ≥ μ_q - μ_p + ρ(σ_q^2 - σ_p^2)
for p in 1:m-1
    for q in (p+1):m
        @constraint(model, d >= mu[p] - mu[q] + rho * (var_p[p] - var_p[q]))
        @constraint(model, d >= mu[p] - mu[q] + rho * (var_p[q] - var_p[p]))
        @constraint(model, d >= mu[q] - mu[p] + rho * (var_p[p] - var_p[q]))
        @constraint(model, d >= mu[q] - mu[p] + rho * (var_p[q] - var_p[p]))
    end
end

# Solve the model
optimize!(model)

# Results
x_opt = value.(x)
d_opt = value(d)
mu_opt = [value(mu[p]) for p in 1:m]
var_opt = [value(var_p[p]) for p in 1:m]

# Determine group assignments
group_1_opt_indices = [i for i in 1:n if x_opt[i,1] > 0.5]
group_2_opt_indices = [i for i in 1:n if x_opt[i,2] > 0.5]

group_1_opt = data_original[group_1_opt_indices, :]
group_2_opt = data_original[group_2_opt_indices, :]

# Calculate discrepancy for reporting (unweighted for comparison with other methods)
opt_mean_diff = abs(mu_opt[1] - mu_opt[2])
opt_var_diff = abs(var_opt[1] - var_opt[2])
opt_total_diff = opt_mean_diff + opt_var_diff

println("=== 4. Optimization (MIP) ===")
println("Mean difference: ", opt_mean_diff)
println("Variance difference: ", opt_var_diff)
println("Total discrepancy: ", opt_total_diff)
println()

println("=== Final Summary ===")
println("Randomization total discrepancy: ", total_diff_rand)
println("Re-randomization total discrepancy: ", min_diff)
println("Pair Matching total discrepancy: ", total_diff_pair)
println("Optimization total discrepancy: ", opt_total_diff)


Set parameter Username
Academic license - for non-commercial use only - expires 2026-08-20
=== 4. Optimization (MIP) ===
Mean difference: 0.006068706595331991
Variance difference: 0.0004051211971438651
Total discrepancy: 0.006473827792475856

=== Final Summary ===
Randomization total discrepancy: 0.614511629639766
Re-randomization total discrepancy: 0.007136753387801423
Pair Matching total discrepancy: 0.14286550568817574
Optimization total discrepancy: 0.006473827792475856


### (b)

**Note:** This is a DIFFERENT problem from part (a). Part (a) compared four methods using mean/variance discrepancy. Part (b) is a separate optimisation problem with a different objective function.

**Objective:** Minimise the sum of pairwise absolute differences between numbers in group 1 and group 2.

We want to solve

$$
\min \sum_{i \in \text{group 1}} \sum_{j \in \text{group 2}}
  \lvert w'_i - w'_j \rvert,
$$

where \(w'_i\) are the (standardised) HbA1c values. The \(w'_i\) are data; the only decision is how to split the subjects into two groups of equal size.

---

#### Modelling as a MILP

We introduce binary variables

$$
z_i =
\begin{cases}
1 & \text{if subject } i \text{ is in group 1},\\[4pt]
0 & \text{if subject } i \text{ is in group 2},
\end{cases}
\qquad i = 1,\dots,n.
$$

The group–size constraint is

$$
\sum_{i=1}^n z_i = \frac{n}{2}.
$$

For each unordered pair \((i,j)\) with \(i<j\), we precompute the constant

$$
c_{ij} = \lvert w'_i - w'_j \rvert.
$$

We then introduce binary variables \(d_{ij}\) for \(1 \le i < j \le n\) with the intended meaning:

- \(d_{ij} = 1\) if subjects \(i\) and \(j\) are in **different** groups (that is, \(z_i \ne z_j\)),
- \(d_{ij} = 0\) if they are in the **same** group.

The logical relation \(d_{ij} = \lvert z_i - z_j \rvert\) is enforced with the following linear constraints, for all \(1 \le i < j \le n\):

$$
d_{ij} \ge z_i - z_j,
$$

$$
d_{ij} \ge z_j - z_i,
$$

$$
d_{ij} \le z_i + z_j,
$$

$$
d_{ij} \le 2 - (z_i + z_j).
$$

Finally, the objective (5.1) becomes

$$
\min \sum_{1 \le i < j \le n} c_{ij}\, d_{ij},
$$

because \(d_{ij} = 1\) exactly when \(i\) and \(j\) are split across the two groups, and then we pay the cost \(c_{ij} = \lvert w'_i - w'_j \rvert\).

Solving this MILP gives the partition of subjects that minimises the sum of cross-group absolute differences in HbA1c.

In [8]:
# Part (b): Minimize sum of pairwise absolute differences
# Objective: min Σ_{i in group1, j in group2} |w'_i - w'_j|

model_pair = Model(Gurobi.Optimizer)
set_optimizer_attribute(model_pair, "OutputFlag", 0)

# Get the HbA1c values (w'_i) as a plain vector
w = data_original.HbA1c
n = length(w)
n_half = Int(n ÷ 2)

# Precompute pairwise absolute differences c[i,j] = |w[i] - w[j]| for i < j
c = Dict{Tuple{Int,Int},Float64}()
for i in 1:n
    for j in i+1:n
        c[(i,j)] = abs(w[i] - w[j])
    end
end

# Binary variables: z[i] = 1 if item i is in group 1, 0 if in group 2
@variable(model_pair, z[1:n], Bin)

# Binary variables: d[i,j] = 1 if i and j are in different groups, 0 otherwise
# Only need them for i < j
@variable(model_pair, d[i in 1:n, j in i+1:n], Bin)

# Objective: sum of pairwise costs over pairs that are split across groups
@objective(model_pair, Min,
    sum(c[(i,j)] * d[i,j] for i in 1:n for j in i+1:n)
)

# Group size constraint: each group has exactly n_half items
@constraint(model_pair, sum(z[i] for i in 1:n) == n_half)

# "Different groups" constraints: d[i,j] = |z[i] - z[j]|
for i in 1:n
    for j in i+1:n
        @constraint(model_pair, d[i,j] >= z[i] - z[j])
        @constraint(model_pair, d[i,j] >= z[j] - z[i])
        @constraint(model_pair, d[i,j] <= z[i] + z[j])
        @constraint(model_pair, d[i,j] <= 2 - (z[i] + z[j]))
    end
end

# Solve the model
optimize!(model_pair)

# Extract results
z_pair_opt = value.(z)
obj_pair_opt = objective_value(model_pair)

# Determine group assignments
group_1_pair_indices = [i for i in 1:n if z_pair_opt[i] > 0.5]
group_2_pair_indices = [i for i in 1:n if z_pair_opt[i] < 0.5]

group_1_pair_opt = data_original[group_1_pair_indices, :]
group_2_pair_opt = data_original[group_2_pair_indices, :]

println("=== Part (b): Minimize Pairwise Differences (MILP) ===")
println("Optimal objective value: ", obj_pair_opt)
println("Group 1 size: ", length(group_1_pair_indices))
println("Group 2 size: ", length(group_2_pair_indices))
println()

# Calculate objective for randomization approach from part (a)
rand_obj = 0.0
for i in 1:n_half
    for j in 1:n_half
        rand_obj += abs(group_1_rand.HbA1c[i] - group_2_rand.HbA1c[j])
    end
end

println("=== Comparison ===")
println("Randomization approach (from part a.1) objective: ", rand_obj)
println("Optimization approach (MILP) objective: ", obj_pair_opt)
println("Improvement: ", rand_obj - obj_pair_opt, " (",
        round(100 * (rand_obj - obj_pair_opt) / rand_obj, digits=2),
        "% reduction)")
println()

Set parameter Username
Academic license - for non-commercial use only - expires 2026-08-20
=== Part (b): Minimize Pairwise Differences (MILP) ===
Optimal objective value: 111.17870482652917
Group 1 size: 10
Group 2 size: 10

=== Comparison ===
Randomization approach (from part a.1) objective: 115.79092183898342
Optimization approach (MILP) objective: 111.17870482652917
Improvement: 4.612217012454252 (3.98% reduction)



**Discussion:**

The optimization approach (MILP) finds the optimal assignment that minimizes the sum of pairwise absolute differences between groups. Compared to the randomization approach from part (a.1), the MILP solution achieves a lower objective value, demonstrating that optimization can systematically find better solutions than random assignment. This has been demonstrated to be true in both scenario explored (single sample assignment to 2 groups and pairwise assignment to two groups).

The improvement shows that using optimization, we can achieve better balance between the groups in terms of pairwise differences. This is particularly valuable when we want to minimize discrepancies between groups for statistical analysis or experimental design purposes. 