# Lotto XGBoost with 5-Number Buckets (Named Ranges)

This notebook:

1. Loads `Lotto5.xlsx` (NZ Lotto draw history).
2. Builds **5-number bucket features** with meaningful names, e.g. `bucket_1_5_count`, `bucket_6_10_count`, etc.
3. Adds simple extra features (Odd/Even, date parts, etc.).
4. Trains an **XGBoost multi-output regressor** (one target per winning number).
5. Performs 5-fold cross-validation.
6. Demonstrates generating candidate draws and predicting with the best model.
7. Saves the expanded dataset (with new features) as `Lotto5_Imputed.xlsx`.

Place this notebook in the same folder as `Lotto5.xlsx` and run all cells top to bottom.


In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.multioutput import MultiOutputRegressor

from xgboost import XGBRegressor

np.random.seed(42)
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 100)

print('Libraries imported.')

Libraries imported.


In [2]:
# Path to your Excel file. Ensure Lotto5.xlsx is in the same directory as this notebook.
excel_path = "Lotto5.xlsx"

df = pd.read_excel(excel_path)
print("Data shape:", df.shape)
display(df.head())
print("\nColumns:", list(df.columns))

Data shape: (1185, 31)


Unnamed: 0,Draw,Date,Winning Number 1,Winning Number 2,Winning Number 3,Winning Number 4,Winning Number 5,Winning Number 6,Bonus Number,From Last,Same As Day,Odd,Even,1-10,11-20,21-30,31-40,Division 1 Winners,Division 1 Prize,Division 2 Winners,Division 2 Prize,Division 3 Winners,Division 3 Prize,Division 4 Winners,Division 4 Prize,Division 5 Winners,Division 5 Prize,Division 6 Winners,Division 6 Prize,Division 7 Winners,Division 7 Prize
0,2258,2023-03-25,4,5,8,19,28,31,38,,,3,3,3,1,1,1,1.0,1000000.0,12.0,19645.0,476.0,491.0,1036.0,52.0,17884.0,26.0,20477.0,22.0,237356.0,1.5
1,2257,2023-03-22,7,16,29,32,33,38,9,38.0,,3,3,1,1,1,3,1.0,1000000.0,5.0,38340.0,249.0,764.0,702.0,62.0,11439.0,33.0,15574.0,24.0,161136.0,1.5
2,2256,2023-03-18,4,11,17,27,37,38,26,4.0,,4,2,1,2,1,2,2.0,500000.0,20.0,24497.0,867.0,560.0,1923.0,58.0,32665.0,30.0,41864.0,23.0,442812.0,1.5
3,2255,2023-03-15,3,4,5,24,33,39,16,39.0,,4,2,3,0,1,2,0.0,0.0,13.0,18925.0,385.0,634.0,935.0,60.0,15447.0,32.0,20815.0,23.0,219791.0,1.5
4,2254,2023-03-11,6,19,26,28,37,39,27,39.0,,3,3,1,1,2,2,3.0,333333.0,7.0,41972.0,448.0,650.0,1165.0,57.0,17799.0,33.0,24135.0,24.0,251402.0,1.5



Columns: ['Draw', 'Date', 'Winning Number 1', 'Winning Number 2', 'Winning Number 3', 'Winning Number 4', 'Winning Number 5', 'Winning Number 6', 'Bonus Number', 'From Last', 'Same As Day', 'Odd', 'Even', '1-10', '11-20', '21-30', '31-40', 'Division 1 Winners', 'Division 1 Prize', 'Division 2 Winners', 'Division 2 Prize', 'Division 3 Winners', 'Division 3 Prize', 'Division 4 Winners', 'Division 4 Prize', 'Division 5 Winners', 'Division 5 Prize', 'Division 6 Winners', 'Division 6 Prize', 'Division 7 Winners', 'Division 7 Prize']


## Basic cleaning and helper columns

We:

- Ensure the winning number columns are numeric.
- Convert helper columns like `From Last`, `Same As Day`, `Odd`, `Even` to numeric (if present).
- Parse `Date` and add simple date features.


In [3]:
# Winning number columns (targets)
number_cols = [
    "Winning Number 1",
    "Winning Number 2",
    "Winning Number 3",
    "Winning Number 4",
    "Winning Number 5",
    "Winning Number 6",
]

# Ensure numeric for winning numbers
for col in number_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")
    else:
        raise ValueError(f"Expected column '{col}' not found in dataframe.")

# Helper columns that may exist
helper_cols = ["From Last", "Same As Day", "Odd", "Even"]
for col in helper_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

# Date handling
if "Date" in df.columns:
    df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
    df["Year"] = df["Date"].dt.year.fillna(0).astype(int)
    df["Month"] = df["Date"].dt.month.fillna(0).astype(int)
    df["DayOfWeek"] = df["Date"].dt.dayofweek.fillna(0).astype(int)
else:
    df["Year"] = 0
    df["Month"] = 0
    df["DayOfWeek"] = 0

print("After basic cleaning:")
display(df.head())

After basic cleaning:


Unnamed: 0,Draw,Date,Winning Number 1,Winning Number 2,Winning Number 3,Winning Number 4,Winning Number 5,Winning Number 6,Bonus Number,From Last,Same As Day,Odd,Even,1-10,11-20,21-30,31-40,Division 1 Winners,Division 1 Prize,Division 2 Winners,Division 2 Prize,Division 3 Winners,Division 3 Prize,Division 4 Winners,Division 4 Prize,Division 5 Winners,Division 5 Prize,Division 6 Winners,Division 6 Prize,Division 7 Winners,Division 7 Prize,Year,Month,DayOfWeek
0,2258,2023-03-25,4,5,8,19,28,31,38,,,3,3,3,1,1,1,1.0,1000000.0,12.0,19645.0,476.0,491.0,1036.0,52.0,17884.0,26.0,20477.0,22.0,237356.0,1.5,2023,3,5
1,2257,2023-03-22,7,16,29,32,33,38,9,38.0,,3,3,1,1,1,3,1.0,1000000.0,5.0,38340.0,249.0,764.0,702.0,62.0,11439.0,33.0,15574.0,24.0,161136.0,1.5,2023,3,2
2,2256,2023-03-18,4,11,17,27,37,38,26,4.0,,4,2,1,2,1,2,2.0,500000.0,20.0,24497.0,867.0,560.0,1923.0,58.0,32665.0,30.0,41864.0,23.0,442812.0,1.5,2023,3,5
3,2255,2023-03-15,3,4,5,24,33,39,16,39.0,,4,2,3,0,1,2,0.0,0.0,13.0,18925.0,385.0,634.0,935.0,60.0,15447.0,32.0,20815.0,23.0,219791.0,1.5,2023,3,2
4,2254,2023-03-11,6,19,26,28,37,39,27,39.0,,3,3,1,1,2,2,3.0,333333.0,7.0,41972.0,448.0,650.0,1165.0,57.0,17799.0,33.0,24135.0,24.0,251402.0,1.5,2023,3,5


## 5-number bucket features with named ranges

We map numbers as follows (for NZ Lotto 1–40):

- 1–5   → bucket index 0 → features: `bucket_1_5_count`, `bucket_1_5_present`
- 6–10  → bucket index 1 → `bucket_6_10_count`, ...
- 11–15 → bucket index 2
- 16–20 → bucket index 3
- 21–25 → bucket index 4
- 26–30 → bucket index 5
- 31–35 → bucket index 6
- 36–40 → bucket index 7

So you can immediately see which 5-number range each feature refers to.


In [4]:
def num_to_bucket(num: float, bucket_size: int = 5) -> float:
    """Map a lotto number to a 0-based bucket index of size `bucket_size`.
    Returns NaN for missing values.
    """
    if pd.isna(num):
        return np.nan
    return int((int(num) - 1) // bucket_size)

# Create per-number bucket columns (indices)
for col in number_cols:
    df[f"{col}_bucket"] = df[col].apply(num_to_bucket)

bucket_cols = [f"{col}_bucket" for col in number_cols]
max_bucket = int(df[bucket_cols].max().max())

# Infer max actual number from data (e.g. 40)
max_number = int(df[number_cols].max().max())
bucket_size = 5

# Build mapping from bucket index -> human-readable column names
bucket_index_to_count_col = {}
bucket_index_to_present_col = {}
bucket_count_cols = []
bucket_present_cols = []

for i in range(max_bucket + 1):
    low = i * bucket_size + 1
    high = min((i + 1) * bucket_size, max_number)
    count_name = f"bucket_{low}_{high}_count"
    present_name = f"bucket_{low}_{high}_present"
    bucket_index_to_count_col[i] = count_name
    bucket_index_to_present_col[i] = present_name
    bucket_count_cols.append(count_name)
    bucket_present_cols.append(present_name)

print("Bucket index to feature names:")
for i in range(max_bucket + 1):
    print(i, "->", bucket_index_to_count_col[i], ",", bucket_index_to_present_col[i])

display(df[bucket_cols].head())

Bucket index to feature names:
0 -> bucket_1_5_count , bucket_1_5_present
1 -> bucket_6_10_count , bucket_6_10_present
2 -> bucket_11_15_count , bucket_11_15_present
3 -> bucket_16_20_count , bucket_16_20_present
4 -> bucket_21_25_count , bucket_21_25_present
5 -> bucket_26_30_count , bucket_26_30_present
6 -> bucket_31_35_count , bucket_31_35_present
7 -> bucket_36_40_count , bucket_36_40_present


Unnamed: 0,Winning Number 1_bucket,Winning Number 2_bucket,Winning Number 3_bucket,Winning Number 4_bucket,Winning Number 5_bucket,Winning Number 6_bucket
0,0,0,1,3,5,6
1,1,3,5,6,6,7
2,0,2,3,5,7,7
3,0,0,0,4,6,7
4,1,3,5,5,7,7


In [5]:
# Compute bucket counts per draw using the named columns
def bucket_count_row(row):
    counts = np.zeros(max_bucket + 1, dtype=int)
    buckets = row[bucket_cols].values
    for b in buckets:
        if not pd.isna(b):
            b_int = int(b)
            if 0 <= b_int <= max_bucket:
                counts[b_int] += 1
    # Map counts into named columns
    data = {}
    for i in range(max_bucket + 1):
        data[bucket_index_to_count_col[i]] = counts[i]
    return pd.Series(data, index=bucket_count_cols)

df_bucket_counts = df.apply(bucket_count_row, axis=1)
df = pd.concat([df, df_bucket_counts], axis=1)

# Presence flags using named columns
for i in range(max_bucket + 1):
    count_col = bucket_index_to_count_col[i]
    present_col = bucket_index_to_present_col[i]
    df[present_col] = (df[count_col] > 0).astype(int)

# Bucket energy (weighted sum of bucket indices by count)
df["bucket_energy"] = 0
for i in range(max_bucket + 1):
    count_col = bucket_index_to_count_col[i]
    df["bucket_energy"] += i * df[count_col]

print("Bucket features created (named ranges):")
display(df.head())

Bucket features created (named ranges):


Unnamed: 0,Draw,Date,Winning Number 1,Winning Number 2,Winning Number 3,Winning Number 4,Winning Number 5,Winning Number 6,Bonus Number,From Last,Same As Day,Odd,Even,1-10,11-20,21-30,31-40,Division 1 Winners,Division 1 Prize,Division 2 Winners,Division 2 Prize,Division 3 Winners,Division 3 Prize,Division 4 Winners,Division 4 Prize,Division 5 Winners,Division 5 Prize,Division 6 Winners,Division 6 Prize,Division 7 Winners,Division 7 Prize,Year,Month,DayOfWeek,Winning Number 1_bucket,Winning Number 2_bucket,Winning Number 3_bucket,Winning Number 4_bucket,Winning Number 5_bucket,Winning Number 6_bucket,bucket_1_5_count,bucket_6_10_count,bucket_11_15_count,bucket_16_20_count,bucket_21_25_count,bucket_26_30_count,bucket_31_35_count,bucket_36_40_count,bucket_1_5_present,bucket_6_10_present,bucket_11_15_present,bucket_16_20_present,bucket_21_25_present,bucket_26_30_present,bucket_31_35_present,bucket_36_40_present,bucket_energy
0,2258,2023-03-25,4,5,8,19,28,31,38,,,3,3,3,1,1,1,1.0,1000000.0,12.0,19645.0,476.0,491.0,1036.0,52.0,17884.0,26.0,20477.0,22.0,237356.0,1.5,2023,3,5,0,0,1,3,5,6,2,1,0,1,0,1,1,0,1,1,0,1,0,1,1,0,15
1,2257,2023-03-22,7,16,29,32,33,38,9,38.0,,3,3,1,1,1,3,1.0,1000000.0,5.0,38340.0,249.0,764.0,702.0,62.0,11439.0,33.0,15574.0,24.0,161136.0,1.5,2023,3,2,1,3,5,6,6,7,0,1,0,1,0,1,2,1,0,1,0,1,0,1,1,1,28
2,2256,2023-03-18,4,11,17,27,37,38,26,4.0,,4,2,1,2,1,2,2.0,500000.0,20.0,24497.0,867.0,560.0,1923.0,58.0,32665.0,30.0,41864.0,23.0,442812.0,1.5,2023,3,5,0,2,3,5,7,7,1,0,1,1,0,1,0,2,1,0,1,1,0,1,0,1,24
3,2255,2023-03-15,3,4,5,24,33,39,16,39.0,,4,2,3,0,1,2,0.0,0.0,13.0,18925.0,385.0,634.0,935.0,60.0,15447.0,32.0,20815.0,23.0,219791.0,1.5,2023,3,2,0,0,0,4,6,7,3,0,0,0,1,0,1,1,1,0,0,0,1,0,1,1,17
4,2254,2023-03-11,6,19,26,28,37,39,27,39.0,,3,3,1,1,2,2,3.0,333333.0,7.0,41972.0,448.0,650.0,1165.0,57.0,17799.0,33.0,24135.0,24.0,251402.0,1.5,2023,3,5,1,3,5,5,7,7,0,1,0,1,0,2,0,2,0,1,0,1,0,1,0,1,28


In [6]:
# Save expanded dataset with new features
output_excel_path = "Lotto5_Imputed.xlsx"
df.to_excel(output_excel_path, index=False)
print(f"Saved dataset with new features to {output_excel_path}")

Saved dataset with new features to Lotto5_Imputed.xlsx


## Build feature matrix and target matrix

- **Targets**: the six winning numbers as a 6D regression target.
- **Features**: named bucket counts/presence, bucket energy, helper columns, and date parts.


In [7]:
target_cols = number_cols.copy()

# bucket_count_cols and bucket_present_cols already defined with meaningful names
candidate_feature_cols = (
    bucket_count_cols
    + bucket_present_cols
    + ["bucket_energy", "From Last", "Same As Day", "Odd", "Even", "Year", "Month", "DayOfWeek"]
)

# Keep only columns that exist in df (in case some helper cols are missing)
feature_cols = [c for c in candidate_feature_cols if c in df.columns]

print("Using feature columns:")
print(feature_cols)

X = df[feature_cols].values
y = df[target_cols].values

print("Feature matrix shape:", X.shape)
print("Target matrix shape:", y.shape)

Using feature columns:
['bucket_1_5_count', 'bucket_6_10_count', 'bucket_11_15_count', 'bucket_16_20_count', 'bucket_21_25_count', 'bucket_26_30_count', 'bucket_31_35_count', 'bucket_36_40_count', 'bucket_1_5_present', 'bucket_6_10_present', 'bucket_11_15_present', 'bucket_16_20_present', 'bucket_21_25_present', 'bucket_26_30_present', 'bucket_31_35_present', 'bucket_36_40_present', 'bucket_energy', 'From Last', 'Same As Day', 'Odd', 'Even', 'Year', 'Month', 'DayOfWeek']
Feature matrix shape: (1185, 24)
Target matrix shape: (1185, 6)


## XGBoost model with cross-validation

We use:

- `SimpleImputer` to handle any missing feature values.
- `MultiOutputRegressor(XGBRegressor)` to predict all 6 numbers at once.
- 5-fold cross-validation with **negative MSE** scoring.


In [8]:
# Define base XGBoost regressor
xgb_reg = XGBRegressor(
    n_estimators=300,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="reg:squarederror",
    tree_method="hist",  # change to 'gpu_hist' if you have GPU XGBoost installed
    random_state=42,
)

model = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("regressor", MultiOutputRegressor(xgb_reg)),
    ]
)

scoring = make_scorer(mean_squared_error, greater_is_better=False)

cv_results = cross_validate(
    model,
    X,
    y,
    cv=5,
    scoring=scoring,
    return_estimator=True,
    n_jobs=-1,
)

test_scores = -cv_results["test_score"]  # convert back to positive MSE
print("Cross-validation MSE scores:", test_scores)
print("Mean CV MSE:", np.mean(test_scores))

best_model_index = np.argmin(test_scores)
best_model = cv_results["estimator"][best_model_index]
print("Best model index:", best_model_index)

Cross-validation MSE scores: [2.41783362 2.53590048 2.59473129 2.38549503 2.53137193]
Mean CV MSE: 2.4930664672852467
Best model index: 3


## Predicting from candidate draws

To keep things simple, we:

1. Generate a **candidate draw** (6 random numbers 1–40, no replacement).
2. Build the same **bucket-based features** (with named ranges) for that draw.
3. Use the best cross-validated model to predict a 6D output.
4. Map predictions back into the 1–40 range (wrapping with modulo).

This is more for exploration / "pattern resonance" than real prediction.


In [9]:
def build_features_from_draw(draw_numbers, feature_columns, max_bucket_local=None, bucket_size_local=5):
    """Build a one-row feature DataFrame for a candidate draw using the same
    bucket logic and feature columns as the training data.
    """
    draw_numbers = np.array(draw_numbers, dtype=int)
    if max_bucket_local is None:
        max_bucket_local = max_bucket

    # bucket counts
    counts = np.zeros(max_bucket_local + 1, dtype=int)
    for n in draw_numbers:
        b = num_to_bucket(n, bucket_size=bucket_size_local)
        if 0 <= b <= max_bucket_local:
            counts[b] += 1

    row = {}

    # bucket counts and presence using named columns
    for i in range(max_bucket_local + 1):
        count_col = bucket_index_to_count_col[i]
        present_col = bucket_index_to_present_col[i]
        if count_col in feature_columns:
            row[count_col] = counts[i]
        if present_col in feature_columns:
            row[present_col] = int(counts[i] > 0)

    # bucket_energy
    if "bucket_energy" in feature_columns:
        row["bucket_energy"] = sum(i * counts[i] for i in range(max_bucket_local + 1))

    # helper + date features (set to neutral values)
    defaults = {
        "From Last": 0,
        "Same As Day": 0,
        "Odd": 0,
        "Even": 0,
        "Year": 0,
        "Month": 0,
        "DayOfWeek": 0,
    }
    for col, val in defaults.items():
        if col in feature_columns and col not in row:
            row[col] = val

    # ensure all feature_columns exist
    for col in feature_columns:
        if col not in row:
            row[col] = 0

    return pd.DataFrame([row], columns=feature_columns)

# Example: generate a few candidate draws and predict
num_predictions = 5
for i in range(num_predictions):
    candidate_numbers = np.sort(np.random.choice(np.arange(1, 41), size=6, replace=False))
    input_df = build_features_from_draw(candidate_numbers, feature_cols)
    pred = best_model.predict(input_df.values)[0]  # shape (6,)

    # Map predictions into 1–40 range and round
    predicted_numbers = ((np.round(pred).astype(int) - 1) % 40) + 1
    predicted_numbers = np.sort(predicted_numbers)

    print(f"Prediction set {i+1}:")
    print("  Candidate base numbers:", candidate_numbers)
    print("  Model predicted numbers:", predicted_numbers)
    print("-" * 60)

Prediction set 1:
  Candidate base numbers: [ 5 13 16 17 20 27]
  Model predicted numbers: [ 2 13 16 17 20 29]
------------------------------------------------------------
Prediction set 2:
  Candidate base numbers: [ 1  5 19 23 27 33]
  Model predicted numbers: [ 1  3 16 22 27 33]
------------------------------------------------------------
Prediction set 3:
  Candidate base numbers: [ 4  6 10 19 36 38]
  Model predicted numbers: [ 2  7 10 16 37 38]
------------------------------------------------------------
Prediction set 4:
  Candidate base numbers: [ 2  8 25 28 31 36]
  Model predicted numbers: [ 1  9 23 27 34 38]
------------------------------------------------------------
Prediction set 5:
  Candidate base numbers: [ 2 14 15 17 18 21]
  Model predicted numbers: [ 1 12 13 14 18 23]
------------------------------------------------------------
