How to train a model using very huge initial datasets : methodological advices needed #1006

tramora · 2026-04-23T09:34:51Z

tramora
Apr 23, 2026
Collaborator

Description

Users who have a lot of data (hundreds of giga bytes) often struggle when training a model.

They need a how to in order to sample the input data to avoid using the whole dataset at the first stages.

Questions/Ideas

A tutorial is welcome to guide step by step the user.

An important demand is to respect the initial data distribution especially if it is highly unbalanced.

Answered by marcboulle

Apr 24, 2026

Training a Model with a Huge Database

Context of Big Data

When dealing with a real data mining problem, you may have a huge database, potentially containing
terabytes of data. This is likely to happen in the case of a multi-table database, where your main
analyzed instances are described with detailed records stored in secondary tables.

For example, if your entity is a customer, you may have many logs per customer:

Purchase details in marketing
Call detail records in telecommunications
Web navigation logs
...

At Orange telecommunications, there are tens of millions of customers in France, each with
thousands of logs in secondary tables, amounting to tens of billions of logs at terabyte …

View full answer

marcboulle · 2026-04-24T09:58:40Z

marcboulle
Apr 24, 2026
Maintainer

Training a Model with a Huge Database

Context of Big Data

When dealing with a real data mining problem, you may have a huge database, potentially containing
terabytes of data. This is likely to happen in the case of a multi-table database, where your main
analyzed instances are described with detailed records stored in secondary tables.

For example, if your entity is a customer, you may have many logs per customer:

Purchase details in marketing
Call detail records in telecommunications
Web navigation logs
...

At Orange telecommunications, there are tens of millions of customers in France, each with
thousands of logs in secondary tables, amounting to tens of billions of logs at terabyte scale.

Objective

The objective is to train a model efficiently.

You need to build a flat analysis dataset, with instances in rows and variables in columns, to train
your model. Given the multi-table database, you have to construct features per instance from the
secondary records to obtain your flat analysis dataset.

Processing the whole database while constructing many features is not an option, given the
prohibitive computational resources required.

Note

For example, if you have a database of 10 million customers, with 100 logs per customer, and want
to construct 100,000 features per customer, you need:

8 TB of disk space to store your analysis dataset: 10 million × 100,000 features = 1,000 billion
values, at approximately 8 bytes per value
100,000 billion read and computation operations from log values, as each constructed value is
computed from 100 logs on average

This type of analysis is possible with Khiops, but it will require very large computational resources (disk, RAM, CPU).

Methodology for Training and Deploying a Model

You need to extract an analysis dataset from the whole database to perform your data mining study
efficiently.

Three steps are necessary:

Extract an analysis dataset from the whole database
Perform your data mining study on the analysis dataset, to evaluate different modeling scenarios
and choose the best one
Deploy the best model on the whole database, to get predictions for all instances

The first and last steps are data-intensive, as they involve scanning the whole database, while the
second step is computationally intensive, as it involves many tests and trials of modeling scenarios.

Analysis Dataset

The analysis dataset needs to be:

Representative of the data under study
Large enough to reduce the variance of the results
A reasonable size, to enable fast analysis times

Representative

Your dataset must fall within the same perimeter as your study. For example, if the business
objective is a marketing campaign targeting young customers in big cities, your analysis dataset must
be within the same perimeter.

Within this perimeter, your instances must be chosen at random. For example, if you need an analysis
dataset containing 100,000 instances, do not take the first 100,000 instances from your database, or
those with the first 100,000 identifiers. You really need a random sample.

Large Enough

Your analysis dataset must be large enough to provide reliable results, in order to determine whether
the business objectives are fulfilled with good confidence.

As a rule of thumb, the standard deviation of an evaluation criterion (e.g. accuracy) is around
$\frac{1}{\sqrt{n}}$ for a sample of size $n$.

For example, in a political poll involving tens of millions of voters, samples are usually taken of
size $n = 1{,}000$, implying a margin of error of around 3% ($\frac{1}{\sqrt{1000}} \approx 0.03$).

The following table indicates error margins for several sample sizes:

Sample size	Error margin
100	10%
1,000	3%
10,000	1%
100,000	0.3%
1,000,000	0.1%

Choosing a sample size of between 1,000 and 10,000 looks like a good trade-off, as it provides good
confidence in the evaluation criteria, with a small error margin, while remaining fast to process
during the data mining study.

Note

If your dataset involves categorical variables with many values (e.g. first name or city zip code),
a larger sample may be necessary to capture enough occurrences of each value and potentially
extract more information from these variables.

Case of Unbalanced Datasets

The rule of thumb applies to balanced datasets: see Q&A https://github.com/orgs/KhiopsML/discussions/474.
In the case of unbalanced datasets with a rare minority class (e.g. churner), you should ensure the requested sample size is met
for the rare class. For example, in a churn study with around 1% churners in your database, to have between 1,000 and 10,000 churners in your sample, as recommended, you need a sample of size between 100,000 and 1,000,000.

In this case, it might be a good idea to extract an enriched analysis dataset from the analysis
dataset, containing all instances of the rare class and a subset of the frequent ones. As a rule of
thumb, a sample of approximately 10 times the number of rare instances is a good trade-off. The
quality of your model will be roughly the same, while being far faster to train.

Note

Using an enriched analysis dataset can speed up training, but the evaluation criteria will be
biased, as the proportion of churners will differ from that in the whole database. To obtain a
correct evaluation, you can use the whole analysis dataset, with the correct proportion of
churners, to evaluate the model trained on the enriched analysis dataset.

Note

For example, in a churn study involving an analysis dataset of size 100,000 with 1,000 churners,
you can extract an enriched analysis sub-dataset containing all 1,000 churners and around 10,000
non-churners. You obtain a sub-dataset about 10 times smaller, which is far faster to process
than the whole analysis dataset.

Reasonable Size

A data mining study involves many tests and trials using your analysis dataset.

You must first divide your analysis dataset into:

Train dataset: to train the model
Test dataset: to obtain a fair evaluation of the model's performance on the whole database

The training phase involves several scans of the data, for data preparation, modeling, and
evaluation. This process must be repeated to evaluate different modeling scenarios, with varying
numbers of constructed features in the multi-table setting, varying numbers of trees, etc. To be
able to evaluate many modeling scenarios rapidly, you should adopt a fail-fast strategy and allow
train-test round trips to complete as quickly as possible.

Note that the requirements in time, energy, and cloud costs are at least proportional to the sample
size.

Note

For example, with a training database of 10 million instances, the training time with a sample of
size 100,000 will be at least 100 times smaller than with the whole database. You need to extract
the analysis dataset once and for all, to avoid scanning the whole database each time you evaluate
a new modeling scenario.

Using Khiops for the Whole Process

Extraction of the Analysis Dataset

When you extract an analysis dataset from the whole database, you need to extract a representative
sample of your analyzed instances, together with their secondary records for each secondary table of
the multi-table schema.

With Khiops, you simply perform a Deploy model applied to the whole database to extract your
analysis dataset:

In the data dictionary (.kdic), use all native variables (not derived) and do not use any
derived variables, in order to obtain tables with exactly the same structure
Define the perimeter of your study using a selection variable that describes the instances to
collect
- e.g. to select customers under 40 years old, use a derivation rule such as:
```
Unused Numerical InPerimeter = LE(age, 40);
```
Use a sampling percentage to obtain an analysis dataset of the desired size

In the case of a multi-table database, you simply specify one file per secondary table to obtain the
secondary records for the selected instances.

Training a Model on the Analysis Dataset

With your analysis dataset, you can perform your data mining study, evaluate different modeling
scenarios, and choose the best one.

With Khiops, use Train model with the default 70% train-test split, and evaluate training
scenarios by increasing complexity, starting with few constructed features (e.g. 1,000 constructed
features and no trees), up to at most 100,000 features and 1,000 trees.

In the case of an unbalanced dataset, you can extract an enriched analysis dataset to speed up
training. For example, to extract all churners and 10% of the other instances, use a selection
variable such as:

// Return 1 if the class is "Churner", 1 otherwise if Random() is less than or equal to 0.1, 0 otherwise
Unused Numerical InEnrichedSample = If(EQc(Class, "Churner"), 1, LE(Random(), 0.1));

In the case of a huge dataset with highly unbalanced classes, even the analysis dataset may be too large to process efficiently.
In this case, you can directly extract the enriched analysis dataset by combining a selection variable and a sampling percentage within the same Deploy model run.

Choose the best modeling scenario with the best evaluation criteria, and verify that the business
objectives are fulfilled.

Tip

Do not spend too much time refining the trained model if you already obtain a good enough
evaluation criterion. Improving test accuracy by 0.2% may not be worthwhile if it is well below
the error margin related to the size of your dataset. Do not forget all the business objectives of
your study, such as interpretability, deployment time, and costs, which may be more important than
a small improvement in accuracy.

Note

Remember that the scoring process is just one step within a project, which may involve subsequent
steps with large variance. For example, the effectiveness of a marketing campaign is likely to vary
considerably depending on the quality of the messages sent to customers.

Deploying the Model on the Whole Database

With Khiops, you simply perform a Deploy model applied to the whole database to build a score
for all customers within your perimeter.

Tip

A simpler model will be faster and cheaper to deploy on the whole database, and may be sufficient
for your business objectives.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Khiops

How to train a model using very huge initial datasets : methodological advices needed #1006

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Khiops

How to train a model using very huge initial datasets : methodological advices needed #1006

Uh oh!

tramora Apr 23, 2026 Collaborator

Description

Questions/Ideas

Training a Model with a Huge Database

Context of Big Data

Replies: 1 comment

Uh oh!

Uh oh!

marcboulle Apr 24, 2026 Maintainer

Training a Model with a Huge Database

Context of Big Data

Objective

Methodology for Training and Deploying a Model

Analysis Dataset

Representative

Large Enough

Case of Unbalanced Datasets

Reasonable Size

Using Khiops for the Whole Process

Extraction of the Analysis Dataset

Training a Model on the Analysis Dataset

Deploying the Model on the Whole Database

tramora
Apr 23, 2026
Collaborator

marcboulle
Apr 24, 2026
Maintainer