How to train a model using very huge initial datasets : methodological advices needed #1006
-
DescriptionUsers who have a lot of data (hundreds of giga bytes) often struggle when training a model. They need a how to in order to sample the input data to avoid using the whole dataset at the first stages. Questions/IdeasA tutorial is welcome to guide step by step the user. An important demand is to respect the initial data distribution especially if it is highly unbalanced. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Training a Model with a Huge DatabaseContext of Big DataWhen dealing with a real data mining problem, you may have a huge database, potentially containing For example, if your entity is a customer, you may have many logs per customer:
At Orange telecommunications, there are tens of millions of customers in France, each with ObjectiveThe objective is to train a model efficiently. You need to build a flat analysis dataset, with instances in rows and variables in columns, to train Processing the whole database while constructing many features is not an option, given the Note For example, if you have a database of 10 million customers, with 100 logs per customer, and want
This type of analysis is possible with Khiops, but it will require very large computational resources (disk, RAM, CPU). Methodology for Training and Deploying a ModelYou need to extract an analysis dataset from the whole database to perform your data mining study Three steps are necessary:
The first and last steps are data-intensive, as they involve scanning the whole database, while the Analysis DatasetThe analysis dataset needs to be:
RepresentativeYour dataset must fall within the same perimeter as your study. For example, if the business Within this perimeter, your instances must be chosen at random. For example, if you need an analysis Large EnoughYour analysis dataset must be large enough to provide reliable results, in order to determine whether As a rule of thumb, the standard deviation of an evaluation criterion (e.g. accuracy) is around For example, in a political poll involving tens of millions of voters, samples are usually taken of The following table indicates error margins for several sample sizes:
Choosing a sample size of between 1,000 and 10,000 looks like a good trade-off, as it provides good Note If your dataset involves categorical variables with many values (e.g. first name or city zip code), Case of Unbalanced DatasetsThe rule of thumb applies to balanced datasets: see Q&A https://github.com/orgs/KhiopsML/discussions/474. In this case, it might be a good idea to extract an enriched analysis dataset from the analysis Note Using an enriched analysis dataset can speed up training, but the evaluation criteria will be Note For example, in a churn study involving an analysis dataset of size 100,000 with 1,000 churners, Reasonable SizeA data mining study involves many tests and trials using your analysis dataset. You must first divide your analysis dataset into:
The training phase involves several scans of the data, for data preparation, modeling, and Note that the requirements in time, energy, and cloud costs are at least proportional to the sample Note For example, with a training database of 10 million instances, the training time with a sample of Using Khiops for the Whole ProcessExtraction of the Analysis DatasetWhen you extract an analysis dataset from the whole database, you need to extract a representative With Khiops, you simply perform a Deploy model applied to the whole database to extract your
In the case of a multi-table database, you simply specify one file per secondary table to obtain the Training a Model on the Analysis DatasetWith your analysis dataset, you can perform your data mining study, evaluate different modeling With Khiops, use Train model with the default 70% train-test split, and evaluate training In the case of an unbalanced dataset, you can extract an enriched analysis dataset to speed up In the case of a huge dataset with highly unbalanced classes, even the analysis dataset may be too large to process efficiently. Choose the best modeling scenario with the best evaluation criteria, and verify that the business Tip Do not spend too much time refining the trained model if you already obtain a good enough Note Remember that the scoring process is just one step within a project, which may involve subsequent Deploying the Model on the Whole DatabaseWith Khiops, you simply perform a Deploy model applied to the whole database to build a score Tip A simpler model will be faster and cheaper to deploy on the whole database, and may be sufficient |
Beta Was this translation helpful? Give feedback.
Training a Model with a Huge Database
Context of Big Data
When dealing with a real data mining problem, you may have a huge database, potentially containing
terabytes of data. This is likely to happen in the case of a multi-table database, where your main
analyzed instances are described with detailed records stored in secondary tables.
For example, if your entity is a customer, you may have many logs per customer:
At Orange telecommunications, there are tens of millions of customers in France, each with
thousands of logs in secondary tables, amounting to tens of billions of logs at terabyte …