Skip to content

More info on model training #1

@nick-youngblut

Description

@nick-youngblut

From the associated manuscript:

In order to predict the SNPs number of strains at their SNP saturation state, the sequencing coverage, sequencing depth, relative abundance, genome length, SNP number, SNP density and saturated SNP number of strains in the aforementioned subsamples and ultra-deep samples were used to construct a data set. Only saturated strains in our data were used here. The data set is divided into training set and test set according to the ratio of 4:1, and the python package scikit-learn (Pedregosa et al., 2011) was used to train a linear regression model and a random forest regression model, respectively.

So from this description, it appears that multiple subsamples of the same metagenomes were used for training & testing. These subsamples are not independent, since they are derived from the same metagenome. This could lead to data leakage between the train and test subsets, which can result in overfitting and biased test accuracies.

Can you please provide more information about how the ML train/test was conducted, especially in regards to the potential for data leakage?

If data leakage has occurred, maybe you could re-train the model while blocking by metagenome sample (e.g., 2 of the 3 metagenomes are used for training, while the 3rd is used for the independent test)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions