Implemented progress bar in dataset download #1336

megansu13 · 2024-04-23T20:27:42Z

Modified datasets/dataset.py to add progress bar using tqdm
Added new script run_openml.py for example usage of OpenML with progress bar

Reference Issue

What does this PR implement/fix? Explain your changes.

Implemented progress bar using tqdm. Within datasets/dataset.py, I imported requests and tqdm,
modified _download_data() and _parse_data_from_arff() methods, as well as added run_openml.py.

Here is what run_openml.py does:

Fetching Dataset: It loads a dataset from OpenML with a specific ID (in your example, it's ID 1471, which is often the Iris dataset). The get_dataset function has the ability to download the dataset, its associated qualities, and feature metadata if not already present.
Extracting Data: It retrieves the features X and the target variable y from the dataset. The target is specified by the dataset's default target attribute.
Data Exploration: It prints out the first few entries of the feature set X using X.head(), gives a statistical summary of X with X.describe(), and shows the distribution of different classes/targets in y with y.value_counts().
Data Splitting: It splits the data into training and testing sets with an 80-20 ratio using train_test_split. The random_state parameter ensures that the split is reproducible.
Model Training: It initializes a Random Forest classifier, trains it on the training set, and then predicts on the testing set.
Evaluation: It evaluates the predictions by computing the accuracy score, comparing the predicted target values y_pred against the true values y_test.
Progress Bar: It demonstrates a simple progress bar using tqdm, which iterates through a range of 10, pausing for 0.5 seconds at each step.

How should this PR be tested?

This PR should be tested by running:
cd openml
python3 run_openml.py

Any other comments?

Let me know if there is anything else I should provide, I am happy to help!

- Modified datasets/dataset.py to add progress bar using tqdm - Added new script run_openml.py for example usage of OpenML with progress bar

for more information, see https://pre-commit.ci

PGijsbers · 2024-04-24T13:51:18Z

Hi! 👋 First off, thanks so much for taking the effort to contribute to OpenML! It's always a big step to make a first contribution, so we really appreciate it.

Unfortunately, it seems that our efforts got crossed a little bit. A progress bar for MinIO downloads has been added in #1335. However, this PR does also create a progress bar when downloading ARFF files, which is nice (at least for the larger arff files).

There are also a few things about the PR that would need changing regardless (I don't advise you take this step until we see how we handle the multiple PRs, but I wanted to include it for any future work):

The PR contains commented out code - we don't want that in our code base. Comments are used to explain why certain code exists and, in rare cases, to explain what code does. Commented out code just makes the source code harder to navigate.
The behavior of the _download_data function has been changed. self.data_file and self.parquet_file should be given a value in this function that correspond to the paths on disk. It is likely other parts of the API break if this is not done.
The functions _get_dataset_arff and _get_dataset_parquet don't just download the files (like your new function to replace them does), they also take care of caching and calculating checksums to verify the downloads were successful. These functionalities are crucial for our users, so we cannot simply replace them by this new function.
The new function which loads arff does not take into account the metadata when deciding the types of data in the dataframe. This will almost certainly lead to datatypes being inferred incorrectly (or at least, inconsistently).
A usage example to demonstrate the PR can be provided directly on Github as a code listing. Or, if appropriate, should be added to the documentation. We do not want extra example scripts in other parts of our project structure.

To avoid this from happening again, if you want to contribute, the best way to go about this is to either comment on an open issue to let us know you are interested in it, or open a new issue indicating you want to set up a PR for this. That way, we can first check whether we are (still) interested in the feature, ensure that no one else has been working on it, and provide some pointers on how we think it should best be implemented.

I will go ahead and close this PR, as it would need major rewrites. This does not mean we don't appreciate the work and are not open to work in this direction! It is just for our own organisation, so we can see the state of PRs.

We would be open to having a progress bar for (large) arff file downloads to accompany #1335, but the changes should be much less invasive. I think a good place to start would be to look at this function and the functions it calls to actually download the file. If you are still interested in this, I would propose that you have a look at that, and maybe share your findings and make a suggestion in the related issue #1333 before working on a new PR. Looking forward to hearing from you :)

megansu13 and others added 2 commits April 23, 2024 16:05

Implemented progress bar in dataset download

92c83ee

- Modified datasets/dataset.py to add progress bar using tqdm - Added new script run_openml.py for example usage of OpenML with progress bar

[pre-commit.ci] auto fixes from pre-commit.com hooks

9530d5f

for more information, see https://pre-commit.ci

PGijsbers closed this Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented progress bar in dataset download #1336

Implemented progress bar in dataset download #1336

megansu13 commented Apr 23, 2024

PGijsbers commented Apr 24, 2024 •

edited

Loading

Implemented progress bar in dataset download #1336

Implemented progress bar in dataset download #1336

Conversation

megansu13 commented Apr 23, 2024

Reference Issue

What does this PR implement/fix? Explain your changes.

How should this PR be tested?

Any other comments?

PGijsbers commented Apr 24, 2024 • edited Loading

PGijsbers commented Apr 24, 2024 •

edited

Loading