Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The datasets.map() implementation modifies the datatype of os.environ object #2115

Closed
leleamol opened this issue Mar 25, 2021 · 0 comments · Fixed by #2119
Closed

The datasets.map() implementation modifies the datatype of os.environ object #2115

leleamol opened this issue Mar 25, 2021 · 0 comments · Fixed by #2119

Comments

@leleamol
Copy link

leleamol commented Mar 25, 2021

In our testing, we noticed that the datasets.map() implementation is modifying the datatype of python os.environ object from '_Environ' to 'dict'.

This causes following function calls to fail as follows:

x = os.environ.get("TEST_ENV_VARIABLE_AFTER_dataset_map", default=None) TypeError: get() takes no keyword arguments
It looks like the following line in datasets.map implementation introduced this functionality.

os.environ = prev_env

Here is the test script to reproduce this error.

from datasets import load_dataset
from transformers import AutoTokenizer
import os


def test_train():
    model_checkpoint = "distilgpt2"
    datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token


    def tokenize_function(examples):
        y = tokenizer(examples['text'], truncation=True, max_length=64)
        return y

    x = os.environ.get("TEST_ENV_VARIABLE_BEFORE_dataset_map", default=None)
    print(f"Testing environment variable: TEST_ENV_VARIABLE_BEFORE_dataset_map {x}")
    print(f"Data type of os.environ before datasets.map = {os.environ.__class__.__name__}")
    datasets.map(tokenize_function, batched=True, num_proc=2, remove_columns=["text"])
    print(f"Data type of os.environ after datasets.map = {os.environ.__class__.__name__}")
    x = os.environ.get("TEST_ENV_VARIABLE_AFTER_dataset_map", default=None)
    print(f"Testing environment variable: TEST_ENV_VARIABLE_AFTER_dataset_map {x}")


if __name__ == "__main__":
    test_train()


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant