# Dataset Download 파헤처기
- module path, data path 따로 존재한다.

## Step 1. Data Preprocessing script 얻기
- module_path로 script file을 shutil로 copy해주는 부분 존재함
- 경로는 내부적으로 대게 `{USER_PATH}\.cache\huggingface\modules\datasets_modules\datasets\`에 생성

In [1]:
from datasets import prepare_module

In [2]:
module_path, hash_, resolved_file_path = prepare_module(
    path="./wlda/wikitext.py",
    cache_dir="data",
    return_resolved_file_path=True,
)

In [3]:
module_path

'datasets_modules.datasets.wikitext.8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af.wikitext'

In [4]:
hash_

'8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af'

In [5]:
resolved_file_path

'./wlda/wikitext.py'

아래와 같은 `wikitext.json` 파일도 생성되고, script도 동일하게 복사한다.

In [6]:
{
    "original file path": "./wlda/wikitext.py", 
    "local file path": "C:\\Users\\jinma\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\wikitext\\24181ecf0c80364527ddc5234d7e27fd0266aa176c88772db341f191e86e6e14\\wikitext.py"
}

{'original file path': './wlda/wikitext.py',
 'local file path': 'C:\\Users\\jinma\\.cache\\huggingface\\modules\\datasets_modules\\datasets\\wikitext\\24181ecf0c80364527ddc5234d7e27fd0266aa176c88772db341f191e86e6e14\\wikitext.py'}

## Step 2. Processing Script에서 Main Class 가져오기
- 얘의 로직은 여러 번 시행해봤기 때문에 알거라고 믿음
- tutorial 작성은 나중에 고려해보기

In [7]:
from datasets import import_main_class

In [8]:
builder_cls = import_main_class(module_path, dataset=True)

In [9]:
builder_cls

datasets_modules.datasets.wikitext.8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af.wikitext.Wikitext

## Step 3. DatasetBuilder 객체화

In [10]:
config_kwargs = {}

In [11]:
builder_instance = builder_cls(
    cache_dir="data",
    name="wikitext-103-v1",
    data_dir=None, # 얜 무엇? 이미 있으면 pass하는건가
    data_files=None, # 얘도 무엇? 궁금
    hash=hash_, # 얜 빼도 될 듯. locally 작업 도중엔
    features=None, # 얜 없으면 script에서 feature 넣어주는거 사용할 듯?
    **config_kwargs,
)

- 객체가 생김과 동시에 `cache_dir`이 생성되었음

In [12]:
builder_instance

<datasets_modules.datasets.wikitext.8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af.wikitext.Wikitext at 0x15b71914b38>

In [13]:
builder_instance.__dict__

{'name': 'wikitext',
 'hash': '8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af',
 'config': WikitextConfig(name='wikitext-103-v1', version=1.0.0, data_dir=None, data_files=None, description='raw level dataset. The raw tokens before the addition of <unk> tokens. They should only be used for character level work or for creating newly derived datasets.'),
 'config_id': 'wikitext-103-v1',
 'info': DatasetInfo(description=' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\n', citation='@InProceedings{wikitext,\n    author={Stephen, Merity and Caiming ,Xiong and James, Bradbury and Richard Socher}\n    year={2016}\n}\n', homepage='https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/', license='', features={'text': Value(dtype='string', id=None)}, post

In [14]:
builder_cls.__mro__

(datasets_modules.datasets.wikitext.8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af.wikitext.Wikitext,
 datasets.builder.GeneratorBasedBuilder,
 datasets.builder.DatasetBuilder,
 object)

init에선 무슨일이 벌어질까? 우선 `GeneratorBaseBuilder`에선 바로 상위 `DatasetBuilder`의 init으로 넘어간다. 그 이후, `_writer_batch_size`를 할당한다

In [15]:
builder_instance._writer_batch_size is None

True

In [16]:
builder_instance.DEFAULT_WRITER_BATCH_SIZE is None

True

`DatasetBuilder` 클래스를 살펴보자. 해당 Builder 클래스는 아래 세 가지 핵심 메서드로 구성된다고 한다.
1. `datasets.DatasetBuilder.info`
2. `datasets.DatasetBuilder.download_and_prepare`
3. `datasets.DatasetBuilder.as_dataset`

이 중 2와 3은 Step 4, 5와 매칭되는 것이며, 현재는 객체 생성만 살펴보자.

우선 객체의 이름은 본래 cls의 이름을 `camelcase_to_snakecase` 메서드를 통해 snakecase로 변환하여 할당한다.

In [17]:
builder_instance.__class__.__name__

'Wikitext'

In [18]:
builder_instance.name

'wikitext'

hash도 바로 할당해준다.

In [19]:
builder_instance.hash

'8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af'

그 다음 config을 세팅해주는데, `~.BUILDER_CONFIG_CLASS`의 `__init__`의 signature를 검사하여 features가 있다면 config_kwargs에 넣어준다.
- `WikitextConfig`을 비교해주는건진 잘 모르겠다.
- 아래보면 그냥 `BUILDER_CONFIG_CLASS`와 비교하는 것으로 보인다.

In [20]:
import inspect

In [21]:
builder_instance.BUILDER_CONFIG_CLASS

datasets.builder.BuilderConfig

In [22]:
inspect.signature(
    builder_instance.BUILDER_CONFIG_CLASS.__init__).parameters

mappingproxy({'self': <Parameter "self">,
              'name': <Parameter "name:str='default'">,
              'version': <Parameter "version:Union[str, datasets.utils.version.Version, NoneType]='0.0.0'">,
              'data_dir': <Parameter "data_dir:str=None">,
              'data_files': <Parameter "data_files:Union[Dict, List]=None">,
              'description': <Parameter "description:str=None">})

In [23]:
"features" in inspect.signature(
    builder_instance.BUILDER_CONFIG_CLASS.__init__).parameters and None

False

우리의 경우는 features가 None인 경우. 여기서 클래스의 `BUILDER_CONFIGS: List`를 검사하여 config을 할당하는 모양.

In [24]:
builder_instance.config

WikitextConfig(name='wikitext-103-v1', version=1.0.0, data_dir=None, data_files=None, description='raw level dataset. The raw tokens before the addition of <unk> tokens. They should only be used for character level work or for creating newly derived datasets.')

info는 아래 `get_exported_dataset_info` 메서드를 통해 기본 `DatasetInfo` 객체를 받아온다.

In [25]:
info = builder_instance.get_exported_dataset_info()
info

DatasetInfo(description=' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified \n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\n', citation='@InProceedings{wikitext,\n    author={Stephen, Merity and Caiming ,Xiong and James, Bradbury and Richard Socher}\n    year=2016\n}\n', homepage='https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/', license='', features={'text': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, builder_name='wikitext', config_name='wikitext-103-v1', version=1.0.0, splits={'test': SplitInfo(name='test', num_bytes=1272551, num_examples=62, dataset_name='wikitext'), 'train': SplitInfo(name='train', num_bytes=535694801, num_examples=29444, dataset_name='wikitext'), 'validation': SplitInfo(name='validation', num_bytes=1134973, num_examples=60, dataset_name='wikitext')}

그 다음, 객체에 정의된 `_info`메서드를 통해 info를 업데이트시켜준다.

```python
class Wikitext(datasets.GeneratorBasedBuilder):
    ...
    def _info(self):
        features = datasets.Features(
            {
                "text": datasets.Value("string")
            }
        )
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=features,
            supervised_keys=None,
            homepage=_URL,
            citation=_CITATION,
        )
```

In [26]:
info.update(builder_instance._info())

In [27]:
info

DatasetInfo(description=' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\n', citation='@InProceedings{wikitext,\n    author={Stephen, Merity and Caiming ,Xiong and James, Bradbury and Richard Socher}\n    year={2016}\n}\n', homepage='https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/', license='', features={'text': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, builder_name='wikitext', config_name='wikitext-103-v1', version=1.0.0, splits={'test': SplitInfo(name='test', num_bytes=1272551, num_examples=62, dataset_name='wikitext'), 'train': SplitInfo(name='train', num_bytes=535694801, num_examples=29444, dataset_name='wikitext'), 'validation': SplitInfo(name='validation', num_bytes=1134973, num_examples=60, dataset_name='wikitext')

In [28]:
info.features

{'text': Value(dtype='string', id=None)}

여기에 이것저것 setting해준 다음, 만일 features가 정의가 안되어있었다면, 입력인자의 features로 넣어줘도 무방하다.

In [29]:
info.builder_name = builder_instance.name
info.config_name = builder_instance.config.name
info.version = builder_instance.config.version

그 다음엔 `data_dir`을 준비한다!
- `cache_dir`이 존재하면 사용
- 아니라면 `datasets.config.HF_DATASETS_CACHE`을 사용
- {USER_PATH}로 확장하여 data_dir을 결정

In [30]:
info

DatasetInfo(description=' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\n', citation='@InProceedings{wikitext,\n    author={Stephen, Merity and Caiming ,Xiong and James, Bradbury and Richard Socher}\n    year={2016}\n}\n', homepage='https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/', license='', features={'text': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, builder_name='wikitext', config_name='wikitext-103-v1', version=1.0.0, splits={'test': SplitInfo(name='test', num_bytes=1272551, num_examples=62, dataset_name='wikitext'), 'train': SplitInfo(name='train', num_bytes=535694801, num_examples=29444, dataset_name='wikitext'), 'validation': SplitInfo(name='validation', num_bytes=1134973, num_examples=60, dataset_name='wikitext')

In [31]:
builder_instance.info

DatasetInfo(description=' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\n', citation='@InProceedings{wikitext,\n    author={Stephen, Merity and Caiming ,Xiong and James, Bradbury and Richard Socher}\n    year={2016}\n}\n', homepage='https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/', license='', features={'text': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, builder_name='wikitext', config_name='wikitext-103-v1', version=1.0.0, splits={'test': SplitInfo(name='test', num_bytes=1272551, num_examples=62, dataset_name='wikitext'), 'train': SplitInfo(name='train', num_bytes=535694801, num_examples=29444, dataset_name='wikitext'), 'validation': SplitInfo(name='validation', num_bytes=1134973, num_examples=60, dataset_name='wikitext')

In [32]:
import os
from datasets import config

config.HF_DATASETS_CACHE

WindowsPath('C:/Users/jinma/.cache/huggingface/datasets')

In [33]:
_cache_dir_root = os.path.expanduser(config.HF_DATASETS_CACHE)
_cache_dir_root

'C:\\Users\\jinma\\.cache\\huggingface\\datasets'

In [34]:
_cache_dir_root = os.path.expanduser("data")
_cache_dir_root

'data'

root는 그냥 위처럼 단순 root이며, 실제 저장위치는 hash와 이것 저것 버전 등을 입혀서 저장해준다.

In [35]:
_cache_dir = builder_instance._build_cache_dir()
_cache_dir

'data\\wikitext\\wikitext-103-v1\\1.0.0\\8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af'

해당 객체 생성시, `_cache_dir_root`에 해당하는 부분의 폴더만 생성해준다.

In [36]:
os.makedirs(_cache_dir_root, exist_ok=True)

그리고 혹시 file이 존재한다면, info 파일을 해당 directory에서 읽어온다.

## Step 4. Data Download
- Builder 클래스의 핵심 메서드 중 2번째! `download_and_prepare`를 사용할 것
- 해당 과정을 요약하자면,
    - Cache가 있다면 데이터셋 재사용
    - 없으면 웹에서 다운로드 실시 (hub or url)

과정과 결과를 조금 더 상세하게 쓰자면,
- cache 검사, 만약 존재하지 않으면 아래 과정으로, 존재한다면 바로 값 반환
- raw dataset 다운로드 (hub or url)
- `_split_generators` 메서드 실행
- split별로 line별 `_generate_examples`로 sample을 arrow dataset에 넣음

결과값은
- LICENSE
- arrow 파일 객체
- dataset_info 생성

In [37]:
from datasets.utils.file_utils import url_or_path_parent

In [38]:
resolved_file_path

'./wlda/wikitext.py'

In [39]:
url_or_path_parent(resolved_file_path)

'./wlda'

In [40]:
from datasets import DownloadConfig, DownloadManager, GenerateMode

download_mode = GenerateMode(None or GenerateMode.REUSE_DATASET_IF_EXISTS)
download_mode

<GenerateMode.REUSE_DATASET_IF_EXISTS: 'reuse_dataset_if_exists'>

In [41]:
verify_infos = True

In [42]:
download_config = DownloadConfig(    
    cache_dir=os.path.join(builder_instance._cache_dir_root, "downloads"),
    force_download=False,
    use_etag=False,
    use_auth_token=False,
)  # We don't use etag for data files to speed up the process

download_config.cache_dir

'data\\downloads'

In [43]:
builder_instance.config.data_dir

In [44]:
dl_manager = DownloadManager(
    dataset_name=builder_instance.name,
    download_config=download_config,
    data_dir=builder_instance.config.data_dir, # None, 얜 setting X
    base_path=url_or_path_parent(resolved_file_path), # ./wlda
)

In [45]:
dl_manager.manual_dir

In [46]:
dl_manager.downloaded_size

0

In [47]:
# Prevent parallel disk operations
lock_path = os.path.join(
    builder_instance._cache_dir_root, 
    builder_instance._cache_dir.replace(os.sep, "_") + ".lock"
)
lock_path

'data\\data_wikitext_wikitext-103-v1_1.0.0_8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af.lock'

In [48]:
from datasets.utils.filelock import FileLock

In [49]:
import shutil
import contextlib
from datasets import utils
from datasets.utils.file_utils import is_remote_url

### 1. Cache 파일이 없거나 재사용하지 않을 때

In [50]:
from datasets.arrow_reader import (
    ArrowReader, HF_GCP_BASE_URL, DatasetNotOnHfGcs, MissingFilesOnHfGcs
)
from datasets.info import DatasetInfo
from datasets.splits import SplitDict
from datasets.utils.info_utils import verify_checksums

In [51]:
filelock = FileLock(lock_path)
filelock.acquire() # ENTER

<datasets.utils.filelock._Acquire_ReturnProxy at 0x15b781d3588>

In [52]:
data_exists = os.path.exists(builder_instance._cache_dir)
print(f"data_exists: {data_exists}")
assert not data_exists, "실험을 위해서 파일을 제거해주세요"

data_exists: True


AssertionError: 실험을 위해서 파일을 제거해주세요

In [None]:
not is_remote_url(builder_instance._cache_dir_root)

In [None]:
# 영웅왕, 디스크 크기는 충분한가?
not utils.has_sufficient_disk_space(
    builder_instance.info.size_in_bytes or 0, 
    # 'C:\\Users\\jinma\\.cache\\huggingface\\datasets'
    directory=builder_instance._cache_dir_root
) # IOError 발생 X

In [None]:
from datasets import utils

In [None]:
# Print is intentional: we want this to always go to stdout so user has
# information needed to cancel download/preparation if needed.
# This comes right before the progress bar.
print(
    f"Downloading and preparing dataset {builder_instance.info.builder_name}/{builder_instance.info.config_name} "
    f"(download: {utils.size_str(builder_instance.info.download_size)}, generated: {utils.size_str(builder_instance.info.dataset_size)}, "
    f"post-processed: {utils.size_str(builder_instance.info.post_processing_size)}, "
    f"total: {utils.size_str(builder_instance.info.size_in_bytes)}) to {builder_instance._cache_dir}..."
)

In [53]:
from dataclasses import asdict

In [54]:
asdict(builder_instance.info)

{'description': ' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\n',
 'citation': '@InProceedings{wikitext,\n    author={Stephen, Merity and Caiming ,Xiong and James, Bradbury and Richard Socher}\n    year={2016}\n}\n',
 'homepage': 'https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/',
 'license': '',
 'features': {'text': {'dtype': 'string', 'id': None, '_type': 'Value'}},
 'post_processed': None,
 'supervised_keys': None,
 'builder_name': 'wikitext',
 'config_name': 'wikitext-103-v1',
 'version': {'version_str': '1.0.0',
  'description': None,
  'major': 1,
  'minor': 0,
  'patch': 0},
 'splits': {'test': {'name': 'test',
   'num_bytes': 1272551,
   'num_examples': 62,
   'dataset_name': 'wikitext'},
  'train': {'name': 'train',
   'num_bytes': 535694801,


- `incomplete_dir` contextmanager!

```python
@contextlib.contextmanager
def incomplete_dir(dirname):
    """Create temporary dir for dirname and rename on exit."""
    if is_remote_url(dirname):
        yield dirname
    else:
        tmp_dir = dirname + ".incomplete"
        os.makedirs(tmp_dir, exist_ok=True)
        try:
            yield tmp_dir
            if os.path.isdir(dirname):
                shutil.rmtree(dirname)
            os.rename(tmp_dir, dirname)
        finally:
            if os.path.exists(tmp_dir):
                shutil.rmtree(tmp_dir)
```

- `utils.py_utils.temporary` contextmanager!

```python
@contextlib.contextmanager
def temporary_assignment(obj, attr, value):
    """Temporarily assign obj.attr to value."""
    original = getattr(obj, attr, None)
    setattr(obj, attr, value)
    try:
        yield
    finally:
        setattr(obj, attr, original)
```

In [55]:
dirname = builder_instance._cache_dir
dirname

'data\\wikitext\\wikitext-103-v1\\1.0.0\\8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af'

In [56]:
is_remote_url(dirname)

False

In [57]:
tmp_dir = dirname + ".incomplete"
os.makedirs(tmp_dir, exist_ok=True)

In [58]:
# Enter @ incomplete_dir
tmp_data_dir = tmp_dir
tmp_data_dir

'data\\wikitext\\wikitext-103-v1\\1.0.0\\8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af.incomplete'

In [59]:
# >> Enter @ temporary_assignment
original = getattr(builder_instance, "_cache_dir", None)
setattr(builder_instance, "_cache_dir", tmp_data_dir)

print(f"""
original: {original}
temp    : {builder_instance._cache_dir}
""".strip())

original: data\wikitext\wikitext-103-v1\1.0.0\8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af
temp    : data\wikitext\wikitext-103-v1\1.0.0\8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af.incomplete


In [60]:
from datasets.arrow_reader import DatasetNotOnHfGcs

In [61]:
try:
    builder_instance._download_prepared_from_hf_gcs(
        dl_manager._download_config)
except DatasetNotOnHfGcs as e:
    print("Error occurs!")

Error occurs!


In [62]:
# self._download_and_prepare
split_dict = SplitDict(dataset_name=builder_instance.name)
split_dict, type(split_dict)

({}, datasets.splits.SplitDict)

In [63]:
split_generators_kwargs = builder_instance._make_split_generators_kwargs({})
split_generators_kwargs, type(split_generators_kwargs)

({}, dict)

여기서 download 발생! dl_manager의 `download_and_extract` 메서드를 활용
- downloads에 파일 다운로드함!
- extract 위치도 바꿔버리면...?

In [64]:
split_generators = builder_instance._split_generators(dl_manager)
split_generators

[SplitGenerator(name='test', gen_kwargs={'data_file': 'data\\downloads\\extracted\\b23f382c6349d2566715ce27126d0fdda673c06ff84c853c01187a45c59ca52c\\wikitext-103\\wiki.test.tokens', 'split': 'test'}, split_info=SplitInfo(name='test', num_bytes=0, num_examples=0, dataset_name=None)),
 SplitGenerator(name='train', gen_kwargs={'data_file': 'data\\downloads\\extracted\\b23f382c6349d2566715ce27126d0fdda673c06ff84c853c01187a45c59ca52c\\wikitext-103\\wiki.train.tokens', 'split': 'train'}, split_info=SplitInfo(name='train', num_bytes=0, num_examples=0, dataset_name=None)),
 SplitGenerator(name='validation', gen_kwargs={'data_file': 'data\\downloads\\extracted\\b23f382c6349d2566715ce27126d0fdda673c06ff84c853c01187a45c59ca52c\\wikitext-103\\wiki.valid.tokens', 'split': 'valid'}, split_info=SplitInfo(name='validation', num_bytes=0, num_examples=0, dataset_name=None))]

In [65]:
builder_instance.config.data_url

'https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip'

In [66]:
builder_instance.info.download_checksums

{'https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip': {'num_bytes': 190229076,
  'checksum': '242ba0f20b329cfdf1ccc61e9e9e5b59becf189db7f7a81cd2a0e2fc31539590'}}

In [67]:
dl_manager.get_recorded_sizes_checksums()

{'https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip': {'num_bytes': 190229076,
  'checksum': '242ba0f20b329cfdf1ccc61e9e9e5b59becf189db7f7a81cd2a0e2fc31539590'}}

In [68]:
# Checksums verification
verify_checksums(
    builder_instance.info.download_checksums,
    dl_manager.get_recorded_sizes_checksums(),
    "dataset source files"
)

In [69]:
type(builder_instance.info.features) # Dict의 subclass

datasets.features.Features

In [70]:
isinstance(builder_instance.info.features, dict)

True

In [71]:
from datasets.arrow_writer import ArrowWriter

In [72]:
# Build Split
for split_generator in split_generators:
    if str(split_generator.split_info.name).lower() == "all":
        raise ValueError(
            "`all` is a special split keyword corresponding to the "
            "union of all splits, so cannot be used as key in "
            "._split_generator()."
        )
    split_dict.add(split_generator.split_info)
    # 아래 메서드는 GeneratorBasedBuilder에 있음
    # **************** _prepare_split ******************
    split_info = split_generator.split_info
    fname = "{}-{}.arrow".format(builder_instance.name, split_generator.name)
    fpath = os.path.join(builder_instance._cache_dir, fname) # incomplete
    
    # >> 여기서 최상위 객체의 `_generate_examples` 메서드가 사용되는구나~
    generator = builder_instance._generate_examples(**split_generator.gen_kwargs)
    not_verbose = False
#     with ArrowWriter(features=builder_instance.info.features,
#                      path=fpath,
#                      writer_batch_size=builder_instance._writer_batch_size) as writer:
    writer = ArrowWriter(features=builder_instance.info.features,
                         path=fpath,
                         writer_batch_size=builder_instance._writer_batch_size)
    try:
        for key, record in utils.tqdm(
            generator, unit=" examples", total=split_info.num_examples,
            leave=False, disable=not_verbose
        ):
            # 아래 메서드는 다음 두 가지 기능을 수행
            # (1) 아래 경우, python object로 변환하여 반환, 아니면 그냥 pass
            #     ndarray, torch.Tensor, tf.Tensor, pd.Series, pd.DataFrame
            #     dict, list, tuple
            # (2) 감싸진 객체를 전부 검사하면서 내려감!
            #     dict, list, tuple, Sequence, str, ClassLabel, ...
            #     위 객체들이면 recursive하게 계속 검사함
            #     값이 나올 경우 return
            example = builder_instance.info.features.encode_example(record)
            writer.write(example)
    finally:
        num_examples, num_bytes = writer.finalize()
        
    split_generator.split_info.num_examples = num_examples
    split_generator.split_info.num_bytes = num_bytes
    # **************** _prepare_split ******************

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

In [73]:
from datasets.utils.info_utils import verify_splits

In [74]:
if verify_infos:
    verify_splits(builder_instance.info.splits, split_dict)

In [75]:
builder_instance.info.splits = split_dict
builder_instance.info.download_size = dl_manager.downloaded_size

In [76]:
# Sync info
builder_instance.info.dataset_size = sum(
    split.num_bytes for split in builder_instance.info.splits.values())
builder_instance.info.download_checksums = dl_manager.get_recorded_sizes_checksums()
builder_instance.info.size_in_bytes = builder_instance.info.dataset_size + \
                                      builder_instance.info.download_size

In [77]:
os.path.join(
    builder_instance._cache_dir_root, 
    builder_instance._cache_dir.replace(os.sep, "_") + ".lock"
)

'data\\data_wikitext_wikitext-103-v1_1.0.0_8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af.incomplete.lock'

In [78]:
builder_instance._save_info() # LICENSE와 dataset_info.json 저장!

In [79]:
# >> Exit @ temporary_assignment
setattr(builder_instance, "_cache_dir", original)

In [80]:
# Exit @ incomplete_dir
if os.path.isdir(dirname):
    print("DELETE")
    shutil.rmtree(dirname)
os.rename(tmp_dir, dirname)
if os.path.exists(tmp_dir):
    shutil.rmtree(tmp_dir)

DELETE


In [81]:
# download post processing resources
builder_instance.download_post_processing_resources(dl_manager)
# 기본구현은 {}, no mapping

In [82]:
print(
    f"Dataset {builder_instance.name} downloaded and prepared to "
    f"{builder_instance._cache_dir}. Subsequent calls will reuse this data."
)

Dataset wikitext downloaded and prepared to data\wikitext\wikitext-103-v1\1.0.0\8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af. Subsequent calls will reuse this data.


### 2. Cache 파일이 있어서 재사용

In [83]:
builder_instance.download_post_processing_resources(dl_manager) # 아무것도 안함

## Step 5. Dataset Build

In [84]:
builder_instance.__dict__

{'name': 'wikitext',
 'hash': '8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af',
 'config': WikitextConfig(name='wikitext-103-v1', version=1.0.0, data_dir=None, data_files=None, description='raw level dataset. The raw tokens before the addition of <unk> tokens. They should only be used for character level work or for creating newly derived datasets.'),
 'config_id': 'wikitext-103-v1',
 'info': DatasetInfo(description=' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\n', citation='@InProceedings{wikitext,\n    author={Stephen, Merity and Caiming ,Xiong and James, Bradbury and Richard Socher}\n    year={2016}\n}\n', homepage='https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/', license='', features={'text': Value(dtype='string', id=None)}, post

In [85]:
# Build dataset for splits
keep_in_memory = False # is_small_dataset

In [86]:
split = {s: s for s in builder_instance.info.splits}
split

{'test': 'test', 'train': 'train', 'validation': 'validation'}

In [87]:
types = tuple([tuple])
types

(tuple,)

In [88]:
iterable = list(split.values()) if isinstance(split, dict) else split
iterable

['test', 'train', 'validation']

In [89]:
from functools import partial

function = partial(
    builder_instance._build_single_dataset,
    run_post_process=True,
    ignore_verifications=False,
    in_memory=False,
)

In [90]:
split_kwds = []  # We organize the splits ourselve (contiguous splits)
for index in range(1):
    div = len(iterable) // 1
    mod = len(iterable) % 1
    start = div * index + min(index, mod)
    end = start + div + (1 if index < mod else 0)
    split_kwds.append((function, iterable[start:end], types, index, False))
split_kwds

[(functools.partial(<bound method DatasetBuilder._build_single_dataset of <datasets_modules.datasets.wikitext.8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af.wikitext.Wikitext object at 0x0000015B71914B38>>, run_post_process=True, ignore_verifications=False, in_memory=False),
  ['test', 'train', 'validation'],
  (tuple,),
  0,
  False)]

In [91]:
# _single_map_nested
function, data_struct, types, rank, disable_tqdm = split_kwds[0]

In [92]:
not isinstance(data_struct, dict) and not isinstance(data_struct, types)

True

In [93]:
split = data_struct
split

['test', 'train', 'validation']

In [94]:
verify_infos = True

In [95]:
isinstance(split, str)

False

In [96]:
from datasets.splits import Split

In [101]:
# _as_dataset
dataset_kwargs = ArrowReader(builder_instance._cache_dir, builder_instance.info).read(
    name=builder_instance.name,
    instructions=Split("test"),
    split_infos=builder_instance.info.splits.values(),
    in_memory=False,
)
dataset_kwargs

{'arrow_table': pyarrow.Table
 text: string,
 'data_files': [{'filename': 'data\\wikitext\\wikitext-103-v1\\1.0.0\\8ae2a41908b3b12285d41e5b92b82eb1837e7053db277a34d471f19c5e0888af\\wikitext-test.arrow',
   'skip': 0,
   'take': 62}],
 'info': DatasetInfo(description=' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.\n', citation='@InProceedings{wikitext,\n    author={Stephen, Merity and Caiming ,Xiong and James, Bradbury and Richard Socher}\n    year={2016}\n}\n', homepage='https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/', license='', features={'text': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, builder_name='wikitext', config_name='wikitext-103-v1', version=1.0.0, splits={'test': SplitInfo(name='test', num_bytes=1272551, num

In [107]:
dataset_kwargs.keys()

dict_keys(['arrow_table', 'data_files', 'info', 'split'])

In [106]:
type(dataset_kwargs)

dict

In [103]:
from datasets.arrow_dataset import Dataset

In [105]:
Dataset(**dataset_kwargs)

Dataset({
    features: ['text'],
    num_rows: 62
})

In [97]:
# Build base dataset
ds = builder_instance._as_dataset(
    split=Split(split[0]),
    in_memory=False
)

In [98]:
# run_post_process
print(builder_instance._post_processing_resources(split))
resources_paths = {}
builder_instance._post_process(ds, resources_paths) is not None

{}


False

Dataset({
    features: ['text'],
    num_rows: 62
})