Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "sentencepiece_model.proto": sentencepiece_model.proto: A file with this name is already in the pool. #12882

Closed
KawaiiNotHawaii opened this issue May 23, 2023 · 1 comment

Comments

@KawaiiNotHawaii
Copy link

KawaiiNotHawaii commented May 23, 2023

What version of protobuf and what language are you using?
Version: v3.8.0 (NOTE: please try updating to the latest version of protoc/runtime possible beforehand to attempt to resolve your problem)
Language: Python

What operating system (Linux, Windows, ...) and version?
16.04.6 LTS (GNU/Linux 4.4.0-142-generic x86_64)

What runtime / compiler are you using (e.g., python version or gcc version)
Python 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:11:38)
[GCC 7.3.0] :: Anaconda, Inc. on linux

What did you do?
Steps to reproduce the behavior:

  1. Load a LanguageModelingTransformer in lightning_transformer
  2. load a dataset from BigBench using load_datasets() imported from datasets
  3. See error (while trace stack differs, same error (TypeError) occurs even if I exchange the order, that is loading dataset first and loading the model after it)

What did you expect to see
Dataset loaded successfully

What did you see instead?
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 2>:2 │
│ │
│ 1 # dataset = ["What the result of 1+3", "Calculate 4*253"] │
│ ❱ 2 dataset = load_dataset("bigbench", 'modified_arithmetic', cache_dir='data', split='valid │
│ 3 # dataset = dataset['validation']['inputs'][:] │
│ 4 │
│ 5 # Create a DataLoader │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/datasets/load.py:1773 in │
│ load_dataset │
│ │
│ 1770 │ ) │
│ 1771 │ │
│ 1772 │ # Create a dataset builder │
│ ❱ 1773 │ builder_instance = load_dataset_builder( │
│ 1774 │ │ path=path, │
│ 1775 │ │ name=name, │
│ 1776 │ │ data_dir=data_dir, │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/datasets/load.py:1512 in │
│ load_dataset_builder │
│ │
│ 1509 │ ) │
│ 1510 │ │
│ 1511 │ # Get dataset builder class from the processing script │
│ ❱ 1512 │ builder_cls = import_main_class(dataset_module.module_path) │
│ 1513 │ builder_kwargs = dataset_module.builder_kwargs │
│ 1514 │ data_files = builder_kwargs.pop("data_files", data_files) │
│ 1515 │ config_name = builder_kwargs.pop("config_name", name) │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/datasets/load.py:115 in │
│ import_main_class │
│ │
│ 112 │ - a DatasetBuilder if dataset is True │
│ 113 │ - a Metric if dataset is False │
│ 114 │ """ │
│ ❱ 115 │ module = importlib.import_module(module_path) │
│ 116 │ │
│ 117 │ if dataset: │
│ 118 │ │ main_cls_type = DatasetBuilder │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/importlib/init.py:127 in import_module │
│ │
│ 124 │ │ │ if character != '.': │
│ 125 │ │ │ │ break │
│ 126 │ │ │ level += 1 │
│ ❱ 127 │ return _bootstrap._gcd_import(name[level:], package, level) │
│ 128 │
│ 129 │
│ 130 _RELOADING = {} │
│ in _gcd_import:1014 │
│ in _find_and_load:991 │
│ in _find_and_load_unlocked:975 │
│ in _load_unlocked:671 │
│ in exec_module:783 │
│ in _call_with_frames_removed:219 │
│ │
│ /home/cxsun/.cache/huggingface/modules/datasets_modules/datasets/bigbench/d2757373c3fb6b35a846ee │
│ 951265c3f8fbf0124fb650b12cef5678cf902914d2/bigbench.py:22 in │
│ │
│ 19 │
│ 20 from typing import Optional │
│ 21 │
│ ❱ 22 import bigbench.api.util as bb_utils # From: "bigbench @ https://storage.googleapis.com
│ 23 import bigbench.bbseqio.bigbench_bridge as bbb │
│ 24 from bigbench.api import json_task │
│ 25 from bigbench.bbseqio import bigbench_json_paths as bb_json_paths │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/bigbench/api/util.py:25 in │
│ │
│ │
│ 22 import json │
│ 23 import os │
│ 24 import bigbench.api.task as task_api │
│ ❱ 25 import bigbench.api.json_task as json_task │
│ 26 import bigbench.api.model as model_api │
│ 27 import bigbench.api.results as results_api │
│ 28 import bigbench.api.task_metrics as task_metrics │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/bigbench/api/json_task.py:26 in │
│ │
│ │
│ 23 │
│ 24 from bigbench.api import json_utils │
│ 25 import bigbench.api.task as task │
│ ❱ 26 import bigbench.api.task_metrics as metrics │
│ 27 import bigbench.api.results as results_api │
│ 28 import numpy as np │
│ 29 from scipy.special import logsumexp │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/bigbench/api/task_metrics.py:24 │
│ in │
│ │
│ 21 │
│ 22 from datasets import load_metric │
│ 23 from scipy.special import logsumexp │
│ ❱ 24 from t5.evaluation import metrics │
│ 25 from sklearn.metrics import f1_score │
│ 26 │
│ 27 │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/t5/init.py:17 in │
│ │
│ 14 │
│ 15 """Import API modules.""" │
│ 16 │
│ ❱ 17 import t5.data │
│ 18 import t5.evaluation │
│ 19 │
│ 20 # Version number. │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/t5/data/init.py:17 in │
│ │
│ │
│ 14 │
│ 15 """Import data modules.""" │
│ 16 # pylint:disable=wildcard-import,g-bad-import-order │
│ ❱ 17 from t5.data.dataset_providers import * │
│ 18 from t5.data.glue_utils import * │
│ 19 import t5.data.postprocessors │
│ 20 import t5.data.preprocessors │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/t5/data/dataset_providers.py:28 │
│ in │
│ │
│ 25 from collections.abc import Mapping │
│ 26 import re │
│ 27 │
│ ❱ 28 import seqio │
│ 29 from t5.data import utils │
│ 30 import tensorflow.compat.v2 as tf │
│ 31 │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/init.py:18 in │
│ │
│ 15 """Import to top-level API.""" │
│ 16 # pylint:disable=wildcard-import,g-bad-import-order │
│ 17 │
│ ❱ 18 from seqio.dataset_providers import * │
│ 19 from seqio import evaluation │
│ 20 from seqio import experimental │
│ 21 from seqio.evaluation import Evaluator │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/dataset_providers.py:38 in │
│ │
│ │
│ 35 import numpy as np │
│ 36 from packaging import version as version_lib │
│ 37 import pyglove as pg │
│ ❱ 38 from seqio import metrics as metrics_lib │
│ 39 from seqio import preprocessors as seqio_preprocessors │
│ 40 from seqio import task_registry_provenance_tracking │
│ 41 from seqio import utils │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/metrics.py:25 in │
│ │
│ 22 import clu.metrics │
│ 23 import flax │
│ 24 import numpy as np │
│ ❱ 25 from seqio import utils │
│ 26 import tensorflow.compat.v2 as tf │
│ 27 │
│ 28 │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/utils.py:29 in │
│ │
│ 26 │
│ 27 from absl import logging │
│ 28 import numpy as np │
│ ❱ 29 from seqio.vocabularies import Vocabulary │
│ 30 import tensorflow.compat.v2 as tf │
│ 31 import tensorflow_datasets as tfds │
│ 32 │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/vocabularies.py:25 in │
│ │
│ │
│ 22 import tensorflow.compat.v2 as tf │
│ 23 import tensorflow_text as tf_text │
│ 24 │
│ ❱ 25 from sentencepiece import sentencepiece_model_pb2 │
│ 26 import sentencepiece as sentencepiece_processor │
│ 27 │
│ 28 PAD_ID = 0 │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/sentencepiece/sentencepiece_mode │
│ l_pb2.py:16 in │
│ │
│ 13 │
│ 14 │
│ 15 │
│ ❱ 16 DESCRIPTOR = _descriptor.FileDescriptor( │
│ 17 name='sentencepiece_model.proto', │
│ 18 package='sentencepiece', │
│ 19 syntax='proto2', │
│ │
│ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/google/protobuf/descriptor.py:10 │
│ 24 in new
│ │
│ 1021 │ │ except KeyError: │
│ 1022 │ │ raise RuntimeError('Please link in cpp generated lib for %s' % (name)) │
│ 1023 │ elif serialized_pb: │
│ ❱ 1024 │ │ return _message.default_pool.AddSerializedFile(serialized_pb) │
│ 1025 │ else: │
│ 1026 │ │ return super(FileDescriptor, cls).new(cls) │
│ 1027 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "sentencepiece_model.proto":
sentencepiece_model.proto: A file with this name is already in the pool.

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

Anything else we should know about your project / environment
Dependencies mentioned:

Name: lightning-transformers
Version: 0.2.5
Summary: Lightning Transformers.
Home-page: https://github.com/Lightning-AI/lightning-transformers
Author: Lightning AI et al.
Author-email: pytorch@lightning.ai
License: Apache-2.0
Location: /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages
Requires: datasets, Pillow, pytorch-lightning, sentencepiece, torchmetrics, transformers
Required-by:

Name: datasets
Version: 2.12.0
Summary: HuggingFace community-driven open-source library of datasets
Home-page: https://github.com/huggingface/datasets
Author: HuggingFace Inc.
Author-email: thomas@huggingface.co
License: Apache 2.0
Location: /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages
Requires: aiohttp, dill, fsspec, huggingface-hub, multiprocess, numpy, packaging, pandas, pyarrow, pyyaml, requests, responses, tqdm, xxhash
Required-by: bigbench, lightning-transformers

@KawaiiNotHawaii KawaiiNotHawaii added the untriaged auto added to all issues by default when created. label May 23, 2023
@KawaiiNotHawaii
Copy link
Author

I fixed this bug following the instructions here

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants