Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError #131

Closed
ningding97 opened this issue Apr 9, 2023 · 1 comment
Closed

UnicodeEncodeError #131

ningding97 opened this issue Apr 9, 2023 · 1 comment

Comments

@ningding97
Copy link

Trying to construct a text map using atlas, it can work when I sample a small number of data. But when I use all the data, it got errors:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[8], line 17
     13 print(data[0])
     15 max_documents = 500000
---> 17 project = atlas.map_text(data=data,
     18                           indexed_field='dialogue',
     19                           name='UltraChat',
     20                           id_field='id',
     21                           description='Large-scale, high-quality, and diverse muli-round dialogue data.',
     22                           )

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/nomic/atlas.py:226, in map_text(data, indexed_field, id_field, name, description, build_topic_model, multilingual, is_public, colorable_fields, num_workers, organization_name, reset_project_if_exists, add_datums_if_exists, shard_size, projection_n_neighbors, projection_epochs, projection_spread)
    224         logger.info(f"{project.name}: Deleting project due to failure in initial upload.")
    225         project.delete()
--> 226     raise e
    228 logger.info("Text upload succeeded.")
    230 # make a new index if there were no datums in the project before

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/nomic/atlas.py:218, in map_text(data, indexed_field, id_field, name, description, build_topic_model, multilingual, is_public, colorable_fields, num_workers, organization_name, reset_project_if_exists, add_datums_if_exists, shard_size, projection_n_neighbors, projection_epochs, projection_spread)
    216     logger.warning("Passing 'num_workers' is deprecated and will be removed in a future release.")
    217 try:
--> 218     project.add_text(
    219         data,
    220         shard_size=None,
    221     )
    222 except BaseException as e:
    223     if number_of_datums_before_upload == 0:

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/nomic/project.py:1387, in AtlasProject.add_text(self, data, pbar, shard_size, num_workers)
   1385     data = pa.Table.from_pandas(data)
   1386 elif isinstance(data, list):
-> 1387     data = pa.Table.from_pylist(data)
   1388 elif not isinstance(data, pa.Table):
   1389     raise ValueError("Data must be a pandas DataFrame, list of dictionaries, or a pyarrow Table.")

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:3705, in pyarrow.lib.Table.from_pylist()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:5226, in pyarrow.lib._from_pylist()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:3580, in pyarrow.lib.Table.from_arrays()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:1391, in pyarrow.lib._sanitize_arrays()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:1372, in pyarrow.lib._schema_from_arrays()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2350-2351: surrogates not allowed

I have removed all the non-ASCII data but still got the error, any ideas?

@ningding97
Copy link
Author

Trying to construct a text map using atlas, it can work when I sample a small number of data. But when I use all the data, it got errors:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
Cell In[8], line 17
     13 print(data[0])
     15 max_documents = 500000
---> 17 project = atlas.map_text(data=data,
     18                           indexed_field='dialogue',
     19                           name='UltraChat',
     20                           id_field='id',
     21                           description='Large-scale, high-quality, and diverse muli-round dialogue data.',
     22                           )

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/nomic/atlas.py:226, in map_text(data, indexed_field, id_field, name, description, build_topic_model, multilingual, is_public, colorable_fields, num_workers, organization_name, reset_project_if_exists, add_datums_if_exists, shard_size, projection_n_neighbors, projection_epochs, projection_spread)
    224         logger.info(f"{project.name}: Deleting project due to failure in initial upload.")
    225         project.delete()
--> 226     raise e
    228 logger.info("Text upload succeeded.")
    230 # make a new index if there were no datums in the project before

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/nomic/atlas.py:218, in map_text(data, indexed_field, id_field, name, description, build_topic_model, multilingual, is_public, colorable_fields, num_workers, organization_name, reset_project_if_exists, add_datums_if_exists, shard_size, projection_n_neighbors, projection_epochs, projection_spread)
    216     logger.warning("Passing 'num_workers' is deprecated and will be removed in a future release.")
    217 try:
--> 218     project.add_text(
    219         data,
    220         shard_size=None,
    221     )
    222 except BaseException as e:
    223     if number_of_datums_before_upload == 0:

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/nomic/project.py:1387, in AtlasProject.add_text(self, data, pbar, shard_size, num_workers)
   1385     data = pa.Table.from_pandas(data)
   1386 elif isinstance(data, list):
-> 1387     data = pa.Table.from_pylist(data)
   1388 elif not isinstance(data, pa.Table):
   1389     raise ValueError("Data must be a pandas DataFrame, list of dictionaries, or a pyarrow Table.")

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:3705, in pyarrow.lib.Table.from_pylist()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:5226, in pyarrow.lib._from_pylist()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:3580, in pyarrow.lib.Table.from_arrays()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:1391, in pyarrow.lib._sanitize_arrays()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/table.pxi:1372, in pyarrow.lib._schema_from_arrays()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/array.pxi:317, in pyarrow.lib.array()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File ~/opt/anaconda3/envs/dn/lib/python3.8/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 2350-2351: surrogates not allowed

I have removed all the non-ASCII data but still got the error, any ideas?

FIxed. There still are some weird characters..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant