Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow arrow table for upload in map_text #214

Merged
merged 4 commits into from Oct 5, 2023
Merged

Conversation

zanussbaum
Copy link
Collaborator

Previously, this failed as iterating over a pyarrow table iterates over the columns. For example,

import pyarrow as pa

>>> schema = pa.schema([
...     ('text', pa.string()),
...     ('url', pa.string())
... ])
>>> 
>>> # Sample data
>>> data = {
...     'text': ['sample1', 'sample2', 'sample3', 'sample4', 'sample5'],
...     'url': ['http://example1.com', 'http://example2.com', 'http://example3.com', 'http://example4.com', 'http://example5.com']
... }
>>> 
>>> tb = pa.Table.from_pydict(data)
>>> tb
pyarrow.Table
text: string
url: string
----
text: [["sample1","sample2","sample3","sample4","sample5"]]
url: [["http://example1.com","http://example2.com","http://example3.com","http://example4.com","http://example5.com"]]
>>> for x in tb:
...     break
... 
>>> x
<pyarrow.lib.ChunkedArray object at 0x7fc8c663e7c0>
[
  [
    "sample1",
    "sample2",
    "sample3",
    "sample4",
    "sample5"
  ]
]
>>> 

when we want to iterate over the rows instead. Feel free to critique my pyarrow code, I just hacked around on this.

Copy link
Collaborator

@bmschmidt bmschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We may want to add a note at some point that it would faster and safer to insert arrow record batches directly rather than casting to list.

@bmschmidt bmschmidt merged commit a3a91c2 into main Oct 5, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants