Concurrent reading using multiprocessing

The example from the doc https://cloud.google.com/spanner/docs/reads#read_data_in_parallel works well with multi-threading. That being said in Python I found the performance to be equivalent to without using multithreading (probably because of the GIL).

I tried to use multiprocessing instead but can't make it work and I am not even sure this is possible.

- I first tried to replace the doc example `ThreadPoolExecutor` with `ProcessPoolExecutor`. It hangs forever so I guess the `snapshot` object can't be shared between multi processes.
- I also tried to recreate the session but a partition is only valid within the same session according to the error: `details = "Partitioned request was created for a different session."`


Here is the code of the latter approach (recreating the session):

```python
spanner_client = spanner.Client(project=project_id)
instance = spanner_client.instance(instance_id)
database = instance.database(database_id)

table = "xxxx"
columns = ("col1", "col2")


def process(batch):
    
    spanner_client = spanner.Client(project=project_id)
    instance = spanner_client.instance(instance_id)
    database = instance.database(database_id)
    
    logger.info(f"Partition: {batch['partition'][:32]}")

    snapshot = database.batch_snapshot()

    row_ct = 0
    for row in tqdm(snapshot.process_read_batch(batch)):
        row_ct += 1
        
    snapshot.close()

    return time.time(), row_ct


snapshot = database.batch_snapshot()

keyset = spanner.KeySet(all_=True)
batches = snapshot.generate_read_batches(table=table, columns=columns, keyset=keyset)
batches = list(batches)

logger.info(f"{len(batches)} batches detected")

start = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
    futures = [executor.submit(process, batch) for batch in batches]

    pbar = tqdm(concurrent.futures.as_completed(futures, timeout=3600), total=len(batches))
    for future in pbar:
        finish, row_ct = future.result()
        elapsed = finish - start
        print(f"Completed {row_ct} rows in {elapsed} seconds")

snapshot.close()
```

Is there is a way to dispatch batched results to multiple processes in Python for high-performance reading? I guess each partition hash should be able to live outside of a snapshot object on the Spanner side. I am not sure this is possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent reading using multiprocessing #332

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Concurrent reading using multiprocessing #332

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions