<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/ingestion/parallel_execution_ingestion_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parallelizing llamaindex RAG Pipeline

## 0. Pré-requis


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%pip install llama-index-cli
%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file
%pip install llama-index-embeddings-huggingface

Collecting llama-index-cli
  Downloading llama_index_cli-0.4.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.0 (from llama-index-cli)
  Downloading llama_index_core-0.12.16.post1-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index-cli)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-llms-openai<0.4.0,>=0.3.0 (from llama-index-cli)
  Downloading llama_index_llms_openai-0.3.18-py3-none-any.whl.metadata (3.3 kB)
Collecting dataclasses-json (from llama-index-core<0.13.0,>=0.12.0->llama-index-cli)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core<0.13.0,>=0.12.0->llama-index-cli)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting filetype<2.0.0,>=1.2.0 (from llama-index-core<0.13.0,>=0.12.0->llama-index-cli)
  Downloading filetype-1.2.0-py2.p

In [3]:
import nest_asyncio

nest_asyncio.apply()

In [4]:
import cProfile, pstats
from pstats import SortKey
import time
import asyncio

### Download data


For this notebook, we'll load the `PatronusAIFinanceBenchDataset` llama-dataset from [llamahub](https://llamahub.ai).

In [5]:
!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data

100% 32/32 [00:17<00:00,  1.79it/s]
Successfully downloaded PatronusAIFinanceBenchDataset to ./data


## 1. Load data

### 1.0 Définition du Reader

**Il y a 32 pdfs d'une centaine de pages dans les données PatronusAIFinanceBenchDataset .**

In [6]:
from llama_index.core import SimpleDirectoryReader

# define our reader with the directory containing the 32 pdf files

reader = SimpleDirectoryReader(
    input_dir="./data/source_files",  # "./data/source_files" "/content/drive/MyDrive/test_data"
    #required_exts=[".pdf"],
    recursive=True,
    )

### 1.1 Sequential load

In [7]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = reader.load_data(show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('stats_sequential_load')
p = pstats.Stats("stats_sequential_load")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Loading files: 100%|██████████| 32/32 [24:07<00:00, 45.23s/file]


Création de 4306 documents en 1447.2268562316895s.
Sat Feb  8 23:33:21 2025    stats_sequential_load

         1875772025 function calls (1872315215 primitive calls) in 1447.226 seconds

   Ordered by: cumulative time
   List reduced from 1261 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000 1447.226  723.613 interactiveshell.py:3512(run_code)
    122/2    0.016    0.000 1447.226  723.613 {built-in method builtins.exec}
        1    0.000    0.000 1447.226 1447.226 <ipython-input-7-89bd0949f457>:1(<cell line: 0>)
        1    0.001    0.001 1447.226 1447.226 base.py:664(load_data)
       32    0.001    0.000 1447.121   45.223 base.py:493(load_file)
       32    0.003    0.000 1447.032   45.220 __init__.py:328(wrapped_f)
       32    0.001    0.000 1447.027   45.220 __init__.py:465(__call__)
       32    0.166    0.005 1447.021   45.219 base.py:36(load_data)
     4306    4.549    0.001 1429.744    0.332 _




<pstats.Stats at 0x7a01a1017750>

In [8]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit reader.load_data()

Temps d'exécution moyen du loader sur 7 ittérations :
5min 52s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 1.2 Parallel load

In [9]:
import multiprocessing

num_cpus = multiprocessing.cpu_count()
print(f"Number of CPUs: {num_cpus}")

Number of CPUs: 2


In [10]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = reader.load_data(num_workers=2, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('stats_parallel_load_worker2')
p = pstats.Stats("stats_parallel_load_worker2")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)


Création de 4306 documents en 446.45228362083435s.
Sun Feb  9 00:27:56 2025    stats_parallel_load_worker2

         11440 function calls (11412 primitive calls) in 446.452 seconds

   Ordered by: cumulative time
   List reduced from 348 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000  446.452  223.226 interactiveshell.py:3512(run_code)
      6/2    0.000    0.000  446.452  223.226 {built-in method builtins.exec}
        1    0.000    0.000  446.452  446.452 <ipython-input-10-fef6e291062a>:1(<cell line: 0>)
        1    0.000    0.000  446.451  446.451 base.py:664(load_data)
        8    0.000    0.000  446.368   55.796 threading.py:611(wait)
        6    0.000    0.000  446.368   74.395 threading.py:295(wait)
       39  446.368   11.445  446.368   11.445 {method 'acquire' of '_thread.lock' objects}
        1    0.000    0.000  446.350  446.350 pool.py:369(starmap)
        1    0.000    0.000  446.350  

<pstats.Stats at 0x7a019aa96110>

In [11]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit reader.load_data(num_workers=2)

Temps d'exécution moyen du loader sur 7 ittérations :
7min 27s ± 2.12 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 1.3 Async Load

In [12]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = await reader.aload_data(show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('stats_async_load')
p = pstats.Stats("stats_async_load")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

100%|██████████| 32/32 [23:52<00:00, 44.78s/it]    


Création de 4306 documents en 1432.9723327159882s.
Sun Feb  9 01:51:27 2025    stats_async_load

         1875688631 function calls (1872233520 primitive calls) in 1432.972 seconds

   Ordered by: cumulative time
   List reduced from 668 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      180    0.000    0.000 1432.962    7.961 events.py:82(_run)
      180    0.000    0.000 1432.961    7.961 {method 'run' of '_contextvars.Context' objects}
        3    0.001    0.000 1432.961  477.654 nest_asyncio.py:100(_run_once)
       33    0.001    0.000 1432.951   43.423 tasks.py:260(__step)
       33    0.000    0.000 1432.949   43.423 {method 'send' of 'coroutine' objects}
       32    0.000    0.000 1432.934   44.779 asyncio.py:75(wrap_awaitable)
       32    0.001    0.000 1432.934   44.779 base.py:594(aload_file)
       32    0.000    0.000 1432.925   44.779 base.py:38(aload_data)
       32    0.001    0.000 1432.924   44.779 __init__




<pstats.Stats at 0x7a019999dc90>

In [13]:
loop = asyncio.get_event_loop()
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit loop.run_until_complete(reader.aload_data())

Temps d'exécution moyen du loader sur 7 ittérations :
5min 51s ± 777 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## 1.4 Async Parallel Load

In [14]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = await reader.aload_data(num_workers=2, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('stats_parallel_async_load_worker2')
p = pstats.Stats("stats_parallel_async_load_worker2")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

100%|██████████| 32/32 [23:59<00:00, 44.99s/it]    


Création de 4306 documents en 1439.6170403957367s.
Sun Feb  9 03:02:22 2025    stats_parallel_async_load_worker2

         1875681750 function calls (1872226639 primitive calls) in 1439.616 seconds

   Ordered by: cumulative time
   List reduced from 639 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3    0.000    0.000 1439.609  479.870 nest_asyncio.py:100(_run_once)
       81    0.000    0.000 1439.609   17.773 events.py:82(_run)
       81    0.000    0.000 1439.609   17.773 {method 'run' of '_contextvars.Context' objects}
       33    0.001    0.000 1439.604   43.624 tasks.py:260(__step)
       33    0.000    0.000 1439.602   43.624 {method 'send' of 'coroutine' objects}
       34    0.001    0.000 1439.599   42.341 dispatcher.py:349(async_wrapper)
       32    0.000    0.000 1439.585   44.987 asyncio.py:75(wrap_awaitable)
       32    0.001    0.000 1439.577   44.987 async_utils.py:136(worker)
       32    0.001    0.




<pstats.Stats at 0x7a0197099fd0>

In [15]:
loop = asyncio.get_event_loop()
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit loop.run_until_complete(reader.aload_data(num_workers=2))

Temps d'exécution moyen du loader sur 7 ittérations :
5min 51s ± 744 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 1.5 TODO : Conclusion

## 2. IngestionPipeline

### 2.0 Définition du pipeline

In [16]:
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=20),
        HuggingFaceEmbedding("BAAI/bge-small-en-v1.5"),
    ]
)

# since we'll be testing performance, using timeit and cProfile
# we're going to disable cache
pipeline.disable_cache = True

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### 2.1 Sequential Execution

By default `num_workers` is set to `None` and this will invoke sequential execution.

In [17]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
nodes = pipeline.run(documents=documents, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

profiler.dump_stats('stats_sequential_ingestion')
p = pstats.Stats("stats_sequential_ingestion")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Parsing nodes:   0%|          | 0/4306 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/9209 [00:00<?, ?it/s]


Création de 9209 nodes en 1059.6075131893158s.
Sun Feb  9 05:18:15 2025    stats_sequential_ingestion

         39107640 function calls (37432824 primitive calls) in 5298.025 seconds

   Ordered by: cumulative time
   List reduced from 782 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000 5298.037 2649.018 interactiveshell.py:3512(run_code)
        2    0.000    0.000 5298.037 2649.018 {built-in method builtins.exec}
   4311/1    0.142    0.000 5298.037 5298.037 dispatcher.py:253(wrapper)
        1    0.000    0.000 5298.036 5298.036 pipeline.py:451(run)
        1    0.000    0.000 5298.036 5298.036 pipeline.py:69(run_transformations)
        1    0.011    0.011 5279.764 5279.764 base.py:442(__call__)
        1    0.081    0.081 5275.832 5275.832 base.py:305(get_text_embedding_batch)
      921    0.005    0.000 5273.457    5.726 base.py:308(_get_text_embeddings)
      921    0.007    0.000 5273.452    5.7

<pstats.Stats at 0x7a0195ec6950>

In [None]:
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit pipeline.run(documents=documents)

Temps d'exécution moyen du pipeline sur 7 ittérations :


### 2.2 Parallel Execution

A single run. Setting `num_workers` to a value greater than 1 will invoke parallel execution.

In [None]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
nodes = pipeline.run(documents=documents, num_workers=2, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

profiler.dump_stats('stats_parallel_ingestion_worker2')
p = pstats.Stats("stats_parallel_ingestion_worker2")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

In [None]:
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit pipeline.run(documents=documents, num_workers=2)

### 2.3 Async on Main Processor

As with the sync case, `num_workers` is default to `None`, which will then lead to single-batch execution of async tasks.

In [None]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
nodes = await pipeline.arun(documents=documents, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('stats_async_ingestion')
p = pstats.Stats("stats_async_ingestion")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

In [None]:
loop = asyncio.get_event_loop()
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit loop.run_until_complete(pipeline.arun(documents=documents))

### 2.4 Async Parallel Execution

Here the `ProcessPoolExecutor` from `concurrent.futures` is used to execute processes asynchronously. The tasks are being processed are blocking, but also performed asynchronously on the individual processes.

In [None]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
nodes = await pipeline.arun(documents=documents, num_workers=2, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('stats_parallel_async_ingestion')
p = pstats.Stats("stats_parallel_async_ingestion")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

In [None]:
loop = asyncio.get_event_loop()
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit loop.run_until_complete(pipeline.arun(documents=documents, num_workers=2))

### TODO : Conclusion

The results from the above experiments are re-shared below where each strategy is listed from fastest to slowest with this example dataset and pipeline.

1. (Async, Parallel Processing): 20.3s
2. (Async, No Parallel Processing): 20.5s
3. (Sync, Parallel Processing): 29s
4. (Sync, No Parallel Processing): 1min 11s

We can see that both cases that use Parallel Processing outperforms the Sync, No Parallel Processing (i.e., `.run(num_workers=None)`). Also, that at least for this case for Async tasks, there is little gains in using Parallel Processing. Perhaps for larger workloads and IngestionPipelines, using Async with Parallel Processing can lead to larger gains.