<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/ingestion/parallel_execution_ingestion_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parallelizing llamaindex RAG Pipeline

## 0. Pré-requis


In [1]:
#%pip install llama-index-cli
#%pip install llama-index-embeddings-openai
#%pip install llama-index-readers-file
#%pip install llama-index-embeddings-huggingface
#%pip install ipywidgets

In [2]:
import nest_asyncio

nest_asyncio.apply()

In [3]:
import cProfile, pstats
from pstats import SortKey
import time
import asyncio
import timeit

### Téléchargement des données


For this notebook, we'll load the `PatronusAIFinanceBenchDataset` llama-dataset from [llamahub](https://llamahub.ai).

In [4]:
# !llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data

In [5]:
!ls ./data

rag_dataset.json  source_files	test_sample


## 1. Pipeline chargement des données

**Il y a 32 pdfs d'une centaine de pages dans les données PatronusAIFinanceBenchDataset .**

Définition du Reader :

In [6]:
from llama_index.core import SimpleDirectoryReader

# define our reader with the directory containing the 32 pdf files

input_dir = "./data/source_files"  # "./data/source_files"

reader = SimpleDirectoryReader(
    input_dir=input_dir,  
    #required_exts=[".pdf"],
    recursive=True,
    )

### 1.1 Chargement séquentiel

In [7]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = reader.load_data(show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('./profiling/stats_sequential_load')
p = pstats.Stats("./profiling/stats_sequential_load")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Loading files:  94%|█████████████████████████████████████████████▉   | 30/32 [14:46<00:25, 12.77s/file]

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f514d445d30 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|█████████████████████████████████████████████████| 32/32 [15:10<00:00, 28.45s/file]


Création de 4207 documents en 910.300127029419s.
Sun Feb  9 20:13:42 2025    ./profiling/stats_sequential_load

         1820747227 function calls (1817825827 primitive calls) in 910.301 seconds

   Ordered by: cumulative time
   List reduced from 1353 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   130/95    0.199    0.002  956.069   10.064 threading.py:637(wait)
   130/95    0.660    0.005  941.228    9.908 threading.py:323(wait)
       32    0.001    0.000  910.283   28.446 base.py:493(load_file)
       32    0.000    0.000  909.020   28.407 __init__.py:328(wrapped_f)
     4207    2.803    0.001  851.330    0.202 _page.py:2266(extract_text)
4345/4207   14.098    0.003  836.333    0.199 _page.py:1822(_extract_text)
       32    0.000    0.000  801.757   25.055 __init__.py:465(__call__)
       34    0.068    0.002  789.449   23.219 base.py:36(load_data)
     4345    0.017    0.000  594.638    0.137 _data_structures.py:1418(ope




<pstats.Stats at 0x7f515756ccb0>

In [8]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit reader.load_data()

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5141976e10 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f514236b110 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5143a735c0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f514363f530 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

### 1.2 Chargement parallèle

In [9]:
import multiprocessing

num_cpus = multiprocessing.cpu_count()
print(f"Number of CPUs: {num_cpus}")

Number of CPUs: 8


#### a) Num_workers=4

In [10]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = reader.load_data(num_workers=4, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('./profiling/stats_parallel_load_worker4')
p = pstats.Stats("./profiling/stats_parallel_load_worker4")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7efcb0eca1e0 state=finished raised DependencyError>]. Skipping...

Création de 4207 documents en 111.52120780944824s.
Sun Feb  9 20:45:18 2025    ./profiling/stats_parallel_load_worker4

         108709 function calls (108330 primitive calls) in 111.754 seconds

   Ordered by: cumulative time
   List reduced from 587 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      375    0.003    0.000  167.432    0.446 pool.py:333(_maintain_pool)
      3/2    0.000    0.000  111.518   55.759 interactiveshell.py:3543(run_code)
      7/2    0.003    0.000  111.518   55.759 {built-in method builtins.exec}
      2/1    0.197    0.099  111.152  111.152 1138091397.py:1(<module>)
        1    0.000    0.000  110.891  110.891 base.py:664(load_data)
       11    0.000    0.000  110.886   10.081 util.py:208(__call__)

<pstats.Stats at 0x7f5144436ea0>

In [11]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit reader.load_data(num_workers=4)

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f2986e0c4a0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7fbb2945fb60 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f637d544080 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7fe0d0824560 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

#### b) Num_workers=8

In [12]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = reader.load_data(num_workers=8, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('./profiling/stats_parallel_load_worker8')
p = pstats.Stats("./profiling/stats_parallel_load_worker8")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f8909cb14c0 state=finished raised DependencyError>]. Skipping...

Création de 4207 documents en 94.99244618415833s.
Sun Feb  9 21:00:22 2025    ./profiling/stats_parallel_load_worker8

         333520 function calls (333208 primitive calls) in 95.439 seconds

   Ordered by: cumulative time
   List reduced from 464 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       15    0.000    0.000   94.945    6.330 util.py:208(__call__)
        1    0.000    0.000   94.945   94.945 pool.py:654(terminate)
        1    0.000    0.000   94.945   94.945 pool.py:680(_terminate_pool)
       40    0.000    0.000   94.853    2.371 connection.py:202(send)
        1    0.000    0.000   94.852   94.852 pool.py:671(_help_stuff_finish)
       10    0.000    0.000   94.852    9.485 {method 'acquire' of '_multiprocessin

<pstats.Stats at 0x7f514d455880>

In [13]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit reader.load_data(num_workers=8)

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f19013d6db0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f2936a6c080 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f82a86fe6f0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7fd0b40c09e0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

### 1.3 Chargement asynchrone

In [14]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = await reader.aload_data(show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('./profiling/stats_async_load')
p = pstats.Stats("./profiling/stats_async_load")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

  0%|                                                                           | 0/32 [00:00<?, ?it/s]

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f514d447b90 state=finished raised DependencyError>]. Skipping...


100%|██████████████████████████████████████████████████████████████████| 32/32 [13:51<00:00, 25.97s/it]



Création de 4207 documents en 831.3364579677582s.
Sun Feb  9 21:27:10 2025    ./profiling/stats_async_load

         1820599888 function calls (1817681631 primitive calls) in 831.374 seconds

   Ordered by: cumulative time
   List reduced from 746 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  191/190    0.204    0.001  868.185    4.569 nest_asyncio.py:100(_run_once)
       32    0.000    0.000  831.054   25.970 __init__.py:328(wrapped_f)
       32    0.001    0.000  831.053   25.970 __init__.py:465(__call__)
   258/83    0.045    0.000  826.017    9.952 events.py:86(_run)
     4207    2.685    0.001  819.466    0.195 _page.py:2266(extract_text)
4345/4207   13.630    0.003  808.466    0.192 _page.py:1822(_extract_text)
       34    0.066    0.002  792.692   23.314 base.py:36(load_data)
     4345    0.016    0.000  580.748    0.134 _data_structures.py:1418(operations)
     4345   88.023    0.020  568.086    0.131 _data_structure

<pstats.Stats at 0x7f514ea45910>

In [15]:
loop = asyncio.get_event_loop()
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit loop.run_until_complete(reader.aload_data())

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51421a36b0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51405ef4a0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5142a22ed0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51431b6d20 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

### 1.4 Chargement asynchrone et parallèle

#### a) Num_workers=4

In [16]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = await reader.aload_data(num_workers=4, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('./profiling/stats_parallel_async_load_worker4')
p = pstats.Stats("./profiling/stats_parallel_async_load_worker4")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

  0%|                                                                           | 0/32 [00:00<?, ?it/s]

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f514d934c20 state=finished raised DependencyError>]. Skipping...


100%|██████████████████████████████████████████████████████████████████| 32/32 [15:09<00:00, 28.42s/it]



Création de 4207 documents en 910.0696785449982s.
Sun Feb  9 22:13:07 2025    ./profiling/stats_parallel_async_load_worker4

         1820604637 function calls (1817686458 primitive calls) in 910.324 seconds

   Ordered by: cumulative time
   List reduced from 700 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       33    0.000    0.000  910.065   27.578 tasks.py:308(__step_run_and_handle_result)
       33    0.000    0.000  910.064   27.578 {method 'send' of 'coroutine' objects}
       32    0.112    0.003  909.751   28.430 asyncio.py:75(wrap_awaitable)
       34    0.003    0.000  909.648   26.754 dispatcher.py:349(async_wrapper)
       32    0.003    0.000  909.626   28.426 async_utils.py:136(worker)
       32    0.001    0.000  909.616   28.426 base.py:594(aload_file)
       32    0.000    0.000  909.328   28.416 base.py:38(aload_data)
       32    0.000    0.000  909.328   28.416 __init__.py:328(wrapped_f)
   274/78    0.20

<pstats.Stats at 0x7f514e938290>

In [17]:
loop = asyncio.get_event_loop()
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit loop.run_until_complete(reader.aload_data(num_workers=4))

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5144af8e00 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51437ff980 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51418dd880 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5143d289b0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

#### b) Num_workers=8

In [18]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
documents = await reader.aload_data(num_workers=8, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('./profiling/stats_parallel_async_load_worker8')
p = pstats.Stats("./profiling/stats_parallel_async_load_worker8")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

  0%|                                                                           | 0/32 [00:00<?, ?it/s]

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f514e980b30 state=finished raised DependencyError>]. Skipping...


100%|██████████████████████████████████████████████████████████████████| 32/32 [14:09<00:00, 26.55s/it]



Création de 4207 documents en 849.7532970905304s.
Sun Feb  9 22:58:37 2025    ./profiling/stats_parallel_async_load_worker8

         1820603793 function calls (1817684805 primitive calls) in 849.751 seconds

   Ordered by: cumulative time
   List reduced from 708 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       32    0.001    0.000  849.482   26.546 base.py:594(aload_file)
       32    0.000    0.000  849.405   26.544 base.py:38(aload_data)
       32    0.000    0.000  849.405   26.544 __init__.py:328(wrapped_f)
       32    0.001    0.000  848.776   26.524 __init__.py:465(__call__)
     4207    2.931    0.001  836.551    0.199 _page.py:2266(extract_text)
4345/4207   14.022    0.003  820.634    0.195 _page.py:1822(_extract_text)
       34    0.073    0.002  775.702   22.815 base.py:36(load_data)
     4345    0.036    0.000  589.351    0.136 _data_structures.py:1418(operations)
     4345   84.023    0.019  576.355    0.133 _

<pstats.Stats at 0x7f51463b7650>

In [19]:
loop = asyncio.get_event_loop()
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit loop.run_until_complete(reader.aload_data(num_workers=8))

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f514325e000 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51425d6150 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51427420c0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5145d51cd0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

### 1.5 Chargement asynchrone avec des fichiers par lots

In [20]:
from pathlib import Path

def lister_fichiers(dossier):
    return [str(fichier) for fichier in Path(dossier).rglob('*') if fichier.is_file()]

fichiers = lister_fichiers(input_dir)

In [21]:
def divide_filenames_into_splits(filenames, num_jobs):
    tic = time.time()
    filenames_splits = [filenames[i::num_jobs] for i in range(num_jobs)]
    print(f"Séparation de {len(filenames)} fichiers en {num_jobs} listes en {round(time.time()-tic, 2)}s de taille {[len(job) for job in filenames_splits]}")
    return filenames_splits

#### a) 4 splits

In [22]:
filenames_splits = divide_filenames_into_splits(filenames=fichiers, num_jobs=4)

Séparation de 32 fichiers en 4 listes en 0.0s de taille [8, 8, 8, 8]


In [23]:
jobs = [SimpleDirectoryReader(input_files=split).aload_data(show_progress=True) for split in filenames_splits]

profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
results = await asyncio.gather(*jobs)
nodes = []
for result in results:
  nodes.extend(result)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('./profiling/stats_parallel_async_load_with_4_split_jobs')
p = pstats.Stats("./profiling/stats_parallel_async_load_with_4_split_jobs")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

  0%|                                                                            | 0/8 [00:00<?, ?it/s]
[A%|                                                                            | 0/8 [00:00<?, ?it/s]

[A[A                                                                           | 0/8 [00:00<?, ?it/s]


[A[A[A                                                                        | 0/8 [00:00<?, ?it/s]

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51444376e0 state=finished raised DependencyError>]. Skipping...


100%|███████████████████████████████████████████████████████████████████| 8/8 [13:37<00:00, 102.24s/it]

100%|███████████████████████████████████████████████████████████████████| 8/8 [13:37<00:00, 102.24s/it]


100%|███████████████████████████████████████████████████████████████████| 8/8 [13:37<00:00, 102.24s/it]



100%|███████████████████████████████████████████████████████████████████| 8/8 [13:37<00:00, 102.24s/it]


Création de 4207 documents en 817.9594793319702s.
Sun Feb  9 23:41:10 2025    ./profiling/stats_parallel_async_load_with_4_split_jobs

         1820628370 function calls (1817709540 primitive calls) in 817.967 seconds

   Ordered by: cumulative time
   List reduced from 680 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  271/269    0.141    0.001  825.651    3.069 nest_asyncio.py:100(_run_once)
    42/41    0.000    0.000  817.965   19.950 tasks.py:291(__step)
    42/41    0.000    0.000  817.964   19.950 tasks.py:308(__step_run_and_handle_result)
    42/41    0.000    0.000  817.963   19.950 {method 'send' of 'coroutine' objects}
   356/69    0.001    0.000  817.958   11.854 events.py:86(_run)
   355/69    0.001    0.000  817.958   11.854 {method 'run' of '_contextvars.Context' objects}
       32    0.000    0.000  817.937   25.561 asyncio.py:75(wrap_awaitable)
       32    0.001    0.000  817.915   25.560 base.py:594(aload_fil




<pstats.Stats at 0x7f514cc74410>

In [24]:
# Define the async function to be timed
async def run_pipeline(filenames_splits):
    jobs = [SimpleDirectoryReader(input_files=split).aload_data() for split in filenames_splits]
    await asyncio.gather(*jobs)

# Use timeit to measure the execution time
time_taken = timeit.timeit(lambda: asyncio.run(run_pipeline(filenames_splits)), number=7)
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations : {time_taken / 7:.4f} secondes")

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5142c42600 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f512d95f9e0 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5126f41220 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5120bd5700 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5119e2e4e0 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51135d3680 state=finished raised DependencyError>]. Skipping...
Failed to load f

#### b) 8 splits

In [25]:
filenames_splits = divide_filenames_into_splits(filenames=fichiers, num_jobs=8)

Séparation de 32 fichiers en 8 listes en 0.0s de taille [4, 4, 4, 4, 4, 4, 4, 4]


In [26]:
jobs = [SimpleDirectoryReader(input_files=split).aload_data(show_progress=True) for split in filenames_splits]

profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
results = await asyncio.gather(*jobs)
nodes = []
for result in results:
  nodes.extend(result)
profiler.disable()
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

profiler.dump_stats('./profiling/stats_parallel_async_load_with_8_split_jobs')
p = pstats.Stats("./profiling/stats_parallel_async_load_with_8_split_jobs")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

  0%|                                                                            | 0/4 [00:00<?, ?it/s]
[A%|                                                                            | 0/4 [00:00<?, ?it/s]

[A[A                                                                           | 0/4 [00:00<?, ?it/s]


[A[A[A                                                                        | 0/4 [00:00<?, ?it/s]



[A[A[A[A                                                                     | 0/4 [00:00<?, ?it/s]




[A[A[A[A[A                                                                  | 0/4 [00:00<?, ?it/s]





[A[A[A[A[A[A                                                               | 0/4 [00:00<?, ?it/s]






[A[A[A[A[A[A[A                                                            | 0/4 [00:00<?, ?it/s]

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f514cb190d0 state=finished raised DependencyError>]. Skipping...


100%|███████████████████████████████████████████████████████████████████| 4/4 [13:44<00:00, 206.19s/it]

100%|███████████████████████████████████████████████████████████████████| 4/4 [13:44<00:00, 206.19s/it]


100%|███████████████████████████████████████████████████████████████████| 4/4 [13:44<00:00, 206.19s/it]



100%|███████████████████████████████████████████████████████████████████| 4/4 [13:44<00:00, 206.19s/it]




100%|███████████████████████████████████████████████████████████████████| 4/4 [13:44<00:00, 206.19s/it]





100%|███████████████████████████████████████████████████████████████████| 4/4 [13:44<00:00, 206.19s/it]






100%|███████████████████████████████████████████████████████████████████| 4/4 [13:44<00:00, 206.19s/it]







100%|███████████████████████████████████████████████████████████████████| 4/4 [13:44<00:00, 206.19s/it]


Création de 4207 documents en 824.8368690013885s.
Mon Feb 10 00:19:07 2025    ./profiling/stats_parallel_async_load_with_8_split_jobs

         1820670062 function calls (1817750127 primitive calls) in 824.822 seconds

   Ordered by: cumulative time
   List reduced from 669 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  437/435    0.125    0.000  889.988    2.046 nest_asyncio.py:100(_run_once)
       49    0.000    0.000  824.892   16.835 tasks.py:291(__step)
       49    0.000    0.000  824.891   16.835 tasks.py:308(__step_run_and_handle_result)
       49    0.001    0.000  824.887   16.834 {method 'send' of 'coroutine' objects}
   527/67    0.001    0.000  824.836   12.311 events.py:86(_run)
   526/67    0.002    0.000  824.836   12.311 {method 'run' of '_contextvars.Context' objects}
       32    0.000    0.000  824.776   25.774 asyncio.py:75(wrap_awaitable)
       32    0.001    0.000  824.728   25.773 base.py:594(aload_fil




<pstats.Stats at 0x7f5143bfd130>

In [27]:
# Define the async function to be timed
async def run_pipeline(filenames_splits):
    jobs = [SimpleDirectoryReader(input_files=split).aload_data() for split in filenames_splits]
    await asyncio.gather(*jobs)

# Use timeit to measure the execution time
time_taken = timeit.timeit(lambda: asyncio.run(run_pipeline(filenames_splits)), number=7)
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations : {time_taken / 7:.4f} secondes")

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5141ea03e0 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f511ff28c80 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f512e5f7e00 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f51265d2810 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f511e06fb00 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f5115928590 state=finished raised DependencyError>]. Skipping...
Failed to load f

### Conclusion chargement des données

  Méthode       | Num_proc | Temps moyen |
 |---------------|----------|-------------|
 | Séquentiel     | 1        | 3min43s ±11s   |
 | Parallèle     | 4 workers       | 1min41s ±6s  |
 | Parallèle     | 8 workers       | 1min37s ±3s  |
 | Asynchrone     | 1        | 3min52s   |
 | Asynchrone/Parallèle     | 4 workers        | 3min51s ±29s  |
 | Asynchrone/Parallèl     | 8 workers        | 3min34s ± 6s   |
 | Asynchrone/Jobs multiple     | 4 jobs       | 3min27s  |
 | Asynchrone/Jobs multiple     | 8 jobs       | 3min26s   |


## 2. Pipeline de traitement des données

#### Définition du pipeline d'ingestion :

In [28]:
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=20),
        HuggingFaceEmbedding("BAAI/bge-small-en-v1.5"),
    ]
)

# since we'll be testing performance, using timeit and cProfile
# we're going to disable cache
pipeline.disable_cache = True

### 2.1 Exécution séquentielle

By default `num_workers` is set to `None` and this will invoke sequential execution.

In [29]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
nodes = pipeline.run(documents=documents, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

profiler.dump_stats('./profiling/stats_sequential_ingestion')
p = pstats.Stats("./profiling/stats_sequential_ingestion")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Parsing nodes:   0%|          | 0/4207 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/9216 [00:00<?, ?it/s]


Création de 9216 nodes en 42.28653264045715s.
Mon Feb 10 00:49:14 2025    ./profiling/stats_sequential_ingestion

         39007488 function calls (37398798 primitive calls) in 211.433 seconds

   Ordered by: cumulative time
   List reduced from 865 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/2    0.000    0.000  211.432  105.716 interactiveshell.py:3543(run_code)
      3/2    0.000    0.000  211.432  105.716 {built-in method builtins.exec}
      2/1    0.001    0.000  211.432  211.432 3387206633.py:1(<module>)
   4213/1    0.098    0.000  211.431  211.431 dispatcher.py:253(wrapper)
        1    0.000    0.000  211.431  211.431 pipeline.py:451(run)
        1    0.000    0.000  211.431  211.431 pipeline.py:69(run_transformations)
     2243    0.023    0.000  205.799    0.092 nest_asyncio.py:100(_run_once)
        1    0.041    0.041  200.549  200.549 base.py:305(get_text_embedding_batch)
      922    0.002    0.000  199

<pstats.Stats at 0x7f5126fdccb0>

In [30]:
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit pipeline.run(documents=documents)

Temps d'exécution moyen du pipeline sur 7 ittérations :
3min 10s ± 208 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 2.2 Exécution parallèle

A single run. Setting `num_workers` to a value greater than 1 will invoke parallel execution.

In [31]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
nodes = pipeline.run(documents=documents, num_workers=4, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

profiler.dump_stats('./profiling/stats_parallel_ingestion_worker4')
p = pstats.Stats("./profiling/stats_parallel_ingestion_worker4")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)




Création de 9216 nodes en 67.03065361976624s.
Mon Feb 10 01:20:12 2025    ./profiling/stats_parallel_ingestion_worker4

         666884 function calls (664767 primitive calls) in 335.153 seconds

   Ordered by: cumulative time
   List reduced from 712 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      332  293.430    0.884  628.581    1.893 {built-in method time.sleep}
     1704    0.007    0.000  613.721    0.360 pool.py:500(_wait_for_updates)
     3411    0.017    0.000  340.250    0.100 connection.py:1122(wait)
     3411    0.014    0.000  340.167    0.100 selectors.py:402(select)
     3411    1.338    0.000  338.610    0.099 {method 'poll' of 'select.poll' objects}
      3/2    0.000    0.000  335.151  167.576 {built-in method builtins.exec}
      2/1    0.120    0.060  335.151  335.151 2641836050.py:1(<module>)
      2/1    0.000    0.000  335.031  335.031 dispatcher.py:253(wrapper)
        1    0.000    0.000  335.030  33

<pstats.Stats at 0x7f514c329880>

In [32]:
# Méthode la moins performante, on la commente pour perdre moins de temps
# print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
# %timeit pipeline.run(documents=documents, num_workers=4)

### 2.3 Exécution asynchrone sur un processeur

As with the sync case, `num_workers` is default to `None`, which will then lead to single-batch execution of async tasks.

In [33]:
profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
nodes = await pipeline.arun(documents=documents, show_progress=True)
profiler.disable()
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

profiler.dump_stats('./profiling/stats_async_ingestion')
p = pstats.Stats("./profiling/stats_async_ingestion")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Parsing nodes:   0%|          | 0/4207 [00:00<?, ?it/s]

Generating embeddings: 100%|█████████████████████████████████████████| 922/922 [03:56<00:00,  3.90it/s]



Création de 9216 nodes en 49.40059442520142s.
Mon Feb 10 01:24:20 2025    ./profiling/stats_async_ingestion

         163838182 function calls (150670398 primitive calls) in 246.998 seconds

   Ordered by: cumulative time
   List reduced from 850 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
13424/9217    0.250    0.000  244.788    0.027 dispatcher.py:253(wrapper)
21522/21462    0.022    0.000  236.541    0.011 events.py:86(_run)
21522/21462    0.019    0.000  236.517    0.011 {method 'run' of '_contextvars.Context' objects}
    11061    0.045    0.000  236.319    0.021 tasks.py:291(__step)
    11061    0.059    0.000  236.250    0.021 tasks.py:308(__step_run_and_handle_result)
    11060    0.024    0.000  235.736    0.021 {method 'send' of 'coroutine' objects}
     9216    0.099    0.000  235.214    0.026 base.py:284(_aget_text_embedding)
     9216    0.024    0.000  233.932    0.025 base.py:296(_get_text_embedding)
     9216  

<pstats.Stats at 0x7f511dbf6f00>

In [34]:
loop = asyncio.get_event_loop()
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit loop.run_until_complete(pipeline.arun(documents=documents))

Temps d'exécution moyen du pipeline sur 7 ittérations :
3min 20s ± 1.22 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 2.4 Exécution asynchrone sur plusieurs processeurs

Here the `ProcessPoolExecutor` from `concurrent.futures` is used to execute processes asynchronously. The tasks are being processed are blocking, but also performed asynchronously on the individual processes.

In [35]:
# profiler = cProfile.Profile()

# tic = time.time()
# profiler.enable()
# nodes = await pipeline.arun(documents=documents, num_workers=4, show_progress=True)
# profiler.disable()
# print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

# profiler.dump_stats('./profiling/stats_parallel_async_ingestion_worker4')
# p = pstats.Stats("./profiling/stats_parallel_async_ingestion_worker4")
# p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

...

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

...

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

In [36]:
# loop = asyncio.get_event_loop()
# print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
# %timeit loop.run_until_complete(pipeline.arun(documents=documents, num_workers=4))

### 2.5 Exécution asynchrone avec des documents par lots

In [37]:
def divide_documents_into_splits(documents, num_jobs):
    tic = time.time()
    documents_splits = [documents[i::num_jobs] for i in range(num_jobs)]
    print(f"Séparation de {len(documents)} documents en {num_jobs} listes en {round(time.time()-tic, 2)}s de taille {[len(job) for job in documents_splits]}")
    return documents_splits

#### a) 4 splits

In [38]:
documents_splits = divide_documents_into_splits(documents=documents, num_jobs=4)

Séparation de 4207 documents en 4 listes en 0.0s de taille [1052, 1052, 1052, 1051]


In [39]:
jobs = [pipeline.arun(documents=split, show_progress=True) for split in documents_splits]

profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
results = await asyncio.gather(*jobs)
nodes = []
for result in results:
  nodes.extend(result)
profiler.disable()
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

profiler.dump_stats('./profiling/stats_parallel_async_ingestion_with_4_split_jobs')
p = pstats.Stats("./profiling/stats_parallel_async_ingestion_with_4_split_jobs")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Parsing nodes:   0%|          | 0/1052 [00:00<?, ?it/s]

Generating embeddings:   0%|                                                   | 0/232 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/1052 [00:00<?, ?it/s]


[Aerating embeddings:   0%|                                                   | 0/230 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/1052 [00:00<?, ?it/s]



[A[Ating embeddings:   0%|                                                   | 0/230 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/1051 [00:00<?, ?it/s]




Generating embeddings: 100%|█████████████████████████████████████████| 232/232 [04:04<00:00,  1.06s/it]

Generating embeddings: 100%|█████████████████████████████████████████| 230/230 [04:02<00:00,  1.06s/it]


Generating embeddings: 100%|█████████████████████████████████████████| 230/230 [04:00<00:00,  1.05s/it]



Generating embeddings: 100%|█████████████████████████████████████████| 231/231 [03:58<00:00,  1.03s/it]



Création de 9216 nodes en 49.52480459213257s.
Mon Feb 10 01:55:12 2025    ./profiling/stats_parallel_async_ingestion_with_4_split_jobs

         164002373 function calls (150827547 primitive calls) in 247.571 seconds

   Ordered by: cumulative time
   List reduced from 851 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  481/478    0.051    0.000  257.725    0.539 nest_asyncio.py:100(_run_once)
21707/21181    0.032    0.000  247.553    0.012 events.py:86(_run)
13427/9220    0.254    0.000  245.390    0.027 dispatcher.py:253(wrapper)
21707/21181    0.019    0.000  238.576    0.011 {method 'run' of '_contextvars.Context' objects}
    11071    0.042    0.000  238.566    0.022 tasks.py:291(__step)
    11071    0.058    0.000  238.499    0.022 tasks.py:308(__step_run_and_handle_result)
    11070    0.022    0.000  238.234    0.022 {method 'send' of 'coroutine' objects}
     9216    0.100    0.000  237.350    0.026 base.py:284(_aget_te

<pstats.Stats at 0x7f514640a8d0>

In [40]:
# Define the async function to be timed
async def run_pipeline(documents_splits):
    jobs = [pipeline.arun(documents=split) for split in documents_splits]
    await asyncio.gather(*jobs)

# Use timeit to measure the execution time
time_taken = timeit.timeit(lambda: asyncio.run(run_pipeline(documents_splits)), number=7)
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations : {time_taken / 7:.4f} secondes")

Temps d'exécution moyen du pipeline sur 7 ittérations : 202.3494 secondes


#### b) 8 splits

In [41]:
documents_splits = divide_documents_into_splits(documents=documents, num_jobs=8)

Séparation de 4207 documents en 8 listes en 0.0s de taille [526, 526, 526, 526, 526, 526, 526, 525]


In [42]:
jobs = [pipeline.arun(documents=split, show_progress=True) for split in documents_splits]

profiler = cProfile.Profile()

tic = time.time()
profiler.enable()
results = await asyncio.gather(*jobs)
nodes = []
for result in results:
  nodes.extend(result)
profiler.disable()
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

profiler.dump_stats('./profiling/stats_parallel_async_ingestion_with_8_split_jobs')
p = pstats.Stats("./profiling/stats_parallel_async_ingestion_with_8_split_jobs")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Parsing nodes:   0%|          | 0/526 [00:00<?, ?it/s]

Generating embeddings:   0%|                                                   | 0/117 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/526 [00:00<?, ?it/s]


[Aerating embeddings:   0%|                                                   | 0/115 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/526 [00:00<?, ?it/s]



[A[Ating embeddings:   0%|                                                   | 0/113 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/526 [00:00<?, ?it/s]




[A[A[Ag embeddings:   0%|                                                   | 0/117 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/526 [00:00<?, ?it/s]





[A[A[A[Ambeddings:   0%|                                                   | 0/115 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/526 [00:00<?, ?it/s]






[A[A[A[A[Addings:   0%|                                                   | 0/116 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/526 [00:00<?, ?it/s]







[A[A[A[A[A[Angs:   0%|                                                   | 0/118 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/525 [00:00<?, ?it/s]








Generating embeddings: 100%|█████████████████████████████████████████| 117/117 [04:03<00:00,  2.08s/it]

Generating embeddings: 100%|█████████████████████████████████████████| 115/115 [04:02<00:00,  2.11s/it]


Generating embeddings: 100%|█████████████████████████████████████████| 113/113 [04:01<00:00,  2.14s/it]



Generating embeddings: 100%|█████████████████████████████████████████| 117/117 [03:59<00:00,  2.05s/it]




Generating embeddings: 100%|█████████████████████████████████████████| 115/115 [03:58<00:00,  2.08s/it]





Generating embeddings: 100%|█████████████████████████████████████████| 116/116 [03:57<00:00,  2.05s/it]






Generating embeddings: 100%|█████████████████████████████████████████| 118/118 [03:56<00:00,  2.00s/it]







Generating embeddings: 100%|█████████████████████████████████████████| 114/114 [03:55<00:00,  2.07s/it]



Création de 9216 nodes en 48.99229469299316s.
Mon Feb 10 02:22:54 2025    ./profiling/stats_parallel_async_ingestion_with_8_split_jobs

         164226142 function calls (151043239 primitive calls) in 244.924 seconds

   Ordered by: cumulative time
   List reduced from 847 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  707/705    0.055    0.000  254.292    0.361 nest_asyncio.py:100(_run_once)
11084/11083    0.044    0.000  246.089    0.022 tasks.py:291(__step)
    11083    0.061    0.000  244.896    0.022 tasks.py:308(__step_run_and_handle_result)
21953/21199    0.036    0.000  244.889    0.012 events.py:86(_run)
21953/21199    0.023    0.000  244.866    0.012 {method 'run' of '_contextvars.Context' objects}
    11083    0.023    0.000  244.708    0.022 {method 'send' of 'coroutine' objects}
13431/9224    0.229    0.000  242.197    0.026 dispatcher.py:253(wrapper)
     9216    0.101    0.000  234.602    0.025 base.py:284(_aget_

<pstats.Stats at 0x7f51261968d0>

In [43]:
# Define the async function to be timed
async def run_pipeline(documents_splits):
    jobs = [pipeline.arun(documents=split) for split in documents_splits]
    await asyncio.gather(*jobs)

# Use timeit to measure the execution time
time_taken = timeit.timeit(lambda: asyncio.run(run_pipeline(documents_splits)), number=7)
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations : {time_taken / 7:.4f} secondes")

Temps d'exécution moyen du pipeline sur 7 ittérations : 199.2023 secondes


### Conclusion Pipeline ingestion

I'm inclined to remove multi-processing from the ingestion pipeline. It creates more issues than it solves, async is enough (and a lot safer)


  Méthode       | Num_proc | Temps moyen |
 |---------------|----------|-------------|
 | Séquentiel     | 1        | 3min10s ± 2s   |
 | Parallèle     | 4 workers       | 5min35s  |
 | Asynchrone     | 1        | 3min20s ± 1s   |
 | Asynchrone/Parallèle     | 4 workers        | Error  |
 | Asynchrone/Jobs multiple     | 4 jobs       | 3min22s  |
 | Asynchrone/Jobs multiple     | 8 jobs       | 3min19s   |
