<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/ingestion/parallel_execution_ingestion_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parallelizing llamaindex RAG Pipeline

## 0. Pré-requis


In [1]:
#%pip install llama-index-cli
#%pip install llama-index-embeddings-openai
#%pip install llama-index-readers-file
#%pip install llama-index-embeddings-huggingface
#%pip install ipywidgets

In [2]:
import nest_asyncio

nest_asyncio.apply()

In [3]:
import cProfile, pstats
from pstats import SortKey
import time
import asyncio
import timeit

### Téléchargement des données


For this notebook, we'll load the `PatronusAIFinanceBenchDataset` llama-dataset from [llamahub](https://llamahub.ai).

In [4]:
# !llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data

In [5]:
!ls ./data

rag_dataset.json  source_files	test_sample


## 1. Pipeline chargement des données

**Il y a 32 pdfs d'une centaine de pages dans les données PatronusAIFinanceBenchDataset .**

Définition du Reader :

In [6]:
from llama_index.core import SimpleDirectoryReader

# define our reader with the directory containing the 32 pdf files

input_dir = "./data/source_files"  # "./data/source_files" "./data/test_sample"

reader = SimpleDirectoryReader(
    input_dir=input_dir,  
    #required_exts=[".pdf"],
    recursive=True,
    )

### 1.1 Chargement séquentiel

In [7]:
with cProfile.Profile() as profiler:
    tic = time.time()
    documents = reader.load_data(show_progress=True)
    profiler.dump_stats('./profiling/stats_sequential_load')

print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")
p = pstats.Stats("./profiling/stats_sequential_load")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Loading files:  94%|█████████████████████████████████████████████████████████████▉    | 30/32 [13:02<00:24, 12.12s/file]

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6bc424a70 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|██████████████████████████████████████████████████████████████████| 32/32 [13:23<00:00, 25.12s/file]


Création de 4207 documents en 804.0599935054779s.
Wed Feb 12 08:31:02 2025    ./profiling/stats_sequential_load

         1820747926 function calls (1817826685 primitive calls) in 822.543 seconds

   Ordered by: cumulative time
   List reduced from 1324 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      344    0.222    0.001 1014.681    2.950 nest_asyncio.py:100(_run_once)
       32    0.000    0.000  822.504   25.703 base.py:493(load_file)
       32    0.000    0.000  821.071   25.658 __init__.py:328(wrapped_f)
     4207    2.418    0.001  795.196    0.189 _page.py:2266(extract_text)
4345/4207   12.929    0.003  783.965    0.186 _page.py:1822(_extract_text)
       32    0.000    0.000  728.161   22.755 __init__.py:465(__call__)
       34    0.055    0.002  579.357   17.040 base.py:36(load_data)
     4345    0.014    0.000  558.427    0.129 _data_structures.py:1418(operations)
     4345   80.612    0.019  548.878    0.126 _data




<pstats.Stats at 0x7ff6bcbd40b0>

In [8]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit reader.load_data()

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b100a2d0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b25668a0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b2fe6ae0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b2bcaab0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

### 1.2 Chargement parallèle

In [9]:
import multiprocessing

num_cpus = multiprocessing.cpu_count()
print(f"Number of CPUs: {num_cpus}")

Number of CPUs: 8


#### a) Num_workers=4

In [10]:
with cProfile.Profile() as profiler:
    tic = time.time()
    documents = reader.load_data(num_workers=4, show_progress=True)
    profiler.dump_stats('./profiling/stats_parallel_load_worker4')
    
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")
p = pstats.Stats("./profiling/stats_parallel_load_worker4")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f3724ba8a40 state=finished raised DependencyError>]. Skipping...

Création de 4207 documents en 96.4300651550293s.
Wed Feb 12 08:59:44 2025    ./profiling/stats_parallel_load_worker4

         117933 function calls (117566 primitive calls) in 99.513 seconds

   Ordered by: cumulative time
   List reduced from 596 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       17    0.000    0.000   99.368    5.845 connection.py:246(recv)
      2/1    0.000    0.000   99.258   99.258 base.py:664(load_data)
       11    0.000    0.000   99.255    9.023 util.py:208(__call__)
        1    0.000    0.000   99.254   99.254 pool.py:738(__exit__)
        1    0.000    0.000   99.254   99.254 pool.py:654(terminate)
        1    0.000    0.000   99.254   99.254 pool.py:680(_terminate_pool)
    19/17    0.000    0.0

<pstats.Stats at 0x7ff6feebfa10>

In [11]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit reader.load_data(num_workers=4)

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f2d894d2150 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff4766f3e60 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7fd181ed3e90 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7fc9eca37d10 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

#### b) Num_workers=8

In [12]:
with cProfile.Profile() as profiler:
    tic = time.time()
    documents = reader.load_data(num_workers=8, show_progress=True)
    profiler.dump_stats('./profiling/stats_parallel_load_worker8')

print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")
p = pstats.Stats("./profiling/stats_parallel_load_worker8")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f3eeb0d3c50 state=finished raised DependencyError>]. Skipping...

Création de 4207 documents en 93.67148470878601s.
Wed Feb 12 09:14:20 2025    ./profiling/stats_parallel_load_worker8

         419654 function calls (419167 primitive calls) in 96.877 seconds

   Ordered by: cumulative time
   List reduced from 503 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      2/1    0.000    0.000   96.704   96.704 base.py:664(load_data)
        1    0.000    0.000   96.700   96.700 pool.py:738(__exit__)
        1    0.000    0.000   96.697   96.697 pool.py:654(terminate)
       15    0.000    0.000   96.671    6.445 util.py:208(__call__)
        1    0.000    0.000   96.671   96.671 pool.py:680(_terminate_pool)
        1    0.000    0.000   96.574   96.574 pool.py:671(_help_stuff_finish)
       11    0.0

<pstats.Stats at 0x7ff6bca22450>

In [13]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit reader.load_data(num_workers=8)

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f2ed93b3c80 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f56f35044d0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7fa4c80d3a10 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7f3e05703c20 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

### 1.3 Chargement asynchrone

In [14]:
with cProfile.Profile() as profiler:
    tic = time.time()
    documents = await reader.aload_data(show_progress=True)
    profiler.dump_stats('./profiling/stats_async_load')
    
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")
p = pstats.Stats("./profiling/stats_async_load")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

  0%|                                                                                            | 0/32 [00:00<?, ?it/s]

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6bd3d3920 state=finished raised DependencyError>]. Skipping...


100%|███████████████████████████████████████████████████████████████████████████████████| 32/32 [13:19<00:00, 24.97s/it]


Création de 4207 documents en 799.2840440273285s.
Wed Feb 12 09:40:13 2025    ./profiling/stats_async_load

         1820605497 function calls (1817687005 primitive calls) in 827.844 seconds

   Ordered by: cumulative time
   List reduced from 741 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  360/177    0.004    0.000  827.690    4.676 events.py:86(_run)
       33    0.000    0.000  827.682   25.081 tasks.py:291(__step)
       33    0.000    0.000  827.682   25.081 tasks.py:308(__step_run_and_handle_result)
       33    0.000    0.000  827.681   25.081 {method 'send' of 'coroutine' objects}
       32    0.000    0.000  827.647   25.864 asyncio.py:75(wrap_awaitable)
       32    0.000    0.000  827.645   25.864 base.py:594(aload_file)
       32    0.000    0.000  827.568   25.861 base.py:38(aload_data)
       32    0.000    0.000  827.565   25.861 __init__.py:328(wrapped_f)
       32    0.001    0.000  827.564   25.861 __init__




<pstats.Stats at 0x7ff6be0bb020>

In [15]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit asyncio.run(reader.aload_data())

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b1c0e3f0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b2c17ad0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b22a7110 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b3e2f140 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

### 1.4 Chargement asynchrone et parallèle

#### a) Num_workers=4

In [16]:
with cProfile.Profile() as profiler:
    tic = time.time()
    documents = await reader.aload_data(num_workers=4, show_progress=True)
    profiler.dump_stats('./profiling/stats_parallel_async_load_worker4')

print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")
p = pstats.Stats("./profiling/stats_parallel_async_load_worker4")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

  0%|                                                                                            | 0/32 [00:00<?, ?it/s]

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b57f30e0 state=finished raised DependencyError>]. Skipping...


100%|███████████████████████████████████████████████████████████████████████████████████| 32/32 [13:19<00:00, 24.98s/it]


Création de 4207 documents en 799.4066798686981s.
Wed Feb 12 10:20:31 2025    ./profiling/stats_parallel_async_load_worker4

         1820603263 function calls (1817684729 primitive calls) in 825.804 seconds

   Ordered by: cumulative time
   List reduced from 701 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       33    0.000    0.000  825.793   25.024 tasks.py:291(__step)
       33    0.000    0.000  825.792   25.024 tasks.py:308(__step_run_and_handle_result)
       33    0.000    0.000  825.792   25.024 {method 'send' of 'coroutine' objects}
       34    0.001    0.000  825.677   24.285 dispatcher.py:349(async_wrapper)
       32    0.000    0.000  825.673   25.802 asyncio.py:75(wrap_awaitable)
       32    0.000    0.000  825.669   25.802 async_utils.py:136(worker)
       32    0.001    0.000  825.668   25.802 base.py:594(aload_file)
       32    0.000    0.000  825.589   25.800 base.py:38(aload_data)
       32    0.000    0




<pstats.Stats at 0x7ff6b20fbc80>

In [17]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit asyncio.run(reader.aload_data(num_workers=4))

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6aad12630 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b3c521e0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b2e01b20 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6aaaddb50 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

#### b) Num_workers=8

In [18]:
with cProfile.Profile() as profiler:
    tic = time.time()
    documents = await reader.aload_data(num_workers=8, show_progress=True)
    profiler.dump_stats('./profiling/stats_parallel_async_load_worker8')

print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")
p = pstats.Stats("./profiling/stats_parallel_async_load_worker8")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

  0%|                                                                                            | 0/32 [00:00<?, ?it/s]

Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6bd3b5010 state=finished raised DependencyError>]. Skipping...


100%|███████████████████████████████████████████████████████████████████████████████████| 32/32 [13:27<00:00, 25.25s/it]


Création de 4207 documents en 808.0082750320435s.
Wed Feb 12 11:01:04 2025    ./profiling/stats_parallel_async_load_worker8

         1820603805 function calls (1817685171 primitive calls) in 820.979 seconds

   Ordered by: cumulative time
   List reduced from 733 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       33    0.000    0.000  820.974   24.878 tasks.py:291(__step)
       33    0.000    0.000  820.974   24.878 tasks.py:308(__step_run_and_handle_result)
       33    0.000    0.000  820.973   24.878 {method 'send' of 'coroutine' objects}
       34    0.001    0.000  820.914   24.145 dispatcher.py:349(async_wrapper)
       32    0.000    0.000  820.910   25.653 asyncio.py:75(wrap_awaitable)
       32    0.000    0.000  820.906   25.653 async_utils.py:136(worker)
       32    0.000    0.000  820.906   25.653 base.py:594(aload_file)
       32    0.000    0.000  820.817   25.651 base.py:38(aload_data)
       32    0.000    0




<pstats.Stats at 0x7ff6bd653980>

In [19]:
print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit asyncio.run(reader.aload_data(num_workers=8))

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6aa451fa0 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b21e6750 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b472a450 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b543e420 state=finished raised DependencyError>]. Skipping...
Failed to load file /mnt/d/PROJET/notebooks/notebooks_test/data/source_files/8255010981931466649.pdf with error: RetryError[<Future at

### 1.5 Chargement asynchrone avec plusieurs tâches

In [20]:
from pathlib import Path

def lister_fichiers(dossier):
    return [str(fichier) for fichier in Path(dossier).rglob('*') if fichier.is_file()]

fichiers = lister_fichiers(input_dir)

In [21]:
def divide_filenames_into_splits(filenames, num_jobs):
    tic = time.time()
    filenames_splits = [filenames[i::num_jobs] for i in range(num_jobs)]
    print(f"Séparation de {len(filenames)} fichiers en {num_jobs} listes en {round(time.time()-tic, 2)}s de taille {[len(job) for job in filenames_splits]}")
    return filenames_splits

#### a) 4 splits

In [22]:
filenames_splits = divide_filenames_into_splits(filenames=fichiers, num_jobs=4)

Séparation de 32 fichiers en 4 listes en 0.0s de taille [8, 8, 8, 8]


In [23]:
jobs = [SimpleDirectoryReader(input_files=split).aload_data() for split in filenames_splits]

with cProfile.Profile() as profiler:
    tic = time.time()
    results = await asyncio.gather(*jobs)
    profiler.dump_stats('./profiling/stats_parallel_async_load_with_4_split_jobs')

nodes = []
for result in results:
  nodes.extend(result)
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

p = pstats.Stats("./profiling/stats_parallel_async_load_with_4_split_jobs")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6bd32a6c0 state=finished raised DependencyError>]. Skipping...

Création de 4207 documents en 802.3483023643494s.
Wed Feb 12 11:41:32 2025    ./profiling/stats_parallel_async_load_with_4_split_jobs

         1820594467 function calls (1817676443 primitive calls) in 802.343 seconds

   Ordered by: cumulative time
   List reduced from 586 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  173/172    0.239    0.001  812.413    4.723 nest_asyncio.py:100(_run_once)
   250/82    0.001    0.000  802.342    9.785 events.py:86(_run)
   249/82    0.001    0.000  802.342    9.785 {method 'run' of '_contextvars.Context' objects}
       41    0.000    0.000  802.342   19.569 tasks.py:291(__step)
       41    0.000    0.000  802.342   19.569 tasks.py:308(__step_run_and_handle_result)
       41    0.000    0.000  802.340   19.569 {method 'send' of 

<pstats.Stats at 0x7ff6b1f938f0>

In [24]:
# Define the async function to be timed
async def run_pipeline(filenames_splits):
    jobs = [SimpleDirectoryReader(input_files=split).aload_data() for split in filenames_splits]
    await asyncio.gather(*jobs)

print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit asyncio.run(run_pipeline(filenames_splits))

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b2e30e90 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b56f4f50 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b5337aa0 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6aabde3c0 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b2293110 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b20b1040 state=finished 

#### b) 8 splits

In [25]:
filenames_splits = divide_filenames_into_splits(filenames=fichiers, num_jobs=8)

Séparation de 32 fichiers en 8 listes en 0.0s de taille [4, 4, 4, 4, 4, 4, 4, 4]


In [26]:
jobs = [SimpleDirectoryReader(input_files=split).aload_data() for split in filenames_splits]

with cProfile.Profile() as profiler:
    tic = time.time()
    results = await asyncio.gather(*jobs)
    profiler.dump_stats('./profiling/stats_parallel_async_load_with_8_split_jobs')
    
nodes = []
for result in results:
  nodes.extend(result)
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

p = pstats.Stats("./profiling/stats_parallel_async_load_with_8_split_jobs")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b4832360 state=finished raised DependencyError>]. Skipping...

Création de 4207 documents en 816.000452041626s.
Wed Feb 12 12:22:10 2025    ./profiling/stats_parallel_async_load_with_8_split_jobs

         1820594130 function calls (1817676055 primitive calls) in 815.985 seconds

   Ordered by: cumulative time
   List reduced from 581 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  176/175    0.187    0.001  824.071    4.709 nest_asyncio.py:100(_run_once)
   263/92    0.001    0.000  815.985    8.869 events.py:86(_run)
   262/92    0.001    0.000  815.984    8.869 {method 'run' of '_contextvars.Context' objects}
       49    0.000    0.000  815.984   16.653 tasks.py:291(__step)
       49    0.000    0.000  815.984   16.653 tasks.py:308(__step_run_and_handle_result)
       49    0.000    0.000  815.982   16.653 {method 'send' of '

<pstats.Stats at 0x7ff6be1434a0>

In [27]:
# Define the async function to be timed
async def run_pipeline(filenames_splits):
    jobs = [SimpleDirectoryReader(input_files=split).aload_data() for split in filenames_splits]
    await asyncio.gather(*jobs)

print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit asyncio.run((run_pipeline(filenames_splits)))

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b0f30140 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6aaf37e90 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b22bc4d0 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b56c3fe0 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b5e77860 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b124bb90 state=finished 

#### c) 1 split par fichier

In [28]:
jobs = [SimpleDirectoryReader(input_files=[file]).aload_data() for file in fichiers]

with cProfile.Profile() as profiler:
    tic = time.time()
    results = await asyncio.gather(*jobs)
    profiler.dump_stats('./profiling/stats_async_load_with_1job_per_file')
    
nodes = []
for result in results:
  nodes.extend(result)
print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")

p = pstats.Stats("./profiling/stats_async_load_with_1job_per_file")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6bc2441a0 state=finished raised DependencyError>]. Skipping...

Création de 4207 documents en 807.4125363826752s.
Wed Feb 12 13:02:42 2025    ./profiling/stats_async_load_with_1job_per_file

         1820597012 function calls (1817678976 primitive calls) in 807.401 seconds

   Ordered by: cumulative time
   List reduced from 581 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  175/174    0.157    0.001  882.249    5.070 nest_asyncio.py:100(_run_once)
  333/163    0.001    0.000  807.400    4.953 events.py:86(_run)
  332/163    0.001    0.000  807.400    4.953 {method 'run' of '_contextvars.Context' objects}
       97    0.000    0.000  807.399    8.324 tasks.py:291(__step)
       97    0.000    0.000  807.399    8.324 tasks.py:308(__step_run_and_handle_result)
       97    0.000    0.000  807.396    8.324 {method 'send' of 'corouti

<pstats.Stats at 0x7ff6bd6a4140>

In [29]:
# Define the async function to be timed
async def run_pipeline(fichiers):
    jobs = [SimpleDirectoryReader(input_files=[file]).aload_data() for file in fichiers]
    await asyncio.gather(*jobs)

print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit asyncio.run((run_pipeline(fichiers)))

Temps d'exécution moyen du loader sur 7 ittérations :
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b2497020 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b2fc3b30 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b50375f0 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b12676e0 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b5b56d80 state=finished raised DependencyError>]. Skipping...
Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b17f6ab0 state=finished 

### 1.6 Parallèle sans utiliser num_workers de llamaindex

#### 8 processus

In [30]:
from concurrent.futures import ProcessPoolExecutor

# Fonction pour charger un segment de données
def load_data_segment(file_split):
    # Supposons que reader.load_data peut être appelé avec un segment spécifique
    return SimpleDirectoryReader(input_files=file_split).load_data(show_progress=True)

file_split = divide_filenames_into_splits(filenames=fichiers, num_jobs=8)

with cProfile.Profile() as profiler:
    tic = time.time()
    # Utiliser un pool de processus pour charger les données en parallèle
    with ProcessPoolExecutor() as executor:
        res = list(executor.map(load_data_segment, file_split))
    profiler.dump_stats('./profiling/stats_custom_parallel_load')
documents = [doc for res_proc in res for doc in res_proc]

print(f"\nCréation de {len(documents)} documents en {time.time()-tic}s.")
p = pstats.Stats("./profiling/stats_custom_parallel_load")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)


Séparation de 32 fichiers en 8 listes en 0.0s de taille [4, 4, 4, 4, 4, 4, 4, 4]


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:35<00:00, 23.81s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [02:03<00:00, 30.89s/file]
Loading files:  75%|███████████████████████████████████████████████████                 | 3/4 [02:20<00:47, 47.71s/file]

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b6f16c00 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [02:20<00:00, 35.08s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [02:58<00:00, 44.55s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [03:03<00:00, 46.00s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [03:15<00:00, 48.91s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [03:34<00:00, 53.74s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [03:57<00:00, 59.45s/file]



Création de 4207 documents en 238.05858945846558s.
Wed Feb 12 13:33:42 2025    ./profiling/stats_custom_parallel_load

         84873 function calls (83603 primitive calls) in 238.023 seconds

   Ordered by: cumulative time
   List reduced from 480 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  201/200    0.000    0.000  238.019    1.190 threading.py:1153(_wait_for_tstate_lock)
        1    0.000    0.000  238.018  238.018 _base.py:646(__exit__)
        1    0.000    0.000  238.018  238.018 process.py:864(shutdown)
      2/1    0.000    0.000  238.018  238.018 threading.py:1115(join)
      2/1    0.000    0.000  237.914  237.914 threading.py:1016(_bootstrap)
      2/1    0.000    0.000  237.913  237.913 threading.py:1056(_bootstrap_inner)
        1    0.000    0.000  237.913  237.913 process.py:340(run)
        1    0.000    0.000  237.913  237.913 process.py:574(join_executor_internals)
        1    0.000    0.000  237.913  23

<pstats.Stats at 0x7ff6b13556a0>

In [31]:
from concurrent.futures import ProcessPoolExecutor

# Fonction pour charger un segment de données
def load_data_segment(file_split):
    # Supposons que reader.load_data peut être appelé avec un segment spécifique
    return SimpleDirectoryReader(input_files=file_split).load_data(show_progress=True)

file_split = divide_filenames_into_splits(filenames=fichiers, num_jobs=8)

def run_pipeline(fichiers):
    # Utiliser un pool de processus pour charger les données en parallèle
    with ProcessPoolExecutor() as executor:
        res = list(executor.map(load_data_segment, file_split))

print(f"Temps d'exécution moyen du loader sur 7 ittérations :")
%timeit run_pipeline(fichiers)

Séparation de 32 fichiers en 8 listes en 0.0s de taille [4, 4, 4, 4, 4, 4, 4, 4]
Temps d'exécution moyen du loader sur 7 ittérations :


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:28<00:00,  7.16s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:37<00:00,  9.33s/file]
Loading files:  75%|███████████████████████████████████████████████████                 | 3/4 [00:42<00:14, 14.61s/file]

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6bc8ea510 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.72s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:54<00:00, 13.74s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:55<00:00, 13.86s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:57<00:00, 14.40s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:04<00:00, 16.01s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:09<00:00, 17.37s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:29<00:00,  7.35s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:37<00:00,  9.38s/file]
Loading files:  75%|████████████

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b11e9f10 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.71s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:52<00:00, 13.17s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:53<00:00, 13.48s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:56<00:00, 14.01s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:01<00:00, 15.40s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:06<00:00, 16.74s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:29<00:00,  7.28s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:37<00:00,  9.45s/file]
Loading files:  75%|████████████

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b3e2c560 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.63s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:52<00:00, 13.23s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:54<00:00, 13.58s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:55<00:00, 13.92s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:01<00:00, 15.48s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:07<00:00, 16.77s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:29<00:00,  7.27s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:37<00:00,  9.39s/file]
Loading files:  75%|████████████

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b0edf3e0 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.64s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:52<00:00, 13.13s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:54<00:00, 13.57s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:56<00:00, 14.06s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:01<00:00, 15.42s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:06<00:00, 16.74s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:29<00:00,  7.30s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:37<00:00,  9.43s/file]
Loading files:  75%|████████████

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b1e63260 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.62s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:53<00:00, 13.33s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:54<00:00, 13.56s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:55<00:00, 13.97s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:01<00:00, 15.46s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:07<00:00, 16.81s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:29<00:00,  7.30s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:37<00:00,  9.45s/file]
Loading files:  75%|████████████

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b3e2e240 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.64s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:52<00:00, 13.20s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:53<00:00, 13.50s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:55<00:00, 13.94s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:01<00:00, 15.34s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:07<00:00, 16.78s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:29<00:00,  7.30s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:37<00:00,  9.39s/file]
Loading files:  75%|████████████

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6b49a5ca0 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.65s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:53<00:00, 13.28s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:54<00:00, 13.52s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:56<00:00, 14.07s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:01<00:00, 15.47s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:07<00:00, 16.79s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:29<00:00,  7.28s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:37<00:00,  9.49s/file]
Loading files:  75%|████████████

Failed to load file data/source_files/8255010981931466649.pdf with error: RetryError[<Future at 0x7ff6aae71d60 state=finished raised DependencyError>]. Skipping...


Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.69s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:52<00:00, 13.15s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:54<00:00, 13.56s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:56<00:00, 14.08s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:02<00:00, 15.50s/file]
Loading files: 100%|████████████████████████████████████████████████████████████████████| 4/4 [01:07<00:00, 16.86s/file]


1min 7s ± 145 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Conclusion chargement des données

  Méthode       | Num_proc | Temps moyen |
 |---------------|----------|-------------|
 | Séquentiel     | 1        | 3min34s ±1s   |
 | Parallèle     | 4 workers       | 1min41s ±1.4s  |
 | Parallèle     | 8 workers       | 1min37s ±1s  |
 | Asynchrone     | 1        | 3min29s ±0.4s   |
 | Asynchrone/Parallèle     | 4 workers        | 3min27s ±3s  |
 | Asynchrone/Parallèl     | 8 workers        | 3min22s ± 1s   |
 | Asynchrone/Jobs multiple     | 4 jobs       | 3min22s ±0.2s  |
 | Asynchrone/Jobs multiple     | 8 jobs       | 3min22s ±0.1s   |
 | Asynchrone/Jobs multiple     | 1 job par fichier       | 3min22s ±0.2s |
 | Parallèle sans num_workers    | 8 process       | 1min7s ±0.1s |

## 2. Pipeline de traitement des données

#### Définition du pipeline d'ingestion :

In [33]:
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=20),
        HuggingFaceEmbedding("BAAI/bge-small-en-v1.5"),
    ]
)

# since we'll be testing performance, using timeit and cProfile
# we're going to disable cache
pipeline.disable_cache = True

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### 2.1 Exécution séquentielle

By default `num_workers` is set to `None` and this will invoke sequential execution.

In [34]:
with cProfile.Profile() as profiler:
    tic = time.time()
    nodes = pipeline.run(documents=documents, show_progress=True)
    profiler.dump_stats('./profiling/stats_sequential_ingestion')

print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")
p = pstats.Stats("./profiling/stats_sequential_ingestion")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Parsing nodes:   0%|          | 0/4207 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/8974 [00:00<?, ?it/s]


Création de 8974 nodes en 40.60011944770813s.
Wed Feb 12 21:13:25 2025    ./profiling/stats_sequential_ingestion

         38123766 function calls (36550446 primitive calls) in 202.969 seconds

   Ordered by: cumulative time
   List reduced from 848 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   4212/1    0.078    0.000  202.966  202.966 dispatcher.py:253(wrapper)
        1    0.000    0.000  202.966  202.966 pipeline.py:451(run)
        1    0.000    0.000  202.966  202.966 pipeline.py:69(run_transformations)
     2196    0.029    0.000  198.914    0.091 nest_asyncio.py:100(_run_once)
        1    0.038    0.038  193.299  193.299 base.py:305(get_text_embedding_batch)
      898    0.002    0.000  192.298    0.214 base.py:308(_get_text_embeddings)
      898    0.003    0.000  192.296    0.214 base.py:239(_embed)
      898    0.007    0.000  192.293    0.214 __init__.py:328(wrapped_f)
      898    0.010    0.000  192.266    0.21

<pstats.Stats at 0x7ff5d5f89850>

In [35]:
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit pipeline.run(documents=documents)

Temps d'exécution moyen du pipeline sur 7 ittérations :
3min 7s ± 1.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 2.2 Exécution parallèle

A single run. Setting `num_workers` to a value greater than 1 will invoke parallel execution.

In [36]:
with cProfile.Profile() as profiler:
    tic = time.time()
    nodes = pipeline.run(documents=documents, num_workers=4, show_progress=True)
    profiler.dump_stats('./profiling/stats_parallel_ingestion_worker4')

print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")
p = pstats.Stats("./profiling/stats_parallel_ingestion_worker4")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)




Création de 8974 nodes en 69.8029604434967s.
Wed Feb 12 21:44:10 2025    ./profiling/stats_parallel_ingestion_worker4

         639667 function calls (637856 primitive calls) in 349.008 seconds

   Ordered by: cumulative time
   List reduced from 667 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9    0.001    0.000  669.616   74.402 connection.py:202(send)
      2/1    0.059    0.030  348.943  348.943 dispatcher.py:253(wrapper)
        1    0.000    0.000  348.942  348.942 pipeline.py:451(run)
        1    0.000    0.000  348.942  348.942 pool.py:738(__exit__)
        1    0.000    0.000  348.942  348.942 pool.py:654(terminate)
       14    0.000    0.000  348.934   24.924 connection.py:406(_send_bytes)
       19    0.008    0.000  348.933   18.365 connection.py:381(_send)
       11    0.000    0.000  348.924   31.720 util.py:208(__call__)
        1    0.000    0.000  348.924  348.924 pool.py:680(_terminate_pool)
       

<pstats.Stats at 0x7ff5a246fdd0>

In [37]:
# Méthode la moins performante, on la commente pour perdre moins de temps
# print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
# %timeit pipeline.run(documents=documents, num_workers=4)

### 2.3 Exécution asynchrone sur un processeur

As with the sync case, `num_workers` is default to `None`, which will then lead to single-batch execution of async tasks.

In [38]:
with cProfile.Profile() as profiler:
    tic = time.time()
    nodes = await pipeline.arun(documents=documents, show_progress=True)
    profiler.dump_stats('./profiling/stats_async_ingestion')

print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")
p = pstats.Stats("./profiling/stats_async_ingestion")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Parsing nodes:   0%|          | 0/4207 [00:00<?, ?it/s]

Generating embeddings: 100%|██████████████████████████████████████████████████████████| 898/898 [03:56<00:00,  3.79it/s]



Création de 8974 nodes en 49.346013975143435s.
Wed Feb 12 21:48:17 2025    ./profiling/stats_async_ingestion

         159672490 function calls (146825357 primitive calls) in 246.740 seconds

   Ordered by: cumulative time
   List reduced from 854 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
20981/20921    0.081    0.000  246.888    0.012 events.py:86(_run)
13182/8975    0.279    0.000  244.394    0.027 dispatcher.py:253(wrapper)
    10771    0.039    0.000  237.273    0.022 tasks.py:291(__step)
    10771    0.055    0.000  237.210    0.022 tasks.py:308(__step_run_and_handle_result)
    10771    0.025    0.000  237.042    0.022 {method 'send' of 'coroutine' objects}
     8974    0.099    0.000  235.990    0.026 base.py:284(_aget_text_embedding)
     8974    0.032    0.000  234.635    0.026 base.py:296(_get_text_embedding)
     8974    0.022    0.000  234.603    0.026 base.py:239(_embed)
     8974    0.066    0.000  234.581    0

<pstats.Stats at 0x7ff5a907fa40>

In [39]:
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit asyncio.run(pipeline.arun(documents=documents))

Temps d'exécution moyen du pipeline sur 7 ittérations :
3min 13s ± 2.07 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 2.4 Exécution asynchrone sur plusieurs processeurs

Here the `ProcessPoolExecutor` from `concurrent.futures` is used to execute processes asynchronously. The tasks are being processed are blocking, but also performed asynchronously on the individual processes.

In [40]:
# profiler = cProfile.Profile()

# tic = time.time()
# profiler.enable()
# nodes = await pipeline.arun(documents=documents, num_workers=4, show_progress=True)
# profiler.disable()
# print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

# profiler.dump_stats('./profiling/stats_parallel_async_ingestion_worker4')
# p = pstats.Stats("./profiling/stats_parallel_async_ingestion_worker4")
# p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

...

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

...

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

In [41]:
# loop = asyncio.get_event_loop()
# print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
# %timeit loop.run_until_complete(pipeline.arun(documents=documents, num_workers=4))

### 2.5 Exécution asynchrone avec des documents par lots

In [42]:
def divide_documents_into_splits(documents, num_jobs):
    tic = time.time()
    documents_splits = [documents[i::num_jobs] for i in range(num_jobs)]
    print(f"Séparation de {len(documents)} documents en {num_jobs} listes en {round(time.time()-tic, 2)}s de taille {[len(job) for job in documents_splits]}")
    return documents_splits

#### a) 4 splits

In [43]:
documents_splits = divide_documents_into_splits(documents=documents, num_jobs=4)

Séparation de 4207 documents en 4 listes en 0.0s de taille [1052, 1052, 1052, 1051]


In [44]:
jobs = [pipeline.arun(documents=split, show_progress=True) for split in documents_splits]

with cProfile.Profile() as profiler:
    tic = time.time()
    results = await asyncio.gather(*jobs)
    profiler.dump_stats('./profiling/stats_parallel_async_ingestion_with_4_split_jobs')

nodes = []
for result in results:
  nodes.extend(result)
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

p = pstats.Stats("./profiling/stats_parallel_async_ingestion_with_4_split_jobs")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

Parsing nodes:   0%|          | 0/1052 [00:00<?, ?it/s]

Generating embeddings:   0%|                                                                    | 0/223 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/1052 [00:00<?, ?it/s]


[Aerating embeddings:   0%|                                                                    | 0/226 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/1052 [00:00<?, ?it/s]



[A[Ating embeddings:   0%|                                                                    | 0/226 [00:00<?, ?it/s]

Parsing nodes:   0%|          | 0/1051 [00:00<?, ?it/s]




Generating embeddings: 100%|██████████████████████████████████████████████████████████| 223/223 [03:52<00:00,  1.04s/it]

Generating embeddings: 100%|██████████████████████████████████████████████████████████| 226/226 [03:50<00:00,  1.02s/it]


Generating embeddings: 100%|██████████████████████████████████████████████████████████| 226/226 [03:47<00:00,  1.01s/it]



Generating embeddings: 100%|██████████████████████████████████████████████████████████| 224/224 [03:45<00:00,  1.01s/it]



Création de 8974 nodes en 46.962664699554445s.
Wed Feb 12 22:18:06 2025    ./profiling/stats_parallel_async_ingestion_with_4_split_jobs

         159827767 function calls (146994407 primitive calls) in 234.726 seconds

   Ordered by: cumulative time
   List reduced from 846 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  475/473    0.045    0.000  236.054    0.499 nest_asyncio.py:100(_run_once)
21141/20622    0.032    0.000  234.667    0.011 events.py:86(_run)
21141/20622    0.017    0.000  234.646    0.011 {method 'run' of '_contextvars.Context' objects}
    10781    0.036    0.000  234.630    0.022 tasks.py:291(__step)
    10781    0.049    0.000  234.571    0.022 tasks.py:308(__step_run_and_handle_result)
    10781    0.021    0.000  234.411    0.022 {method 'send' of 'coroutine' objects}
13185/8978    0.239    0.000  232.407    0.026 dispatcher.py:253(wrapper)
     8974    0.088    0.000  224.776    0.025 base.py:284(_aget_t

<pstats.Stats at 0x7ff5d584c680>

In [45]:
# Define the async function to be timed
async def run_pipeline(documents_splits):
    jobs = [pipeline.arun(documents=split) for split in documents_splits]
    await asyncio.gather(*jobs)

# Use timeit to measure the execution time
print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit asyncio.run(run_pipeline(documents_splits))

Temps d'exécution moyen du pipeline sur 7 ittérations :
3min 11s ± 178 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### b) 8 splits

In [46]:
documents_splits = divide_documents_into_splits(documents=documents, num_jobs=8)

Séparation de 4207 documents en 8 listes en 0.0s de taille [526, 526, 526, 526, 526, 526, 526, 525]


In [47]:
jobs = [pipeline.arun(documents=split) for split in documents_splits]

with cProfile.Profile() as profiler:
    tic = time.time()
    results = await asyncio.gather(*jobs)
    profiler.dump_stats('./profiling/stats_parallel_async_ingestion_with_8_split_jobs')

nodes = []
for result in results:
  nodes.extend(result)
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

p = pstats.Stats("./profiling/stats_parallel_async_ingestion_with_8_split_jobs")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)


Création de 8974 nodes en 46.89188899993896s.
Wed Feb 12 22:47:31 2025    ./profiling/stats_parallel_async_ingestion_with_8_split_jobs

         159488162 function calls (146664153 primitive calls) in 234.371 seconds

   Ordered by: cumulative time
   List reduced from 588 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
20732/20683    0.020    0.000  234.295    0.011 events.py:86(_run)
20732/20683    0.017    0.000  234.275    0.011 {method 'run' of '_contextvars.Context' objects}
10796/10795    0.038    0.000  234.228    0.022 tasks.py:291(__step)
    10795    0.051    0.000  234.167    0.022 tasks.py:308(__step_run_and_handle_result)
    10795    0.021    0.000  234.002    0.022 {method 'send' of 'coroutine' objects}
13189/8982    0.237    0.000  232.838    0.026 dispatcher.py:253(wrapper)
    58/57    0.061    0.001  229.499    4.026 nest_asyncio.py:100(_run_once)
     8974    0.089    0.000  225.429    0.025 base.py:284(_aget_

<pstats.Stats at 0x7ff62216d7f0>

In [48]:
# Define the async function to be timed
async def run_pipeline(documents_splits):
    jobs = [pipeline.arun(documents=split) for split in documents_splits]
    await asyncio.gather(*jobs)

print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit asyncio.run(run_pipeline(documents_splits))

Temps d'exécution moyen du pipeline sur 7 ittérations :
3min 11s ± 619 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### c) 1 document par job

In [49]:
async def process_documents(documents):
    jobs = [pipeline.arun(documents=[doc]) for doc in documents]
    return await asyncio.gather(*jobs)

with cProfile.Profile() as profiler:
    tic = time.time()
    # Exécuter la fonction asynchrone
    results = asyncio.run(process_documents(documents))
    profiler.dump_stats('./profiling/stats_async_ingestion_with_1job_per_doc')
    
nodes = []
for result in results:
    nodes.extend(result)
print(f"\nCréation de {len(nodes)} nodes en {(time.time()-tic)/5}s.")

p = pstats.Stats("./profiling/stats_async_ingestion_with_1job_per_doc")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)


Création de 8974 nodes en 47.553266525268555s.
Wed Feb 12 23:17:01 2025    ./profiling/stats_async_ingestion_with_1job_per_doc

         162577911 function calls (149689667 primitive calls) in 237.683 seconds

   Ordered by: cumulative time
   List reduced from 655 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
43252/43198    0.031    0.000  237.624    0.006 {method 'run' of '_contextvars.Context' objects}
    25804    0.063    0.000  237.382    0.009 tasks.py:291(__step)
    25804    0.078    0.000  237.287    0.009 tasks.py:308(__step_run_and_handle_result)
    25804    0.039    0.000  237.056    0.009 {method 'send' of 'coroutine' objects}
        1    0.007    0.007  236.893  236.893 nest_asyncio.py:86(run_until_complete)
17388/13181    0.268    0.000  233.636    0.018 dispatcher.py:253(wrapper)
     8974    0.090    0.000  225.374    0.025 base.py:284(_aget_text_embedding)
     8974    0.026    0.000  224.178    0.025 base.p

<pstats.Stats at 0x7ff6bc49e4e0>

In [50]:
# Define the async function to be timed
async def run_pipeline(documents):
    jobs = [pipeline.arun(documents=[doc]) for doc in documents]
    await asyncio.gather(*jobs)

print(f"Temps d'exécution moyen du pipeline sur 7 ittérations :")
%timeit asyncio.run(run_pipeline(documents))

Temps d'exécution moyen du pipeline sur 7 ittérations :
3min 12s ± 354 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Conclusion Pipeline ingestion

I'm inclined to remove multi-processing from the ingestion pipeline. It creates more issues than it solves, async is enough (and a lot safer)


  Méthode       | Num_proc | Temps moyen |
 |---------------|----------|-------------|
 | Séquentiel     | 1        | 3min10s ± 2s   |
 | Parallèle     | 4 workers       | 5min35s  |
 | Asynchrone     | 1        | 3min20s ± 1s   |
 | Asynchrone/Parallèle     | 4 workers        | Error  |
 | Asynchrone/Jobs multiple     | 4 jobs       | 3min22s  |
 | Asynchrone/Jobs multiple     | 8 jobs       | 3min19s   |
 | Asynchrone/Jobs multiple     | 1 job par doc       |  51s  |
