Add parallel processing to `map` step of map reduce summarisation #656

andy-symonds · 2024-06-25T14:58:14Z

Context

Before this change, the map step of map reduce summarisation, used for large document summarisation, was not happing in parallel. The code needed to be updated to allow the parallele summarisation of large documents into smaller summaries using LCEL batch.

batch only takes a list as input, so I needed to split the question prompt into map_question_prompt, which is populated with the question and chat_history before being passed to the map_operation and then the map_document_prompt which is populated when running batch.

Changes proposed in this pull request

Updated build_map_reduce_summary_chain to allow for batch processing
Made Chunk model iterable
Renamed max_tokens in env var AI Settings to summarisation_chunk_max_tokens to better describe what it does. max_tokens is already an overloaded term.

Guidance to review

Check you are happy with how I have structured build_map_reduce_summary_chain.

Relevant links

Things to check

I have added any new ENV vars in all deployed environments
I have tested any code added or changed
I have run integration tests

…fore documents, as LCEL batch operation can only take a list of documents

…erform the map operation of summarising into multiple summaries in parallel using LCEL batch

…o summarisation_chunk_max_tokens for clear naming

…eeded and I added it earlier in this PR work

jamesrichards4

Looks good. Probably worth making max_concurrency a setting so we can switch it out if we get issues though

andy-symonds added 5 commits June 25, 2024 15:45

[REDBOX-409] | AS | Updated Chunk model to be iterable

ef01395

[REDBOX-409] | AS | Split map prompts, so question can be inserted be…

e388e0b

…fore documents, as LCEL batch operation can only take a list of documents

[REDBOX-409] | Updated make_map_reduce_summary_chain so that it can p…

36f6585

…erform the map operation of summarising into multiple summaries in parallel using LCEL batch

[REDBOX-409] | AS | Renamed max_tokens for summarisation chunk size t…

80be257

…o summarisation_chunk_max_tokens for clear naming

[REDBOX-409] | AS | Ruff formating fixes

3e780af

andy-symonds requested review from gecBurton and jamesrichards4 June 25, 2024 14:58

[REDBOX-409] | AS | Removed __iter__ from Chunk model, as it is not n…

5d509b1

…eeded and I added it earlier in this PR work

jamesrichards4 approved these changes Jun 26, 2024

View reviewed changes

andy-symonds changed the title ~~Add parallele processing to map step of map reduce summarisation~~ Add parallel processing to map step of map reduce summarisation Jun 26, 2024

[REDBOX-409] | AS | Made an AI Settings

dffb34f

andy-symonds merged commit 1593999 into main Jun 26, 2024
3 checks passed

andy-symonds deleted the feat/map-reduce-batch branch June 26, 2024 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel processing to `map` step of map reduce summarisation #656

Add parallel processing to `map` step of map reduce summarisation #656

andy-symonds commented Jun 25, 2024 •

edited

Loading

jamesrichards4 left a comment

Add parallel processing to map step of map reduce summarisation #656

Add parallel processing to map step of map reduce summarisation #656

Conversation

andy-symonds commented Jun 25, 2024 • edited Loading

Context

Changes proposed in this pull request

Guidance to review

Relevant links

Things to check

jamesrichards4 left a comment

Choose a reason for hiding this comment

Add parallel processing to `map` step of map reduce summarisation #656

Add parallel processing to `map` step of map reduce summarisation #656

andy-symonds commented Jun 25, 2024 •

edited

Loading