# Document Summaries with Small Language Models (SLIMs Cookbook)

### The need for document summaries

Often, we face an issue with smaller models, where the context window of the model is relatively small. If we want to pass in entire documents to the model that exceed the context window, we need to find a way to summarize the contents of the documents. Let's look at an example of how to do this with LLMWare.

### For Google Colab users

If you are using Colab for free, we highly recommend you activate the T4 GPU hardware accelerator. Our models are designed to run with at least 16GB of RAM, activating T4 will grant the notebook 16GB of GDDR6 RAM as apposed to the ~13GB Colab gives automatically.

To activate T4:
1. click on the "Runtime" tab
2. click on "Change runtime type"
3. select T4 GPU under Hardware Accelerator

NOTE: there is a weekly usage limit on using T4 for free

### Installing and importing dependencies

In [1]:
%pip install llmware

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os

from llmware.prompts import Prompt
from llmware.setup import Setup

### Worker function

The process for summarizing the document is as follows:
- The document gets split up into batches (each batch can be a few pages long)
- The SLIM Summary tool summarizes each batch into a list of points
- The points from each batch are aggregated at the end and printed, and the output is also available as a Python list

The LLMWare library provides several sample files for you to run this example with, but if you wanted to use your own documents, then you can pass in the following optional parameters for the summarization:
- Topic: all the points in the summary will be about this topic
- Query: it is used to filter which parts of the document are considered for the summarization (only text chunks matching the query will be considered)

In [3]:
def test_summarize_document(example="jd salinger"):

    # pull a sample document (or substitute a file_path and file_name of your own)
    sample_files_path = Setup().load_sample_files(over_write=False)

    topic = None
    query = None
    fp = None
    fn = None

    if example not in ["jd salinger", "employment terms", "just the comp", "un resolutions"]:
        print ("not found example")
        return []

    if example == "jd salinger":
        fp = os.path.join(sample_files_path, "SmallLibrary")
        fn = "Jd-Salinger-Biography.docx"
        topic = "jd salinger"
        query = None

    if example == "employment terms":
        fp = os.path.join(sample_files_path, "Agreements")
        fn = "Athena EXECUTIVE EMPLOYMENT AGREEMENT.pdf"
        topic = "executive compensation terms"
        query = None

    if example == "just the comp":
        fp = os.path.join(sample_files_path, "Agreements")
        fn = "Athena EXECUTIVE EMPLOYMENT AGREEMENT.pdf"
        topic = "executive compensation terms"
        query = "base salary"

    if example == "un resolutions":
        fp = os.path.join(sample_files_path, "SmallLibrary")
        fn = "N2126108.pdf"
        # fn = "N2137825.pdf"
        topic = "key points"
        query = None

    # optional parameters:  'query' - will select among blocks with the query term
    #                       'topic' - will pass a topic/issue as the parameter to the model to 'focus' the summary
    #                       'max_batch_cap' - caps the number of batches sent to the model
    #                       'text_only' - returns just the summary text aggregated

    kp = Prompt().summarize_document_fc(fp, fn, topic=topic, query=query, text_only=True, max_batch_cap=15)

    print(f"\nDocument summary completed - {len(kp)} Points")
    for i, points in enumerate(kp):
        print(i, points)

    return 0

### Main function

Here, we have our function call to the worker function above.

In [4]:
if __name__ == "__main__":

    print(f"\nExample: Summarize Documents\n")

    #   4 examples - ["jd salinger", "employment terms", "just the comp", "un resolutions"]
    #   -- "jd salinger" - summarizes key points about jd salinger from short biography document
    #   -- "employment terms" - summarizes the executive compensation terms across 15 page document
    #   -- "just the comp" - queries to find subset of document and then summarizes the key terms
    #   -- "un resolutions" - summarizes the un resolutions document

    summary_direct = test_summarize_document(example="employment terms")


Example: Summarize Documents

update: Prompt - summarize_document_fc - document - Athena EXECUTIVE EMPLOYMENT AGREEMENT.pdf
update: Prompt - summarize_document_fc - number of source batches -  14
update: iterating through source batches - 0 - ['Accrued Compensation - all compensation reimbursements and other amounts earned by payable to or accrued and vested for Executive through and including Executives Date of Termination including but not limited to (i) Base Salary; (ii) Executives Incentive Bonus for the fiscal year that ended immediately prior to Executives Date of Termination to the extent such Incentive Bonus was accrued and earned by but not yet paid to Executive as of Executives Date of Termination; (iii) pay for accrued but unused vacation; and (iv) reimbursable business expenses incurred by Executive on behalf of Employer']
update: iterating through source batches - 1 - ['Not Found']
update: iterating through source batches - 2 - ['Not to exceed 5% of base salary', 'Materia

We're given an output of 23 points summarizing the `employment terms` sample document!

Note that several batches gave us an output of `Not Found`, indicating that there were no relevant points related to the topic of `executive compensation terms`. This is the topic that was passed in to the `summarize_document_fc()` function call in our worker function. The model gave us the output of `Not Found` rather than unrelated points with no meaning to our topic!