## recursive summarization pipeline

One of the most practical ways to achieve a short, abstract, but representative summary of a long document is to apply summarization *recursively*.  This concept was discussed in our introduction to the [`summarize` module](modules/summarize.md#recursive-summarization).  There we applied a single `summarize` module pipeline several times to create terser and terser summary representations of an input text.

In this document we reproduce the same result via a pipeline consisting of multiple [`summarize`](modules/summarize.md) modules in immediate succession.  Processing files through this pipeline applies `summarize` recursively with a single pipeline invocation.

---

To follow along with this demonstration be sure to initialize your krixik session with your api key and url as shown below. 

We illustrate loading these required secrets in via [python-dotenv](https://pypi.org/project/python-dotenv/), storing those secrets in a `.env` file.  This is always good practice for storing / loading secrets (e.g., doing so will reduce the chance you inadvertantly push secrets to a repo).

In [1]:
import sys 
sys.path.append('../../../')
from docs.utilities.reset import reset_pipeline

In [2]:
# load secrets from a .env file using python-dotenv
from dotenv import load_dotenv
import os
load_dotenv("../../.env")
MY_API_KEY = os.getenv('MY_API_KEY')
MY_API_URL = os.getenv('MY_API_URL')

# import krixik and initialize it with your personal secrets
from krixik import krixik
krixik.init(api_key = MY_API_KEY, 
            api_url = MY_API_URL)

SUCCESS: You are now authenticated.


This small function prints dictionaries very nicely in notebooks / markdown.

In [3]:
# print dictionaries / json nicely in notebooks / markdown
import json
def json_print(data):
    print(json.dumps(data, indent=2))

A table of contents for the remainder of this document is shown below.


- [pipeline setup](#pipeline-setup)
- [processing an input file](#processing-an-input-file)


## Pipeline setup

Below we setup a pipeline consisting of three `summarize` modules in sequence.

In [4]:
# create a pipeline with a single module
pipeline = krixik.create_pipeline(name="my-recursive-summarize-pipeline",
                                  module_chain=["summarize", "summarize", "summarize"])

In [5]:
# delete all processed datapoints belonging to this pipeline
reset_pipeline(pipeline)

In [6]:
pipeline.pipeline_ordered_modules

['summarize', 'summarize', 'summarize']

## processing an input file

We first define a path to a local input file.

In [7]:
# define path to an input file from examples directory
test_file = "../../input_data/1984_short.txt"

Lets take a quick look at this file before processing.

In [8]:
# examine contents of input file
with open(test_file, "r") as file:
    print(file.read())
    

It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape the
vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.

The hallway smelt of boiled cabbage and old rag mats. At one end of it a
coloured poster, too large for indoor display, had been tacked to the wall.
It depicted simply an enormous face, more than a metre wide: the face of a
man of about forty-five, with a heavy black moustache and ruggedly handsome
features. Winston made for the stairs. It was no use trying the lift. Even
at the best of times it was seldom working, and at present the electric
current was cut off during daylight hours. It was part of the economy drive
in preparation for Hate Week. The flat was seven flights up, and Winston,
who was thirty-nine and had a varicose ulcer above his right ankle, went
slowly, resting se

When introducing the [`summarize` module](modules/summarize.md) we applied a single module `summarize` pipeline to this document.  This produced a summary that was about half the length of the original text.

In the [recursive summarization](modules/summarize.md#recursive-summarization) section of that introduction we then applied the same single module pipeline two more times to produce a one paragraph summary of the text above.

Here we will produce the same one paragraph summary by applying the recursive `summarize` pipeline defined above a single time to the input text.

Below we [process](#system/process.md) the input through our pipeline.  Here we use the default model for[`summarize`](modules/summarize.md) for each of the three instances of the module.

In [9]:
# define path to an input file from examples directory
test_file = "../../input_data/1984_short.txt"

# process a file through the pipeline
process_output = pipeline.process(local_file_path = test_file,
                                  local_save_directory=".", # save output in current directory
                                  expire_time=60*5,         # set all process data to expire in 5 minutes
                                  wait_for_process=True,    # wait for process to complete before regaining ide
                                  verbose=False)            # set verbosity to False

The output of this process is printed below.  Because the output of this particular module-model pair is a json, the process output is provided in this object is as well.  The file itself has been returned to the address noted in the `process_output_files` key.  The `file_id` of the processed file is used as a filename prefix for both output files.

In [10]:
# nicely print the output of this process
json_print(process_output)

{
  "status_code": 200,
  "pipeline": "my-recursive-summarize-pipeline",
  "request_id": "3e9e54ef-5f66-4434-98bf-672c3dd7ed6f",
  "file_id": "affbb8e6-4289-4231-80f8-894d4f868506",
  "message": "SUCCESS - output fetched for file_id affbb8e6-4289-4231-80f8-894d4f868506.Output saved to location(s) listed in process_output_files.",
  "process_output": null,
  "process_output_files": [
    "./affbb8e6-4289-4231-80f8-894d4f868506.txt"
  ]
}


We load in the text file output from `process_output_files` below. 

In [11]:
# load in process output from file
with open(process_output['process_output_files'][0], "r") as file:
    print(file.read())  

Winston Smith walked through the glass doors of Victory Mansions. The hallway
smelled of boiled cabbage and old rag mats. A kilometre away the
Ministry of Truth, his place of work, towered vast.
