# Document Summarization

This notebook demonstrates an innovative application of long document summarization techniques to automatically generate documentation for Python code. By treating a codebase as a "long document," we leverage AI-powered language models to comprehend, distill, and explain complex code structures.

Key concepts:

1. Document preprocessing: We fetch a document from a URLand format code from a GitHub repository, similar to how one might prepare a long text document for summarization.
2. Chunking and tokenization: We analyze the token count of our code "document" to ensure it fits within the model's context window, a crucial step in long document processing.
3. Prompt engineering: We craft a specialized prompt that guides the AI to focus on key aspects of the code, much like how summarization prompts direct models to capture essential information.
4. AI-powered analysis: Using the Replicate API, we access a large language model capable of understanding code semantics and generating human-readable explanations.
5. Structured output: We instruct the model to produce documentation in a consistent format, analogous to generating structured summaries from lengthy texts.


## Install Dependencies

Before we begin, we need to install the required Python packages. We'll be using:

- `replicate`: To interact with the Replicate API for accessing AI models
- `transformers`: For tokenization and working with language models

These packages will be installed using pip, Python's package installer. If you're running this notebook in a fresh environment, make sure you have pip installed and updated (if you are in Colab, this is done for you).

In [None]:
!pip install replicate transformers

## Set Replicate Token

To use the Replicate API, we need to authenticate our requests. This is done using an API token.

For security reasons, it's best to store this token as an environment variable rather than hardcoding it into our script. In this notebook, we're using Google Colab's `userdata` feature to securely store and retrieve the token.  (If you are not using Colab, change this cell to

```
os.environ['REPLICATE_API_TOKEN'] = "your-token"
```

Remember to never share your API tokens publicly or commit them to version control systems.

In [None]:
import os

if os.environ.get('REPLICATE_API_TOKEN') is None:
    """Replicate API token not set, we're probably in Colab. Let's try to fetch it."""
    from google.colab import userdata
    userdata = userdata.get("replicate-api-token")
    os.environ['REPLICATE_API_TOKEN'] = userdata.get('REPLICATE_API_TOKEN')

## Download a book

In [None]:
import requests
from time import sleep

# The following URL contains a text version of H.D. Thoreau's "Walden"
url = "https://www.gutenberg.org/cache/epub/205/pg205.txt"

response = requests.get(url)
response.raise_for_status()

contents = response.text

## Count the tokens

Before sending our code to the AI model, it's crucial to understand how much of the model's capacity we're using. Language models typically have a limit on the number of tokens they can process in a single request.

Key points:
- We're using the `granite-8B-Code-instruct-128k` model, which has a context window of 128,000 tokens
- The context window includes both the input (our code) and the output (the generated documentation)
- Tokenization can vary between models, so we use the specific tokenizer for our chosen model
- If our input is too large, we may need to split it into smaller chunks or summarize it

Understanding token count helps us optimize our prompts and ensure we're using the model efficiently.

In [None]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-8B-Code-instruct-128k"
tokenizer = AutoTokenizer.from_pretrained(model_path)

print(f"Your document has has {len(tokenizer(contents, return_tensors='pt')['input_ids'][0])} tokens")

### Create our prompt and call the model in Replicate

This is where we construct our final prompt and send it to the AI model for processing.

Our approach involves:
1. Combining the code we fetched with specific instructions for documentation
2. Using a template to guide the model's output format
3. Calling the Replicate API with our constructed prompt and additional parameters

Key considerations:
- The prompt includes both the code and instructions for how to document it
- We use a response template to ensure consistent formatting across functions
- Parameters like `max_tokens`, `temperature`, and `system_prompt` can be adjusted to fine-tune the model's behavior
- The output is streamed, allowing for real-time display of the generated documentation

This step is where the magic happens - transforming our code into human-readable documentation.

In [None]:
import replicate

prompt = """

Provide detailed developer documentation for each function provided above.

Response Template:
## `function_name`

* _param1_: (type) description"

Synopsis of the function

_**returns**_:
"""

output = replicate.run(
    "ibm-granite/granite-8b-code-instruct-128k",
    input={

        "prompt": prompt,
        "max_tokens": 10000,
        "min_tokens": 0,
        "temperature": 0.75,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    })


print("".join(output))
