# Auto-generating Documentation: A Long Document Summarization Approach

This notebook demonstrates an innovative application of long document summarization techniques to automatically generate documentation for Python code. By treating a codebase as a "long document," we leverage AI-powered language models to comprehend, distill, and explain complex code structures.

Key concepts:
1. Document preprocessing: We fetch and format code from a GitHub repository, similar to how one might prepare a long text document for summarization.
2. Chunking and tokenization: We analyze the token count of our code "document" to ensure it fits within the model's context window, a crucial step in long document processing.
3. Prompt engineering: We craft a specialized prompt that guides the AI to focus on key aspects of the code, much like how summarization prompts direct models to capture essential information.
4. AI-powered analysis: Using the Replicate API, we access a large language model capable of understanding code semantics and generating human-readable explanations.
5. Structured output: We instruct the model to produce documentation in a consistent format, analogous to generating structured summaries from lengthy texts.

This approach demonstrates how techniques traditionally used for summarizing long articles, reports, or books can be adapted for technical documentation tasks. It showcases the versatility of large language models in processing and synthesizing complex information, whether it's natural language or programming code.

By the end of this notebook, you'll see how principles of long document summarization can be applied to streamline and enhance the software documentation process, potentially saving developers significant time and effort.

## Install Dependencies

Before we begin, we need to install the required Python packages. We'll be using:

- `replicate`: To interact with the Replicate API for accessing AI models
- `transformers`: For tokenization and working with language models

These packages will be installed using pip, Python's package installer. If you're running this notebook in a fresh environment, make sure you have pip installed and updated (if you are in Colab, this is done for you).

In [1]:
!pip install git+https://github.com/ibm-granite-community/utils.git replicate transformers

Collecting git+https://github.com/ibm-granite-community/utils.git
  Cloning https://github.com/ibm-granite-community/utils.git to /private/var/folders/8t/m9m188_d0tb8szvfqlc20hfr0000gn/T/pip-req-build-mxkj_zhg
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils.git /private/var/folders/8t/m9m188_d0tb8szvfqlc20hfr0000gn/T/pip-req-build-mxkj_zhg
  Resolved https://github.com/ibm-granite-community/utils.git to commit a5965f40db3950dd2a41f3ca62a2c34adcdc20d7
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting transformers
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
Collecting filelock (from transformers)
  Downloading filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers)
  Downloading huggingface_hub-0.27.0-py3-none-any.whl.metadata (13 kB

## Set Replicate Token

To use the Replicate API, we need to authenticate our requests. This is done using an API token.

For security reasons, it's best to store this token as an environment variable rather than hardcoding it into our script. If we are using Google Colab, the `get_env_var` function will use the `userdata` feature to retrieve the token
and set it in the environment variable.

Remember to never share your API tokens publicly or commit them to version control systems.

In [2]:
from ibm_granite_community.notebook_utils import get_env_var

get_env_var("REPLICATE_API_TOKEN")

## Define a function for downloading a repository

We'll create a function to fetch code from a GitHub repository. This allows us to easily obtain the code we want to document.

Key points about this function:
- It uses the GitHub API to retrieve repository contents
- It can handle both files and directories recursively
- The function formats the code with appropriate language tags for better display
- An optional GitHub token can be provided for increased API rate limits and access to private repositories

Note on GitHub tokens:
A GitHub token is not required for public repositories, but it can be beneficial. With a token, you can:
1. Access private repositories
2. Have a higher rate limit for API requests
3. Fetch more detailed information about the repository

To create a GitHub token, go to your GitHub account settings, select "Developer settings", then "Personal access tokens".  Find more information [here](https://docs.github.com/en/rest/authentication/authenticating-to-the-rest-api?apiVersion=2022-11-28).

In [3]:
import requests
from time import sleep

def get_github_repo_contents(repo, directory_path, github_token = None):

    api_url = f"https://api.github.com/repos/{repo}/contents/{directory_path}"
    if github_token is not None:
      headers = {'Authorization': f'token {github_token}'}
      response = requests.get(api_url, headers = headers)
    else:
      response = requests.get(api_url)
    response.raise_for_status()

    contents = response.json()

    result = []

    for item in contents:
        if item['type'] == 'file':
            file_response = requests.get(item['download_url'])
            file_response.raise_for_status()
            file_content = file_response.text
            language = item['name'].split('.')[-1]
            if language == 'py':
                language = 'python'
            elif language == 'js':
                language = 'javascript'
            result.append(f"{item['path']}\n```{language}\n{file_content}\n```")
        elif item['type'] == 'dir':
            # Recursively go through subdirectories
            subdirectory_contents = get_github_repo_contents(repo, item['path'], github_token)
            result.append(subdirectory_contents)
        sleep(0.1)

    return "\n\n".join(result)


## Get code from `ibm-granite-community/utils`

In this example, we're focusing on the `ibm-granite-community/utils` repository, specifically the `src` directory. This directory contains various utility functions that we want to document.

By specifying this directory, we ensure that we're only fetching the relevant code and not unnecessary files or directories. This helps to keep our input focused and reduces the likelihood of exceeding token limits in our AI model.

In [4]:
prompt = get_github_repo_contents("ibm-granite-community/utils", "src")

## Count the tokens

Before sending our code to the AI model, it's crucial to understand how much of the model's capacity we're using. Language models typically have a limit on the number of tokens they can process in a single request.

Key points:
- We're using the `granite-8B-Code-instruct-128k` model, which has a context window of 128,000 tokens
- The context window includes both the input (our code) and the output (the generated documentation)
- Tokenization can vary between models, so we use the specific tokenizer for our chosen model
- If our input is too large, we may need to split it into smaller chunks or summarize it

Understanding token count helps us optimize our prompts and ensure we're using the model efficiently.

In [5]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-3.1-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)

print(f"Your git repo load has {len(tokenizer.tokenize(prompt))} tokens")

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Your git repo load has 673 tokens


### Create our prompt and call the model in Replicate

This is where we construct our final prompt and send it to the AI model for processing.

Our approach involves:
1. Combining the code we fetched with specific instructions for documentation
2. Using a template to guide the model's output format
3. Calling the Replicate API with our constructed prompt and additional parameters

Key considerations:
- The prompt includes both the code and instructions for how to document it
- We use a response template to ensure consistent formatting across functions
- Parameters like `max_tokens`, `temperature`, and `system_prompt` can be adjusted to fine-tune the model's behavior
- The output is streamed, allowing for real-time display of the generated documentation

This step is where the magic happens - transforming our code into human-readable documentation.

In [6]:
import replicate

full_prompt = prompt + """

Provide detailed developer documentation for each function provided above.

Response Template:
## `function_name`

* _param1_: (type) description"

Synopsis of the function

_**returns**_:
"""

output = replicate.run(
    "ibm-granite/granite-8b-code-instruct-128k",
    input={

        "prompt": full_prompt,
        "max_tokens": 10000,
        "min_tokens": 0,
        "temperature": 0.75,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    })


print("".join(output))


## `is_colab()`

* `_return_`: (bool) Returns `True` if the code is running in Google Colab, `False` otherwise.

This function checks if the code is running in Google Colab by using the `importlib.util.find_spec` function to check if the `google.colab` module is available.

**returns**:


- `True` if the code is running in Google Colab, `False` otherwise.

## `get_env_var(var_name, default_value)`

* `_param1_`: (str) The name of the environment variable to retrieve.
* `_param2_`: (str | None, optional) The default value to return if the environment variable is not found. Default is `None`.
* `_return_`: (str | None) The value of the environment variable if found, or the default value if not found.

This function retrieves the value of an environment variable by checking the `os.environ` dictionary for the specified variable name. If the variable is not found, it checks if the code is running in Google Colab and, if so, attempts to retrieve the value from a secret using the `google.col