<a href="https://colab.research.google.com/github/run-llama/llama-hub/blob/main/llama_hub/docstring_walker/docstringwalker_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

This notebook will show you an example of how to use DocstringWalker from Llama Hub, combined with Llama Index and LLM of your choice.

# Lib install for Collab

In [None]:
!pip install llama_index

In [None]:
!pip install llama_hub

For this exercise we will use **PyTorch Geometric (PyG)** module for inspecting multi-module doctstrings.

In [None]:
!pip install torch_geometric

# Lib imports

In [None]:
import os

from pprint import pprint

from llama_index import (
    ServiceContext,
    VectorStoreIndex,
    SummaryIndex,
)

import llama_hub.docstring_walker as docstring_walker

# Example 1 - reading Docstring Walker's own docstrings

Let's start by using it.... on itself :) We will see what information gets extracted from the module.


In [None]:
# Step 1 - create DocstringWalker object
walker = docstring_walker.DocstringWalker()

# Step 2 - prepare path to module
path_to_docstring_walker = os.path.dirname(docstring_walker.__file__)

# Step 3 - load documents from docstrings
example1_docs = walker.load_data(path_to_docstring_walker)

In [None]:
print(example1_docs[0].text)

Module name: base 
 Docstring: Main module for DocstringWalker loader for Llama Hub 

 Class name: DocstringWalker, In: base 
 Docstring: A loader for docstring extraction and building structured documents from them.
Recursively walks a directory and extracts docstrings from each Python
module - starting from the module itself, then classes, then functions.
Builds a graph of dependencies between the extracted docstrings.

 Function name: load_data, In: DocstringWalker 
 Docstring: Load data from the specified code directory.
Additionally, after loading the data, build a dependency graph between the loaded documents.
The graph is stored as an attribute of the class.


Parameters
----------
code_dir : str
    The directory path to the code files.
skip_initpy : bool
    Whether to skip the __init__.py files. Defaults to True.
fail_on_malformed_files : bool
    Whether to fail on malformed files. Defaults to False - in this case,

Returns
-------
List[Document]
    A list of loaded documen

Now we can use the doc to generate Llama index and use it with LLM.

In [None]:
# Step 1 - create vector strore index
example1_index = VectorStoreIndex(example1_docs)

# Step 2 - turn vector store into the query engine
example1_query_engine = example1_index.as_query_engine()

In [None]:
pprint(
    example1_query_engine.query("What is the main purpose of DocstringWalker?").response
)

('The main purpose of DocstringWalker is to extract docstrings from Python '
 'modules, classes, and functions, and build structured documents from them. '
 'It also constructs a graph of dependencies between the extracted docstrings '
 'while recursively walking a directory.')


In [None]:
print(
    example1_query_engine.query(
        "What are the main funcitons used in DocstringWalker. Use numbered list, briefly describe each function."
    ).response
)

1. load_data: Loads data from a specified code directory and builds a dependency graph between the loaded documents.
2. process_directory: Processes a directory and extracts information from Python files.
3. read_module_text: Reads the text of a Python module given its path.
4. parse_module: Parses a single Python module and returns a Document object with extracted information.
5. process_class: Processes a class node in the AST and adds relevant information to the graph, returning a string representation of the processed class node and its sub-elements.
6. process_function: Processes a function node in the AST, adds it to the graph, and returns a string representation of the processed function node with its sub-elements.
7. process_elem: Processes an element in the AST, delegates execution to more specific functions based on the element type, and returns the result of processing the element.


# Example 2 - checking multi-module project

Now we can use the same approach to check a multi-module project. Let's use **PyTorch Geometric (PyG) Knowledge Graph (KG)** module for this exercise.

In [5]:
import torch_geometric.nn.kge as kge

path_to_module = os.path.dirname(kge.__file__)
example2_docs = walker.load_data(path_to_module)

In [None]:
example2_index = SummaryIndex(example2_docs)
example2_docs = example2_index.as_query_engine()

In [None]:
print(
    example2_docs.query(
        "What classes are available and what is their main purpose? Use nested numbered list to describe: the class name, short summary of purpose, papers or literature review for each one of them"
    ).response
)

1. DistMult
   - Purpose: Models relations as diagonal matrices, simplifying the bi-linear interaction between head and tail entities.
   - Paper: "Embedding Entities and Relations for Learning and Inference in Knowledge Bases" (https://arxiv.org/abs/1412.6575)

2. RotatE
   - Purpose: Models relations as a rotation in complex space from head to tail entities.
   - Paper: "RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space" (https://arxiv.org/abs/1902.10197)

3. TransE
   - Purpose: Models relations as a translation from head to tail entities.
   - Paper: "Translating Embeddings for Modeling Multi-Relational Data" (https://proceedings.neurips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf)

4. KGEModel
   - Purpose: An abstract base class for implementing custom KGE models.

5. ComplEx
   - Purpose: Models relations as complex-valued bilinear mappings between head and tail entities using the Hermetian dot product.
   - Paper: "Complex Embeddings fo

In [None]:
print(
    example2_docs.query("What are the parameteres required by TransE class?").response
)

The parameters required by the TransE class are:

1. num_nodes (int): The number of nodes/entities in the graph.
2. num_relations (int): The number of relations in the graph.
3. hidden_channels (int): The hidden embedding size.
4. margin (int, optional): The margin of the ranking loss (default: 1.0).
5. p_norm (int, optional): The order embedding and distance normalization (default: 1.0).
6. sparse (bool, optional): If set to True, gradients w.r.t. the embedding matrices will be sparse (default: False).
