## ⚠️ WARNING

This notebook isn't directly used for the API and is only intended for the development process of the project

Please remove the following to ensure a clean production build

```bash
uv remove python-dotenv ipykernel ollama uvicorn tqdm
```

## Thought process

The directive is to build an LLM-based application that can automatically answer any of the previously answered graded assignment and return their corresponding answer.

It wouldn't be as simple as returning the correct answer as given the instructions, it would most likely take new inputs and values and have the scripts run.

Sample request

```bash
curl -X POST "https://your-app.vercel.app/api/" \
  -H "Content-Type: multipart/form-data" \
  -F "question=Download and unzip file abcd.zip which has a single extract.csv file inside. What is the value in the "answer" column of the CSV file?" \
  -F "file=@abcd.zip"
```



### Project Structure

The project needs to be organized in a manner that's easily managable, thus the structure will require a folder to store all the modules, an app to serve as the API, a folder to source all the downloaded content, and a folder to contain all vector collections.

```bash
├───chromadb
│   └───vector collections...
├───helpers
│   ├───helper functions...
│   └───task.py
├───data
│   └───downloaded files...
├───submissions
│   ├───consists of graded assignment modules...
│   └───task.py
├───.gitignore
├───app.py
├───data_preparations.ipynb
├───pyproject.toml
├───README.md
└───uv.lock
```


### Function Calling vs Embeddings

Function Calling might be the closest approach **HOWEVER** given the scope of the project, this will inevitably hit the token limit if each task were to be given their own funciton call. A much suited approach is to take a Retrieval Augmented Generation (RAG) to not only reference authorative knowledge but to keep the scope of the project within its capabilities. A refusal parameter can be added if the model cannot comprehend the request of the user.


## Parsing Python Functions

In [1]:
from typing import List, Dict
import random
import ast
import os

def parse_functions(directory: str) -> List[Dict]:
    folder = os.listdir(directory)
    functions = []
    for function in folder:
        if not function.endswith(".py"):
            continue
        func = os.path.join(directory, function)
        with open(func, "r") as file:
            tree = ast.parse(file.read())
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):
                docstring = ast.get_docstring(node)
                params = [arg.arg for arg in node.args.args]
                if docstring:
                    functions.append({
                        "name": node.name,
                        "docstring": docstring,
                        "params": params
                    })
    return functions


In [2]:
functions = parse_functions("./submissions/")
functions[random.randint(0, len(functions) - 1)]

{'name': 'json_cleanup',
 'docstring': 'Cleans up a JSON file by removing escaped quotes and returns the cleaned output as a string.\n\nArgs:\n    path (str): The path to the JSON file to be cleaned\n\nReturns:\n    str: The cleaned JSON data as a base64 encoded string\n\nRaises:\n    FileNotFoundError: If the specified path does not exist\n    json.JSONDecodeError: If the file is not a valid JSON\n\nExample:\n    >>> json_cleanup("path/to/file.json")\n    \'{"Hello": "World", ...}\'',
 'params': ['path']}

### OpenAI Requests

In [3]:
%load_ext dotenv
%dotenv

## Generating Vector Embeddings

In [4]:
from helpers.authentication import generate_embeddings
from typing import List, Dict
from tqdm import tqdm
import chromadb

client = chromadb.PersistentClient(path="chromadb")
collection = client.get_or_create_collection(name="functions", metadata={ "hnsw:space": "cosine" })

def create_embeddings(functions: List[Dict]) -> List[Dict]:
    for func in tqdm(functions):
        text = f"Function {func["name"]}: {func["docstring"]} Parameters: {" ".join(func["params"])}"
        func["embedding"] = generate_embeddings(text)
    return functions

def append_to_collection(functions: List[Dict]) -> None:
    documents = []
    metadatas = []
    embeddings = []
    ids = []
    
    for idx, func in enumerate(functions):
        documents.append(func["name"])
        metadatas.append({ "docstring": func["docstring"] })
        embeddings.append(func["embedding"])
        ids.append(str(idx))

    collection.add(
        documents = documents,
        metadatas = metadatas,
        embeddings = embeddings,
        ids = ids
    )


In [5]:
embedded_functions = create_embeddings(functions)
# embedded_functions[random.randint(0, len(embedded_functions) - 1)]

100%|██████████| 47/47 [00:29<00:00,  1.58it/s]


In [6]:
append_to_collection(embedded_functions)

## Allowing the LLM to understand which method to call based on a question

In [7]:
client = chromadb.PersistentClient(path="chromadb")
collection = client.get_collection(name="functions")

def query_collection(query: str) -> Dict:
    result = collection.query(
        query_embeddings = generate_embeddings(query),
        n_results = 1
    )
    return result


In [8]:
query_collection("How many Wednesdays are there in the date range 1986-11-29 to 2008-08-14?")

{'ids': [['33']],
 'embeddings': None,
 'documents': [['counting_days']],
 'uris': None,
 'data': None,
 'metadatas': [[{'docstring': "Calculates the total number of day of the week between two dates.\n\nArgs:\n    day (str): The day of the week to calculate for Mondays, Tues, Wed, THURS, fri, Saturday, or Sunday\n    start (str): The start date in 'year-month-day' format\n    end (str): The end date in 'year-month-day' format\n\nReturns:\n    str: The total number of of day of the week between the two dates\n\nRaises:\n    ValueError: If the day of the week is not one of 'mon', 'tue', 'wed', 'thu', 'fri', 'sat', or 'sun'\n\nExample:\n    >>> counting_days('wed', '2022-01-01', '2022-12-31')\n    '13'"}]],
 'distances': [[0.6521538543993015]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

## Docstring to function call

The OpenAI API only accepts functions within valid JSON data types, thus function calling can only contain the following types:

- string
- number
- boolean
- null/empty
- object
- array

Source: https://community.openai.com/t/function-calling-parameter-types/268564/8

### Dealing with Dict

Dictionaries can be tricky for as the number of parameters is not certain. Instruct the program to consider dictionaries as strings, then use the `json.loads()` function to retrieve necessary data.