## ⚠️ WARNING

This notebook isn't directly used for the API and is only intended for the development process of the project

Please remove the following to ensure a clean production build

```bash
uv remove sentence_transformers python-dotenv ipykernel ollama uvicorn tqdm
```

## Thought process

The directive is to build an LLM-based application that can automatically answer any of the previously answered graded assignment and return their corresponding answer.

It wouldn't be as simple as returning the correct answer as given the instructions, it would most likely take new inputs and values and have the scripts run.

Sample request

```bash
curl -X POST "https://your-app.vercel.app/api/" \
  -H "Content-Type: multipart/form-data" \
  -F "question=Download and unzip file abcd.zip which has a single extract.csv file inside. What is the value in the "answer" column of the CSV file?" \
  -F "file=@abcd.zip"
```



### Project Structure

The project needs to be organized in a manner that's easily managable, thus the structure will require a folder to store all the modules, an app to serve as the API, a folder to source all the downloaded content, and a folder to contain all vector collections.

```bash
├───chromadb
│   └───vector collections...
├───helpers
│   ├───helper functions...
│   └───task.py
├───data
│   └───downloaded files...
├───submissions
│   ├───consists of graded assignment modules...
│   └───task.py
├───.gitignore
├───app.py
├───data_preparations.ipynb
├───pyproject.toml
├───README.md
└───uv.lock
```


### Function Calling vs Embeddings

Function Calling might be the closest approach **HOWEVER** given the scope of the project, this will inevitably hit the token limit if each task were to be given their own funciton call. A much suited approach is to take a Retrieval Augmented Generation (RAG) to not only reference authorative knowledge but to keep the scope of the project within its capabilities. A refusal parameter can be added if the model cannot comprehend the request of the user.


## Parsing Python Functions

In [1]:
from typing import List, Dict
import random
import ast
import os

def parse_functions(directory: str) -> List[Dict]:
    folder = os.listdir(directory)
    functions = []
    for function in folder:
        if not function.endswith(".py"):
            continue
        func = os.path.join(directory, function)
        with open(func, "r") as file:
            tree = ast.parse(file.read())
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):
                docstring = ast.get_docstring(node)
                params = [arg.arg for arg in node.args.args]
                if docstring:
                    functions.append({
                        "name": node.name,
                        "docstring": docstring,
                        "params": params
                    })
    return functions


In [2]:
functions = parse_functions("./submissions/")
functions[random.randint(0, len(functions) - 1)]

{'name': 'wikipedia_outline',
 'docstring': 'Fetches the Wikipedia outline for a given country and returns it as a JSON string.\n\nArgs:\n    country (str): The name of the country to fetch the outline for.\n\nReturns:\n    str: A JSON string representing the Wikipedia outline for the given country.\n\nRaises:\n    httpx.HTTPError: If the HTTP request to Wikipedia fails.\n    json.JSONDecodeError: If the response from Wikipedia is not valid JSON.\n\nExample:\n    >>> wikipedia_outline("France")\n    \'{"h1": "France", "h2": "History of France", "h3": "French Revolution"}\'',
 'params': ['country']}

### OpenAI Requests

In [3]:
%load_ext dotenv
%dotenv

## Generating Vector Embeddings

In [4]:
from helpers.authentication import generate_embeddings
from typing import List, Dict, Any
from tqdm import tqdm
import chromadb

client = chromadb.PersistentClient(path="chromadb")
collection = client.get_or_create_collection(name="functions", metadata={ "hnsw:space": "cosine" })

def create_embeddings(functions: List[Dict]) -> List[Dict]:
    for func in tqdm(functions):
        text = f"Function {func["name"]}: {func["docstring"]} Parameters: {" ".join(func["params"])}"
        func["embedding"] = generate_embeddings(text)
    return functions

def append_to_collection(functions: List[Dict]) -> None:
    documents = []
    metadatas = []
    embeddings = []
    ids = []
    
    for idx, func in enumerate(functions):
        documents.append(func["name"])
        metadatas.append({ "docstring": func["docstring"] })
        embeddings.append(func["embedding"])
        ids.append(str(idx))

    collection.add(
        documents = documents,
        metadatas = metadatas,
        embeddings = embeddings,
        ids = ids
    )


In [None]:
embedded_functions = create_embeddings(functions)
# embedded_functions[random.randint(0, len(embedded_functions) - 1)]

100%|██████████| 43/43 [00:26<00:00,  1.65it/s]


{'name': 'yt_transcribe',
 'docstring': 'Transcribes a YouTube video using the provided URL.\n\nArgs:\n    url (str): The URL of the YouTube video to transcribe.\n\nReturns:\n    int: The number of words (punctuations included) in the transcription.\n\nExample:\n    >>> yt_transcribe("https://www.youtube.com/watch?v=dQw4w9WgXcQ")\n    \'152\'',
 'params': ['url'],
 'embedding': [-0.027584347873926163,
  0.013848315924406052,
  -0.03941155597567558,
  -0.019668351858854294,
  0.016917401924729347,
  0.02507667988538742,
  -0.02881946787238121,
  0.03218797594308853,
  -0.032693251967430115,
  0.025937519967556,
  0.0026386654935777187,
  -0.011686855927109718,
  0.007298436481505632,
  -0.02090347185730934,
  -0.008687946945428848,
  -0.011321933940052986,
  -0.008163955993950367,
  -0.02371056191623211,
  -0.015607425943017006,
  0.05767636373639107,
  0.00569839496165514,
  -0.04955451190471649,
  0.012753549963235855,
  0.03450850397348404,
  0.03851328790187836,
  -0.008056350983679

In [6]:
append_to_collection(embedded_functions)

## Allowing the LLM to understand which method to call based on a question

In [None]:
client = chromadb.PersistentClient(path="chromadb")
collection = client.get_collection(name="functions")

def query_collection(query: str) -> Dict:
    result = collection.query(
        query_embeddings = generate_embeddings(query),
        n_results = 1
    )
    return result


In [None]:
query_collection("How many Wednesdays are there in the date range 1986-11-29 to 2008-08-14?")

## Docstring to function call

The OpenAI API only accepts functions within valid JSON data types, thus function calling can only contain the following types:

- string
- number
- boolean
- null/empty
- object
- array

Source: https://community.openai.com/t/function-calling-parameter-types/268564/8

### Dealing with Dict

Dictionaries can be tricky for as the number of parameters is not certain. Instruct the program to consider dictionaries as strings, then use the `json.loads()` function to retrieve necessary data.