## ⚠️ WARNING

This notebook isn't directly used for the API and is only intended for the development process of the project

Please remove the following to ensure a clean production build

```bash
uv remove sentence_transformers openai python-dotenv
```

## Thought process

The directive is to build an LLM-based application that can automatically answer any of the previously answered graded assignment and return their corresponding answer.

It wouldn't be as simple as returning the correct answer as given the instructions, it would most likely take new inputs and values and have the scripts run.

Sample request

```bash
curl -X POST "https://your-app.vercel.app/api/" \
  -H "Content-Type: multipart/form-data" \
  -F "question=Download and unzip file abcd.zip which has a single extract.csv file inside. What is the value in the "answer" column of the CSV file?" \
  -F "file=@abcd.zip"
```



### Project Structure

The project needs to be organized in a manner that's easily managable, thus the structure will require a folder to store all the modules, an app to serve as the API, a folder to source all the downloaded content, and a folder to contain all vector collections.

```bash
├───chromadb
│   └───vector collections...
├───data
│   └───downloaded files...
├───submissions
│   ├───consists of graded assignment modules...
│   └───task.py
├───.gitignore
├───app.py
├───data_preparations.ipynb
├───pyproject.toml
├───README.md
└───uv.lock
```


### Function Calling vs Embeddings

Function Calling might be the closest approach **HOWEVER** given the scope of the project, this will inevitably hit the token limit if each task were to be given their own funciton call. A much suited approach is to take a Retrieval Augmented Generation (RAG) to not only reference authorative knowledge but to keep the scope of the project within its capabilities. A refusal parameter can be added if the model cannot comprehend the request of the user.


## Parsing Python Functions

In [None]:
from typing import List, Dict
import ast
import os

def parse_functions(directory: str) -> List[Dict]:
    folder = os.listdir(directory)
    functions = []
    for function in folder:
        if not function.endswith(".py"):
            continue
        func = os.path.join(directory, function)
        with open(func, "r") as file:
            tree = ast.parse(file.read())
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):
                doctring = ast.get_docstring(node) or ""
                params = [arg.arg for arg in node.args.args]
                functions.append({
                    "name": node.name,
                    "docstring": doctring,
                    "params": params,
                    "filepath": func
                })
    return functions


In [None]:
functions = parse_functions("./submissions/")
functions[0]

### Open AI Requests

In [None]:
%load_ext dotenv
%dotenv

In [None]:
import httpx
import os

URL = "https://llmfoundry.straive.com/openai/v1/embeddings"
KEY = os.environ["AIPROXY_TOKEN"]

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer { KEY }",
}

def get_embeddings(text):
    data = {
        "model": "text-embedding-3-small",
        "input": text,
    }
    response = httpx.post(URL, headers=headers, json=data)
    return response.json()["data"][0]["embedding"]

## Generating Vector Embeddings

In [None]:
from typing import List, Dict
import chromadb

client = chromadb.PersistentClient(path="chromadb")
collection = client.get_or_create_collection(name="functions")

def generate_embeddings(functions: List[Dict]) -> List[Dict]:
    for func in functions:
        text = f"Function {func["name"]}: {func["docstring"]} Parameters: {" ".join(func["params"])}"
        func["embedding"] = get_embeddings(text)
    return functions

def append_to_collection(functions: List[Dict]) -> None:
    for idx, func in enumerate(functions):
        collection.add(
            documents=func["name"],
            embeddings=func["embedding"],
            ids=str(idx)
        )


In [None]:
embedded_functions = generate_embeddings(functions)
embedded_functions[0]

In [None]:
append_to_collection(embedded_functions)

## Allowing the LLM to understand which method to call based on a question

In [None]:
import submissions as sub

client = chromadb.PersistentClient(path="chromadb")
collection = client.get_collection(name="functions")

print(sub.cosine_similarity(collection, "Get all these files and give me the SHA256 of all their content combined?"))
