## Thought process

The directive is to build an LLM-based application that can automatically answer any of the previously answered graded assignment and return their corresponding answer.

It wouldn't be as simple as returning the correct answer as given the instructions, it would most likely take new inputs and values and have the scripts run.

Sample request

```bash
curl -X POST "https://your-app.vercel.app/api/" \
  -H "Content-Type: multipart/form-data" \
  -F "question=Download and unzip file abcd.zip which has a single extract.csv file inside. What is the value in the "answer" column of the CSV file?" \
  -F "file=@abcd.zip"
```

This project will rely on a [Waterfall Model](https://www.geeksforgeeks.org/waterfall-model/). The project needs to first idenify the following issues before it could move down to the next objective.

It's been identified to ensure completion the following objectives must be addressed.

1. Parsing the Python files to get the methods
2. Allowing the LLM to understand which method to call based on the `question`
3. Setting up the API


### Project Structure

The project needs to be organized in a manner that's easily managable, thus the structure will require a folder to store all the modules, an app to serve as the API, a folder to source all the downloaded content, and a folder to contain all vector collections.

```bash
├───chromadb
│   └───vector collections...
├───data
│   └───downloaded files...
├───docs
│   └───documentation purposes...
├───submissions
│   ├───consists of graded assignment modules...
│   └───task.py
├───.gitignore
├───app.py
├───pyproject.toml
├───README.md
└───uv.lock
```


### Function Calling vs Embeddings

Function Calling might be the closest approach **HOWEVER** given the scope of the project, this will inevitably hit the token limit if each task were to be given their own funciton call. A much suited approach is to take a Retrieval Augmented Generation (RAG) to not only reference authorative knowledge but to keep the scope of the project within its capabilities. A refusal parameter can be added if the model cannot comprehend the request of the user.


## Graded Assignment Breakdowns

This section focuses on each task for each graded assignment and their associated module within the project. This also serves as a checklist to idenfity which are possible within a Pythonic approach. __Minimize this section if necessary__.

⚠️ Refers to a task that might rely on external control

💣 Refers to a task that cannot be accomplished with just Python

🚫 Refers to a task that cannot be accomplished

- Graded Assignment 1
  - [VS Code Version](../submissions/vscode_info.py) 
  - [Make HTTP Requests with uv](../submissions/uv_requests.py)
  - [Run command with npx](../submissions/npx_prettier.py) ⚠️
  - [Use Google Sheets](../submissions/google_sheets.py) 💣
  - [Use Excel](../submission/excel.py) 💣
  - [Use DevTools](../submission/chrome_devtools.py) 💣
  - [Count Wednesdays](../submissions/counting_days.py)
  - [Extract CSV from a ZIP](../submissions/zipfile_extract.py)
  - [Use JSON](../submissions/sorting_students.py)
  - [Multi-cursor edits](../submissions/json_cleanup.py)
  - [CSS Selectors](../submissions/css_selectors.py)
  - [Process files with different encoding](../submissions/process_encoding.py)
  - [Use GitHub](../submissions/github_email.py) ⚠️
  - [Replace across files](../submissions/replace_across.py) 
  - [List files and attributes](../submissions/sort_attributes.py)
  - [Move and rename files](../submissions/move_rename.py)
  - [Compare files](../submissions/compare_files.py) 
  - [SQL: Ticket Sales](../submissions/ticket_sales.py)
- Graded Assignment 2
  - [Write documentation in Markdown](../submissions/create_markdown.py)
  - [Compress an image](../submissions/compress_image.py)
  - [Use an Image Library in Google Colab](../submissions/image_colab.py) 💣
  - [Deploy a Python API to Vercel](../submissions/vercel_api.py) ⚠️
  - [Create a GitHub Action](../submissions/github_action.py) ⚠️
  - [Push an image to Docker Hub](../submissions/docker_hub.py) 💣
  - [Write a FastAPI server to serve data](../submissions/fastapi_server) 
  - [Run a local LLM with Llamafile and ngrok](../submissions/local_lllm) ⚠️
- Graded Assignment 3
  - [LLM Sentiment Analysis](../submissions/sentiment_analysis.py)
  - [LLM Token Costs](../submissions/token_costs.py)
  - [Generate addresses with LLM](../submissions/generate_addresses.py)
  - [Base64 Encoding](../submissions/base64_encoding.py)
  - [LLM Embeddings](../submissions/llm_embeddings.py)
  - [Embedding Similarity](../submissions/cosine_similarity.py)
  - [Vector Databases](../submissions/vector_databses.py)
  - [Function Calling](../submissions/function_calling.py)
  - [Prompt Engineering](../submissions/prompt_engineering.py) 🚫
- Graded Assignment 4
  - [Import HTML to Google Sheets](../submissions/html_google.py) 💣
  - [Scrape IMDb Movies](../submissions/imdb_movies.py)
  - [Wikipedia Outline](../submissions/wikiepedia_outline.py)
  - [Scrape the BBC Weather API](../submissions/bbc_weather.py)
  - [Find the bounding box of a city](../submissions/bounding_box.py)
  - [Search Hacker News](../submissions/hacker_news.py)
  - [Find newest GitHub User](../submissions/newest_user.py)
  - [Scheduled GitHub Action](../submissions/github_actions.py)
  - [Extract tables from PDF](../submissions/extract_tables.py)
  - [Convert PDF to Markdown](../submissions/pdf_to_markdown.py)
- Graded Assignment 5
  - [Clean up Excel sales data](../submissions/clean_sales.py) 💣
  - [Clean up student marks](../submissions/clean_student_marks.py)
  - [Apache log requests](../submissions/log_requests.py) 💣
  - [Apache log downloads](../submissions/log_request_downloads.py)
  - [Parsing JSON](../submissions/parse_json.py)
  - [Extract nested JSON keys](../submissions/extract_keys.py)
  - [DuckDB: Social Media Interactions](../submissions/duckdb_interactions.py) ⚠️
  - [Transcribe a Youtube video](../submissions/yt_transcribe.py) ⚠️
  - [Reconstruct an image](../submissions/jigsaw_image.py) ⚠️


## Parsing Python Functions

In [None]:
from typing import List, Dict
import ast
import os

def parse_functions(directory: str) -> List[Dict]:
    folder = os.listdir(directory)
    functions = []
    for function in folder:
        if not function.endswith(".py"):
            continue
        func = os.path.join(directory, function)
        with open(func, "r") as file:
            tree = ast.parse(file.read())
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):
                doctring = ast.get_docstring(node) or ""
                params = [arg.arg for arg in node.args.args]
                functions.append({
                    "name": node.name,
                    "docstring": doctring,
                    "params": params,
                    "filepath": func
                })
    return functions

functions = parse_functions("../submissions/")

[{'name': 'compare_files',
  'docstring': 'Compare two files line by line and count the number of differing lines.\n\nArgs:\n    path1 (str): The file path of the first file to compare.\n    path2 (str): The file path of the second file to compare.\n\nReturns:\n    int: The number of lines that differ between the two files.\n\nRaises:\n    FileNotFoundError: If the specified directory does not exist.\n    \nExample:\n    >>> compare_files("a.txt", "b.txt")\n    38',
  'params': ['path1', 'path2'],
  'filepath': '../submissions/compare_files.py'},
 {'name': 'replace_across',
  'docstring': '',
  'params': ['path', 'replace_from', 'replace_to'],
  'filepath': '../submissions/replace_across.py'},
 {'name': 'vscode_info',
  'docstring': 'Returns the process usage and diagnostic information with vscode through the `code -s` command.\n\nReturns:\n    str: The result from running `code -s` in the terminal.',
  'params': [],
  'filepath': '../submissions/vscode_info.py'}]

## Generating Vector Embeddings

In [None]:
from sentence_transformers import SentenceTransformer
from typing import List, Dict
import chromadb

client = chromadb.PresistentClient(path="chromadb")
collection = client.get_or_create_collection(name="functions")
model = SentenceTransformer('all-distilroberta-v1')

def generate_embeddings(functions: List[Dict]) -> List[Dict]:
    for func in functions:
        text = f"Function {func["name"]}: {func["docstring"]} Parameters: {" ".join(func["params"])}"
        func["embedding"] = model.encode(text)
    return functions