Point it at a repo, get back "this is an e-commerce app that does X".
Pattern-based application functional-category inference from routes, data models, and README. Zero runtime dependencies (pure stdlib). Optional LLM polish β bring your own provider.
Onboarding to a new repo, every engineer asks the same questions: "What does this thing do? Is it a CRUD app or a queue worker? What database? What ports does it need?" The README is usually wrong or stale. Codeowners are unavailable. You end up grepping for clues.
app-classifier answers those questions in under a second for any repo, on disk, with no network calls:
- What kind of app is this? β e-commerce, blog, social network, admin panel, REST API, auth/SSO, file management, scheduling, or messaging (9 categories, weighted-pattern matching, confidence-scored)
- What does it do? β a 2-3 sentence functional description, deterministically composed, optionally LLM-polished
- How does it deploy? β runtime + version, framework, web server, databases, caches, ports, env vars, container base image, runtime CVEs
pip install app-classifier
app-classifier ./my-repo=== my-repo ===
Category: e-commerce (78% confidence)
Runtime: python 3.11
Framework: FastAPI
Deploys as: ASGI server (uvicorn / hypercorn / daphne)
Databases: PostgreSQL, SQLAlchemy ORM
Cache/Queue: Redis, Celery
Features: online shopping, messaging
π Summary: my-repo Β· python 3.11 Β· FastAPI Β· 23 HTTP route(s) Β· 5 data model(s) Β· DB: PostgreSQL, SQLAlchemy ORM
π What it does:
my-repo is a e-commerce application. Primary functionality: online shopping, messaging.
It models entities like Cart, Order, Product, User serving authenticated users.
π HTTP Routes (23 found):
GET /products β list_products
POST /cart/add β add_to_cart
POST /checkout β checkout
...
from app_classifier import classify
result = classify("./my-repo")
print(result.app_category) # 'e-commerce'
print(result.app_category_confidence) # 0.78
print(result.detected_features) # ['online shopping', 'messaging']
print(result.functional_description) # "my-repo is a e-commerce application. ..."
# Full structured access
for route in result.routes:
print(route.method, route.path, route.handler)
for model in result.data_models:
print(model.name, model.framework, model.fields_hint)
# JSON-serializable
import json
print(json.dumps(result.to_dict(), indent=2))Skip the classifier, use hosting directly:
from app_classifier import analyze_hosting_requirements
report = analyze_hosting_requirements("./my-repo")
print(report.runtime) # {'language': 'python', 'version': '3.11'}
print(report.web_server) # {'framework': 'FastAPI', 'deployment_target': '...'}
print(report.databases) # [{'name': 'PostgreSQL', ...}, ...]
print(report.ports) # [{'port': 8000, 'source': 'Dockerfile', ...}]
print(report.web_server_vulnerabilities) # CVEs on the container base imageclassify_async accepts ANY async callable as the LLM provider β no SDK pinned. If the LLM gives a useful response, the deterministic functional_description is replaced with the polished version; on any failure (timeout / parse error / hallucination guard / no provider) the deterministic version is kept.
# OpenAI shim
async def my_openai_provider(prompt, max_tokens=400, temperature=0.2):
import openai
client = openai.AsyncOpenAI()
resp = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens, temperature=temperature,
)
return resp.choices[0].message.content
# Anthropic shim
async def my_anthropic_provider(prompt, max_tokens=400, temperature=0.2):
import anthropic
client = anthropic.AsyncAnthropic()
resp = await client.messages.create(
model="claude-haiku-4-5",
max_tokens=max_tokens, temperature=temperature,
messages=[{"role": "user", "content": prompt}],
)
return resp.content[0].text
# Local llama.cpp / Ollama shim
async def my_ollama_provider(prompt, max_tokens=400, temperature=0.2):
import httpx
async with httpx.AsyncClient() as client:
r = await client.post("http://localhost:11434/api/generate", json={
"model": "llama3", "prompt": prompt, "stream": False,
"options": {"num_predict": max_tokens, "temperature": temperature},
})
return r.json().get("response")
# Use any of the above
import asyncio
from app_classifier import classify_async
result = asyncio.run(classify_async("./my-repo", llm_provider=my_openai_provider))
print(result.functional_description)Python, Java (JDK 8+), Node.js, Go, Ruby, PHP, Rust β detected from manifest files, Dockerfiles, version files (.nvmrc, .python-version, .ruby-version).
| Language | Frameworks |
|---|---|
| Python | Flask, FastAPI, Django |
| Java | Spring Boot, Struts 2 (struts.xml), classic Spring |
| Node | Express, Fastify, NestJS, Next.js |
| ORM | Detected from |
|---|---|
| JPA / Hibernate | @Entity, @Table annotations |
| SQLAlchemy | class X(Base) |
| Django ORM | class X(models.Model) |
PostgreSQL, MySQL, MongoDB, H2, Oracle, SQL Server, MariaDB, Redis, RabbitMQ, Kafka, Elasticsearch, Celery.
Dockerfile (FROM, EXPOSE, ENV), docker-compose, Kubernetes manifests, Helm charts, k8s deployment YAML, Heroku Procfile, Vercel / Netlify configs.
Curated CVE manifest for nginx, Apache HTTPD, Tomcat, OpenJDK / Eclipse Temurin / Amazon Corretto. ~30 high-impact CVEs covered out of the box. PRs welcome.
e-commerce, blog/content, social network, admin panel/dashboard, REST API service, authentication/SSO, file/document management, scheduling/booking, messaging/notification. Each is matched by a weighted regex pattern against routes + model names + README.
- Walk every manifest/config file in the repo (capped at 800 files for speed)
- Each file extracts language-specific signals (Maven artifact IDs, npm package names, Python deps, Dockerfile FROM, k8s containerPort, etc.) β
HostingReport - Walk source files to extract HTTP routes + data models per framework
- Pattern-match routes + model names + README purpose against 9 category fingerprints (weighted regex)
- Compose the 2-3 sentence functional description deterministically
- (Optional) Hand the structured signals to your LLM for a polished rewrite
Time budget: under 1 second on a 5K-file repo. Bounded scan caps file count + per-file read size.
- No network. Every signal comes from on-disk content. Bundled CVE manifest, no live API calls.
- No SDK pin. The LLM step is provider-agnostic β bring your own callable. We never
import openai. - No surprises. Failures on individual files don't kill the pass. Confidence is always reported; the consumer decides whether to trust it.
- Pure read. We never modify the target repo.
PRs welcome on three axes:
- More category fingerprints β
_CATEGORY_FINGERPRINTSinclassifier.py. Each is{ name, feature_label, signals: [(regex, weight), ...] }. - More CVE entries β
data/web_server_cves.json. Schema is documented in the file header. - More framework extractors β route + model extraction for Ruby on Rails, Phoenix, ASP.NET Core, Gin, Rocket, etc. would all be welcome.
pip install -e ".[test]"
pytestTests use fixture directories under tests/fixtures/ β point the classifier at each, assert the expected category + features.
- Not a security scanner. It surfaces runtime CVEs on the container base image, but the rest of the code is for understanding, not vulnerability detection.
- Not a deployment tool. It tells you what the deployment looks like; it doesn't deploy anything.
- Not a replacement for a README. It generates a structural sketch; humans still write the narrative.
If you want a full security analysis + fix pipeline that uses this internally, see Codefixer (closed-source).
MIT β see LICENSE. Use it however you want. Attribution appreciated but not required.
Extracted from Codefixer's hosting_requirements + app_description analyzers. The category-fingerprint approach was inspired by Sourcegraph's "what is this repo?" tooling and the way Backstage classifies services.