FynCut is an AI-powered, serverless video editing platform that automatically converts long-form video content (like podcasts, interviews, and talk shows) into highly engaging, viral, vertical (9:16) clips ready for TikTok, YouTube Shorts, and Instagram Reels.
The system leverages Google Gemini for identifying key moments, WhisperX for fast word-level audio transcription, and intelligent face tracking/active speaker detection (Columbia ASD) to automatically crop and focus the camera on whoever is speaking.
FynCut was originally built and launched almost a year ago as a commercial SaaS product. The platform was designed to automate vertical clip editing for creators. However, managing serverless GPU infrastructure, API costs, marketing, and founder burnout solo proved challenging.
Instead of letting the code sit idle, I have open-sourced the entire project for the developer community to study, self-host, and extend!
- Read the full story & post-mortem blog post: Why I'm Open-Sourcing My AI Video SaaS (FynCut)
- Intelligent Reframing (Face Tracking): Automatically crops horizontal video (16:9) to vertical (9:16) by dynamically panning/tracking the active speaker using a face-detection pipeline.
- AI Moment Identification: Analyzes transcriptions with Google Gemini API to detect high-impact moments (viral hooks, stories, emotional segments, key questions) and gives them viral scores.
- GPU-Accelerated Transcription: WhisperX is run on serverless GPUs for fast, word-level timestamps.
- TikTok-Style Captions: Generates synchronized, word-by-word highlighted captions.
- Custom Caption Burner: Allows users to customize caption fonts (Poppins, Anton, Montserrat, etc.), sizes, layout positions, and burn them directly into the video.
- Scale-to-Zero Serverless Backend: Powered by Modal, executing GPU heavy workloads only when needed, minimizing infrastructure costs.
- Robust Background Queues: Built with Next.js and Inngest to handle long-running video processing asynchronously with automatic retries and exponential backoff.
FynCut uses a decoupled, full-stack architecture divided into a Next.js client-facing web application and a serverless Modal Python backend.
graph TD
User(["User Browser"]) <--> |"HTTP / WebSockets"| NextJS["Next.js Frontend"]
NextJS <--> |"Prisma ORM"| PostgreSQL[("PostgreSQL Database")]
NextJS <--> |"Presigned Upload / Download"| S3[("AWS S3 Storage")]
NextJS <--> |"Trigger Workflows"| Inngest["Inngest Event Engine"]
subgraph SB ["Serverless Backend (Modal Cloud)"]
Inngest <--> |"POST Request + Bearer Auth"| ModalGPU["Modal GPU: FynCutAi"]
NextJS <--> |"POST Request + Bearer Auth"| ModalCPU["Modal CPU: CaptionBurner"]
ModalGPU <--> |"Download Video / Upload Clips"| S3
ModalCPU <--> |"Download Clips / Upload Captioned Clips"| S3
ModalGPU --> |"moment analysis"| Gemini["Google Gemini AI"]
end
The following sequence diagram details what happens when a user uploads a video for processing:
sequenceDiagram
autonumber
actor User as "User Browser"
participant FE as "Next.js Web App"
participant DB as "PostgreSQL"
participant S3 as "AWS S3 Bucket"
participant IG as "Inngest Runner"
participant M_GPU as "Modal GPU (FynCutAi)"
participant Gemini as "Gemini AI API"
User->>FE: Selects video & clicks "Upload"
FE->>DB: Create upload record (status: pending)
FE->>S3: Upload video directly via S3 Presigned URL
S3-->>FE: Upload complete
FE->>IG: Dispatch "process-video-events" Event
FE-->>User: Redirect to dashboard (status: queued)
Note over IG: Inngest worker picks up the job
IG->>DB: Check user credits & set status to "processing"
IG->>M_GPU: Trigger Video Processing (s3_key)
activate M_GPU
M_GPU->>S3: Download input video
M_GPU->>M_GPU: Generate video thumbnail
M_GPU->>S3: Upload thumbnail
Note over M_GPU: Run WhisperX model (GPU)
M_GPU->>M_GPU: Extract audio & transcribe (word-level timestamps)
M_GPU->>Gemini: Send transcript (Get interesting moments & metadata)
Gemini-->>M_GPU: Return moments JSON (title, start, end, viral_score, keywords)
Note over M_GPU: Parallel Clip Generation
loop For each detected moment
M_GPU->>M_GPU: Track active speaker faces (Columbia ASD + OpenCV)
M_GPU->>M_GPU: Reframe & crop video to 9:16 centered on speaker
M_GPU->>M_GPU: Generate subtitle files (SRT, VTT, TXT)
M_GPU->>S3: Upload raw cropped clip & subtitles to S3
end
M_GPU-->>IG: Return completion response (clip keys, metadata)
deactivate M_GPU
IG->>DB: Save clip URLs and metadata, deduct user credits
IG->>DB: Set video status to "processed"
IG->>FE: Send email notification to user via Resend API
FE-->>User: Update dashboard UI (clips ready)
The project is split into two folders:
FynCut/
├── frontend/ # Next.js App Router (T3 Stack)
│ ├── src/
│ │ ├── app/ # Next.js Pages and API Routes
│ │ ├── actions/ # Next.js Server Actions (Auth, Captions, S3)
│ │ ├── components/ # React Components (Dashboard, Player, Editor)
│ │ ├── inngest/ # Background event workflow definitions
│ │ └── env.js # Environment validation schema (zod)
│ ├── prisma/ # Database Schema (PostgreSQL)
│ └── .env.example # Frontend environment template
│
└── server/ # Serverless Python Backend (Modal)
├── main.py # Modal entrypoint & GPU pipeline endpoints
├── requirements.txt # Python package dependencies
├── Makefile # Setup, test, and deployment automation
└── .env.example # Backend environment template
Follow the dedicated setup and deployment guides inside each folder to run the project locally or deploy it to production:
- Backend Setup: Go to the Backend README to set up python, configure secrets, and deploy endpoints to Modal.
- Frontend Setup: Go to the Frontend README to launch the Next.js app, configure database migrations, and connect the background worker.
Contributions are welcome! If you'd like to improve FynCut, optimize face-tracking performance, or add support for new caption styles:
- Fork the Repository.
- Create your Feature Branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the Branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
This project is open-source and licensed under the MIT License.


