This project provides a production-ready FastAPI application that serves a Hugging Face Shakespeare language model with:
- ✅ Text generation endpoint (
/generate) - ✅ Inference metrics tracking (
/metrics) - ✅ Health check endpoint (
/health) - ✅ Automatic API documentation (
/docs) - ✅ Docker containerization
- ✅ Google Cloud deployment configuration
- main.py - FastAPI application with model serving and metrics
- requirements.txt - Python dependencies
- Dockerfile - Container configuration
- app.yaml - Google Cloud App Engine configuration
- deploy.sh - Automated deployment script
- DEPLOYMENT_GUIDE.md - Detailed deployment instructions
-
Install dependencies:
pip install -r requirements.txt
-
Run the app:
python main.py
The API will be available at
http://localhost:8080 -
View documentation: Open your browser to
http://localhost:8080/docs -
Test endpoints:
# Health check curl http://localhost:8080/health # Generate text curl -X POST http://localhost:8080/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "To be", "max_length": 50}' # Get metrics curl http://localhost:8080/metrics
-
Set your project ID:
export PROJECT_ID="your-gcp-project-id"
-
Run deployment script:
chmod +x deploy.sh ./deploy.sh
-
Choose deployment method:
- Option 1: App Engine (simple, auto-scaling, no containers needed)
- Option 2: Cloud Run (containerized, fine-grained control)
- Option 3: Both
Generate text based on a prompt.
Request:
{
"prompt": "To be or not to be",
"max_length": 100,
"temperature": 0.7,
"top_p": 0.9
}Response:
{
"prompt": "To be or not to be",
"generated_text": "To be or not to be, that is the question...",
"tokens_generated": 45
}Get inference statistics.
Response:
{
"total_inferences": 42,
"service_status": "running"
}Health check endpoint.
Response:
{
"status": "healthy",
"model_loaded": true
}Service information.
Edit main.py line 33:
model_name = "/path/to/your/model" # Update this pathOr use any Hugging Face model:
model_name = "gpt2" # or any other HF modelIn app.yaml:
automatic_scaling:
min_instances: 1
max_instances: 10
resources:
memory_gb: 4
cpu: 2Add environment variables to app.yaml:
env_variables:
MY_VAR: "value"
PYTHONUNBUFFERED: "true"gcloud app logs read -n 100- Visit Cloud Console > App Engine > Metrics
- Check requests, errors, and latency
gcloud app logs read -fThe first deployment may take time as the model is downloaded. Check logs:
gcloud app logs readIncrease memory in app.yaml:
resources:
memory_gb: 8 # Increase from 4- Increase number of worker processes in
Dockerfile - Increase
min_instancesinapp.yaml - Consider using a lighter model
- Free tier: 28 instance-hours/day
- Pay-as-you-go: ~$0.05/instance-hour
- Example: 1 instance 24/7 ≈ $36/month
- Free tier: 180,000 vCPU-seconds/month
- Pay-as-you-go: ~$0.00002400/vCPU-second
- Example: 1M requests/month ≈ $20-50/month
- Update the model path to your specific Shakespeare model
- Add authentication if needed (API keys, OAuth2)
- Set up monitoring and alerts
- Configure auto-scaling based on your traffic
- Add request logging for analytics
- Set up CI/CD pipeline for automated deployments