# üéôÔ∏è Cantonese Voice Transcription - GitHub Automated

This notebook can be triggered automatically from GitHub Actions or run manually.

## üöÄ Two Modes:

### Mode 1: Manual Upload
Run normally and upload files when prompted

### Mode 2: GitHub Integration
Automatically loads files from your GitHub repository

---

## Step 1: Install Dependencies

In [None]:
!pip install -q openai-whisper gitpython PyGithub
!apt-get -qq install -y ffmpeg
print("‚úì Dependencies installed successfully!")

## Step 2: Configuration

Set your GitHub repository details if you want to auto-commit results.

In [None]:
# GitHub Configuration (optional)
GITHUB_REPO = "kevinzjpeng/voice-record"  # Change to your repo
GITHUB_TOKEN = ""  # Add your GitHub token for auto-commit (optional)
AUTO_COMMIT = False  # Set to True to automatically push results

# Mode selection
USE_MANUAL_UPLOAD = True  # Set to False if loading from cloned repo

print("‚úì Configuration set")
print(f"  Repository: {GITHUB_REPO}")
print(f"  Auto-commit: {AUTO_COMMIT}")
print(f"  Manual upload: {USE_MANUAL_UPLOAD}")

## Step 3: Clone Repository (if using GitHub mode)

In [None]:
import os
from pathlib import Path

if not USE_MANUAL_UPLOAD:
    print("Cloning repository...")
    !git clone https://github.com/{GITHUB_REPO}.git repo
    os.chdir('repo')
    print(f"‚úì Repository cloned to: {os.getcwd()}")
else:
    print("Using manual upload mode")

## Step 4: Import Libraries & Load Model

In [None]:
import whisper
import torch
from google.colab import files
import git

print("‚úì Libraries imported")
print(f"‚úì GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  GPU: {torch.cuda.get_device_name(0)}")

print("\nLoading Whisper Large model...")
model = whisper.load_model("large")
print("‚úì Model loaded successfully!")

## Step 5: Get Audio Files

In [None]:
audio_files = []

if USE_MANUAL_UPLOAD:
    print("üì§ Please upload your audio files...")
    uploaded = files.upload()
    audio_files = list(uploaded.keys())
else:
    # Find audio files in voice-record directory
    voice_dir = Path('voice-record')
    if voice_dir.exists():
        for ext in ['.mp3', '.wav', '.m4a', '.flac', '.ogg']:
            audio_files.extend([str(f) for f in voice_dir.rglob(f'*{ext}')])

print(f"\n‚úì Found {len(audio_files)} audio file(s):")
for f in audio_files:
    print(f"  - {f}")

## Step 6: Transcribe to Cantonese

In [None]:
def format_timestamp(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    return f"{hours:02d}:{minutes:02d}:{secs:02d}"

def transcribe_audio(audio_path, model):
    print(f"\n{'='*60}")
    print(f"Transcribing: {audio_path}")
    print(f"{'='*60}")
    
    result = model.transcribe(
        str(audio_path),
        language='zh',
        task='transcribe',
        verbose=True
    )
    
    # Determine output path
    audio_path_obj = Path(audio_path)
    if USE_MANUAL_UPLOAD:
        transcript_path = audio_path_obj.stem + '_transcript.txt'
    else:
        transcript_path = audio_path_obj.with_suffix('.txt')
    
    # Write transcript
    with open(transcript_path, 'w', encoding='utf-8') as f:
        f.write(f"Transcript of: {audio_path_obj.name}\n")
        f.write(f"Language: Cantonese/Chinese\n")
        f.write(f"{'='*60}\n\n")
        f.write(result['text'].strip())
        f.write("\n\n")
        f.write(f"{'='*60}\n")
        f.write("Detailed segments:\n\n")
        
        for segment in result['segments']:
            start = format_timestamp(segment['start'])
            end = format_timestamp(segment['end'])
            text = segment['text'].strip()
            f.write(f"[{start} -> {end}] {text}\n")
    
    print(f"\n‚úì Transcript saved: {transcript_path}")
    print(f"\nüìù Preview:\n{'-'*60}")
    print(result['text'][:500])
    print(f"{'-'*60}")
    
    return transcript_path

# Transcribe all files
transcript_files = []
for audio_file in audio_files:
    try:
        transcript = transcribe_audio(audio_file, model)
        transcript_files.append(transcript)
    except Exception as e:
        print(f"\n‚úó Error: {e}")

print(f"\n{'='*60}")
print(f"‚úì Complete! Transcribed {len(transcript_files)}/{len(audio_files)} files")
print(f"{'='*60}")

## Step 7: Download or Commit Results

In [None]:
if USE_MANUAL_UPLOAD:
    # Download transcripts
    print("üì• Downloading transcripts...\n")
    for transcript in transcript_files:
        files.download(transcript)
    print("‚úì Downloads complete!")
    
elif AUTO_COMMIT and GITHUB_TOKEN:
    # Commit to GitHub
    print("Committing to GitHub...")
    try:
        repo = git.Repo('.')
        repo.git.add('voice-record/**/*.txt')
        repo.index.commit('Add Cantonese transcripts from Colab')
        
        origin = repo.remote('origin')
        origin.push()
        print("‚úì Transcripts committed and pushed!")
    except Exception as e:
        print(f"‚úó Commit failed: {e}")
        print("Downloading instead...")
        for transcript in transcript_files:
            files.download(transcript)
else:
    print("Transcripts saved locally.")
    print("To download, run the next cell.")

## Optional: Manual Download

In [None]:
# Run this cell to download transcripts manually
for transcript in transcript_files:
    if os.path.exists(transcript):
        files.download(transcript)

---

## üéâ Done!

### Next Steps:

**Manual Mode**: Your transcripts have been downloaded

**GitHub Mode**: 
- Transcripts are saved in the repository
- Commit them manually or enable AUTO_COMMIT