# **Automated Video Transcription and Preservation Workflow**

This script automates the transcription of video files and their preservation using [WhisperX](https://github.com/m-bain/whisperX) and [pyPreservica](https://github.com/carj/pyPreservica), integrating Google Drive for file management.

## **Workflow Overview**

1. **Install Tools**: The script installs WhisperX for speech-to-text transcription and `ffmpeg` for multimedia handling.

2. **Google Drive Integration**: It mounts Google Drive to access and manage media files.

3. **Directory Setup**: The script sets up directories for videos to be processed and stores the processed outputs.

4. **Video Transcription**: Videos are transcribed using WhisperX, generating `.srt` subtitle files, which are then moved to the processed folder.

5. **Preservica Upload**: After processing, the script uploads the videos and subtitles to Preservica for secure digital preservation.

## Mount Google Drive

In [None]:
# This step is necessary to access files stored on your Google Drive, especially if you are using Google Colab.
from google.colab import drive
drive.mount('/content/drive')

## T4 (GPU Access)

In [None]:
# Set this to True if connected to a T4 runtime, else set to False if connected to CPU runtime
gpu_access = True

# WhisperX

## Install necessary packages

In [None]:
# THIS ENTIRE CELL EXAMPLE IS FROM HERE: https://github.com/m-bain/whisperX/issues/1087

!pip uninstall torch torchvision torchaudio -y

# Workaround from: https://github.com/m-bain/whisperX/issues/1027#issuecomment-2627525081
!pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# WhisperX-related packages:
!pip install ctranslate2==4.4.0
!pip install faster-whisper==1.1.0
# !pip install git+https://github.com/m-bain/whisperx.git
!pip install whisperx==3.3.1

!apt-get update
!apt-get install libcudnn8=8.9.2.26-1+cuda12.1
!apt-get install libcudnn8-dev=8.9.2.26-1+cuda12.1

!python -c "import torch; torch.backends.cuda.matmul.allow_tf32 = True; torch.backends.cudnn.allow_tf32 = True"

print('WhisperX installation complete!')

## Import required libraries

In [None]:
import os
import shutil

## Define paths for the directories

In [None]:
# Define paths for the directories
base_folder = '/content/drive/My Drive/Media for transcription'
to_process_folder = os.path.join(base_folder, 'To be processed')
processed_folder = os.path.join(base_folder, 'Processed')
processed_and_ingested_folder = os.path.join(base_folder, 'Processed and ingested')

## Create the necessary directories if they don't exist

In [None]:
# Create the necessary directories if they don't exist
os.makedirs(to_process_folder, exist_ok=True)
os.makedirs(processed_folder, exist_ok=True)
os.makedirs(processed_and_ingested_folder, exist_ok=True)

## Loop through all files in the 'To be processed' folder

In [None]:
# This loop will go through each video file in the 'To be processed' folder, process it using WhisperX,
# and then move the video and its subtitles to the 'Processed' folder.
for filename in os.listdir(to_process_folder):
    if filename.endswith(('.mp4', '.mkv', '.avi', '.mov', '.flv', '.wmv', '.m4v')):  # Add more extensions as needed
        input_path = os.path.join(to_process_folder, filename)

        print(f"Processing {filename}...")

        # Define output path for subtitles directly in the 'Processed' folder
        # The output .srt subtitle file will be saved in the 'Processed' folder.
        output_folder = processed_folder
        output_srt_path = os.path.join(output_folder, f"{os.path.splitext(filename)[0]}.srt")

        # Run WhisperX transcription and alignment via CLI
        # This command transcribes the video using WhisperX, specifying a large model for transcription,
        # and aligns the output to generate subtitles in the .srt format.

        if gpu_access:
            !whisperx "{input_path}" --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --chunk_size 4 --language 'en' --output_format 'srt' --output_dir "{output_folder}"
        else:
            !whisperx "{input_path}" --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --chunk_size 4 --language 'en' --output_format 'srt' --compute_type int8 --output_dir "{output_folder}"

        # Check if the subtitles file exists
        if os.path.exists(output_srt_path):
            print(f"Subtitles saved to {output_srt_path}")
        else:
            raise FileNotFoundError(f"Failed to save subtitles at {output_srt_path}. Halting the script.")

        # Move the processed media file to the 'Processed' folder
        # After transcription, the video file is moved to the 'Processed' folder.
        processed_path = os.path.join(processed_folder, filename)
        shutil.move(input_path, processed_path)

print("Processing complete.")

# pyPreservica

## Install pyPreservica

In [None]:
# # pyPreservica is a Python library for interacting with the Preservica digital preservation platform.
# !pip install pypreservica

## Collect credentials for Preservica

In [None]:
# # User is prompted to enter their Preservica credentials, which are necessary for uploading files.
# from getpass import getpass

# USERNAME = input("Enter your USERNAME: ")
# PASSWORD = getpass("Enter your PASSWORD: ")
# TENANT = input("Enter your TENANT: ") # icaew
# SERVER = input("Enter your SERVER: ") # eu.preservica.com

## Define the Preservica folder ID where the files will be uploaded


In [None]:
# preservica_folder_id = "7421d1a0-af87-47e7-9163-78401da161dc"

## Import necessary classes from pyPreservica

In [None]:
# from pyPreservica import UploadAPI, complex_asset_package
# import os

## Define the path to the 'Processed' folder

In [None]:
# This folder contains the processed videos and their corresponding subtitle files.
processed_folder = '/content/drive/My Drive/Media for transcription/Processed'

## Initialize Preservica client

In [None]:
# # This creates an instance of the Preservica client, which will be used to upload files.
# client = UploadAPI(username=USERNAME, password=PASSWORD,
#                    tenant=TENANT, server=SERVER)

## Upload processed files to Preservica

In [None]:
# for filename in os.listdir(processed_folder):
#     if filename.endswith(('.mp4', '.mkv', '.avi', '.mov', '.flv')):  # Add more extensions as needed
#         video_path = os.path.join(processed_folder, filename)
#         srt_path = os.path.join(processed_folder, f"{os.path.splitext(filename)[0]}.srt")

#         # Check if the corresponding subtitle file exists
#         if os.path.exists(srt_path):
#             files_to_upload = [video_path, srt_path]
#         else:
#             files_to_upload = [video_path]

#         print(f"Uploading files for {filename} to Preservica...")

#         # Create a complex asset package and upload it
#         package = complex_asset_package(files_to_upload, parent_folder=preservica_folder_id, Description="")
#         client.upload_zip_package(package)

#         print(f"Uploaded {filename} and corresponding subtitles to Preservica.")

#         # Move the files to the "Processed and Ingested" folder after upload
#         ingested_video_path = os.path.join(processed_and_ingested_folder, filename)
#         shutil.move(video_path, ingested_video_path)

#         if os.path.exists(srt_path):
#             ingested_srt_path = os.path.join(processed_and_ingested_folder, os.path.basename(srt_path))
#             shutil.move(srt_path, ingested_srt_path)

# print("All files uploaded to Preservica and moved to the 'Processed and Ingested' folder.")