<a href="https://colab.research.google.com/github/jimwhite/commentator_ai/blob/main/Transcript_to_Animated_Video.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visualizer.TV: "MP3 to MTV" Demo

 * Concept: Robert Sloan (https://www.linkedin.com/in/sloanrobert/)
 * Code: Jim White (https://www.linkedin.com/in/jamespaulwhite/)
 * Demo Slides: [Google Slides in Drive link](https://docs.google.com/presentation/d/14mSrc1GOQhetNkOHjFkGLNNe-ZrbhPHIviC9Tj6VuHE/edit?usp=sharing)
 * License: GPL v3 (https://github.com/jimwhite/commentator_ai/blob/main/LICENSE)

## Notes
This implementation doesn't actually take MP3s like the title says.  The current approach is to provide a URL to a YouTube video with a transcript (i.e. closed captions).  

This code should work using "Run All" (after you set a suitable YouTube URL below), there will be auth prompt for access to your Google Drive (for temporary files and the final video) and getpass prompting for your OpenAI and Stability API keys.  See notes in line for getting keys if you need pointers, they both include some free credits for new accounts.

## TODO
 * Add animation (e.g. pan & zoom) effects.
 * Add transitions effects.  One approach is using Stability init_image to generate frame-by-frame animations that morph from one scene to the next.
 * Add review mode for image generation for easy way to accept/reject candidates.  Image generation is pretty hit or miss (with plenty of misses).  Right now if you want to replace an image you can remove the file from the folder in your drive and rerun the image generation step (because it doesn't regenerate images that already have a file with the matching description/name).
 * Support actually using MP3s and running STT on them to get the timed lyrics.  Although OpenAI Whisper API has timestamps the code we had didn't deliver them.  Google Speech and other APIs would work fine too.
 * Try other image generators.  This SDXL beta makes some funny stuff.
 * Use the upscaler API to make higher resoltion videos.
 * Prompt refinements/experiments.
 * Buff up the error handling for:
  * ChatGPT responses with bad CSV formatting.  LangChain is used here and I wanted to use tools and OpenAI Functions but that wound up not fitting in the time for the hackathon.
  * Image descriptions that fail Stability filters.  Even seemingly innocuous words/phrases will fail and the current logic is to just skip those.  Retrying by asking ChatGPT to rephrase those descriptions is one possible solution.

In [None]:
#@title Install dependencies
%pip install -q stability-sdk youtube-transcript-api langchain openai opencv-python yt-dlp ffmpeg-python

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m93.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m112.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.8/409.8 kB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
#@title AI Lost Media's Text to Video Colab Workspace https://youtube.com/@ailostmedia
#huge thanks to Camenduru https://twitter.com/camenduru and Cerspense https://twitter.com/cerspense for putting these models together.
#tutorial: https://www.ailostmedia.com/post/the-ai-lost-media-text-to-video-colab-workspace
%cd /content
!pip install -q torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 torchtext==0.14.1 torchdata==0.5.1 --extra-index-url https://download.pytorch.org/whl/cu116 -U
!pip install git+https://github.com/huggingface/diffusers transformers accelerate imageio[ffmpeg] -U einops omegaconf decord xformers==0.0.16 safetensors
!git clone -b dev https://github.com/camenduru/Text-To-Video-Finetuning
!git clone https://github.com/ailostmedia/Potat1ALM
!mv /content/Potat1ALM/inference.py /content/Text-To-Video-Finetuning/

/content
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 GB[0m [31m929.3 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.2/24.2 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m112.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m81.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting git+https://github.com/huggingface/diffusers
  Cloning https://github.com/huggingface/diffusers to /tmp/pip-req-build-buj_7bp_
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/diffusers /tmp/pip-req-build-buj_7bp_
  Resolved https://github.com/huggingface/diffusers to commit 174dcd697faf88370f1e7b2eeabb059dd8f1b2f4
  Installing build de

Cloning into 'Text-To-Video-Finetuning'...
remote: Enumerating objects: 973, done.[K
remote: Counting objects: 100% (322/322), done.[K
remote: Compressing objects: 100% (99/99), done.[K
remote: Total 973 (delta 250), reused 234 (delta 223), pack-reused 651[K
Receiving objects: 100% (973/973), 1.77 MiB | 11.70 MiB/s, done.
Resolving deltas: 100% (571/571), done.
Cloning into 'Potat1ALM'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 11 (delta 2), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (11/11), 5.05 KiB | 1.68 MiB/s, done.


In [None]:
#@title Install Potat1
#default 1024 x 576 - try 800 x 448 for colab
%cd /content/
!git clone https://huggingface.co/camenduru/potat1

/content
Cloning into 'potat1'...
remote: Enumerating objects: 88, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 88 (delta 5), reused 0 (delta 0), pack-reused 73[K
Unpacking objects: 100% (88/88), 522.95 KiB | 2.05 MiB/s, done.
Filtering content: 100% (3/3), 4.05 GiB | 149.79 MiB/s, done.


In [None]:
#@title Set up Google Drive for file storage
try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    outputs_path = "/content/gdrive/MyDrive/Commentator_AI/Animated_Video"
    !mkdir -p "$outputs_path"
except:
    outputs_path = "."
print(f"Files will be saved to {outputs_path}")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Files will be saved to /content/gdrive/MyDrive/Commentator_AI/Animated_Video


In [None]:
%rm -rf /content/output

In [None]:
#@title Set YouTube URL

YOUTUBE_URL = "https://www.youtube.com/watch?v=DADmZdbQ9x8" #@param {type:"string"}

import os
import re

def ytIdFromURL(url:str)->str:
   data = re.findall(r"(?:v=|\/)([0-9A-Za-z_-]{11}).*", url)
   if data:
       return data[0]
   return None

video_id = ytIdFromURL(YOUTUBE_URL)

if not video_id:
  raise ValueError("video_id isn't set")

print(f'YouTube ID: {video_id}')

out_dir = os.path.join(outputs_path, video_id)
os.makedirs(out_dir, exist_ok=True)
!ln -s "$out_dir" /content/output

print(f'/content/output linked to ${out_dir}')

YouTube ID: DADmZdbQ9x8
/content/output linked to $/content/gdrive/MyDrive/Commentator_AI/Animated_Video/DADmZdbQ9x8


In [None]:
#@title Get the Audio
import os

audio_file_path = os.path.join(out_dir, 'audio.m4a')

if os.path.exists(audio_file_path):
  print('Audio already downloaded')
else:
  !yt-dlp -f "bestaudio[ext=m4a]"  -o "{audio_file_path}" "{YOUTUBE_URL}"

[youtube] Extracting URL: https://www.youtube.com/watch?v=DADmZdbQ9x8
[youtube] DADmZdbQ9x8: Downloading webpage
[youtube] DADmZdbQ9x8: Downloading ios player API JSON
[youtube] DADmZdbQ9x8: Downloading android player API JSON
[youtube] DADmZdbQ9x8: Downloading m3u8 information
[info] DADmZdbQ9x8: Downloading 1 format(s): 140
[download] Destination: /content/gdrive/MyDrive/Commentator_AI/Animated_Video/DADmZdbQ9x8/audio.m4a
[K[download] 100% of    1.61MiB in [1;37m00:00:00[0m at [0;32m17.85MiB/s[0m
[FixupM4a] Correcting container of "/content/gdrive/MyDrive/Commentator_AI/Animated_Video/DADmZdbQ9x8/audio.m4a"


In [None]:
#@title Get Video Transcript (CSV)

# Using CSV files is a convenient way to integrate with LangChain (and LangFlow).
# Also is much more efficient in token usage so longer transcripts will work for
# any given LLM context token limit.

import csv
from youtube_transcript_api import YouTubeTranscriptApi

transcript = []

transcript_file_path = os.path.join(out_dir, 'transcript.csv')
fieldnames = ['start', 'duration', 'text']
if os.path.exists(transcript_file_path):
  with open(transcript_file_path, 'r') as csv_file:
    reader = csv.DictReader(csv_file, fieldnames=fieldnames, quoting=csv.QUOTE_NONNUMERIC)
    next(reader)  # skip header
    for row in reader:
      transcript.append(row)
    print(f'Read transcript from file: {transcript_file_path}')

if not transcript:
  transcript = YouTubeTranscriptApi.get_transcript(video_id)
  print('Got transcript from YouTube API')
  with open(transcript_file_path, 'w', newline='') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames, quoting=csv.QUOTE_NONNUMERIC)
    writer.writeheader()
    for row in transcript:
      writer.writerow(row)
    print(f'Saved transcript to file: {transcript_file_path}')

transcript

Got transcript from YouTube API
Saved transcript to file: /content/gdrive/MyDrive/Commentator_AI/Animated_Video/DADmZdbQ9x8/transcript.csv


[{'text': 'Oh, say can you see', 'start': 5.48, 'duration': 4.18},
 {'text': "By the dawn's early light", 'start': 9.66, 'duration': 4.1},
 {'text': 'What so proudly we hailed', 'start': 13.76, 'duration': 4.64},
 {'text': "At the twilight's last gleaming?", 'start': 18.4, 'duration': 4.16},
 {'text': 'Whose broad stripes and bright stars',
  'start': 22.56,
  'duration': 4.64},
 {'text': 'Through the perilous fight', 'start': 27.2, 'duration': 4.24},
 {'text': "O'er the ramparts we watched", 'start': 31.44, 'duration': 4.8},
 {'text': 'Were so gallantly streaming?', 'start': 36.24, 'duration': 4.72},
 {'text': "And the rockets' red glare!", 'start': 40.96, 'duration': 4.8},
 {'text': 'The bombs bursting in air!', 'start': 45.76, 'duration': 4.72},
 {'text': 'Gave proof through the night', 'start': 50.48, 'duration': 4.8},
 {'text': 'That our flag was still there', 'start': 55.28, 'duration': 5.04},
 {'text': 'Oh, say does that star spangled banner yet wave',
  'start': 60.32,
  'durat

**Instructions for getting an OpenAI API key:** [https://platform.openai.com/](https://platform.openai.com/)

The key is only stored in the kernel running this notebook for you and used in the calls to OpenAI's service endpoint.

In [None]:
#@title Get OpenAI API key
from getpass import getpass

if not 'OPENAI_API_KEY' in os.environ:
  key = getpass('Enter your OpenAI API key: ')
  if key:
    os.environ['OPENAI_API_KEY'] = key


Enter your OpenAI API key: ··········


This is the ChatGPT prompt used to select which lyrics to generate images for and the image description to use for each.  It also generates an initial image (aka scene) that starts at 0 seconds that represents the theme of the whole song.

In [None]:
%%writefile prompt.txt
You're a visual musical artist. Given the following lyrics choose the phrases that should be
animated or filmed to make a timed music video for this song.

Respond in CSV format with the columns
'start', 'duration', 'text' (for the transcription text), 'description' (for the scene description).

Choose an artistic style that fits the mood of the song and include that in the individual video
segment descriptions so that the visual appearance of the whole video will be connected.
Of course there are creative cases where using different artistic styles/moods in one work is good
but that is pretty rare.

For the first row start at time 0 and make an image description that reflects the songs theme.
The theme should include specifics such as the time and place that is being described, at least in
some general way.  Otherwise generic terms such as events, activities, an objects of many kinds
which naturally would be illustrated differently in different places and
times will get rendered in an inconsistent manner.  For example, don't just say "battle", "person",
or "sunset".  There should be sufficient details so that the time (or era), people, or place are
described for the animator.  So applying the songs theme and meaning to
each video scene description will need to include those particular choices you make in interpretation.
Keep in mind that each image description will be rendered separately so don't use any references
between them.  Also the image generator has limited input length and inputs are always image
descriptions so omit superfluous words like "Generate a video..." or "Animation of ...".
Also because the image rendering is done in isolation for each description please be sure to include
enough thematic keys in them so the videos are holistic related to the song's theme (and that goose
for the artistic style too).
Finally, but very importantly, the scene descriptions should have enough details so they are
specific to the ideas intended by the song as a whole including its theme.  Again, the video
generator only sees each description separately so you have the job to do the translation from
ideas that the songwriter and listeners get from the song and making an isolated image description.
For example, in the Star Spangled Banner the phrase "Home of the Brave" refers to the whole country
and all its people (all of whom are brave, not that there just some who are brave).
Keep in mind this is the 2020s and good images are those that are inclusive of all people,
in all the variety of their ethnicities, genders, traditions, religeons, languages, political views,
philosophies, and identities.
=== lyrics ===

Writing prompt.txt


If you don't have access to GPT-4 with your API key (if you're not sure you'll get an error message to that effect) then change the CHAT_MODEL to `gpt-3.5-turbo-0613`.  You might want to use 3.5 in any case because it is a lot cheaper (pricing is per token).

In [None]:
#@title ChatGPT selects lyrics to illustrate and generates image descriptions

CHAT_MODEL = 'gpt-4-0613'  #@param {type:"string"}
TEMPERATURE = 0.7  #@param {type:"number"}

from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

image_description_csv_text = None
out_dir = os.path.join(outputs_path, video_id)
image_description_file_path = os.path.join(out_dir, 'image_descriptions.csv')
fieldnames = ['start', 'duration', 'text', 'description']
if os.path.exists(image_description_file_path):
  with open(image_description_file_path, 'r') as csv_file:
    image_description_csv_text = csv_file.read()
    print(f'Read image descriptions from file: {image_description_file_path}')
    print(image_description_csv_text)

if not image_description_csv_text:
  chat = ChatOpenAI(temperature=TEMPERATURE, model=CHAT_MODEL)
  print('ChatGPT working...')
  prompt_text = ''
  with open('prompt.txt', 'r') as f:
    prompt_text = f.read()
  # Save a copy of prompt for future reference.
  with open(os.path.join(out_dir, 'prompt.txt'), 'w') as f:
    f.write(prompt_text)
  with open(transcript_file_path, 'r') as csv_file:
    prompt_text = '\n\n'.join([prompt_text, csv_file.read()])
  response = chat([HumanMessage(content=prompt_text)])
  print('Got image descriptions from ChatGPT')
  image_description_csv_text = response.content
  print(image_description_csv_text)
  os.makedirs(out_dir, exist_ok=True)
  with open(image_description_file_path, 'w', newline='') as csv_file:
    csv_file.write(response.content)
    print(f'Saved image descriptions to file: {image_description_file_path}')


ChatGPT working...
Got image descriptions from ChatGPT
"start","duration","text","description"
0,5.48,"","A sunrise illuminating an 1812 era American battlefield, rendered in a dramatic, historic painting style."
5.48,4.18,"Oh, say can you see","A close-up of a soldier's face, awestruck as he looks towards the horizon, depicted in a realistic, historic painting style."
9.66,4.1,"By the dawn's early light","The soldier's perspective of the horizon, where the first light of dawn is breaking, painted in a romantic, historic style."
13.76,4.64,"What so proudly we hailed","A fading flashback of the soldiers hailing the American flag, illustrated in a dramatic, historic painting style."
18.4,4.16,"At the twilight's last gleaming?","The horizon scene transitions from dawn back to the previous night's twilight, captured in a historic, moody and atmospheric style."
22.56,4.64,"Whose broad stripes and bright stars","Close-up of the American flag at twilight, with its stripes and stars dramatical

In [None]:
%cd /content/Text-To-Video-Finetuning
import torch
import random
import numpy as np
"""
torch.use_deterministic_algorithms(True)

torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic=True
random.seed(2)
np.random.seed(2)
torch.manual_seed(2)
torch.cuda.manual_seed(2)
torch.cuda.manual_seed_all(2)
torch.manual_seed(0)
"""
#print("seed is " + str(torch.seed()))

#seeding = "Random"
#thisSeed = 123;


#preset = "Manual"
# while True:
#@markdown ### Configuration for Text to Video generation
model = "potat1" #@param ["potat1", "zeroscope_v2_dark_30x448x256", "zeroscope_v2_576w", "zeroscope_v2_XL"]
negative = "text, watermark, copyright, blurry, nsfw, noise, quick motion, bad quality, flicker, dirty, ugly, fast motion, quick cuts, fast editing, cuts" #@param {type:"string"}
negative = f"\"{negative}\""
num_steps = 25 #@param {type:"raw"}
guidance_scale = 23 #@param {type:"raw"}
width = 800 #@param {type:"raw"}
height = 448 #@param {type:"raw"}
fps = 10 #@param {type:"raw"}
num_frames = 30 #@param {type:"raw"}
seedManual = "Random"
seeding = "Random" #@param ["Random", "Manual"]

inputSeed = 7106521602475165645 #@param {type:"raw"}
if seeding == "Random":
  thisSeed = random.randint(0, ((1<<63)-1))
  print("seed is " + str(thisSeed))
else:
  thisSeed = inputSeed

thisHeight = int(round(height/8.0)*8.0)
thisWidth = int(round(width/8.0)*8.0)

thisModel="/content/"+model


/content/Text-To-Video-Finetuning
seed is 5641571670149848983


In [65]:
#@title Generate Animated Scene Videos

scene_mp4_name = 'scene_f30.mp4'

def get_scene_video(video_out_path:str, name:str):
  for fp in os.listdir(video_out_path):
    if fp.endswith('mp4'):
      if fp != scene_mp4_name:
        os.rename(os.path.join(video_out_path, fp), os.path.join(video_out_path, scene_mp4_name))
      return os.path.join(video_out_path, scene_mp4_name)
  return None

def generate_video(name:str, prompt: str):
  # Path to the video for the scene or None

  video_out_path = os.path.join('/content/output/videos', name)
  os.makedirs(video_out_path, exist_ok=True)

  scene_video = get_scene_video(video_out_path, name)
  if scene_video:
    print(f'Already have video for: {prompt}')
    return scene_video

  print(f'Generating video for: {prompt}')
  prompt = f"\"{prompt}\""
  !python inference.py -m {thisModel} -p {prompt} -n {negative} -W {thisWidth} -H {thisHeight} -o {video_out_path} -d cuda -x -s {num_steps} -g {guidance_scale} -f {fps} -T {num_frames} -seed {thisSeed}

  return get_scene_video(video_out_path, name)

videos_dir = os.path.join(out_dir, 'videos')
os.makedirs(videos_dir, exist_ok=True)

def description_to_name(description:str):
  return re.sub(r'[^\w\d-]','_', description).lower()

def description_to_videopath(description:str):
  video_out_path = os.path.join('/content/output/videos', description_to_name(description))
  return os.path.join(video_out_path, scene_mp4_name)

with open(image_description_file_path, 'r') as csv_file:
  reader = csv.DictReader(csv_file, quoting=csv.QUOTE_NONNUMERIC)
  for row in reader:
    print(row)
    description = row['description']
    name = description_to_name(description)
    video_path = generate_video(name, description)
    if video_path:
      print(video_path)
    else:
      print(f"No video for: {description}")

print('done')


{'start': 0.0, 'duration': 5.48, 'text': '', 'description': 'A sunrise illuminating an 1812 era American battlefield, rendered in a dramatic, historic painting style.'}
Already have video for: A sunrise illuminating an 1812 era American battlefield, rendered in a dramatic, historic painting style.
/content/output/videos/a_sunrise_illuminating_an_1812_era_american_battlefield__rendered_in_a_dramatic__historic_painting_style_/scene_f30.mp4
{'start': 5.48, 'duration': 4.18, 'text': 'Oh, say can you see', 'description': "A close-up of a soldier's face, awestruck as he looks towards the horizon, depicted in a realistic, historic painting style."}
Already have video for: A close-up of a soldier's face, awestruck as he looks towards the horizon, depicted in a realistic, historic painting style.
/content/output/videos/a_close-up_of_a_soldier_s_face__awestruck_as_he_looks_towards_the_horizon__depicted_in_a_realistic__historic_painting_style_/scene_f30.mp4
{'start': 9.66, 'duration': 4.1, 'text'

In [67]:
#!@title Concatenate scene videos into single video

import ffmpeg

video_path = os.path.join(out_dir, 'video.mp4')

video_inputs = []
video_input_paths = []
with open(image_description_file_path, 'r') as csv_file:
  reader = csv.DictReader(csv_file, quoting=csv.QUOTE_NONNUMERIC)
  for row in reader:
    print(row)
    description = row['description']
    scene_videopath = description_to_videopath(description)
    print(scene_videopath)
    if not os.path.exists(scene_videopath):
      print(f"Missing scene file: {scene_videopath}")
      continue
    video_in = ffmpeg.input(scene_videopath)
    video_inputs.append(video_in)
    video_input_paths.append(scene_videopath)

print(f'len(video_inputs): {len(video_inputs)}')
print(video_inputs)
with open(os.path.join(out_dir, 'video_input_paths.txt'), 'w') as f:
  for video_input_path in video_input_paths:
    f.write(f"file '{video_input_path}'\n")

# ffmpeg_args = ffmpeg.concat(*video_inputs, v=1, a=1).output(video_with_audio_path).overwrite_output()
ffmpeg_args = ffmpeg.concat(*video_inputs, v=1, a=0).output(video_path).overwrite_output()
# ffmpeg_args = ffmpeg.filter_('concat', *video_inputs, v=1, a=1, n=len(video_inputs)).output(video_path).overwrite_output()
# print(' -i '.join(video_input_paths))
# print(' '.join(ffmpeg_args.compile()))
print('concatenating with ffmpeg')
result = ffmpeg_args.run()
print('Done merging')



{'start': 0.0, 'duration': 5.48, 'text': '', 'description': 'A sunrise illuminating an 1812 era American battlefield, rendered in a dramatic, historic painting style.'}
/content/output/videos/a_sunrise_illuminating_an_1812_era_american_battlefield__rendered_in_a_dramatic__historic_painting_style_/scene_f30.mp4
{'start': 5.48, 'duration': 4.18, 'text': 'Oh, say can you see', 'description': "A close-up of a soldier's face, awestruck as he looks towards the horizon, depicted in a realistic, historic painting style."}
/content/output/videos/a_close-up_of_a_soldier_s_face__awestruck_as_he_looks_towards_the_horizon__depicted_in_a_realistic__historic_painting_style_/scene_f30.mp4
{'start': 9.66, 'duration': 4.1, 'text': "By the dawn's early light", 'description': "The soldier's perspective of the horizon, where the first light of dawn is breaking, painted in a romantic, historic style."}
/content/output/videos/the_soldier_s_perspective_of_the_horizon__where_the_first_light_of_dawn_is_breaking

In [None]:
#@title Merge Audio with Video using ffmpeg-python
import ffmpeg

video_with_audio_path = os.path.join(out_dir, 'video_with_audio.mp4')
destfile_time = None
if os.path.exists(video_with_audio_path):
  destfile_time = os.path.getmtime(video_with_audio_path)

if (destfile_time is not None) and (destfile_time > os.path.getmtime(video_path)):
  print('video with audio already exists: ', video_with_audio_path)
else:
  video_in = ffmpeg.input(video_path)
  audio_in = ffmpeg.input(audio_file_path)
  try:
    ffmpeg_args = ffmpeg.concat(video_in, audio_in, v=1, a=1).output(video_with_audio_path).overwrite_output()
    print('concatenating with ffmpeg')
    result = ffmpeg_args.run()
    print('Done merging')
  except:
    print('Merge failed!')


concatenating with ffmpeg
Done merging


In [None]:
video_with_audio_path

'/content/gdrive/MyDrive/Commentator_AI/Transcript_to_Video/DADmZdbQ9x8/video_with_audio.mp4'