# Story Teller

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17C284MOUDQMxV6bRbgVRH4GXsb87iADW?usp=sharing)

In this notebook, we demonstrate how to use Story Teller to generate video stories with voice narrations.

First, download package requirements. This includes `storyteller-core` as well as other packages we need to run Story Teller on Colab's default runtime.

In [None]:
%%capture
!pip install -U pip wheel
!pip install storyteller-core accelerate moviepy
%pip install -U protobuf
%pip install numpy==1.22.4 # to avoid conflict with numba


: 

Below is a sample generation snippet. We specify the configuraiton of the model in `StoryTellerConfig` and load the model. In this case, we are loading:

* GPT-2
* Stable Diffusion
* Glow-TTS + MelGAN

We use CUDA and `torch.float16` for memory optimization and latency improvements.

In [1]:
import os
from storyteller import StoryTeller, StoryTellerConfig
from storyteller.utils import set_seed

set_seed(42)
os.makedirs("out", exist_ok=True)
config = StoryTellerConfig(
    writer_device="cuda",
    painter_device="cuda",
    writer_dtype="float16",
    painter_dtype="float16",
)
story_teller = StoryTeller(config)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


Downloading (…)ain/model_index.json:   0%|          | 0.00/537 [00:00<?, ?B/s]

text_encoder/pytorch_model.fp16.safetensors not found


Fetching 22 files:   0%|          | 0/22 [00:00<?, ?it/s]

Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading (…)cheduler_config.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Downloading (…)_encoder/config.json:   0%|          | 0.00/633 [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/824 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

Downloading (…)d79/unet/config.json:   0%|          | 0.00/909 [00:00<?, ?B/s]

Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Downloading model.fp16.safetensors:   0%|          | 0.00/681M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.36G [00:00<?, ?B/s]

Downloading pytorch_model.fp16.bin:   0%|          | 0.00/681M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.36G [00:00<?, ?B/s]

Downloading (…)del.fp16.safetensors:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

Downloading (…)torch_model.fp16.bin:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/3.46G [00:00<?, ?B/s]

Downloading (…)on_pytorch_model.bin:   0%|          | 0.00/3.46G [00:00<?, ?B/s]

Downloading (…)9d79/vae/config.json:   0%|          | 0.00/611 [00:00<?, ?B/s]

Downloading (…)on_pytorch_model.bin:   0%|          | 0.00/335M [00:00<?, ?B/s]

Downloading (…)torch_model.fp16.bin:   0%|          | 0.00/167M [00:00<?, ?B/s]

Downloading (…)del.fp16.safetensors:   0%|          | 0.00/167M [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

 > Downloading model to /root/.local/share/tts/tts_models--en--ljspeech--glow-tts


100%|██████████| 344M/344M [00:08<00:00, 41.8MiB/s]


 > Model's license - MPL
 > Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
 > Downloading model to /root/.local/share/tts/vocoder_models--en--ljspeech--multiband-melgan


100%|██████████| 82.8M/82.8M [00:02<00:00, 36.4MiB/s]


 > Model's license - MPL
 > Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
 > Using model: glow_tts
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.1
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Vocoder Model: multiband_melgan
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func

We then call `storyteller.generate(...)`. For simplicity, we generate only 5 slides.

In [2]:
story_teller.generate(
    writer_prompt="Once upon a time, unicorns roamed the Earth.",
    painter_prompt_prefix="Beautiful painting",
    num_images=5,
    output_dir="out",
)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


  0%|          | 0/20 [00:00<?, ?it/s]

 > Text splitted to sentences.
['Once upon a time, unicorns roamed the Earth.']
 > Processing time: 1.5207033157348633
 > Real-time factor: 0.4183907480529264


  0%|          | 0/20 [00:00<?, ?it/s]

 > Text splitted to sentences.
["It was only that the earth was, and had never stopped being, divided by great nations, with no space to roam (this will explain why this year there weren't any unicorns on the land)."]
 > Processing time: 1.7319765090942383
 > Real-time factor: 0.1452492013993487


  0%|          | 0/20 [00:00<?, ?it/s]

 > Text splitted to sentences.
['Today, it has become a vast majority where there is little more than an out of date rainbow of unicorns.']
tədeɪ, ɪt hæz bɪkʌm ə væst məd͡ʒɔɹəti wɛɹ ðɛɹ ɪz lɪtəl mɔɹ ðən ən aʊt əv deɪt ɹeɪnboʊ əv junɪkɔɹnz.
 [!] Character '͡' not found in the vocabulary. Discarding it.
 > Processing time: 1.7564148902893066
 > Real-time factor: 0.21955186128616333


  0%|          | 0/20 [00:00<?, ?it/s]

 > Text splitted to sentences.
['In a recent report entitled "The Unicorns of the Universe", there are many new unicorns.']
 > Processing time: 1.2805664539337158
 > Real-time factor: 0.20614188112690132


  0%|          | 0/20 [00:00<?, ?it/s]

 > Text splitted to sentences.
['I was wondering if there was a reason for all these new unicorns to come here and join in, to provide some reassurance.']
 > Processing time: 1.586329698562622
 > Real-time factor: 0.1971467775120943


You can either download the generated video in `out/out.mp4` or use `moviepy` to watch the video in Colab.

In [3]:
%%capture
from moviepy.editor import VideoFileClip

path = "out/out.mp4"
clip = VideoFileClip(path)
html = clip.ipython_display(width=280)


In [4]:
html
