# Creating Subtitles with AI and Machine Learning

In this session, we will be split into groups. Each group will use this Jupyter Notebook to automatically generate a subtitle file using the technologies we discussed previously.

Below, find your group and follow the instructions.

## Group 1 - A

Mason, Gordon A.: WWII Oral History
https://digital.lib.utk.edu/collections/islandora/object/wwiioh%3A2001
Language Model: Tiny


## Group 1 - B

Mason, Gordon A.: WWII Oral History
https://digital.lib.utk.edu/collections/islandora/object/wwiioh%3A2001
Language Model: Base

In [1]:
group1a = {
    "request": "https://digital.lib.utk.edu/collections/islandora/object/wwiioh%3A2001/datastream/PROXY_MP3",
    "language_model": "tiny",
    "name": "wwiioh_2001.mp3",
    "language": "English"
}

group1b = {
    "request": "https://digital.lib.utk.edu/collections/islandora/object/wwiioh%3A2001/datastream/PROXY_MP3",
    "language_model": "base",
    "name": "wwiioh_2001.mp3",
    "language": "English"
}

group2a = {
    "request": "https://digital.lib.utk.edu/collections/islandora/object/rfta:156/datastream/PROXY_MP3",
    "language_model": "tiny",
    "name": "rfta_156.mp3",
    "language": "English"
}

group2b = {
    "request": "https://digital.lib.utk.edu/collections/islandora/object/rfta:156/datastream/PROXY_MP3",
    "language_model": "base",
    "name": "rfta_156.mp3",
    "language": "English"
}

group3a = {
    "request": "https://digital.lib.utk.edu/collections/islandora/object/rfta:165/datastream/MP4",
    "language_model": "tiny",
    "name": "rfta_165.mp4",
    "language": "English"
}

group3b = {
    "request": "https://digital.lib.utk.edu/collections/islandora/object/rfta:165/datastream/MP4",
    "language_model": "base",
    "name": "rfta_165.mp4",
    "language": "English"
}

group3c = {
    "request": "https://digital.lib.utk.edu/collections/islandora/object/rfta:165/datastream/MP4",
    "language_model": "base",
    "name": "rfta_165.mp4",
    "language": "Spanish"
}

group4a = {
    "request": "https://digital.lib.utk.edu/collections/islandora/object/cdf:10411/datastream/MP4",
    "language_model": "tiny",
    "name": "cdf_10411.mp4",
    "language": "English"
}

group4b = {
    "request": "https://digital.lib.utk.edu/collections/islandora/object/cdf:10411/datastream/MP4",
    "language_model": "base",
    "name": "cdf_10411.mp4",
    "language": "English"
}

## Choose Group

In the seciont below, set `group` equal to your group number (e.g. `group1a`, `group1b`, `group2a`, `group2b`).

It should look something like:

```
group = group3c
```

In [2]:
group = group3c

## Download Resource

The code below will download the file the A/V file so we can generate a subtitle.

In [3]:
import requests
import os

r = requests.get(group['request'])
os.makedirs('downloads', exist_ok=True)
with open(f"downloads/{group['name']}", 'wb') as our_sample:
    our_sample.write(r.content)


# Generate a Closed Captioning File

In [4]:
import warnings

# @Todo: remove this when numba is updated to 0.53.1
warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")
import whisper
import os
from whisper.utils import get_writer

file = f"downloads/{group['name']}"
model = whisper.load_model(group['language_model'])
result = model.transcribe(file, fp16=False, language=group["language"])
os.makedirs('class_test', exist_ok=True)
output_file = f'sample_{group["language_model"]}.vtt'
writer = get_writer('vtt', 'class_test')
writer(result, output_file)

# Read the VTT

This will read the closed caption file without the timestamps.

Let's compare your results to what's online and discuss the differences.

* [Group 1](https://samvera-labs.github.io/clover-iiif/?iiif-content=https://digital.lib.utk.edu/assemble/manifest/wwiioh/2001)
* [Group 2](https://samvera-labs.github.io/clover-iiif/?iiif-content=https://digital.lib.utk.edu/assemble/manifest/rfta/156)
* [Group 3](https://samvera-labs.github.io/clover-iiif/?iiif-content=https://digital.lib.utk.edu/assemble/manifest/rfta/165)
* [Group 4](https://samvera-labs.github.io/clover-iiif/?iiif-content=https://digital.lib.utk.edu/assemble/manifest/cdf/10411) 

In [5]:
with open(f'class_test/sample_{group["language_model"]}.vtt', 'r') as current:
    for line in current:
        if ' --> ' not in line and 'WEBVTT' not in line and line != '\n':
            print(line)

Hoy es tres de marzo del veinte veintiduno y como parte del proyecto de historia oral de

incendios forestales de chimney tops dos, resurgiendo de las inínsas. Hoy tenemos el

honor de entrevistar a Julia Rodríguez sobre sus experiencias durante y después de los incendios

forestales chimney tops dos en 2016. Mi nombre es Cristiano de Díctlíbera y hoy es

haré conduciendo la entrevista a Julia. Julia, muchas gracias por su disposición a compartir su

historia y experiencia con nosotros. Lo apreciamos mucho. Un placer estar aquí para ustedes.

Gracias. Vamos a empezar con si no puede decir por favor su nombre. Julia, Jamilette,

Rodríguez. Y cuándo ha vivido en Galibera y nos puede contar cómo llegó a la área de donde

sucede originalmente. Hoy originalmente de Honduras llegué aquí en el 2000 en el año 2000.

Llegué esta ciudad porque a mi esposo le gustó porque era una ciudad pequeña y cómo hizo trabajo.

Ok, nos puede hablar un poco sobre su vida en Galiber. Por ejemplo, tiene familia

# Group 3 Only -- Convert Spanish to English

In [7]:
if "rfta_165.mp4" in file:
    result = model.transcribe(file, fp16=False, language=group["language"], task='translate')
    os.makedirs('class_test', exist_ok=True)
    output_file = f'sample_{group["language_model"]}_translated.vtt'
    writer = get_writer('vtt', 'class_test')
    writer(result, output_file)
    with open(f'class_test/sample_{group["language_model"]}_translated.vtt', 'r') as current:
        for line in current:
            if ' --> ' not in line and 'WEBVTT' not in line and line != '\n':
                print(line)

is

you

make

other

because I liked it because it was a small city and how did you work?

ok we can talk a little about your life in galleyburg for example you have a family in the area

you work and how did you interact with the community? I have two brothers here

my sister and my brother work in hotel in a bandera and because I

grew up in this city it is a little difficult when I arrived here because I realized that my English was not

nothing lived in a place where it was moved I could not communicate with anyone so I decided that I had to do something

I only had two options one was to return from where it comes or learn the language I went to school I went to school with my

two children because I also had to take care of them I took them to school and I had them close to me where I could be

safe while I was studying classes then I felt that I could do many things if it was very easy for me and so it was

as I got to the city of work I had more access to many things that I co