### Transcription using whisper

This script uses OpenAI's whisper to auto translate videos. 

For this script we need these libraries:
- json: a library that knows how to read json data formats since whisper transcribes audio into the json format
- glob: a library that allows you to read file directories on your computer
- pandas: our trusted data analysis library
- whisper: OpenAI's open sourced [`whisper`](https://github.com/openai/whisper) library

In [1]:
import json 
import glob

import pandas as pd
import whisper


Using glob we find every video in the data directory:

In [2]:
base = "../output/demure/"
paths  = glob.glob(base+"*.mp4")
len(paths)

10

Load OpenAI's model:

In [3]:
# load model
model = whisper.load_model("base")


In the next cells we first write a few lines to transcribe each video. In a later version of the code we add these transcripts into an array. This line tests out the model for one video:

In [4]:
for path in paths: 
    result = model.transcribe(path,  language="es", fp16=False)
    print(result["text"])
    break

 ¿Qué es lo que hace para trabajar? Muy dimuloso, muy mindful. No me gusta trabajar con un recorredor grado. No me parece un clon cuando me voy a trabajar. No me gusta mucho mucho a lo que me gusta. ¿Qué es lo que hace para trabajar? La forma que me into el interview es la forma que me into el trabajo. A veces, a veces me into el interview Look like Marge Simpson y into el trabajo, Look like Patty and Selma, no, Damiro. Estoy muy modesto, estoy muy mindful. Alan Boye, un poco de broma's<|es|>


Now let us run the code for all files:

In [5]:
%%time
rows = []
for path in paths: 
    result = model.transcribe(path,  language="es", fp16=False)
    row = {
            "file_name":path,
            "transcript": result["text"]
          }
    rows.append(row)

Now that we have the transcript, we can once again, turn it into a dataframe and export it:

In [6]:
transcripts = pd.DataFrame(rows)



In [7]:
transcripts.head()

Unnamed: 0,file_name,transcript
0,../output/demure/7399736793119247662.mp4,Qué healed Re Balance para meditzar ha climbs...
1,../output/demure/7425252333770018090.mp4,Ok.我們eremos un anas a hablar about these wigs...
2,../output/demure/7424994658423541034.mp4,¿Se hace dubota de una fondad? ¿ου tal conpl ...
3,../output/demure/7424650065811164459.mp4,secondary
4,../output/demure/7425288919949856043.mp4,"que iréado por ser algo de sus 느낌이, pero por ..."


In [8]:
len(transcripts)


10

In [9]:
transcripts.to_csv("../output/transcripts.csv", index=False)