### Transcription using whisper

This script uses OpenAI's whisper to auto translate videos. 

For this script we need these libraries:
- json: a library that knows how to read json data formats since whisper transcribes audio into the json format
- glob: a library that allows you to read file directories on your computer
- pandas: our trusted data analysis library
- whisper: OpenAI's open sourced [`whisper`](https://github.com/openai/whisper) library

In [1]:
import json 
import glob

import pandas as pd
import whisper

Using glob we find every video in the data directory:

In [2]:
base = "../data/"
paths  = glob.glob(base+"*.mp4")
len(paths)

7

Load OpenAI's model:

In [3]:
# load model
model = whisper.load_model("base")


In the next cells we first write a few lines to transcribe each video. In a later version of the code we add these transcripts into an array:

In [None]:
for path in paths: 
    result = model.transcribe(path,  language="en")
    print(result["text"])



In [None]:
rows = []
for path in paths: 
    result = model.transcribe(path,  language="en")
    row = {
            "file_name":path,
            "transcript": result["text"]
          }
    rows.append(row)

Now that we have the transcript

In [6]:
transcripts = pd.DataFrame(rows)



In [7]:
transcripts.head()

Unnamed: 0,file_name,transcript
0,../data/@joolieannie_7404929915893681451.mp4,"Ladies, let's be mindful when we use our phon..."
1,../data/@joolieannie_1724362477829.mp4,give me one that's like the size of like a fi...
2,../data/@joolieannie_1723610748575.mp4,You see how I do my makeup for work? Very dem...
3,../data/@joolieannie_1724324953244.mp4,Divas I'm in Los Angeles and Zillow needs my ...
4,../data/@joolieannie_1724362572281.mp4,"Hi, Tvaz. Okay, so I've been going to the sam..."


In [8]:
len(transcripts)


7

In [9]:
transcripts.to_csv("../output/transcripts.csv", index=False)