<a href="https://colab.research.google.com/github/pszemraj/vid2cleantxt/blob/master/colab_notebooks/vid2cleantext_single_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# vid2cleantxt - single file version on Colab
Peter Szemraj

[Link to full GitHub Repo](https://github.com/pszemraj/vid2cleantxt)

## Purpose

A quick demo of vid2cleantxt.


* Downloads a video to transcribe, converts the video file to audio chunks, runs those chunks through facebook's wav2vec2 pretrained speech transcription model. 
* After saving original transcription, it also creates a version that is spell-corrected, and a third version with sentence boundary disambiguation (i.e. it adds periods into sentences)

## Instructions

*NOTE: key inputs have now been made easier to edit/input with the google "forms" feature, look for those as things to change.*

The two main things that need to be done to make this work are:
1. Specify what the input filename and filepath are
    * In this demo it is already specified for you (uses **requests** to get vid file from project repo)
2. Adjust model main parameters
    * with a GPU, should be stable @ **20** seconds
    * with a CPU, bit of a trial-and-error process
    * <font color='orange'> **Before running script, do Runtime->Change Runtime Type-> GPU in the top menu** </font>

Sections where these parameters need to be updated are indicated in the file below (or see table of contents).

** **

<font color='orange'> This example was designed to be run in the Google Colab environment but should work locally with a few tweaks (i.e. get rid of google colab libraries) </font>

In [1]:
want_to_download_results = True  # @param {type:"boolean"}

In [2]:
!nvidia-smi

Tue Aug 17 05:12:29 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Setup

In [3]:
%%capture

!pip install pysbd
!pip install transformers
!pip install texthero
!pip install wordninja
!pip install yake
!pip install symspellpy
!pip install pycuda
!pip install gputil
!pip install psutil
!pip install humanize
!pip install moviepy --pre --upgrade
!apt install ffmpeg
!pip install -U tqdm
!pip install -U neuspell
!pip install clean-text[gpl]

import math
import os
import pprint as pp
import shutil
import time
import re
from datetime import datetime
from os import listdir
from os.path import isfile, join

import librosa
import moviepy.editor as mp
import moviepy
import pandas as pd
import pkg_resources
import pysbd
import texthero as hero
import torch
import wordninja
import yake
from natsort import natsorted
from symspellpy import SymSpell
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import pycuda.driver as cuda
import psutil
import humanize
import GPUtil as GPU
from tqdm.auto import tqdm
import neuspell
from tqdm.auto import tqdm
from cleantext import clean

# Function Definitions


## generic

In [4]:
%%capture
# for display in jupyter notebooks / colab only
from IPython.display import HTML, display


def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )


get_ipython().events.register("pre_run_cell", set_css)

In [5]:
# define user functions


def increase_font():
    from IPython.display import Javascript

    display(
        Javascript(
            """
  for (rule of document.styleSheets[0].cssRules){
    if (rule.selectorText=='body') {
      rule.style.fontSize = '24px'
      break
    }
  }
  """
        )
    )


def reset_font():
    from IPython.display import Javascript

    display(
        Javascript(
            """
  for (rule of document.styleSheets[0].cssRules){
    if (rule.selectorText=='body') {
      rule.style.fontSize = '14px'
      break
    }
  }
  """
        )
    )


def corr(s):
    # adds space after period if there isn't one
    # removes extra spaces
    return re.sub(r"\.(?! )", ". ", re.sub(r" +", " ", s))


def cleantxt_wrap(ugly_text):
    # a wrapper for clean text with options different than default

    # https://pypi.org/project/clean-text/
    cleaned_text = clean(
        ugly_text,
        fix_unicode=True,  # fix various unicode errors
        to_ascii=True,  # transliterate to closest ASCII representation
        lower=True,  # lowercase text
        no_line_breaks=True,  # fully strip line breaks as opposed to only normalizing them
        no_urls=True,  # replace all URLs with a special token
        no_emails=True,  # replace all email addresses with a special token
        no_phone_numbers=True,  # replace all phone numbers with a special token
        no_numbers=False,  # replace all numbers with a special token
        no_digits=False,  # replace all digits with a special token
        no_currency_symbols=True,  # replace all currency symbols with a special token
        no_punct=True,  # remove punctuations
        replace_with_punct="",  # instead of removing punctuations you may replace them
        replace_with_url="<URL>",
        replace_with_email="<EMAIL>",
        replace_with_phone_number="<PHONE>",
        replace_with_number="<NUM>",
        replace_with_digit="0",
        replace_with_currency_symbol="<CUR>",
        lang="en",  # set to 'de' for German special handling
    )

    return cleaned_text


def beautify_filename(filename, num_words=20, start_reverse=False, word_separator="_"):
    # takes a filename stored as text, removes extension, separates into X words ...
    # and returns a nice filename with the words separateed by
    # useful for when you are reading files, doing things to them, and making new files

    filename = str(filename)
    index_file_Ext = filename.rfind(".")
    current_name = str(filename)[:index_file_Ext]  # get rid of extension
    clean_name = cleantxt_wrap(current_name)  # wrapper with custom defs
    file_words = wordninja.split(clean_name)
    # splits concatenated text into a list of words based on common word freq
    if len(file_words) <= num_words:
        num_words = len(file_words)

    if start_reverse:
        t_file_words = file_words[-num_words:]
    else:
        t_file_words = file_words[:num_words]

    pretty_name = word_separator.join(t_file_words)  # see function argument

    # NOTE IT DOES NOT RETURN THE EXTENSION
    return pretty_name[
        : (len(pretty_name) - 1)
    ]  # there is a space always at the end, so -1


def quick_keys(
    filename, filepath, max_ngrams=3, num_keywords=20, save_db=False, verbose=False
):
    # uses YAKE to quickly determine keywords in a text file. Saves Keywords and YAKE score (0 means very important) in
    # an excel file (from a dataframe)
    # yes, the double entendre is intended.
    file = open(join(filepath, filename), "r", encoding="utf-8", errors="ignore")
    text = file.read()
    file.close()

    language = "en"
    deduplication_threshold = 0.3  # technically a hyperparameter
    custom_kw_extractor = yake.KeywordExtractor(
        lan=language,
        n=max_ngrams,
        dedupLim=deduplication_threshold,
        top=num_keywords,
        features=None,
    )
    yake_keywords = custom_kw_extractor.extract_keywords(text)
    phrase_db = pd.DataFrame(yake_keywords)
    if verbose:
        print("YAKE keywords are: \n", yake_keywords)
        print("dataframe looks like: \n")
        pp.pprint(phrase_db.head())

    if len(phrase_db) == 0:
        print("warning - no phrases were able to be extracted... ")
        return None

    phrase_db.columns = ["key_phrase", "YAKE_sore"]

    # add a column for how many words the phrases contain
    yake_kw_len = []
    yake_kw_freq = []
    for entry in yake_keywords:
        entry_wordcount = len(str(entry).split(" ")) - 1
        yake_kw_len.append(entry_wordcount)

    for index, row in phrase_db.iterrows():
        search_term = row["key_phrase"]
        entry_freq = text.count(str(search_term))
        yake_kw_freq.append(entry_freq)

    word_len_series = pd.Series(yake_kw_len, name="No. Words in Phrase")
    word_freq_series = pd.Series(yake_kw_freq, name="Phrase Freq. in Text")
    phrase_db2 = pd.concat([phrase_db, word_len_series, word_freq_series], axis=1)
    # add column names and save file as excel because CSVs suck
    phrase_db2.columns = [
        "key_phrase",
        "YAKE Score (Lower = More Important)",
        "num_words",
        "freq_in_text",
    ]

    if save_db:
        # saves individual file if user asks
        yake_fname = (
            beautify_filename(filename=filename, start_reverse=False)
            + "_top_phrases_YAKE.xlsx"
        )
        phrase_db2.to_excel(join(filepath, yake_fname), index=False)

    # print out top 10 keywords, or if desired num keywords less than 10, all of them
    max_no_disp = 10
    if num_keywords > max_no_disp:
        num_phrases_disp = max_no_disp
    else:
        num_phrases_disp = num_keywords

    if verbose:
        print("Top Key Phrases from YAKE, with max n-gram length: ", max_ngrams, "\n")
        pp.pprint(phrase_db2.head(n=num_phrases_disp))
    else:
        list_o_words = phrase_db2["key_phrase"].to_list()
        print("top 5 phrases are: \n")
        if len(list_o_words) < 5:
            pp.pprint(list_o_words)
        else:
            pp.pprint(list_o_words[:5])

    return phrase_db2


def digest_txt_directory(file_directory, identifer="", verbose=False, make_folder=True):
    run_date = datetime.now()
    files_to_merge = natsorted(
        [
            f
            for f in listdir(file_directory)
            if isfile(join(file_directory, f)) & f.endswith(".txt")
        ]
    )
    outfilename = (
        "Zealous_MERGED_words_" + identifer + run_date.strftime("_%d%m%Y_%H") + ".txt"
    )

    og_wd = os.getcwd()
    os.chdir(file_directory)

    if make_folder:
        folder_name = "merged_txt_files"
        if not os.path.isdir(join(file_directory, folder_name)):
            os.mkdir(
                join(file_directory, folder_name)
            )  # make a place to store outputs if one does not exist
        output_loc = join(file_directory, folder_name)

        outfilename = join(folder_name, outfilename)

        if verbose:
            print("created new folder. new full path is: \n", output_loc)

    count = 0
    with open(outfilename, "w") as outfile:

        for names in files_to_merge:

            with open(names) as infile:
                count += 1
                outfile.write("Start of: " + names + "\n")
                outfile.writelines(infile.readlines())

            outfile.write("\n")

    print("Merged {} text files together.".format(count))
    if verbose:
        print("the merged file is located at: \n", os.getcwd())
    os.chdir(og_wd)


def validate_output_directories(directory, verbose=False):

    # checks and creates folders

    t_folder_name = "wav2vec2_sf_transcript"
    m_folder_name = "wav2vec2_sf_metadata"

    # check if transcription folder exists. If not, create it
    if not os.path.isdir(join(directory, t_folder_name)):
        os.mkdir(
            join(directory, t_folder_name)
        )  # make a place to store outputs if one does not exist

        if verbose:
            print(
                "needed to create the folder: {}. Its location is: \n".format(
                    t_folder_name
                ),
                join(directory, t_folder_name),
            )
    t_path_full = join(directory, t_folder_name)

    # check if metadata folder exists. If not, create it
    if not os.path.isdir(join(directory, m_folder_name)):
        os.mkdir(
            join(directory, m_folder_name)
        )  # make a place to store outputs if one does not exist

        if verbose:
            print(
                "needed to create the folder: {}. Its location is: \n".format(
                    m_folder_name
                ),
                join(directory, m_folder_name),
            )

    m_path_full = join(directory, m_folder_name)

    output_locs = {"t_out": t_path_full, "m_out": m_path_full}

    return output_locs


def move2completed(from_dir, filename, new_folder="completed", verbose=False):

    # this is the better version
    old_filepath = join(from_dir, filename)

    new_filedirectory = join(from_dir, new_folder)

    if not os.path.isdir(new_filedirectory):
        os.mkdir(new_filedirectory)
        if verbose:
            print("created new directory for files at: \n", new_filedirectory)

    new_filepath = join(new_filedirectory, filename)

    try:
        shutil.move(old_filepath, new_filepath)
        print("successfully moved the file {} to */completed.".format(filename))
    except:
        print(
            "ERROR! unable to move file to \n{}. Please investigate".format(
                new_filepath
            )
        )


print("loaded all generic functions at: ", datetime.now())

# create time log

time_log = []
time_log_desc = []
time_log.append(time.perf_counter())
time_log_desc.append("start")

loaded all generic functions at:  2021-08-17 05:16:03.529439


## hardware monitoring

In [6]:
def clear_GPU_cache(verbose=False):

    GPUs = GPU.getGPUs()

    if len(GPUs) > 0:
        check_runhardware_torch()
        torch.cuda.empty_cache()
        print("\nchecked and cleared cache")
    else:
        print("\nNo GPU being used :( time = ", datetime.now())
    if verbose:
        print("-----------End of Cache Clear----------------")


print("loaded all hardware functions at: ", datetime.now())


def check_runhardware_torch(verbose=False):
    # https://www.run.ai/guides/gpu-deep-learning/pytorch-gpu/

    GPUs = GPU.getGPUs()

    if len(GPUs) > 0:
        if verbose:
            print("\n ------------------------------")
            print("Checking CUDA status for PyTorch")

        torch.cuda.init()

        print("Cuda availability (PyTorch): ", torch.cuda.is_available())

        # Get Id of default device
        torch.cuda.current_device()
        if verbose:
            print(
                "Name of GPU: ", torch.cuda.get_device_name(device=0)
            )  # '0' is the id of your GPU
            print("------------------------------\n")
        return True

    else:
        print("No GPU being used :(")
        return False


def torch_validate_cuda(verbose=False):
    GPUs = GPU.getGPUs()
    num_gpus = len(GPUs)
    try:
        torch.cuda.init()
        if not torch.cuda.is_available():
            print(
                "WARNING - CUDA is not being used in processing - expect longer runtime"
            )
            if verbose:
                print("GPU util detects {} GPUs on your system".format(num_gpus))
    except:
        print(
            "WARNING - unable to start CUDA. If you wanted to use a GPU, exit and check hardware."
        )


def check_runhardware(verbose=False):
    # ML package agnostic hardware check
    GPUs = GPU.getGPUs()

    if verbose:
        print("\n ------------------------------")
        print("Checking hardware with psutil")
    try:
        gpu = GPUs[0]
    except:
        if verbose:
            print("GPU not available - ", datetime.now())
        gpu = None
    process = psutil.Process(os.getpid())

    CPU_load = psutil.cpu_percent()
    if CPU_load > 0:
        cpu_load_string = "loaded at {} % |".format(CPU_load)
    else:
        # the first time process.cpu_percent() is called it returns 0 which can be confusing
        cpu_load_string = "|"
    print(
        "\nGen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available),
        " | Proc size: " + humanize.naturalsize(process.memory_info().rss),
        " | {} CPUs ".format(psutil.cpu_count()),
        cpu_load_string,
    )

    if len(GPUs) > 0 and GPUs is not None:
        print(
            "GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB\n".format(
                gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil * 100, gpu.memoryTotal
            )
        )
    else:
        print("No GPU being used :(", "\n-----------------\n")


def only_clear_GPU_cache(verbose=False):

    GPUs = GPU.getGPUs()

    if len(GPUs) > 0:
        torch.cuda.empty_cache()
        if verbose:
            print("\nchecked and cleared cache")
    else:
        print("\nClearCache - No GPU being used :( time = ", datetime.now())


# create time log

time_log.append(time.perf_counter())
time_log_desc.append("loaded hardware functions")

loaded all hardware functions at:  2021-08-17 05:16:03.616676


## for video conversion / transcription

In [7]:
def convert_vidfile(
    vidfilename,
    start_time=0,
    end_time=6969,
    input_directory="",
    output_directory="",
    new_filename="",
):
    # takes a video file and creates an audiofile with various parameters
    # NOTE video filename is required
    if len(input_directory) < 1:
        my_clip = mp.VideoFileClip(vidfilename)
    else:
        my_clip = mp.VideoFileClip(join(input_directory, vidfilename))

    if end_time == 6969:
        modified_clip = my_clip.subclip(t_start=int(start_time * 60))
    else:
        modified_clip = my_clip.subclip(
            t_start=int(start_time * 60), t_end=int(end_time * 60)
        )

    converted_filename = (
        vidfilename[: (len(vidfilename) - 4)]
        + "-converted_"
        + datetime.now().strftime("day_%d_time_%H-%M-%S_")
        + ".wav"
    )
    # update_filename
    if len(new_filename) > 0:
        converted_filename = new_filename

    if len(output_directory) < 1:
        modified_clip.audio.write_audiofile(converted_filename)
    else:
        # removed 'verbose=False,' from argument of function (removed from Dev)
        modified_clip.audio.write_audiofile(
            join(output_directory, converted_filename), logger=None
        )

    audio_conv_results = {
        "output_filename": converted_filename,
        "output_folder": output_directory,
        "clip_length": modified_clip.duration,
    }

    return audio_conv_results


def convert_vid_for_transcription(
    vid2beconv, len_chunks, input_directory, output_directory
):
    # Oriented specifically for the "wav2vec2" model speech to text transcription
    # takes a video file, turns it into .wav audio chunks of length <input> and stores them in a specific location
    # TODO: add try/except clause in case the user already has an audio file the want to transcribe
    my_clip = mp.VideoFileClip(join(input_directory, vid2beconv))
    number_of_chunks = math.ceil(my_clip.duration / len_chunks)  # to get in minutes
    print("converting into " + str(number_of_chunks) + " audio chunks")
    preamble = beautify_filename(vid2beconv)
    outfilename_storage = []
    print(
        "separating audio into chunks starting at ",
        datetime.now().strftime("_%H.%M.%S"),
    )
    update_incr = math.ceil(number_of_chunks / 10)

    for i in tqdm(
        range(number_of_chunks),
        desc="Converting Video to Audio Chunks",
        total=number_of_chunks,
    ):

        start_time = i * len_chunks
        if i == number_of_chunks - 1:
            this_clip = my_clip.subclip(t_start=start_time)
        else:
            this_clip = my_clip.subclip(
                t_start=start_time, t_end=(start_time + len_chunks)
            )
        this_filename = preamble + "_run_" + str(i) + ".wav"
        outfilename_storage.append(this_filename)

        if this_clip.audio is not None:
            # removed 'verbose=False,' from argument of function (removed from Dev)
            this_clip.audio.write_audiofile(
                join(output_directory, this_filename), logger=None
            )
        else:
            print("\n WARNING: chunk {} is empty / has no audio".format(i))

    print("Finished creating audio chunks at ", datetime.now().strftime("_%H.%M.%S"))
    print("Files are located in ", output_directory)
    return outfilename_storage


def transcribe_video_wav2vec(
    transcription_model, directory, vid_clip_name, chunk_length_seconds
):
    # this is the same process as used in the single video transcription, now as a function. Note that spell correction
    # and keyword extraction are now done separately in the script
    # user needs to pass in: the model, the folder the video is in, and the name of the video
    output_path_full = directory

    # Split Video into Audio Chunks-----------------------------------------------

    print("\n============================================================")
    print("Converting video to audio for file: ", vid_clip_name)
    print("============================================================\n")

    # create audio chunk folder
    output_folder_name = "audio_chunks"
    if not os.path.isdir(join(directory, output_folder_name)):
        os.mkdir(
            join(directory, output_folder_name)
        )  # make a place to store outputs if one does not exist
    path2audiochunks = join(directory, output_folder_name)
    chunk_directory = convert_vid_for_transcription(
        vid2beconv=vid_clip_name,
        input_directory=directory,
        len_chunks=chunk_length_seconds,
        output_directory=path2audiochunks,
    )

    print("\n============================================================")
    print(
        "converted video to audio. About to start transcription loop for file: ",
        vid_clip_name,
    )
    print("============================================================\n")
    torch_validate_cuda()
    check_runhardware()
    time_log.append(time.perf_counter())
    time_log_desc.append("converted video to audio")
    full_transcription = []
    before_loop_st = time.perf_counter()
    GPU_update_incr = math.ceil(len(chunk_directory) / 2)

    # Load audio chunks by name, pass into model, append output text-----------------------------------------------

    for audio_chunk in tqdm(
        chunk_directory,
        total=len(chunk_directory),
        desc="wav2vec2 model for " + vid_clip_name,
    ):

        current_loc = chunk_directory.index(audio_chunk)
        if (current_loc % GPU_update_incr == 0) and (GPU_update_incr != 0):
            # provide update on GPU usage
            check_runhardware()

        # load dat chunk
        audio_input, rate = librosa.load(join(path2audiochunks, audio_chunk), sr=16000)
        # MODEL
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
        input_values = tokenizer(
            audio_input, return_tensors="pt", padding="longest", truncation=True
        ).input_values.to(device)
        transcription_model = transcription_model.to(device)
        logits = transcription_model(input_values).logits
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = str(tokenizer.batch_decode(predicted_ids)[0])
        full_transcription.append(transcription + "\n")
        # empty memory so you don't overload the GPU
        del input_values
        del logits
        del predicted_ids
        del audio_input
        torch.cuda.empty_cache()

    print(
        "\nFinished audio transcription of "
        + vid_clip_name
        + " and now saving metrics."
    )

    # build metadata log -------------------------------------------------
    mdata = []
    mdata.append("original file name: " + vid_clip_name + "\n")
    mdata.append(
        "number of recorded audio chunks: "
        + str(len(chunk_directory))
        + " of lengths seconds each"
        + str(chunk_length_seconds)
        + "\n"
    )
    approx_input_len = (len(chunk_directory) * chunk_length_seconds) / 60
    mdata.append(
        "approx {0:3f}".format(approx_input_len) + " minutes of input audio \n"
    )
    mdata.append(
        "transcription date: "
        + datetime.now().strftime("date_%d_%m_%Y_time_%H-%M-%S")
        + "\n"
    )
    full_text = " ".join(full_transcription)
    transcript_length = len(full_text)
    mdata.append(
        "length of transcribed text: " + str(transcript_length) + " characters \n"
    )
    t_word_count = len(full_text.split(" "))
    mdata.append(
        "total word count: " + str(t_word_count) + " words (based on spaces) \n"
    )

    # delete audio chunks in folder -------------------------------------------------
    # TODO: add try/except for deleting folder as not technically needed to achieve goal
    shutil.rmtree(path2audiochunks)
    print("\nDeleted Audio Chunk Folder + Files")

    # compile results -------------------------------------------------
    transcription_results = {
        "audio_transcription": full_transcription,
        "metadata": mdata,
    }

    print(
        "\nFinished transcription successfully for "
        + vid_clip_name
        + " at "
        + datetime.now().strftime("date_%d_%m_%Y_time_%H-%M-%S")
    )
    return transcription_results


print("loaded all transcription specific functions at: ", datetime.now())

loaded all transcription specific functions at:  2021-08-17 05:16:03.870653


## spell correction

In [8]:
%%capture
# neuspell prints out the whole upon definition, hence ^ jupyter magic

def symspell_file(filepath, filename, dist=2, keep_numb_words=True, create_folder=True, save_metrics=False,
                  verbose=False):
    # given a text (has to be text) file, reads the file, autocorrects any words it deems misspelled, saves as new file
    # it can store the new file in a sub-folder it creates as needed
    # distance represents how far it searches for a better spelling. higher dist = higher RT.
    # https://github.com/mammothb/symspellpy

    script_start_time = time.perf_counter()
    sym_spell = SymSpell(max_dictionary_edit_distance=dist, prefix_length=7)
    print("\nPySymSpell - Starting to correct the file: ", filename)
    # ------------------------------------

    dictionary_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_dictionary_en_82_765.txt")
    bigram_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_bigramdictionary_en_243_342.txt")
    # term_index is the column of the term and count_index is the
    # column of the term frequency
    sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
    sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

    # ------------------------------------
    file = open(join(filepath, filename), 'r', encoding="utf-8", errors='ignore')
    textlines = file.readlines()  # return a list
    file.close()

    if create_folder:
        # create a folder
        output_folder_name = "auto-corrected" 
        if not os.path.isdir(join(filepath, output_folder_name)):
            os.mkdir(join(filepath, output_folder_name))  # make a place to store outputs if one does not exist
        filepath = join(filepath, output_folder_name)

    if verbose:
        print("loaded text with {0:6d} lines ".format(len(textlines)))

    corrected_list = []

    # iterate through list of lines. Pass each line to be corrected. 
    #Append / sum results from each line till done
    for line in textlines:
        if line == "":
            # blank line, skip to next run
            continue

        # correct the line of text using spellcorrect_line() which returns a dictionary
        suggestions = sym_spell.lookup_compound(phrase=line, max_edit_distance=dist, 
                                                ignore_non_words=keep_numb_words,
                                                ignore_term_with_digits=keep_numb_words)
        all_sugg_for_line = []
        for suggestion in suggestions:
            all_sugg_for_line.append(suggestion.term)

        # append / sum / log results from correcting the line

        corrected_list.append(' '.join(all_sugg_for_line) + "\n")

    # finished iterating through lines. Now sum total metrics

    corrected_doc = "".join(corrected_list)
    corrected_fname = "Corrected_SSP_" + beautify_filename(filename, 
                                                           num_words=10, start_reverse=False) + ".txt"

    # proceed to saving
    file_out = open(join(filepath, corrected_fname), 'w',
                    encoding="utf-8", errors='ignore')
    file_out.writelines(corrected_doc)
    file_out.close()

    # report RT
    if verbose:
        script_rt_m = (time.perf_counter() - script_start_time) / 60
        print("RT for this file was {0:5f} minutes".format(script_rt_m))
        print("output folder for this transcription is: \n", 
              filepath)

    print("Done correcting ", filename, " at time: ", 
          datetime.now().strftime("%H:%M:%S"), "\n")

    corr_file_Data = {
        "corrected_ssp_text": corrected_doc,
        "corrected_ssp_fname": corrected_fname,
        "output_path": filepath,
    }
    return corr_file_Data


# preload defaults
sym_spell = SymSpell(max_dictionary_edit_distance=3, prefix_length=7)

dictionary_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_dictionary_en_82_765.txt")
bigram_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_bigramdictionary_en_243_342.txt")
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

print("loaded defaults - ", datetime.now())

def symspell_freetext(textlines, dist=3, keep_numb_words=True, verbose=False,
                      d_path=dictionary_path, b_path=bigram_path, default=sym_spell):
    # https://github.com/mammothb/symspellpy

    if dist != 3:

        # have to recreate object each time because doesn't match pre-built

        sym_spell = SymSpell(max_dictionary_edit_distance=dist, prefix_length=7)
        sym_spell.load_dictionary(d_path, term_index=0, count_index=1)
        sym_spell.load_bigram_dictionary(b_path, term_index=0, count_index=2)
    else:
        sym_spell=default

    corrected_list = []

    if type(textlines) == str:
        textlines = [textlines] # put in a list if a string

    if verbose:
        print("\nStarting to correct text with {0:6d} lines ".format(len(textlines)))
        print("the type of textlines var is ",type(textlines))

    # iterate through list of lines. Pass each line to be corrected. Append / sum results from each line till done
    for line_obj in textlines:
        line = ''.join(line_obj) 
        if verbose:
            print("line {} in the text is: ".format(textlines.index(line_obj)))
            pp.pprint(line) 
        if line == "":
            # blank line, skip to next run
            continue

        suggestions = sym_spell.lookup_compound(phrase=line, max_edit_distance=dist, 
                                                ignore_non_words=keep_numb_words,
                                                ignore_term_with_digits=keep_numb_words)
        all_sugg_for_line = []
        for suggestion in suggestions:
            all_sugg_for_line.append(suggestion.term)

        # append / sum / log results from correcting the line

        corrected_list.append(' '.join(all_sugg_for_line) + "\n")

    # join corrected text

    corrected_text = "".join(corrected_list)

    if verbose:
        print("Finished correcting w/ symspell at time: ", datetime.now(), "\n")

    return corrected_text


# START OF NEUSPELL

checker = neuspell.SclstmbertChecker()
checker.from_pretrained()

def neuspell_freetext(textlines, verbose=False):

    corrected_list = []

    if type(textlines) == str:
        textlines = [textlines] # put in a list if a string

    # iterate through list of lines. Pass each line to be corrected. Append / sum results from each line till done
    for line_obj in textlines:
        line = ''.join(line_obj) 

        if verbose:
            print("line {} in the text is: ".format(textlines.index(line_obj)))
            pp.pprint(line) 
        if line == "" or (len(line) <= 5):
            # blank line, skip to next run
            continue

        line = line.lower()
        corrected_text = checker.correct_strings([line])
        corrected_text_f = " ".join(corrected_text)

        corrected_list.append(corrected_text_f + "\n")

    # join corrected text

    corrected_text = " ".join(corrected_list)

    if verbose:
        print("Finished correcting w/ neuspell at time: ", datetime.now(), "\n")

    return corrected_text

def SBD_freetext(text, verbose=False):
    # input should be STRING
    # use pysbd to segment 

    if isinstance(text, list):
        print("Warning, input ~text~ has type {}. Will convert to str".format(type(text)))
        text = " ".join(text)

    seg = pysbd.Segmenter(language="en", clean=True)
    sentences = []
    sentences = seg.segment(text)

    if verbose:
        print("input text of {} words was split into ".format(len(text.split(" "))),
              len(sentences), "sentences")

    # take segments and make them sentences
    
    capitalized = []
    for sentence in sentences:
        if sentence and sentence.strip():
            # ensure that the line is not all spaces
            first_letter = sentence[0].upper()
            rest = sentence[1:]
            capitalized.append(first_letter + rest)

    seg_and_capital = ". ".join(capitalized)

    return seg_and_capital

print("loaded all spell correction functions at: ", datetime.now())



# Load Files and Model

Note that you can also connect to a google drive folder if you want to transcribe a large video file or several video files (described in the "multi" script [here](https://colab.research.google.com/drive/1UMCSh9XdvUABjDJpFUrHPj4uy3Cc26DC?usp=sharing)

The code to do so would be as follows:

```
# create interface to upload / interact with google drive and video files

from google.colab import files
from google.colab import drive
drive.mount('/content/drive')
# google will ask you to click link, approve, and paste code

# after authentication, you can work using the path "/content/drive/My Drive"
# if it works it will say "Mounted at /content/drive"

# part 2: specify where in the drive the files are located

filename = "President John F. Kennedy's Peace Speech.mp4"
filepath = "/content/drive/My Drive/Programming/vid2cleantxt_colabfiles"

print('Will use the following as directory/file: ')
pp.pprint(''.join([filepath, filename]))

```

** **

## Instructions P1: Input File Details

Specify the URL of the file you want to transcribe. This script downloads the file from the vid2clntext github repo and saves it to the VM's working directory using the **requests** library. 

- the URL can be changed to anything as long as it downloads a video file (and the filename is updated as relevant)

<font color='orange'> update the **input_path** variable to a custom filepath if desired (i.e. you are running this locally) </font>

In [9]:
filename = "JFK_rice_moon_speech.mp4"  # @param {type:"string"}
URL = "https://github.com/pszemraj/vid2cleantxt/raw/master/example_JFK_speech/TEST_folder_edition/JFK_rice_moon_speech.mp4"  # @param {type:"string"}

In [10]:
# using requests
import requests

file_loc = join(os.getcwd(), "transcription")
os.makedirs(file_loc, exist_ok=True)
input_path = os.path.join(file_loc, filename)  # filename taken from above
print("starting to download and save file ")
r = requests.get(URL, allow_redirects=True)
open(input_path, "wb").write(r.content)  # URL taken from above input
print("successfully saved ", filename, " - ", datetime.now())

starting to download and save file 
successfully saved  JFK_rice_moon_speech.mp4  -  2021-08-17 05:17:23.657168


## Instructions P2: Update chunk_length

Update the variable 'chunk_length' to your use case. A good value is one that doesn't cause Colab to crash and is greater than a sentence length (for context, grammar purposes).

 

- If Colab is using a GPU, 20 seconds should be fine. If Colab is only able to use a CPU, may need to be decreased. 

- You can use the `!nvidia-smi` command in a cell to check GPU status.


---

*NOTE* recommended value for `chunk_length`  depends on which GPU Colab assigns you. A chunk length of 20 works on a Tesla K-80, which is typically what the free version gets allocated. A chunk length of 30 works on a Tesla P-100 16gb, standard for Colab Pro. 

In [11]:
chunk_length = 20  # @param {type:"number"}

In [12]:
# load huggingface model
time_log.append(time.perf_counter())
time_log_desc.append("starting to load model")

# load pretrained model
wav2vec2_model = "facebook/wav2vec2-large-960h-lv60-self"
# wav2vec2_model = "facebook/wav2vec2-base-960h" # faster+smaller, less accurate
print("\nPreparing to load model: " + wav2vec2_model)
tokenizer = Wav2Vec2Tokenizer.from_pretrained(wav2vec2_model)
model = Wav2Vec2ForCTC.from_pretrained(wav2vec2_model)

# (in seconds) if model fails to work or errors out (and there isn't some other
# obvious error, reduce chunk_length.

print("loaded the following model:", wav2vec2_model, " at ", datetime.now())
time_log.append(time.perf_counter())
time_log_desc.append("loaded model")


Preparing to load model: facebook/wav2vec2-large-960h-lv60-self


Downloading:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/162 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.61k [00:00<?, ?B/s]


The class `Wav2Vec2Tokenizer` is deprecated and will be removed in version 5 of Transformers. Please use `Wav2Vec2Processor` or `Wav2Vec2CTCTokenizer` instead.



Downloading:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

loaded the following model: facebook/wav2vec2-large-960h-lv60-self  at  2021-08-17 05:18:06.131490


# Run Transformer Model (wav2vec2)

In [13]:
# load videos, run through the model
st = time.perf_counter()
time_log.append(st)
time_log_desc.append("starting transcription")

t_results = transcribe_video_wav2vec(
    transcription_model=model,
    directory=file_loc,
    vid_clip_name=filename,
    chunk_length_seconds=chunk_length,
)
end_t = time.perf_counter()
time_log.append(end_t)
time_log_desc.append("finished transcription")

# t_results is a dictonary containing the transcript and associated metadata
full_transcription = t_results.get("audio_transcription")
metadata = t_results.get("metadata")

print("completed transcription in {} minutes".format(round((end_t - st) / 60, 2)))


Converting video to audio for file:  JFK_rice_moon_speech.mp4

converting into 9 audio chunks
separating audio into chunks starting at  _05.18.07


Converting Video to Audio Chunks:   0%|          | 0/9 [00:00<?, ?it/s]

Finished creating audio chunks at  _05.18.11
Files are located in  /content/transcription/audio_chunks

converted video to audio. About to start transcription loop for file:  JFK_rice_moon_speech.mp4


Gen RAM Free: 24.3 GB  | Proc size: 6.2 GB  | 4 CPUs  loaded at 20.8 % |
GPU RAM Free: 16278MB | Used: 2MB | Util   0% | Total 16280MB



wav2vec2 model for JFK_rice_moon_speech.mp4:   0%|          | 0/9 [00:00<?, ?it/s]


Gen RAM Free: 24.3 GB  | Proc size: 6.2 GB  | 4 CPUs  loaded at 39.1 % |
GPU RAM Free: 16278MB | Used: 2MB | Util   0% | Total 16280MB


Gen RAM Free: 24.1 GB  | Proc size: 6.6 GB  | 4 CPUs  loaded at 20.6 % |
GPU RAM Free: 14113MB | Used: 2167MB | Util  13% | Total 16280MB


Finished audio transcription of JFK_rice_moon_speech.mp4 and now saving metrics.

Deleted Audio Chunk Folder + Files

Finished transcription successfully for JFK_rice_moon_speech.mp4 at date_17_08_2021_time_05-18-31
completed transcription in 0.42 minutes


# Post-Transcription

## Spell Check, SBD, Keywords

If you got to here, your colab file was able to run the model and transcribe it. Now a little cleaning up, then done.

In [14]:
# create output locations and store full transcription

time_log.append(time.perf_counter())
time_log_desc.append("starting saving output files")

# check if directories for output exist. If not, create them
storage_locs = validate_output_directories(file_loc)
output_path_transcript = storage_locs.get("t_out")
output_path_metadata = storage_locs.get("m_out")

# label and store this transcription
vid_preamble = beautify_filename(filename, num_words=15, start_reverse=False)

# transcription
transcribed_filename = (
    vid_preamble + "_tscript_" + datetime.now().strftime("_%H.%M.%S") + ".txt"
)
tscript_path = join(output_path_transcript, transcribed_filename)
with open(tscript_path, "w", encoding="utf-8", errors="ignore") as tfile:
    tfile.writelines(full_transcription)
# metadata
metadata_filename = "metadata for " + vid_preamble + " transcription.txt"
meta_path = join(output_path_metadata, metadata_filename)
with open(meta_path, "w", encoding="utf-8", errors="ignore") as mfile:
    mfile.writelines(metadata)

print("saved files at the following locations")
print("transcript at: " + output_path_transcript)
print("metadata at: " + output_path_metadata)
time_log.append(time.perf_counter())
time_log_desc.append("saved output files to local runtime")

saved files at the following locations
transcript at: /content/transcription/wav2vec2_sf_transcript
metadata at: /content/transcription/wav2vec2_sf_metadata


### Spell Correction

**Note that symspell is used here for the spelling + grammar checker instead of Neuspell.** If you wish to use the better version with Neuspell, that is implemented as an example [here](https://colab.research.google.com/drive/1qOUkiPMaUZgBTMfCFF-fCRTPCMg1997J?usp=sharing)
 

In [15]:
# spell correction, sentence disambiguation, and keyword extraction

# Go through base transcription files and spell correct them and get keywords
print("\n Starting to spell-correct and extract keywords\n")
seg = pysbd.Segmenter(language="en", clean=True)
tf_pretty_name = beautify_filename(
    transcribed_filename, start_reverse=False, num_words=10
)
# auto-correct spelling (wav2vec2 doesn't enforce spelling on its output)
corr_results_fl = symspell_file(
    filepath=output_path_transcript,
    filename=transcribed_filename,
    keep_numb_words=True,
    create_folder=True,
    dist=2,
)
output_path_impr = corr_results_fl.get("output_path")

# Write version of transcription with sentences / boundaries inferred with periods. All text in one line
seg_list = seg.segment(corr_results_fl.get("corrected_ssp_text"))
seg_text = ". ".join(seg_list)
seg_outname = "SegTEXT " + tf_pretty_name + ".txt"
seg_text_path = join(output_path_impr, seg_outname)
with open(seg_text_path, "w", encoding="utf-8", errors="ignore") as file_seg:
    file_seg.write(seg_text)

# extract keywords from transcription (once spell-corrected)
key_phr_fl = quick_keys(
    filepath=output_path_impr,
    filename=corr_results_fl.get("corrected_ssp_fname"),
    num_keywords=50,
    max_ngrams=3,
    save_db=False,
)
key_phr_fl.to_excel(
    os.path.join(
        output_path_transcript, tf_pretty_name + "YAKE_extracted_keywords.xlsx"
    )
)

time_log.append(time.perf_counter())
time_log_desc.append("transcription spell-corrected + keywords extracted")


 Starting to spell-correct and extract keywords


PySymSpell - Starting to correct the file:  jfk_rice_moon_speec_tscript__05.18.31.txt
Done correcting  jfk_rice_moon_speec_tscript__05.18.31.txt  at time:  05:18:36 

top 5 phrases are: 

['forty thousand miles',
 'space promise high',
 'hundred feet tall',
 'propulsion guidance control',
 'guidance control communications']


## location of "final" transcription

In [16]:
increase_font()
print(
    "The final transcription outputs (i.e. the best / fully corrected) are here: \n",
    output_path_impr,
)

<IPython.core.display.Javascript object>

The final transcription outputs (i.e. the best / fully corrected) are here: 
 /content/transcription/wav2vec2_sf_transcript/auto-corrected


## Download generated files

In [17]:
from google.colab import files

if want_to_download_results:
    zip_dir = join(os.getcwd(), "zipped_outputs")
    os.makedirs(zip_dir, exist_ok=True)
    output_header = "vid2cleantxt_output_" + datetime.now().strftime("%d%m%Y")
    shutil.make_archive(join(zip_dir, output_header), "zip", file_loc)
    files.download(join(zip_dir, output_header + ".zip"))
    print("download started - ", datetime.now())
else:
    print("want_to_download_results is set to: ", want_to_download_results)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

download started -  2021-08-17 05:18:38.981721


## Log & Exit

In [18]:
print("\n\n----------------- Script Complete -----------------")
print("time of completion block: ", datetime.now())
print("Transcription file + more can be found here: ", output_path_transcript)
print("Metadata for each transcription is located: ", output_path_metadata)
time_log.append(time.perf_counter())
time_log_desc.append("End")
# save runtime database
time_records_db = pd.DataFrame(
    list(zip(time_log_desc, time_log)), columns=["Event", "Time (sec)"]
)
time_records_db.to_excel(
    join(output_path_metadata, tf_pretty_name + "transcription_time_log.xlsx")
)
# total
print("total runtime was {0:3f}".format((time_log[-1] - time_log[0]) / 60), " minutes")



----------------- Script Complete -----------------
time of completion block:  2021-08-17 05:18:39.003680
Transcription file + more can be found here:  /content/transcription/wav2vec2_sf_transcript
Metadata for each transcription is located:  /content/transcription/wav2vec2_sf_metadata
total runtime was 2.591219  minutes
