<a href="https://colab.research.google.com/github/pszemraj/vid2cleantxt/blob/fix-neuspell/colab_notebooks/vid2cleantext_multi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# vid2cleantxt - Multi file version on Colab

> PURPOSE: transcribe a series of media files to text from either a URL to a zip file containing said media or directly from a Google Drive folder.

- developed as part of the [vid2cleantxt](https://github.com/pszemraj/vid2cleantxt) repo
- by [Peter Szemraj](https://github.com/pszemraj)

---



In [1]:
#@title print out GPU info
#@markdown this is the Colab-allocated GPU. 
#@markdown - <font color="orange"> If the output here says it fails, no
#@markdown GPU is being used. go to runtime at the top of your colab to set runtime to GPU.
#@markdown - To change runtime, go to Runtime->Change Runtime Type and set Hardware Acceleration to GPU </font>


!nvidia-smi

Thu Feb 24 00:40:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    23W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# setup

In [2]:
#@markdown add auto-Colab formatting with `IPython.display`
from IPython.display import HTML, display
# colab formatting
def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )

get_ipython().events.register("pre_run_cell", set_css)

In [3]:
#@title setup inputs and outputs
#@markdown - <font color="orange"> **This is where you setup the input media. This cell will prompt you to link your google drive.** </font>

#@markdown - if `use_url` is True, it assumes that the media is provided in a link that it 
#@markdown is supposed to download. 
#@markdown - if `use_url` is False, it assumes that the media is in the `directory` folder
#@markdown of your google drive.
#@markdown - output files are by default saved to the `directory` folder in your Drive, so ensure it is defined.
#@markdown - optionally, set `download_output_files` to `True` to download everything
#@markdown as a zip file. 
import os
from os.path import join
from google.colab import files
from google.colab import drive

drive.mount("/content/drive")

directory = "/content/drive/MyDrive/Programming/vid2cleantxt/test"  # @param {type:"string"}
# set to false if you don't want it to download a zipped file of all the text
download_output_files = True  # @param {type:"boolean"}
use_url = True  # @param {type:"boolean"}
URL_of_media = "https://www.dropbox.com/sh/0u3easov5bygo5l/AACXWZ_gXhAsSTHg64FYsjf5a?dl=1"  # @param {type:"string"}
#@markdown note that `URL_of_media` assumes the media files are contained in a `zip` file
# specify file inputs

print("Will use the following as directory/file: ")
print(directory)

URL_save_folder = join(os.getcwd(), "downloaded_media")
os.makedirs(URL_save_folder, exist_ok=True)

if use_url:
    print("Using the URL as the source for video files.")
    print("Videos will be saved here: \n{}".format(URL_save_folder))
from datetime import datetime

run_start = datetime.now()
tag_date = "started_" + run_start.strftime("%m/%d/%Y, %H-%M")

Mounted at /content/drive
Will use the following as directory/file: 
/content/drive/MyDrive/Programming/vid2cleantxt/test
Using the URL as the source for video files.
Videos will be saved here: 
/content/downloaded_media


# Install, Import 

- imports and installs may take several minutes.

In [4]:
#@title set torch version
#@markdown fix any compatibility issues with a100 GPU
!pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html -q
!pip install https://storage.googleapis.com/jax-releases/cuda111/jaxlib-0.1.71+cuda111-cp37-none-manylinux2010_x86_64.whl -q

# see this issue https://github.com/googlecolab/colabtools/issues/2452 for colab A100 GPU

[K     |██████████████▋                 | 834.1 MB 117.9 MB/s eta 0:00:09tcmalloc: large alloc 1147494400 bytes == 0x55666b846000 @  0x7fd9fa702615 0x5566324603bc 0x55663254118a 0x5566324631cd 0x556632555b3d 0x5566324d7458 0x5566324d202f 0x556632464aba 0x5566324d72c0 0x5566324d202f 0x556632464aba 0x5566324d3cd4 0x556632556986 0x5566324d3350 0x556632556986 0x5566324d3350 0x556632556986 0x5566324d3350 0x556632464f19 0x5566324a8a79 0x556632463b32 0x5566324d71dd 0x5566324d202f 0x556632464aba 0x5566324d3cd4 0x5566324d202f 0x556632464aba 0x5566324d2eae 0x5566324649da 0x5566324d3108 0x5566324d202f
[K     |██████████████████▌             | 1055.7 MB 1.2 MB/s eta 0:10:23tcmalloc: large alloc 1434370048 bytes == 0x5566afe9c000 @  0x7fd9fa702615 0x5566324603bc 0x55663254118a 0x5566324631cd 0x556632555b3d 0x5566324d7458 0x5566324d202f 0x556632464aba 0x5566324d72c0 0x5566324d202f 0x556632464aba 0x5566324d3cd4 0x556632556986 0x5566324d3350 0x556632556986 0x5566324d3350 0x556632556986 0x5566324d335

In [5]:
%%capture
#@markdown import / install libraries as needed

!pip install pysbd
!pip install -U transformers
!pip install wordninja
!pip install yake
!pip install symspellpy
!pip install pycuda
!pip install gputil
!pip install humanize
!pip install -U plotly
!pip install moviepy --pre --upgrade
!apt install ffmpeg
!pip install -U tqdm
!pip install -U neuspell
!pip install clean-text
!pip install -U dropbox

# !apt-get install ffmpeg # if you get ffmpeg errors

import math, re
import os, shutil, time, gc
import pprint as pp
from datetime import datetime
from os import listdir
from os.path import isfile, join

import dropbox
import librosa
import moviepy.editor as mp
import moviepy
import pandas as pd
import pkg_resources
import pysbd
import torch
import wordninja
import yake
from natsort import natsorted
from symspellpy import SymSpell
import transformers
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import pycuda.driver as cuda
import psutil
import humanize
import GPUtil
import GPUtil as GPU
import neuspell
from tqdm.auto import tqdm
from cleantext import clean

# Function Definitions

- there is **a lot** of code in here, which is sort of organized. 
- It should only need to be opened / adjusted for debugging any errors or implementing improvements, or if you love reading python code.

## generic functions 

In [6]:
# define user functions


def increase_font():
    from IPython.display import Javascript

    display(
        Javascript(
            """
  for (rule of document.styleSheets[0].cssRules){
    if (rule.selectorText=='body') {
      rule.style.fontSize = '24px'
      break
    }
  }
  """
        )
    )


def reset_font():
    from IPython.display import Javascript

    display(
        Javascript(
            """
  for (rule of document.styleSheets[0].cssRules){
    if (rule.selectorText=='body') {
      rule.style.fontSize = '14px'
      break
    }
  }
  """
        )
    )


def corr(s):
    # adds space after period if there isn't one
    # removes extra spaces
    return re.sub(r"\.(?! )", ". ", re.sub(r" +", " ", s))


def shorten_title(title_text, max_no=20):
    if len(title_text) < max_no:
        return title_text
    else:
        return title_text[:max_no] + "..."


def digest_txt_directory(file_directory, identifer="", verbose=False, make_folder=True):
    run_date = datetime.now()
    files_to_merge = natsorted(
        [
            f
            for f in listdir(file_directory)
            if isfile(join(file_directory, f)) & f.endswith(".txt")
        ]
    )
    outfilename = (
        "Zealous_MERGED_words_" + identifer + run_date.strftime("_%d%m%Y_%H") + ".txt"
    )

    og_wd = os.getcwd()
    os.chdir(file_directory)

    if make_folder:
        folder_name = "merged_txt_files"
        if not os.path.isdir(join(file_directory, folder_name)):
            os.mkdir(
                join(file_directory, folder_name)
            )  # make a place to store outputs if one does not exist
        output_loc = join(file_directory, folder_name)

        outfilename = join(folder_name, outfilename)

        if verbose:
            print("created new folder. new full path is: \n", output_loc)

    count = 0
    with open(outfilename, "w") as outfile:

        for names in files_to_merge:

            with open(names) as infile:
                count += 1
                outfile.write("Start of: " + names + "\n")
                outfile.writelines(infile.readlines())

            outfile.write("\n")

    print("Merged {} text files together.".format(count))
    if verbose:
        print("the merged file is located at: \n", os.getcwd())
    os.chdir(og_wd)


def validate_output_directories(directory, verbose=False):

    # checks and creates folders

    t_folder_name = "wav2vec2_sf_transcript"
    m_folder_name = "wav2vec2_sf_metadata"

    # check if transcription folder exists. If not, create it'

    t_path_full = join(directory, t_folder_name)
    m_path_full = join(directory, m_folder_name)
    create_folder(t_path_full)
    create_folder(m_path_full)

    output_locs = {"t_out": t_path_full, "m_out": m_path_full}

    return output_locs


def move2completed(from_dir, filename, new_folder="completed", verbose=False):

    # this is the better version
    old_filepath = join(from_dir, filename)

    new_filedirectory = join(from_dir, new_folder)
    create_folder(new_filedirectory)

    new_filepath = join(new_filedirectory, filename)

    try:
        shutil.move(old_filepath, new_filepath)
        if verbose:
            print("moved {} to */completed.".format(filename))
    except:
        print(
            "Warning! unable to move file to \n{}. Please investigate".format(
                new_filepath
            )
        )

### clean filenames

In [7]:
def cleantxt_wrap(ugly_text):
    # a wrapper for clean text with options different than default

    # https://pypi.org/project/clean-text/
    cleaned_text = clean(
        ugly_text,
        fix_unicode=True,  # fix various unicode errors
        to_ascii=True,  # transliterate to closest ASCII representation
        lower=True,  # lowercase text
        no_line_breaks=True,  # fully strip line breaks as opposed to only normalizing them
        no_urls=True,  # replace all URLs with a special token
        no_emails=True,  # replace all email addresses with a special token
        no_phone_numbers=True,  # replace all phone numbers with a special token
        no_numbers=False,  # replace all numbers with a special token
        no_digits=False,  # replace all digits with a special token
        no_currency_symbols=True,  # replace all currency symbols with a special token
        no_punct=True,  # remove punctuations
        replace_with_punct="",  # instead of removing punctuations you may replace them
        replace_with_url="<URL>",
        replace_with_email="<EMAIL>",
        replace_with_phone_number="<PHONE>",
        replace_with_number="<NUM>",
        replace_with_digit="0",
        replace_with_currency_symbol="<CUR>",
        lang="en",  # set to 'de' for German special handling
    )

    return cleaned_text


def beautify_filename(filename, num_words=20, start_reverse=False, word_separator="_"):
    # takes a filename stored as text, removes extension, separates into X words ...
    # and returns a nice filename with the words separateed by
    # useful for when you are reading files, doing things to them, and making new files

    filename = str(filename)
    index_file_Ext = filename.rfind(".")
    current_name = str(filename)[:index_file_Ext]  # get rid of extension
    if current_name[-1].isnumeric():
        current_name = current_name + "V2CT"
    clean_name = cleantxt_wrap(current_name)  # wrapper with custom defs
    file_words = wordninja.split(clean_name)
    # splits concatenated text into a list of words based on common word freq
    if len(file_words) <= num_words:
        num_words = len(file_words)

    if start_reverse:
        t_file_words = file_words[-num_words:]
    else:
        t_file_words = file_words[:num_words]

    pretty_name = word_separator.join(t_file_words)  # see function argument

    # NOTE IT DOES NOT RETURN THE EXTENSION
    return pretty_name[
        : (len(pretty_name) - 1)
    ]  # there is a space always at the end, so -1

In [8]:
def fast_scandir(dirname):
    # return all subfolders in a given filepath

    subfolders = [f.path for f in os.scandir(dirname) if f.is_dir()]
    for dirname in list(subfolders):
        subfolders.extend(fast_scandir(dirname))
    return subfolders  # list


def create_folder(directory):
    os.makedirs(directory, exist_ok=True)


def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]


def chunky_pandas(my_df, num_chunks=4):
    n = int(len(my_df) // num_chunks)
    list_df = [my_df[i : i + n] for i in range(0, my_df.shape[0], n)]

    return list_df

In [9]:
import os
from os.path import basename
from natsort import natsorted
import pprint as pp


def load_dir_files(directory, req_extension=".txt", return_type="list", verbose=False):
    appr_files = []
    # r=root, d=directories, f = files
    for r, d, f in os.walk(directory):
        for prefile in f:
            if prefile.endswith(req_extension):
                fullpath = join(r, prefile)
                appr_files.append(fullpath)

    appr_files = natsorted(appr_files)

    if verbose:
        print("A list of files in the {} directory are: \n".format(directory))
        if len(appr_files) < 10:
            pp.pprint(appr_files)
        else:
            pp.pprint(appr_files[:10])
            print("\n and more. There are a total of {} files".format(len(appr_files)))

    if return_type.lower() == "list":
        return appr_files
    else:
        if verbose:
            print("returning dictionary")

        appr_file_dict = {}
        for this_file in appr_files:
            appr_file_dict[basename(this_file)] = this_file

        return appr_file_dict

In [10]:
print("loaded all generic functions at: ", datetime.now())

# create time log

time_log = []
time_log_desc = []
time_log.append(time.time())
time_log_desc.append("start")

loaded all generic functions at:  2022-02-24 00:47:29.295614


### time log

In [11]:
def get_timestamp(exact=False):
    """
    get_timestamp - return a timestamp in the format YYYY-MM-DD_HH-MM-SS (exact=False)
        or YYYY-MM-DD_HH-MM-SS-MS (exact=True)
    exact : bool, optional, by default False,  if True, return a timestamp with seconds
    """
    ts = (
        datetime.now().strftime("%b-%d-%Y_-%H-%M-%S")
        if exact
        else datetime.now().strftime("%b-%d-%Y_-%H")
    )
    return ts


In [12]:
def timelog_analytics(
    time_log, time_log_desc, save_path=None, show_plot=False, save_plot=True
):
    # takes the two lists used everywhere in this file and makes a relatively useful report
    if save_path is None:
        save_path = os.getcwd()
    time_records_db = pd.DataFrame(
        list(zip(time_log_desc, time_log)), columns=["Event", "Time (sec)"]
    )
    start_time_block = time_records_db.loc[0, "Time (sec)"]
    time_records_db["time_mins"] = 0
    time_records_db["diff_time_mins"] = 0
    prior_time = 0
    for index, row in time_records_db.iterrows():
        normalized_time = row["Time (sec)"] - start_time_block
        time_records_db.loc[index, "Time (sec)"] = normalized_time
        time_records_db.loc[index, "time_mins"] = normalized_time / 60

        if prior_time == 0:
            time_records_db.loc[index, "diff_time_mins"] = normalized_time / 60
            prior_time = normalized_time / 60
        else:
            # conditional check mostly for safety reasons
            time_records_db.loc[index, "diff_time_mins"] = (
                normalized_time / 60
            ) - prior_time
            prior_time = normalized_time / 60

    time_records_db.to_excel(
        join(
            save_path,
            "vid2cleantxt_"
            + datetime.now().strftime("%d%m%Y")
            + "transc_time_log.xlsx",
        )
    )
    print(
        "total runtime was {} minutes".format(round((time_log[-1] - time_log[0]) / 60))
    )

    total_time = time_records_db["diff_time_mins"].sum()

    def get_time_frac(section_time):
        return section_time / total_time

    time_records_db["duration_frac"] = time_records_db["diff_time_mins"].apply(
        get_time_frac
    )
    time_records_db.loc[
        time_records_db["duration_frac"] < 0.03, "Event"
    ] = "Misc."  # Represent only large events
    figtitle = "Run Time Viz - transc of {} media files on {}".format(
        len(approved_files), datetime.now().strftime("%d.%m.%Y")
    )
    fig = px.pie(
        time_records_db,
        values="diff_time_mins",
        names="Event",
        title=figtitle,
        template="plotly_dark",
    )
    if show_plot:
        fig.show()
    if save_plot:
        fig.to_html(
            join(save_path, "transcriptions {} run time viz.html".format(tag_date)),
            include_plotlyjs=True,
            default_width=1280,
            default_height=720,
        )

In [13]:
#@title colab download alias

from google.colab import files

download = files.download

### download zip file to colab


In [14]:
import re


def URL_string_filter(text):
    custom_printable = (
        "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ._"
    )

    filtered = "".join((filter(lambda i: i in custom_printable, text)))

    return filtered


def getFilename_fromCd(cd):

    if not cd:
        return None
    fname = re.findall("filename=(.+)", cd)
    if len(fname) > 0:
        output = fname[0]
    elif cd.find("/"):
        possible_fname = url.rsplit("/", 1)[1]
        output = URL_string_filter(possible_fname)
    else:
        output = None
    return output

In [15]:
import shutil, lzma, bz2, zlib  # zipfile formats
import requests
from os.path import getsize
from datetime import datetime


def get_zip_URL(
    URLtoget, extract_loc=None, file_header="dropboxexport_", verbose=False
):

    r = requests.get(URLtoget, allow_redirects=True)
    names = getFilename_fromCd(r.headers.get("content-disposition"))
    fixed_fnames = names.split(";")  # split the multiple results
    this_filename = file_header + URL_string_filter(fixed_fnames[0])

    # define paths and save the zip file
    if extract_loc is None:
        extract_loc = "dropbox_dl"
    dl_place = join(os.getcwd(), extract_loc)
    create_folder(dl_place)
    save_loc = join(os.getcwd(), this_filename)
    open(save_loc, "wb").write(r.content)
    if verbose:
        print("downloaded file size was {} MB".format(getsize(save_loc) / 1000000))

    # unpack the archive
    shutil.unpack_archive(save_loc, extract_dir=dl_place)
    if verbose:
        print("extracted zip file - ", datetime.now())
        x = load_dir_files(dl_place, req_extension="", verbose=verbose)

    # remove original
    try:
        os.remove(save_loc)
        del save_loc
    except:
        print("unable to delete original zipfile - check if exists", datetime.now())

    print("finished extracting zip - ", datetime.now())

    return dl_place

## check hardware


In [16]:
def gpu_mem_total():
    # Returns the total memory of the first available GPU
    try:
        gpus = GPUtil.getGPUs()
    except:
        LOGGER.warning(
            "Unable to detect GPU model. Is your GPU configured? Is Colab Runtime set to GPU?"
        )
        return np.nan
    if len(gpus) == 0:
        raise ValueError("No GPUs detected in the system")
    return gpus[0].memoryTotal

checks and resets

In [17]:
def clear_GPU_cache(verbose=False):

    GPUs = GPU.getGPUs()

    if len(GPUs) > 0:
        check_runhardware_torch()
        torch.cuda.empty_cache()
        print("\nchecked and cleared cache")
    else:
        print("\nNo GPU being used :( time = ", datetime.now())
    if verbose:
        print("-----------End of Cache Clear----------------")


print("loaded all hardware functions at: ", datetime.now())


def check_runhardware_torch(verbose=False):
    # https://www.run.ai/guides/gpu-deep-learning/pytorch-gpu/

    GPUs = GPU.getGPUs()

    if len(GPUs) > 0:
        if verbose:
            print("\n ------------------------------")
            print("Checking CUDA status for PyTorch")

        torch.cuda.init()

        print("Cuda availability (PyTorch): ", torch.cuda.is_available())

        # Get Id of default device
        torch.cuda.current_device()
        if verbose:
            print(
                "Name of GPU: ", torch.cuda.get_device_name(device=0)
            )  # '0' is the id of your GPU
            print("------------------------------\n")
        return True

    else:
        print("No GPU being used :(")
        return False


def torch_validate_cuda(verbose=False):
    GPUs = GPU.getGPUs()
    num_gpus = len(GPUs)
    try:
        torch.cuda.init()
        if not torch.cuda.is_available():
            print(
                "WARNING - CUDA is not being used in processing - expect longer runtime"
            )
            if verbose:
                print("GPU util detects {} GPUs on your system".format(num_gpus))
    except:
        print(
            "WARNING - unable to start CUDA. If you wanted to use a GPU, exit and check hardware."
        )


def check_runhardware(verbose=False):
    # ML package agnostic hardware check
    GPUs = GPU.getGPUs()

    if verbose:
        print("\n ------------------------------")
        print("Checking hardware with psutil")
    try:
        gpu = GPUs[0]
    except:
        if verbose:
            print("GPU not available - ", datetime.now())
        gpu = None
    process = psutil.Process(os.getpid())

    CPU_load = psutil.cpu_percent()
    if CPU_load > 0:
        cpu_load_string = "loaded at {} % |".format(CPU_load)
    else:
        # the first time process.cpu_percent() is called it returns 0 which can be confusing
        cpu_load_string = "|"
    print(
        "\nGen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available),
        " | Proc size: " + humanize.naturalsize(process.memory_info().rss),
        " | {} CPUs ".format(psutil.cpu_count()),
        cpu_load_string,
    )

    if len(GPUs) > 0 and GPUs is not None:
        print(
            "GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB\n".format(
                gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil * 100, gpu.memoryTotal
            )
        )
    else:
        print("No GPU being used :(", "\n-----------------\n")


def only_clear_GPU_cache(verbose=False):

    GPUs = GPU.getGPUs()

    if len(GPUs) > 0:
        torch.cuda.empty_cache()
        if verbose:
            print("\nchecked and cleared cache")
    else:
        print("\nClearCache - No GPU being used :( time = ", datetime.now())


# create time log

time_log.append(time.time())
time_log_desc.append("loaded hardware functions")

loaded all hardware functions at:  2022-02-24 00:47:29.456656


## spell correction

symspell is defined for backup purposes. It is faster than neuspell and decently accurate. It does not do grammar though.

In [18]:
%%capture

def symspell_file(filepath, filename, dist=2, keep_numb_words=True, create_folder=True, save_metrics=False,
                  verbose=False):
    # given a text (has to be text) file, reads the file, autocorrects any words it deems misspelled, saves as new file
    # it can store the new file in a sub-folder it creates as needed
    # distance represents how far it searches for a better spelling. higher dist = higher RT.
    # https://github.com/mammothb/symspellpy

    script_start_time = time.time()
    sym_spell = SymSpell(max_dictionary_edit_distance=dist, prefix_length=7)
    print("\nPySymSpell - Starting to correct the file: ", filename)
    # ------------------------------------

    dictionary_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_dictionary_en_82_765.txt")
    bigram_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_bigramdictionary_en_243_342.txt")
    # term_index is the column of the term and count_index is the
    # column of the term frequency
    sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
    sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

    # ------------------------------------
    file = open(join(filepath, filename), 'r', encoding="utf-8", errors='ignore')
    textlines = file.readlines()  # return a list
    file.close()

    if create_folder:
        # create a folder
        output_folder_name = "auto-corrected" 
        if not os.path.isdir(join(filepath, output_folder_name)):
            os.mkdir(join(filepath, output_folder_name))  # make a place to store outputs if one does not exist
        filepath = join(filepath, output_folder_name)

    if verbose:
        print("loaded text with {0:6d} lines ".format(len(textlines)))

    corrected_list = []

    # iterate through list of lines. Pass each line to be corrected. 
    #Append / sum results from each line till done
    for line in textlines:
        if line == "":
            # blank line, skip to next run
            continue

        # correct the line of text using spellcorrect_line() which returns a dictionary
        suggestions = sym_spell.lookup_compound(phrase=line, max_edit_distance=dist, 
                                                ignore_non_words=keep_numb_words,
                                                ignore_term_with_digits=keep_numb_words)
        all_sugg_for_line = []
        for suggestion in suggestions:
            all_sugg_for_line.append(suggestion.term)

        # append / sum / log results from correcting the line

        corrected_list.append(' '.join(all_sugg_for_line) + "\n")

    # finished iterating through lines. Now sum total metrics

    corrected_doc = "".join(corrected_list)
    corrected_fname = "Corrected_SSP_" + beautify_filename(filename, 
                                                           num_words=10, start_reverse=False) + ".txt"

    # proceed to saving
    file_out = open(join(filepath, corrected_fname), 'w',
                    encoding="utf-8", errors='ignore')
    file_out.writelines(corrected_doc)
    file_out.close()

    # report RT
    if verbose:
        script_rt_m = (time.time() - script_start_time) / 60
        print("RT for this file was {0:5f} minutes".format(script_rt_m))
        print("output folder for this transcription is: \n", 
              filepath)

    print("Done correcting ", filename, " at time: ", 
          datetime.now().strftime("%H:%M:%S"), "\n")

    corr_file_Data = {
        "corrected_ssp_text": corrected_doc,
        "corrected_ssp_fname": corrected_fname,
        "output_path": filepath,
    }
    return corr_file_Data


# preload defaults
sym_spell = SymSpell(max_dictionary_edit_distance=3, prefix_length=7)

dictionary_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_dictionary_en_82_765.txt")
bigram_path = pkg_resources.resource_filename(
        "symspellpy", "frequency_bigramdictionary_en_243_342.txt")
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

print("loaded defaults - ", datetime.now())

def symspell_freetext(textlines, dist=3, keep_numb_words=True, verbose=False,
                      d_path=dictionary_path, b_path=bigram_path, default=sym_spell):
    # https://github.com/mammothb/symspellpy

    if dist != 3:

        # have to recreate object each time because doesn't match pre-built

        sym_spell = SymSpell(max_dictionary_edit_distance=dist, prefix_length=7)
        sym_spell.load_dictionary(d_path, term_index=0, count_index=1)
        sym_spell.load_bigram_dictionary(b_path, term_index=0, count_index=2)
    else:
        sym_spell=default

    corrected_list = []

    if type(textlines) == str:
        textlines = [textlines] # put in a list if a string

    if verbose:
        print("\nStarting to correct text with {0:6d} lines ".format(len(textlines)))
        print("the type of textlines var is ",type(textlines))

    # iterate through list of lines. Pass each line to be corrected. Append / sum results from each line till done
    for line_obj in textlines:
        line = ''.join(line_obj) 
        if verbose:
            print("line {} in the text is: ".format(textlines.index(line_obj)))
            pp.pprint(line) 
        if line == "":
            # blank line, skip to next run
            continue

        suggestions = sym_spell.lookup_compound(phrase=line, max_edit_distance=dist, 
                                                ignore_non_words=keep_numb_words,
                                                ignore_term_with_digits=keep_numb_words)
        all_sugg_for_line = []
        for suggestion in suggestions:
            all_sugg_for_line.append(suggestion.term)

        # append / sum / log results from correcting the line

        corrected_list.append(' '.join(all_sugg_for_line) + "\n")

    # join corrected text

    corrected_text = "".join(corrected_list)

    if verbose:
        print("Finished correcting w/ symspell at time: ", datetime.now(), "\n")

    return corrected_text



neuspell

- a better spellchecker (but more intensive)

In [19]:
# START OF NEUSPELL

try:
    checker = neuspell.BertChecker()
    checker.from_pretrained()
    use_base = False
except Exception as e:
    print("Note: failed to initialize Neuspell. Will use SymSpell backup.")
    print(f"Error causing Neuspell failure is:\n{e}")
    use_base = True


def neuspell_freetext(textlines, verbose=False):

    corrected_list = []

    if type(textlines) == str:
        textlines = [textlines]  # put in a list if a string

    # iterate through list of lines. Pass each line to be corrected. Append / sum results from each line till done
    for line_obj in textlines:
        line = "".join(line_obj)

        if verbose:
            print("line {} in the text is: ".format(textlines.index(line_obj)))
            pp.pprint(line)
        if line == "" or (len(line) <= 5):
            # blank line, skip to next run
            continue

        line = line.lower()
        corrected_text = checker.correct_strings([line])
        corrected_text_f = " ".join(corrected_text)

        corrected_list.append(corrected_text_f + "\n")

    # join corrected text

    corrected_text = " ".join(corrected_list)

    if verbose:
        print("Finished correcting w/ neuspell at time: ", datetime.now(), "\n")

    return corrected_text

/usr/local/lib/python3.7/dist-packages/neuspell/../data/checkpoints/subwordbert-probwordnoise created
Pretrained model downloading start (may take few seconds to couple of minutes based on download speed) ...
Pretrained model download success
loading vocab from path:/usr/local/lib/python3.7/dist-packages/neuspell/../data/checkpoints/subwordbert-probwordnoise/vocab.pkl
initializing model


Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

SubwordBert(
  (bert_dropout): Dropout(p=0.2, inplace=False)
  (bert_model): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): 

sentence boundary disambiguation

In [20]:
def SBD_freetext(text, verbose=False):
    # input should be STRING
    # use pysbd to segment

    if isinstance(text, list):
        print(
            "Warning, input ~text~ has type {}. Will convert to str".format(type(text))
        )
        text = " ".join(text)

    seg = pysbd.Segmenter(language="en", clean=True)
    sentences = []
    sentences = seg.segment(text)

    if verbose:
        print(
            "input text of {} words was split into ".format(len(text.split(" "))),
            len(sentences),
            "sentences",
        )

    # take segments and make them sentences

    capitalized = []
    for sentence in sentences:
        if sentence and sentence.strip():
            # ensure that the line is not all spaces
            first_letter = sentence[0].upper()
            rest = sentence[1:]
            capitalized.append(first_letter + rest)

    seg_and_capital = ". ".join(capitalized)

    return seg_and_capital

### pipeline

In [21]:
def spellcorrect_pipeline(filepath, filename, verbose=False, 
                          basic_sc=False):
    # uses two functions (neuspell_freetext, SBD_freetext)
    # in a pipeline
    _input = join(filepath, filename)
    with open(_input, "r", encoding="utf-8", errors="ignore") as fi:
        textlines = fi.readlines()  # return a list
    textlines = [line.lower() for line in textlines]

    if basic_sc:
        sc_textlines = symspell_freetext(textlines, verbose=verbose)
    else:
        sc_textlines = neuspell_freetext(textlines, verbose=verbose)

    loc_SC = "spell_corrected"
    if not os.path.isdir(join(filepath, loc_SC)):
        os.mkdir(
            join(filepath, loc_SC)
        )  # make a place to store outputs if one does not exist

    sc_outname = (
        "NSC_" + beautify_filename(filename, num_words=15, start_reverse=False) + ".txt"
    )

    file_sc = open(
        join(filepath, loc_SC, sc_outname), "w", encoding="utf-8", errors="replace"
    )
    file_sc.writelines(sc_textlines)
    file_sc.close()
    quick_sc_fixes = {
        " ' ": "'",
    }
    if isinstance(sc_textlines, list):
        SBD_sc_textlines = []
        for line in sc_textlines:
            if isinstance(line, list):
                # handles weird corner cases
                line = " ".join(line)

            sentenced = SBD_freetext(line, verbose=verbose)
            for key, value in quick_sc_fixes.items():

                sentenced = sentenced.replace(key, value)

            SBD_sc_textlines.append(sentenced)
    else:
        SBD_sc_textlines = SBD_freetext(sc_textlines, verbose=verbose)

        for key, value in quick_sc_fixes.items():

            SBD_sc_textlines = SBD_sc_textlines.replace(key, value)

    # SBD_text = " ".join(SBD_sc_textlines)

    loc_SBD = "FULLY_COMPLETE"
    if not os.path.isdir(join(filepath, loc_SBD)):
        os.mkdir(
            join(filepath, loc_SBD)
        )  # make a place to store outputs if one does not exist

    SBD_outname = (
        "FIN_" + beautify_filename(filename, num_words=15, start_reverse=False) + ".txt"
    )
    ncsbd_path = join(filepath, loc_SBD, SBD_outname)
    file_sc = open(ncsbd_path, "w", encoding="utf-8", errors="replace")
    file_sc.writelines(SBD_sc_textlines)
    file_sc.close()
    pipelineout = {
        "original_transcript_text": " ".join(textlines),
        "spellcorrected_text": " ".join(sc_textlines),
        "final_text": " ".join(SBD_sc_textlines),
        "spell_corrected_dir": join(filepath, loc_SC),
        "sc_filename": sc_outname,
        "SBD_dir": join(filepath, loc_SBD),
        "SBD_filename": SBD_outname,
    }

    return pipelineout

In [22]:
print("loaded all spell correction functions at: ", datetime.now())

loaded all spell correction functions at:  2022-02-24 00:47:50.008910


## vid2cleantext specific

things that are more or less unique to video conversion / audio transcription.

### convert media

In [23]:
!pip install -U -q pydub
from natsort import natsorted
from pydub import AudioSegment

#@title new function `prep_transc_pydub`
def prep_transc_pydub(
    _vid2beconv,
    in_dir,
    out_dir,
    len_chunks=15,
    verbose=False,
):
    """
    prep_transc_pydub - prepares audio files for transcription using pydub

    Parameters
    ----------
    _vid2beconv : str, the name of the video file to be converted
    in_dir : str or Path, the path to the video file directory
    out_dir : str or Path, the path to the output audio file directory
    len_chunks : int, optional, by default 15, the length of the audio chunks in seconds
    verbose : bool, optional, by default False
        [description], by default False

    Returns
    -------
    list, the list of audio filepaths created
    """

    load_path = join(in_dir, _vid2beconv) if in_dir is not None else _vid2beconv
    vid_audio = AudioSegment.from_file(load_path)
    sound = AudioSegment.set_channels(vid_audio, 1)

    create_folder(out_dir)  # create the output directory if it doesn't exist
    dur_seconds = len(sound) / 1000
    n_chunks = math.ceil(dur_seconds / len_chunks)  # to get in minutes, round up
    preamble = shorten_title(_vid2beconv)
    chunk_fnames = []
    # split sound in 5-second slices and export
    slicer = 1000 * len_chunks  # in milliseconds
    st = time.perf_counter()
    for i, chunk in enumerate(sound[::slicer]):
        chunk_name = f"{preamble}_clipaudio_{i}.wav"
        with open(join(out_dir, chunk_name), "wb") as f:
            chunk.export(f, format="wav")
        chunk_fnames.append(chunk_name)
    rt = round(time.perf_counter() -st, 5)
    print(f"\ncreated audio chunks in {rt} seconds - {get_timestamp()}")
    if verbose:
        print(f" files saved to {out_dir}")

    return natsorted(chunk_fnames)


### transcribe (main)

In [24]:
def transcribe_wav2vec(
    transcription_model, directory, vid_clip_name, 
    chunk_length_seconds, verbose=False
):
    # this is the same process as used in the single video transcription, now as a function. Note that spell correction
    # and keyword extraction are now done separately in the script
    # user needs to pass in: the model, the folder the video is in, and the name of the video
    output_path_full = directory

    # Split Video into Audio Chunks-----------------------------------------------

    print("Starting file: ", vid_clip_name)

    # create audio chunk folder
    output_folder_name = "audio_chunks"
    path2audiochunks = join(directory, output_folder_name)
    os.makedirs(path2audiochunks, exist_ok=True)
    chunk_directory = prep_transc_pydub(    
                                        vid_clip_name,
                                        in_dir=directory,
                                        out_dir=path2audiochunks,
                                        len_chunks=chunk_length_seconds,
                                        verbose=verbose,
                                    )
   

    if verbose:
        print(
            "converted video to audio. About to start transcription loop for file: ",
            vid_clip_name,
        )
    torch_validate_cuda()
    check_runhardware()
    time_log.append(time.time())
    time_log_desc.append("converted video to audio")
    full_transcription = []
    before_loop_st = time.time()
    GPU_update_incr = math.ceil(len(chunk_directory) / 2)
    _need_update = True
    # Load audio chunks by name, pass into model, append output text-----------------------------------------------

    for audio_chunk in tqdm(
        chunk_directory,
        total=len(chunk_directory),
        desc= f"transcribing {shorten_title(vid_clip_name)}:\t",
    ):

        current_loc = chunk_directory.index(audio_chunk)

        if (current_loc % GPU_update_incr == 0) and (GPU_update_incr != 0) and _need_update:
            # provide update on GPU usage
            check_runhardware()
            _need_update = False

        # load dat chunk
        audio_input, rate = librosa.load(
                                        join(path2audiochunks,
                                            audio_chunk),
                                        sr=16000
                                    )
        # MODEL
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
        input_values = tokenizer(
                    audio_input, return_tensors="pt", padding="longest", 
                ).input_values.to(
                    device
            )
        transcription_model = transcription_model.to(device)
        logits = transcription_model(input_values).logits
        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = str(tokenizer.batch_decode(predicted_ids)[0])
        full_transcription.append(transcription + "\n")
        # empty memory so you don't overload the GPU
        del input_values, logits, predicted_ids, audio_input
        torch.cuda.empty_cache()

    if verbose:
        print("\nFinished transc. of {}, saving metrics ".format(vid_clip_name))

    # build metadata log -------------------------------------------------
    mdata = []
    mdata.append("original file name: " + vid_clip_name + "\n")
    mdata.append(
        "number of recorded audio chunks: "
        + str(len(chunk_directory))
        + " of lengths seconds each"
        + str(chunk_length_seconds)
        + "\n"
    )
    approx_input_len = (len(chunk_directory) * chunk_length_seconds) / 60
    mdata.append(
        "approx {0:3f}".format(approx_input_len) + " minutes of input audio \n"
    )
    mdata.append(
        "transcription date: "
        + datetime.now().strftime("date_%d_%m_%Y_time_%H-%M-%S")
        + "\n"
    )
    full_text = " ".join(full_transcription)
    transcript_length = len(full_text)
    mdata.append(
        "length of transcribed text: " + str(transcript_length) + " characters \n"
    )
    t_word_count = len(full_text.split(" "))
    mdata.append(
        "total word count: " + str(t_word_count) + " words (based on spaces) \n"
    )

    # delete audio chunks in folder -------------------------------------------------
    try:
        shutil.rmtree(path2audiochunks)
        if verbose:
            print("\nDeleted Audio Chunk Folder + Files")
    except:
        print("warning - could not delete the audio chunk folder on VM")
    # compile results -------------------------------------------------
    transcription_results = {
        "audio_transcription": full_transcription,
        "metadata": mdata,
    }

    return transcription_results

In [25]:
def save_transcript_outputs(fileheader, file_id, output_dict):

    full_transcription = output_dict.get("audio_transcription")
    metadata = output_dict.get("metadata")

    # check if directories for output exist. If not, create them
    storage_locs = validate_output_directories(directory)
    output_path_transcript = storage_locs.get("t_out")
    output_path_metadata = storage_locs.get("m_out")

    transcribed_filename = fileheader + "_transcription_" + file_id + ".txt"
    transcribed_file = open(
        join(output_path_transcript, transcribed_filename),
        "w",
        encoding="utf-8",
        errors="ignore",
    )
    transcribed_file.writelines(full_transcription)
    transcribed_file.close()
    # metadata
    metadata_filename = (
        "metadata - " + fileheader + "_transcription_" + file_id + ".txt"
    )
    metadata_file = open(
        join(output_path_metadata, metadata_filename),
        "w",
        encoding="utf-8",
        errors="ignore",
    )
    metadata_file.writelines(metadata)
    metadata_file.close()

    print("saved outputs for file ID {} - ".format(file_id), datetime.now())

### keyword extraction

In [26]:
def quick_keys(
    filename, filepath, max_ngrams=3, num_keywords=20, save_db=False, verbose=False
):
    # uses YAKE to quickly determine keywords in a text file. Saves Keywords and YAKE score (0 means very important) in
    # an excel file (from a dataframe)
    # yes, the double entendre is intended.
    file = open(join(filepath, filename), "r", encoding="utf-8", errors="ignore")
    text = file.read()
    file.close()

    language = "en"
    deduplication_threshold = 0.6  # technically a hyperparameter
    custom_kw_extractor = yake.KeywordExtractor(
        lan=language,
        n=max_ngrams,
        dedupLim=deduplication_threshold,
        top=num_keywords,
        features=None,
    )
    yake_keywords = custom_kw_extractor.extract_keywords(text)
    phrase_db = pd.DataFrame(yake_keywords)
    if verbose:
        print("YAKE keywords are: \n", yake_keywords)
        print("dataframe looks like: \n")
        pp.pprint(phrase_db.head())

    if len(phrase_db) == 0:
        print("warning - no phrases were able to be extracted... ")
        return None

    phrase_db.columns = ["key_phrase", "YAKE_score"]

    # add a column for how many words the phrases contain
    yake_kw_len = []
    yake_kw_freq = []
    for entry in yake_keywords:
        entry_wordcount = len(str(entry).split(" ")) - 1
        yake_kw_len.append(entry_wordcount)

    for index, row in phrase_db.iterrows():
        search_term = row["key_phrase"]
        entry_freq = text.count(str(search_term))
        yake_kw_freq.append(entry_freq)

    word_len_series = pd.Series(yake_kw_len, name="No. Words in Phrase")
    word_freq_series = pd.Series(yake_kw_freq, name="Phrase Freq. in Text")
    phrase_db2 = pd.concat([phrase_db, word_len_series, word_freq_series], axis=1)
    # add column names and save file as excel because CSVs suck
    phrase_db2.columns = [
        "key_phrase",
        "YAKE Score (Lower = More Important)",
        "num_words",
        "freq_in_text",
    ]

    if save_db:
        # saves individual file if user asks
        yake_fname = (
            beautify_filename(filename=filename, start_reverse=False)
            + "_top_phrases_YAKE.xlsx"
        )
        phrase_db2.to_excel(join(filepath, yake_fname), index=False)

    # print out top 10 keywords, or if desired num keywords less than 10, all of them
    max_no_disp = 10
    if num_keywords > max_no_disp:
        num_phrases_disp = max_no_disp
    else:
        num_phrases_disp = num_keywords

    if verbose:
        print("Top Key Phrases from YAKE, with max n-gram length: ", max_ngrams, "\n")
        pp.pprint(phrase_db2.head(n=num_phrases_disp))
    else:
        list_o_words = phrase_db2["key_phrase"].to_list()
        print("top 5 phrases are: \n")
        if len(list_o_words) < 5:
            pp.pprint(list_o_words)
        else:
            pp.pprint(list_o_words[:5])

    return phrase_db2

In [27]:
get_timestamp()

'Feb-24-2022_-00'

# Specify Parameters, File Locations, Load wav2vec2 Model

## load media files

In [28]:
from pathlib import Path
#@markdown define allowed extension types and a function to validate if media present in user folder

extensions = [".mp4",".mov",".mkv",
                  ".mp3",".wav",".ogg", ".m4a",
                  ]
def is_media_empty(dir):
    _dir = Path(dir)
    dirfiles = [f for f in _dir.iterdir() if f.is_file()]
    if len(dirfiles) < 1: return True

    for df in dirfiles:
        this_ext = df.suffix
        if any([m in this_ext for m in extensions]):
            return False
    return True

print(f'currently allowed media types are:\n{extensions}')

currently allowed media types are:
['.mp4', '.mov', '.mkv', '.mp3', '.wav', '.ogg', '.m4a']


In [29]:
#@title load media files 
#@markdown allowed extensions defined in cell above
# iterate through and grab files:
if use_url:
    if is_media_empty(URL_save_folder):
        URL_save_folder = get_zip_URL(
                URL_of_media, extract_loc=URL_save_folder, verbose=True
        )
    else:
        print(f"found media files in {URL_save_folder}, skipping download")
    # no else cause URL_save_folder exists
    files_to_munch = natsorted(
        [f for f in listdir(URL_save_folder) if isfile(join(URL_save_folder, f))]
    )
else:
    files_to_munch = natsorted(
        [f for f in listdir(directory) if isfile(join(directory, f))]
    )
total_files_1 = len(files_to_munch)
removed_count_1 = 0
approved_files = []
# remove non-media files
for prefile in files_to_munch:
    if any([m in prefile for m in extensions]):
        approved_files.append(prefile)
    else:
        files_to_munch.remove(prefile)
        removed_count_1 += 1

print(
    "out of {0:3d} file(s) originally in the folder, ".format(total_files_1),
    "{0:3d} non-media files were removed".format(removed_count_1),
)
print(
    "\n {0:3d} media file(s) in folder will be transcribed.".format(len(approved_files))
)

pp.pprint(approved_files)

downloaded file size was 401.137433 MB
extracted zip file -  2022-02-24 00:48:07.425251
A list of files in the /content/downloaded_media directory are: 

['/content/downloaded_media/MIT_MatricesSGD_000.mp4',
 '/content/downloaded_media/MIT_MatricesSGD_001.mp4',
 '/content/downloaded_media/MIT_MatricesSGD_002.mp4',
 '/content/downloaded_media/MIT_Signals_000.mp4',
 '/content/downloaded_media/MIT_Signals_001.mp4',
 '/content/downloaded_media/MIT_Signals_002.mp4',
 '/content/downloaded_media/MIT_VibrationsAndWaves000.mp4',
 '/content/downloaded_media/MIT_VibrationsAndWaves001.mp4',
 '/content/downloaded_media/MIT_VibrationsAndWaves002.mp4',
 '/content/downloaded_media/MIT_VibrationsAndWaves003.mp4']

 and more. There are a total of 10 files
finished extracting zip -  2022-02-24 00:48:07.465639
out of  10 file(s) originally in the folder,    0 non-media files were removed

  10 media file(s) in folder will be transcribed.
['MIT_MatricesSGD_000.mp4',
 'MIT_MatricesSGD_001.mp4',
 'MIT_Matric

## Load wav2vec2 model (#3)

- <font color = "orange">enter chunk length & choose a model. defaults should work fine for most cases 
- with recent upgrades made on the function to convert media to `.wav` audio chunks there is no "runtime penalty" of a smaller chunk length as before. Just keep in mind as the number decreases **too low** then the model won't have the relevant auditory context to determine what is being said effectively.




In [30]:
chunk_length =   15# @param {type:"integer"}
_model = "facebook/hubert-large-ls960-ft" #@param ["facebook/wav2vec2-large-960h-lv60-self", "facebook/hubert-large-ls960-ft", "facebook/hubert-xlarge-ls960-ft", "facebook/wav2vec2-base-960h"] {allow-input: true}
#@markdown model name is the tag on huggingface.co

In [31]:
#@title load model from huggingface hub
%%time
from transformers import (
    Wav2Vec2ForMaskedLM, 
    Wav2Vec2CTCTokenizer, 
    Wav2Vec2Processor,
    HubertForCTC,
)
time_log.append(time.time())

gpu_mem = round(gpu_mem_total() / 1024, 2)

if gpu_mem < 15 and chunk_length > 20:
    print("GPU memory of {} is too low.. setting chunk length to 20".format(gpu_mem))
    chunk_length = 20  # automatically adjust down to avoid issues

time_log_desc.append("starting to load model")

print("\nPreparing to load model: " + _model)
tokenizer = Wav2Vec2Processor.from_pretrained(_model)

if "hubert" in _model.lower():
    model = HubertForCTC.from_pretrained(_model,
                                         gradient_checkpointing=True,
                                         low_cpu_mem_usage=False, # set to true if issues
                                    )
else:
    model = Wav2Vec2ForCTC.from_pretrained(_model)
# (in seconds) if model fails to work or errors out (and there isn't some other
# obvious error, reduce this number. 20-25 is a good start.
print("loaded the following model:", _model, " at ", datetime.now())
time_log.append(time.time())
time_log_desc.append("loaded model")


Preparing to load model: facebook/hubert-large-ls960-ft


Downloading:   0%|          | 0.00/212 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/138 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

loaded the following model: facebook/hubert-large-ls960-ft  at  2022-02-24 00:48:38.688115
CPU times: user 22.2 s, sys: 3.62 s, total: 25.8 s
Wall time: 31.2 s


---

# Run Transformers Model (wav2vec2 or huBERT)

In [32]:
import gc
#@markdown initial check - hardware
gc.collect()
check_runhardware()


Gen RAM Free: 50.3 GB  | Proc size: 4.7 GB  | 8 CPUs  loaded at 11.8 % |
GPU RAM Free: 16158MB | Used: 2MB | Util   0% | Total 16160MB



In [33]:
#@title Transcription Loop
#@markdown - here's where the transformer model is applied.
#@markdown - transcription speed depends on a lot of things, most notably what 
#@markdown GPU the runtime was assigned (check at the top of notebook)


if use_url:
    vid_src_folder = URL_save_folder
else:
    vid_src_folder = directory

storage_locs = validate_output_directories(directory)
output_path_transcript = storage_locs.get("t_out")
output_path_metadata = storage_locs.get("m_out")

for filename in tqdm(
    approved_files, 
    total=len(approved_files), 
    desc="Main Proc: \t"
):

    t_results = transcribe_wav2vec(
        transcription_model=model,
        directory=vid_src_folder,
        vid_clip_name=filename,
        chunk_length_seconds=chunk_length,
    )
    # t_results is a dictonary containing the transcript and associated metadata
    # label and store this transcription
    vid_preamble = beautify_filename(
        filename, num_words=30, start_reverse=False
    )  # gets a nice phrase from filename
    # transcription
    ID = str(1 + approved_files.index(filename))
    save_transcript_outputs(vid_preamble, ID, t_results)

    if not use_url:
        move2completed(directory, filename=filename)
    # ^ if G Drive, move file to a "done" folder

    # save runtime database after each run
    time_records_db = pd.DataFrame(
        list(zip(time_log_desc, time_log)), columns=["Event", "Time (sec)"]
    )
    time_records_db.to_excel(
        join(output_path_metadata, "mid_loop_runtime_database.xlsx")
    )


Main Proc: 	:   0%|          | 0/10 [00:00<?, ?it/s]

Starting file:  MIT_MatricesSGD_000.mp4

created audio chunks in 0.10433 seconds - Feb-24-2022_-00

Gen RAM Free: 50.3 GB  | Proc size: 4.7 GB  | 8 CPUs  loaded at 11.2 % |
GPU RAM Free: 16158MB | Used: 2MB | Util   0% | Total 16160MB



transcribing MIT_MatricesSGD_000....:	:   0%|          | 0/81 [00:00<?, ?it/s]


Gen RAM Free: 50.3 GB  | Proc size: 4.7 GB  | 8 CPUs  loaded at 22.1 % |
GPU RAM Free: 16158MB | Used: 2MB | Util   0% | Total 16160MB

saved outputs for file ID 1 -  2022-02-24 00:49:56.591374
Starting file:  MIT_MatricesSGD_001.mp4

created audio chunks in 0.11098 seconds - Feb-24-2022_-00

Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 15.7 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB



transcribing MIT_MatricesSGD_001....:	:   0%|          | 0/80 [00:00<?, ?it/s]


Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 24.0 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB

saved outputs for file ID 2 -  2022-02-24 00:51:02.142953
Starting file:  MIT_MatricesSGD_002.mp4

created audio chunks in 0.07107 seconds - Feb-24-2022_-00

Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 14.1 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB



transcribing MIT_MatricesSGD_002....:	:   0%|          | 0/53 [00:00<?, ?it/s]


Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 23.8 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB

saved outputs for file ID 3 -  2022-02-24 00:51:45.218994
Starting file:  MIT_Signals_000.mp4

created audio chunks in 0.10909 seconds - Feb-24-2022_-00

Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 14.1 % |
GPU RAM Free: 13105MB | Used: 3055MB | Util  19% | Total 16160MB



transcribing MIT_Signals_000.mp4:	:   0%|          | 0/81 [00:00<?, ?it/s]


Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 22.1 % |
GPU RAM Free: 13105MB | Used: 3055MB | Util  19% | Total 16160MB

saved outputs for file ID 4 -  2022-02-24 00:52:50.602036
Starting file:  MIT_Signals_001.mp4

created audio chunks in 0.10934 seconds - Feb-24-2022_-00

Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 14.2 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB



transcribing MIT_Signals_001.mp4:	:   0%|          | 0/81 [00:00<?, ?it/s]


Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 24.4 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB

saved outputs for file ID 5 -  2022-02-24 00:53:56.736700
Starting file:  MIT_Signals_002.mp4

created audio chunks in 0.031 seconds - Feb-24-2022_-00

Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 14.2 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB



transcribing MIT_Signals_002.mp4:	:   0%|          | 0/21 [00:00<?, ?it/s]


Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 24.0 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB

saved outputs for file ID 6 -  2022-02-24 00:54:14.546603
Starting file:  MIT_VibrationsAndWaves000.mp4

created audio chunks in 0.17993 seconds - Feb-24-2022_-00

Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 14.1 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB



transcribing MIT_VibrationsAndWav...:	:   0%|          | 0/81 [00:00<?, ?it/s]


Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 21.0 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB

saved outputs for file ID 7 -  2022-02-24 00:55:20.704703
Starting file:  MIT_VibrationsAndWaves001.mp4

created audio chunks in 0.17497 seconds - Feb-24-2022_-00

Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 14.2 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB



transcribing MIT_VibrationsAndWav...:	:   0%|          | 0/81 [00:00<?, ?it/s]


Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 21.1 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB

saved outputs for file ID 8 -  2022-02-24 00:56:26.574142
Starting file:  MIT_VibrationsAndWaves002.mp4

created audio chunks in 0.17484 seconds - Feb-24-2022_-00

Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 14.2 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB



transcribing MIT_VibrationsAndWav...:	:   0%|          | 0/80 [00:00<?, ?it/s]


Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 22.3 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB

saved outputs for file ID 9 -  2022-02-24 00:57:31.968425
Starting file:  MIT_VibrationsAndWaves003.mp4

created audio chunks in 0.07246 seconds - Feb-24-2022_-00

Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 14.1 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB



transcribing MIT_VibrationsAndWav...:	:   0%|          | 0/54 [00:00<?, ?it/s]


Gen RAM Free: 50.1 GB  | Proc size: 7.8 GB  | 8 CPUs  loaded at 22.2 % |
GPU RAM Free: 13107MB | Used: 3053MB | Util  19% | Total 16160MB

saved outputs for file ID 10 -  2022-02-24 00:58:16.329804


---

# Post Model: Save files, Spell Check, SBD, Keywords

If you got to here, your colab file was able to run the model and transcribe it. Now a little cleaning up, then done.

In [34]:
#@markdown Merge Text Files: if `True` creates a **new** file that is all the transcribed text together.
merge_transc = False  # @param {type:"boolean"}

In [35]:
#@title Merge Files
#@markdown use `digest_txt_directory()` function to merge text files
locations = validate_output_directories(directory, verbose=True)
output_path_transcript = locations.get("t_out")
output_path_metadata = locations.get("m_out")

if merge_transc:
    time_log.append(time.time())
    time_log_desc.append("Merging Transcriptions")

    pr_time = datetime.now().strftime("date_%d_%m_%Y_time_%H")
    print("Creating merged files from original transcriptions")
    digest_txt_directory(
        output_path_transcript, identifer="original_tscripts" + pr_time
    )
    digest_txt_directory(
        output_path_metadata,
        identifer="ALLE_metadata_for_tscript_run" + pr_time,
        make_folder=False,
    )

    time_log.append(time.time())
    time_log_desc.append("Finished Merging Transcriptions")
else:
    print("the value of merge_transc is set to ", merge_transc)

the value of merge_transc is set to  False


In [36]:
#@title Validate text files to spell-check
#@markdown reload everything from the text directory in case of changes
# first, you need to go through the output directory of transcripts and make sure that all those files are gucci
transcripts_to_munch = natsorted(
    [
        f
        for f in listdir(output_path_transcript)
        if isfile(join(output_path_transcript, f))
    ]
)
t_files = len(transcripts_to_munch)
removed_count_t = 0
approved_txt_files = []
# remove non-.txt files
for tfile in transcripts_to_munch:
    if tfile.endswith(".txt"):
        approved_txt_files.append(tfile)
    else:
        transcripts_to_munch.remove(tfile)
        removed_count_t += 1

print(
    "out of {0:3d} file(s) originally in the folder, ".format(t_files),
    "{0:3d} non-txt files were removed".format(removed_count_t),
)

approved_txt_files

out of  11 file(s) originally in the folder,    1 non-txt files were removed


['mit_matrices_sgd_001_v_2_c_transcription_2.txt',
 'mit_matrices_sgd_002_v_2_c_transcription_3.txt',
 'mit_signals_000_v_2_c_transcription_4.txt',
 'mit_signals_001_v_2_c_transcription_5.txt',
 'mit_signals_002_v_2_c_transcription_6.txt',
 'mit_vibrations_and_waves_000_v_2_c_transcription_7.txt',
 'mit_vibrations_and_waves_001_v_2_c_transcription_8.txt',
 'mit_vibrations_and_waves_002_v_2_c_transcription_9.txt',
 'mit_vibrations_and_waves_003_v_2_c_transcription_10.txt']

In [37]:
#@title Spellcorrect Pipeline
#@markdown 1. lower output text and correct with Neuspell (or SymSpell in case of issues)
#@markdown 2. apply pySBD for sentence boundary detection
#@markdown 3. extract keywords with the YAKE library

transcript_run_qk = pd.DataFrame()  # empty df to hold all the keywords

for orig_tscript in tqdm(
    approved_txt_files,
    total=len(approved_txt_files),
    desc="spellcorrect_pipeline on transcriptions",
):

    current_loc = approved_txt_files.index(orig_tscript) + 1  # add 1 bc start at 0
    print(
        "\nStarting file {} of ".format(current_loc),
        len(approved_txt_files),
        " | ",
        orig_tscript,
    )

    PL_out = spellcorrect_pipeline(
        output_path_transcript, 
        orig_tscript,
        basic_sc=use_base,
        verbose=False,
    )  # verbose is just for debug
    directory_for_keywords = PL_out.get("spell_corrected_dir")
    filename_for_keywords = PL_out.get("sc_filename")

    qk_df = quick_keys(
        filepath=directory_for_keywords,
        filename=filename_for_keywords,
        num_keywords=25,
        max_ngrams=3,
        save_db=False,
        verbose=False,
    )

    print("completed keywords")
    transcript_run_qk = pd.concat([transcript_run_qk, qk_df], axis=1)

# save overall transcription file
date_field = datetime.now().strftime("%d%m%Y")
folder_desc = basename(directory)
keyword_db_name = "YAKE - all keywords - {} - {}.csv".format(folder_desc, date_field)
keywords_total_path = join(output_path_transcript, keyword_db_name)
transcript_run_qk.to_csv(keywords_total_path, index=True)
download(keywords_total_path)

# print results
print(
    "Transcription files used to extract KW can be found in: \n ",
    directory_for_keywords,
)
print(
    "A file with keyword results is in {} \ntitled {}".format(
        output_path_transcript, keyword_db_name
    )
)

time_log.append(time.time())
time_log_desc.append("SC, keywords")

spellcorrect_pipeline on transcriptions:   0%|          | 0/9 [00:00<?, ?it/s]


Starting file 1 of  9  |  mit_matrices_sgd_001_v_2_c_transcription_2.txt
top 5 phrases are: 

['stochastic gradient descent',
 'tiny step size',
 'true gradient',
 'dimensional optimization problem',
 'millions stochastic steps']
completed keywords

Starting file 2 of  9  |  mit_matrices_sgd_002_v_2_c_transcription_3.txt
top 5 phrases are: 

['training data point',
 'gradients sarcastic gradients',
 'stochastic gradient starts',
 'neural network training',
 'large euro network']
completed keywords

Starting file 3 of  9  |  mit_signals_000_v_2_c_transcription_4.txt
top 5 phrases are: 

['butter worth filter',
 'discreet time filter',
 'continuous time frequency',
 'time butter worth',
 'time frequency axis']
completed keywords

Starting file 4 of  9  |  mit_signals_001_v_2_c_transcription_5.txt
top 5 phrases are: 

['discreet time filter',
 'continuous time filter',
 'butter worth filter',
 'filter continuous time',
 'time filter frequency']
completed keywords

Starting file 5 of  9  

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Transcription files used to extract KW can be found in: 
  /content/drive/MyDrive/Programming/vid2cleantxt/test/wav2vec2_sf_transcript/spell_corrected
A file with keyword results is in /content/drive/MyDrive/Programming/vid2cleantxt/test/wav2vec2_sf_transcript 
titled YAKE - all keywords - test - 24022022.csv


---

# Save, Download, Exit


In [38]:
#@title Download All Files in .zip
#@markdown this needs to be specified in _Setup_ at the beginning
import os, shutil
from os.path import basename

zip_dir = join(directory, "zipped_outputs")
os.makedirs(zip_dir, exist_ok=True)

date_field = datetime.now().strftime("%d%m%Y")
folder_desc = basename(directory)
base_header = date_field + folder_desc
# transcriptions
transc_header = "vid2clntxt_transcripts_archive" + base_header
zip_path_t = join(zip_dir, transc_header)
shutil.make_archive(zip_path_t, "zip", output_path_transcript)
# metadata
meta_header = "vid2clntxt_metadata_archive" + base_header
zip_path_m = join(zip_dir, meta_header)
shutil.make_archive(zip_path_m, "zip", output_path_metadata)

if download_output_files:
    files.download(join(zip_dir, zip_path_t + ".zip"))
    time.sleep(5) # browsers do not like when 2 files download instantly
    files.download(join(zip_dir, zip_path_m + ".zip"))
    print("downloaded files - ", datetime.now())
else:
    print("download_output_files is set to: ", download_output_files)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

downloaded files -  2022-02-24 00:59:06.077182


In [39]:
#@title  Exit block
increase_font()
print(
    "\n\n----------------------------------- Script Complete -------------------------------"
)
print(datetime.now())
print("Transcription files + more in folder: \n", output_path_transcript)
print("More specifically, best transcriptions in: \n", PL_out.get("SBD_dir"))
print("Metadata for each transcription located @ \n", output_path_metadata)
time_log.append(time.time())
time_log_desc.append("End")

<IPython.core.display.Javascript object>



----------------------------------- Script Complete -------------------------------
2022-02-24 00:59:06.088982
Transcription files + more in folder: 
 /content/drive/MyDrive/Programming/vid2cleantxt/test/wav2vec2_sf_transcript
More specifically, best transcriptions in: 
 /content/drive/MyDrive/Programming/vid2cleantxt/test/wav2vec2_sf_transcript/FULLY_COMPLETE
Metadata for each transcription located @ 
 /content/drive/MyDrive/Programming/vid2cleantxt/test/wav2vec2_sf_metadata
