In [1]:
from os.path import join, dirname
from ibm_watson import SpeechToTextV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
import json
import time
import jiwer


In [3]:
# Run only if needed
#pip install jiwer
#pip install ibm_watson

# Train Custom Language Model

This notebook walks through the training and testing of a language model for the transcription of the meetings. As part of this PoC, we want to test the Telephony LSM model or Multimedia LSM. Even though the transcription will be drawn from a multimedia video or audio, telephony LSM often is more accurate on multimedia data than Multimedia LSM.

## Prerequisites

You will need an IBM Cloud account with a Speech to Text service. A text file (corpus) with a real vocabulary and language typical of the use case or client, ideally coming from a real transcription. Additionally, you want to have a separate audio and transcription to test and compare the accuracy of our models. The file `extractAudio.py` can help you extract the audio from the video if needed.

In [2]:
authenticator = IAMAuthenticator('< Your STT API KEY >')
speech_to_text = SpeechToTextV1(
    authenticator=authenticator
)
speech_to_text.set_service_url('< Your STT service URL >')

In [3]:
# Used Parameters
lm_name_prefix = 'CustomPOC_'
corpus_name = 'corpus_demo'
corpus_file = 'corpus.txt'

In [4]:
def createCustom(lm_name_prefix, speech_to_text, base_model = 'en-US_Multimedia'):
    lm_name = lm_name_prefix+base_model
    language_model = speech_to_text.create_language_model(
        lm_name,
        base_model
    ).get_result()
    print(json.dumps(language_model, indent=2))
    return language_model

In [5]:
def addCorpus(lm_id, corpus_name, corpus_file, speech_to_text):
    with open(corpus_file, 'rb') as corpus_f:
        speech_to_text.add_corpus(
            lm_id,
            corpus_name,
            corpus_f
        )

In [6]:
def trainCustom(lm_id, speech_to_text):
    speech_to_text.train_language_model(lm_id)

In [7]:
def getCustom(lm_id, speech_to_text):
    language_model = speech_to_text.get_language_model(lm_id).get_result()
    #print(json.dumps(language_model, indent=2))
    return language_model

In [8]:
def trainCorpus(lm_id, corpus_name, corpus_file, speech_to_text):
    ts = 100
    addCorpus(lm_id, corpus_name, corpus_file, speech_to_text)
    print("Added Corpus")
    # Wait
    ready = False
    for i in range(ts):
        lm = getCustom(lm_id, speech_to_text)
        if lm['status'] != "ready":
            print("Wait Corpus")
            time.sleep(60)
        else:
            print(lm)
            ready=True
    # Train
    if not ready:
        "Not Ready to train!"
        return False
    trainCustom(lm_id, speech_to_text)
    print("Training")
    # Wait
    for i in range(ts):
        lm = getCustom(lm_id, speech_to_text)
        if lm['status'] != "available":
            print("Wait Training")
            time.sleep(60)
        else:
            return True
    return False

## Telephony LSM

In this section, we will train a language model for the Telephony LSM model. This model is intended for low-frequency audio, however, it produces good results for multimedia recordings as well.

In [9]:
base_model_tel = 'en-US_Telephony_LSM'

lm_response = createCustom(lm_name_prefix, speech_to_text, base_model_tel)
lm_id_tel = lm_response['customization_id']


{
  "customization_id": "fd697ab8-712e-4eb0-8571-244bf45ce9e3"
}


The previous id is the language customization id. You will need it for the implementation. Please copy it to a safe place. The following cell code can take a while to run (15-20 mins).

In [10]:
trainCorpus(lm_id_tel, corpus_name, corpus_file, speech_to_text)

Added Corpus
Wait Corpus
Wait Corpus
Wait Corpus
Wait Corpus
Wait Corpus
Wait Corpus
Wait Corpus
Wait Corpus
Wait Corpus
Wait Corpus
{'owner': '33104b95-9ae9-4790-b404-40e8528b19e8', 'base_model_name': 'en-US_Telephony_LSM', 'customization_id': 'fd697ab8-712e-4eb0-8571-244bf45ce9e3', 'dialect': 'en-US', 'versions': ['en-US_Telephony_LSM.v2023-10-31'], 'created': '2024-06-19T20:10:50.038Z', 'name': 'CustomPOC_en-US_Telephony_LSM', 'description': '', 'progress': 0, 'language': 'en-US', 'updated': '2024-06-19T20:21:07.219Z', 'status': 'ready'}
{'owner': '33104b95-9ae9-4790-b404-40e8528b19e8', 'base_model_name': 'en-US_Telephony_LSM', 'customization_id': 'fd697ab8-712e-4eb0-8571-244bf45ce9e3', 'dialect': 'en-US', 'versions': ['en-US_Telephony_LSM.v2023-10-31'], 'created': '2024-06-19T20:10:50.038Z', 'name': 'CustomPOC_en-US_Telephony_LSM', 'description': '', 'progress': 0, 'language': 'en-US', 'updated': '2024-06-19T20:21:07.219Z', 'status': 'ready'}
{'owner': '33104b95-9ae9-4790-b404-40e8

True

## Multimedia LSM

In this section, we will train a language model for the Multimedia LSM model.

In [11]:
base_model_MM = 'en-US_Multimedia_LSM'

lm_response = createCustom(lm_name_prefix, speech_to_text, base_model_MM)
lm_id_MM = lm_response['customization_id']

{
  "customization_id": "4618f695-8653-46f4-b3b5-6d7c85f2f9bc"
}


The previous id is the language customization id. You will need it for the implementation. Please copy it to a safe place. The following cell code can take a while to run.

In [12]:
trainCorpus(lm_id_MM, corpus_name, corpus_file, speech_to_text)

Added Corpus
Wait Corpus
Wait Corpus
Wait Corpus
Wait Corpus
{'owner': '33104b95-9ae9-4790-b404-40e8528b19e8', 'base_model_name': 'en-US_Multimedia_LSM', 'customization_id': '4618f695-8653-46f4-b3b5-6d7c85f2f9bc', 'dialect': 'en-US', 'versions': ['en-US_Multimedia_LSM.v2023-11-17'], 'created': '2024-06-19T20:33:44.471Z', 'name': 'CustomPOC_en-US_Multimedia_LSM', 'description': '', 'progress': 0, 'language': 'en-US', 'updated': '2024-06-19T20:37:10.440Z', 'status': 'ready'}
{'owner': '33104b95-9ae9-4790-b404-40e8528b19e8', 'base_model_name': 'en-US_Multimedia_LSM', 'customization_id': '4618f695-8653-46f4-b3b5-6d7c85f2f9bc', 'dialect': 'en-US', 'versions': ['en-US_Multimedia_LSM.v2023-11-17'], 'created': '2024-06-19T20:33:44.471Z', 'name': 'CustomPOC_en-US_Multimedia_LSM', 'description': '', 'progress': 0, 'language': 'en-US', 'updated': '2024-06-19T20:37:10.440Z', 'status': 'ready'}
{'owner': '33104b95-9ae9-4790-b404-40e8528b19e8', 'base_model_name': 'en-US_Multimedia_LSM', 'customizati

True

## Quality Test

This section will focus on testing which model will perform better. We will need at least 1 sample with a baseline transcriptio to compare to. You can use shorter audios than the original meeting sessions. Ideally you would have a big enough test set to do bootstrap the accuracy of the model. However, since this is a POC one or a couple should sufice.

We will be using the Word Error Rate (WER) to measure the accuracy of the model. This rate considers the substitutions, deletions, and insertions to calculate the accuracy.

> [How to calculate WER](https://medium.com/@johnidouglasmarangon/how-to-calculate-the-word-error-rate-in-python-ce0751a46052)

In [13]:
def transcribe(lm_id, base_model, audio_name, speech_to_text):
    with open(audio_name, 'rb') as audio_file:
        speech_recognition_results = speech_to_text.recognize(
            audio=audio_file,
            content_type='audio/mp3', # Change if needed
            model= base_model,
            smart_formatting=True,
            language_customization_id=lm_id
        ).get_result()
    transcript = ""
    for result in speech_recognition_results['results']:
        for alternative in result['alternatives']:
            transcript += alternative['transcript']
    return transcript

Change the audio name and reference file to the baseline transcription you will be using.

In [14]:
audio_name = 'audio.mp3'
reference_file = 'corpus.txt'

reference = ''
with open(reference_file) as f:
    # Read the contents of the file into a variable
    reference = f.read()

transforms = jiwer.Compose(
    [
        jiwer.ExpandCommonEnglishContractions(),
        jiwer.RemoveEmptyStrings(),
        jiwer.ToLowerCase(),
        jiwer.RemoveMultipleSpaces(),
        jiwer.Strip(),
        jiwer.RemovePunctuation(),
        jiwer.ReduceToListOfListOfWords(),
    ]
)

In [15]:
base_model_tel = 'en-US_Telephony_LSM'
#lm_id_tel = "365c3238-a016-46aa-8b61-85a4c827da94"

base_model_MM = 'en-US_Multimedia_LSM'
#lm_id_MM = "809f2dfa-c826-4f89-a905-e37c62ed5ccc"

Generating the telephony transcription.

In [16]:
# Telephone model transcription

transcription_tel = transcribe(lm_id_tel, base_model_tel, audio_name,speech_to_text)

In [17]:
wer_tel = jiwer.wer(
                reference,
                transcription_tel,
                truth_transform=transforms,
                hypothesis_transform=transforms,
            )

Generating the multimedia transcription.

In [18]:
# Multimedia model transcription

transcription_MM = transcribe(lm_id_MM, base_model_MM, audio_name, speech_to_text)

In [19]:
wer_mult = jiwer.wer(
                reference,
                transcription_MM,
                truth_transform=transforms,
                hypothesis_transform=transforms,
            )

Remember that WER is an error rate. A lower value indicates better accuracy, while a higher value indicate worse accuracy.

In [20]:
print(f"Word Error Rate (WER) Telephony: {wer_tel*100:.2f}% (or {(1-wer_tel)*100:.2f}% Accuracy)")
print(f"Word Error Rate (WER) Multimedia: {wer_mult*100:.2f}% (or {(1-wer_mult)*100:.2f}% Accuracy)")

Word Error Rate (WER) Telephony: 12.29% (or 87.71% Accuracy)
Word Error Rate (WER) Multimedia: 19.61% (or 80.39% Accuracy)


In [21]:
print(transcription_MM)

okay okay we're one steve there i think you're muted yes so it's yeah my bad was alright so be sure history analysis of the nba final project let's start let's go with the abstracts let's see okay is objective still temporal bison station yeah still there said the same that's that's okay ok perfect and what the keywords so keywords we have only these or i think i think we should add more specific isolations into the keyword so let's add dendrogram animated bubbles animated lines bars and lines as well ok some skirts let's go with that and 4 so now this morning to the introduction tell me a little bit about it case so let's school paragraph by paragraph i think this first one it's okay i think these are the correct references that we are using the status correct ok i think i think that that is good 3.revolution okay yeah i'll double check that but i think that that is fine as well this do pre okay i think i think this second one is also okay is jump into this last one okay there is some

In [22]:
print(transcription_tel)

ok okay where one steve are you there i think you're muted yes it's yeah my bad was all right so visual history analysis of the nba final project let's start let's go with the abstracts let's see ok is the objective still temporal visualization yeah yeah it's still still the same that's that's okay ok perfect and what all the keywords so keywords we have only these 4 i think i think we should add more specific visualizations into the keyword so let's add dendrogram animated bubbles animated lines bars and lines as well ok sounds good let's go with that and or so now let's move on to the introduction tell me a little bit about it okay so let's go paragraph by paragraph i think this first one it's okay you think these are the correct references that we are using this date is correct ok i think i think that that is good 3.revolution okay yeah i'll double check that but i think that that is fine as well this dupree okay i think i think this second one it's also ok let's jump into this last

## Clean Up

USE **ONLY** AFTER THE POC IS DONE.

In [None]:
# If needed, uncomment and fill with your ids.
#lm_id_tel = "{telephony_id}"
#lm_id_MM = "{multimedia_id}"

In [36]:
# Delete Corpus
speech_to_text.delete_corpus(
    lm_id_tel,
    corpus_name
)

<ibm_cloud_sdk_core.detailed_response.DetailedResponse at 0x117d201f0>

In [40]:
# Delete Language Model
speech_to_text.delete_language_model(lm_id_tel)

<ibm_cloud_sdk_core.detailed_response.DetailedResponse at 0x117e2b130>

In [None]:
# Delete Corpus
speech_to_text.delete_corpus(
    lm_id_MM,
    corpus_name
)

In [None]:
# Delete Language Model
speech_to_text.delete_language_model(lm_id_MM)