# AI-enhanced Speech Analytics Process

## Feature checklist
- [ ] Speech length
- [ ] Words per minute
- [ ] Recognized text
- [ ] Word count
- [ ] Character count
- [ ] Stemming/lemmatization
- [ ] Popular words and stopwords
- [ ] Sentiment analysis
- [ ] Parts of speech tagging
- [ ] Text summarization
- [ ] Spelling correction
- [ ] Tagalog language support


## Challenges
* Filipino language
* Code switching

## Implemeting speech recognition
Implementing speech recognition using a TedEd 3 minutes talk titled 'Try something new for 30 days' by Matt Cutts as audio data

In [1]:
import speech_recognition as sr 
from pydub import AudioSegment
from pydub.playback import play

Converting the audio file to wav format and feeding it to Google speech recognition engine to get the transcript.

In [2]:
recognizer = sr.Recognizer()

try:
  audio_full_filename = 'TedEd - Try something new for 30 days.mp3'
  audio_filename = audio_full_filename[0:audio_full_filename.rfind('.')]
  raw_audios_directory = 'audios/raw'
  converted_audios_directory = 'audios/converted'

  # Converting the audio file to wav format
  audio_file = AudioSegment.from_file(f'{raw_audios_directory}/{audio_full_filename}')
  test = audio_file.export(f'{converted_audios_directory}/{audio_filename}.wav', format='wav')

  # Feeding the wav audio to Google speech recognition engine
  with sr.AudioFile(f'{converted_audios_directory}/{audio_filename}.wav') as source:
    audio_data = recognizer.record(source)
    recognized_text = recognizer.recognize_google(audio_data)
    print(f'Recognized text: {recognized_text}')

except sr.UnknownValueError():
  recognizer = sr.Recognizer()

Recognized text: a few years ago i felt like i was stuck in a rat so i decided to follow in the footsteps of the great american philosopher morgan spurlock and try something new for 30 days the idea is actually pretty simple think about something you've always wanted to add your life and try it for the next 30 days it turns out 30 days is just about the right amount of time to add a new habit or subtract the habit like watching the news from your life there's a few things that i learned while doing these 30 day challenges the first was instead of the months flying by forgotten the time was much more memorable this was part of a challenge i did to take a picture everyday for a month and i remember exactly where i was and what i was doing that day i also noticed that as i started to do more and harder 30 day challenges myself confidence grew i went from death dwelling computer nerd to the kind of guy who bikes to work for fun even last year i ended up hiking up mount kilimanjaro the high

## Using BERT Restore Punctuation pretrained model for sentence boundary recognition and punctuation restoration
The [BERT Restore Punctuation](https://huggingface.co/felflare/bert-restore-punctuation) model by [felflare](https://huggingface.co/felflare) is a pretrained transformer model designed to restore punctuation and capitalization in text. This model is based on the bert-base-uncased architecture and has been fine-tuned specifically for punctuation restoration on Yelp Reviews. It is highly effective for use cases such as automatic speech recognition (ASR) outputs or any other scenarios where text has lost its punctuation and capitalization.


In [30]:
%%capture

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

In [29]:
class RestorePunctuation:
  _LABEL_MAP = {
    "LABEL_0": "OU",
    "LABEL_1": "OO",
    "LABEL_2": ".O",
    "LABEL_3": "!O",
    "LABEL_4": ",O",
    "LABEL_5": ".U",
    "LABEL_6": "!U",
    "LABEL_7": ",U",
    "LABEL_8": ":O",
    "LABEL_9": ";O",
    "LABEL_10": ":U",
    "LABEL_11": "'O",
    "LABEL_12": "-O",
    "LABEL_13": "?O",
    "LABEL_14": "?U",
  }

  def __init__(self):
    self._tokenizer = AutoTokenizer.from_pretrained("felflare/bert-restore-punctuation")
    self._model = AutoModelForTokenClassification.from_pretrained("felflare/bert-restore-punctuation")
    self._pipe = pipeline('token-classification', model=self._model, tokenizer=self._tokenizer)

  def restore(self, text: str):
    predictions = self._pipe(text)
    
    restored_text = ''
    for token_prediction in predictions:
      label = self._LABEL_MAP[token_prediction['entity']]

      if "U" in label:
        restored_text += (token_prediction['word'].capitalize())
      else:
        restored_text += (token_prediction['word'])

      for punctuation in [".", ",", "'", "-", ":", ";", "!", "?"]:
        if punctuation in label:
          restored_text += punctuation
       
      restored_text += ' '

    restored_text = (
      restored_text
      .replace(' ##', '')
      .replace(" ' ", "'"))
      
    return restored_text
  

In [32]:
restored_text = RestorePunctuation().restore(text=recognized_text)
restored_text

"A few years ago I felt like I was stuck in a rat. so I decided to follow in the footsteps of the great American philosopher Morgan Spurlock and try something new for 30 days. The idea is actually pretty simple. Think about something you've always wanted to add your life and try it for the next 30 days. It turns out 30 days is just about the right amount of time to add a new habit or subtract the habit. like watching the news from your life. There's a few things that I learned while doing these 30 day challenges. The first was instead of the months flying by forgotten, the time was much more memorable. This was part of a challenge I did to take a picture everyday for a month and I remember exactly where I was and what I was doing that day. I also noticed that as I started to do more and harder 30 day challenges myself, confidence grew. I went from death dwelling computer nerd to the kind of guy who bikes to work for fun. Even last year I ended up hiking up Mount Ki,limanjar,o, the high

## Using pretrained model for text summarization

In [37]:
%%capture
from transformers import AutoModelForSeq2SeqLM

In [50]:
class TextSummarize:
  def __init__(self):
    self._tokenizer = AutoTokenizer.from_pretrained("Falconsai/text_summarization")
    self._model = AutoModelForSeq2SeqLM.from_pretrained("Falconsai/text_summarization")
    self._pipe = pipeline('summarization', model=self._model, tokenizer=self._tokenizer)

  def summarize(self, text:str):
    summary = self._pipe(text)[0]['summary_text'].replace(' . ', '. ')
    return summary

In [51]:
TextSummarize().summarize(restored_text)

"The idea is actually pretty simple. Think about something you've always wanted to add your life and try it for the next 30 days. It turns out 30 days is just about the right amount of time to add a new habit or subtract the habit. The first was instead of the months flying by forgotten, the time was much more memorable. I went from death dwelling computer nerd to the kind of guy who bikes to work for fun ."

## Natural Language Processing with Spacy

In [9]:
%%capture
!python -m spacy download en_core_web_sm
import spacy

Loading small core english model

In [10]:
nlp = spacy.load('en_core_web_sm')

Tokenizing the restored transcript

In [11]:
doc = nlp(restored_text)
for token in list(doc)[:10]:
  print(f"Word: {token.text} \tLemma: {token.lemma_} \tPOS: {token.pos_} \tEntity: {token.ent_type_}")

Word: The 	Lemma: the 	POS: DET 	Entity: ORG
Word: Big 	Lemma: Big 	POS: PROPN 	Entity: ORG
Word: Brown 	Lemma: Brown 	POS: PROPN 	Entity: ORG
Word: Fox 	Lemma: Fox 	POS: PROPN 	Entity: ORG


Getting the sentences

In [12]:
for sentence in list(doc.sents)[:10]:
  print(sentence)

The Big Brown Fox


Getting the token, word, and character count (pre-standardization)

In [13]:
def get_count_stats(doc):
  token_count = len(doc)
  character_count = len(restored_text)
  word_count = len([token.text for token in doc if not token.is_punct and not token.is_space and not token.like_num])

  return token_count, word_count, character_count

In [14]:
token_count, word_count, character_count = get_count_stats(doc)

print(f'Token count: {token_count}')
print(f'Word count: {word_count}')
print(f'Character count: {character_count}')

Token count: 4
Word count: 4
Character count: 18


### Text standardizing

Lemmatizing and removing stop words

In [15]:
def standardize_doc(doc):
  lemmatized_text = ""
  stop_words = []

  for token in doc:
    if token.is_punct:
      continue

    if token.is_stop:
      stop_words.append(token.text.lower())
    else:
      lemmatized_text += f"{token.lemma_} "

  return lemmatized_text, stop_words

In [16]:
lemmatized_text, stop_words = standardize_doc(doc)
doc_lemmatized = nlp(lemmatized_text)

In [17]:
for token in list(doc_lemmatized)[:10]:
  print(f"Word: {token.text} \tLemma: {token.lemma_} \tPOS: {token.pos_} \tEntity: {token.ent_type_}")

Word: Big 	Lemma: Big 	POS: PROPN 	Entity: ORG
Word: Brown 	Lemma: Brown 	POS: PROPN 	Entity: ORG
Word: Fox 	Lemma: Fox 	POS: PROPN 	Entity: ORG


Getting the token, word, and character count (pre-standardization)

In [18]:
token_count, word_count, character_count = get_count_stats(doc_lemmatized)

print(f'Token count: {token_count}')
print(f'Word count: {word_count}')
print(f'Character count: {character_count}')

Token count: 3
Word count: 3
Character count: 18


### Getting word frequency

In [19]:
from collections import Counter

In [20]:
def get_word_list(doc):
  word_list = [token.text.lower() for token in doc if not token.is_punct and not token.is_space and not token.like_num]
  return word_list

10 most popular words (pre-standardization)

In [21]:
Counter(get_word_list(doc)).most_common(10)

[('the', 1), ('big', 1), ('brown', 1), ('fox', 1)]

10 most popular words (standardized)

In [22]:
Counter(get_word_list(doc_lemmatized)).most_common(10)

[('big', 1), ('brown', 1), ('fox', 1)]

10 most popular stopwords

In [23]:
Counter(stop_words).most_common(10)

[('the', 1)]