### Speech Summarizer

Building Speech summarizer that transcribes and summarizes speech input is quite simple with transformer based pretrained models and open-source libraries

Speech file <b>must</b> meet below specifications
1. <b>wav</b> file format
2. sample rate of <b>16KHz</b>
3. <b>mono</b> type audio channel

<b>Note:</b> Update variables under <b>Variables</b> section to configure inputs if required before running the notebook. 

The variable <b>INPUT_AUDIO_FILE</b> takes input audio file to be summarised.

The last cell <b>outputs</b> the summarisation of input audio file into <b>OUTPUT_SUMMARISED_FILE</b>

To run notebook cell by cell, click on a cell and click <b>Run</b> button below the <b>Menu</b> bar. Or to run all cells, select <b>Cell --> Run All from Menu bar.</b>

#### Variables

In [1]:
INPUT_AUDIO_FILE=r"input/speech.wav"

Default variables for Transcription

In [2]:
DEEPSPEECH_MODEL_FILE=r"models/deepspeech-0.9.3-models.tflite" #add model to be downloaded to this location
OUTPUT_TRANSCRIPT_FILE=r"output/transcript.txt"

Default variables for Summarisation

In [3]:
PYTORCH_MODEL_URL=r"https://cdn-lfs.huggingface.co/facebook/bart-large-cnn/2ac2745c02ac987d82c78a14b426de58d5e4178ae8039ba1c6881eccff3e82f1"
PYTORCH_MODEL_FILE=r"models/bart-large-cnn/pytorch_model.bin"
PYTORCH_MODEL_DIR = r"models\bart-large-cnn"
OUTPUT_SUMMARISED_FILE=r"output/summarised_text.txt"

#### Transcribe speech into text

In [4]:
import IPython
IPython.display.Audio(INPUT_AUDIO_FILE)

Use deepspeech model to convert a speech file (e.g., speech.wav) into text file transcript.txt

In [5]:
!deepspeech --model {DEEPSPEECH_MODEL_FILE} --audio {INPUT_AUDIO_FILE} > {OUTPUT_TRANSCRIPT_FILE}

Loading model from file models/deepspeech-0.9.3-models.tflite
TensorFlow: v2.3.0-6-g23ad988fcd
DeepSpeech: v0.9.3-0-gf2e9c858
Loaded model in 0.0169s.
Running inference.
Inference took 50.301s for 67.892s audio file.


#### Load generated transcript into memory

In [6]:
transcribed_text = str()
with open(OUTPUT_TRANSCRIPT_FILE) as file:
    transcribed_text = file.read().strip()

In [7]:
transcribed_text

"there are certain things that a necessary to ensure that the future is good and some of those things are in the long term having long term sustainable transportand sustainable energy generation and to be a space exploring civilization and for humanity to be out their among the stars and be a multi planetry species i think the being a multi planetory species and being out there among the stars is important for the long terme survival of humanity and that's one reason kind of like life insurance for life collectively life as we know it but then the part that i find personally most motivating is that it creates a sense of adventure and it makes people excited about the future if you con sider two futures one way we are forever confine to earth until eventually something terrible happens ot another future where we are out there on many planets may be even going beyond the solar system i think that second version is incredibly exciting and inspiring at there need to be reasons to get up in

#### Summarize transcribed speech

###### Download models for summarisation

In [8]:
!curl -o {PYTORCH_MODEL_FILE} {PYTORCH_MODEL_URL}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 1549M    0 1989k    0     0  2529k      0  0:10:27 --:--:--  0:10:27 2531k
  0 1549M    0 5958k    0     0  3341k      0  0:07:54  0:00:01  0:07:53 3341k
  0 1549M    0 8897k    0     0  3193k      0  0:08:16  0:00:02  0:08:14 3193k
  0 1549M    0 10.9M    0     0  2955k      0  0:08:56  0:00:03  0:08:53 2955k
  0 1549M    0 13.6M    0     0  2927k      0  0:09:02  0:00:04  0:08:58 2927k
  1 1549M    1 16.1M    0     0  2858k      0  0:09:15  0:00:05  0:09:10 2909k
  1 1549M    1 18.7M    0     0  2837k      0  0:09:19  0:00:06  0:09:13 2657k
  1 1549M    1 21.0M    0     0  2772k      0  0:09:32  0:00:07  0:09:25 2536k
  1 1549M    1 23.3M    0     0  2725k      0  0:09:42  0:00:08  0:09:34 2550k
  1 1549M    1 25.7M    0     0  2700k      0  0:09

##### Import libraries

In [9]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

 15 1549M   15  238M    0     0  2498k      0  0:10:35  0:01:37  0:08:58 2281k
 15 1549M   15  242M    0     0  2510k      0  0:10:32  0:01:38  0:08:54 2414k
 15 1549M   15  244M    0     0  2508k      0  0:10:32  0:01:39  0:08:53 2324k
 15 1549M   15  246M    0     0  2505k      0  0:10:33  0:01:40  0:08:53 2347k
 16 1549M   16  250M    0     0  2517k      0  0:10:30  0:01:41  0:08:49 2727k
 16 1549M   16  252M    0     0  2514k      0  0:10:31  0:01:42  0:08:49 2819k
 16 1549M   16  254M    0     0  2511k      0  0:10:32  0:01:43  0:08:49 2519k
 16 1549M   16  257M    0     0  2513k      0  0:10:31  0:01:44  0:08:47 2621k
 16 1549M   16  259M    0     0  2515k      0  0:10:30  0:01:45  0:08:45 2723k
 16 1549M   16  261M    0     0  2511k      0  0:10:32  0:01:46  0:08:46 2379k
 17 1549M   17  264M    0     0  2511k      0  0:10:32  0:01:47  0:08:45 2438k
 17 1549M   17  266M    0     0  2511k      0  0:10:31  0:01:48  0:08:43 2520k
 17 1549M   17  269M    0     0  2514k      0  0:10:

##### Load models

In [10]:
tokenizer = AutoTokenizer.from_pretrained(PYTORCH_MODEL_DIR)
model = AutoModelForSeq2SeqLM.from_pretrained(PYTORCH_MODEL_DIR)

##### Summarize

In [11]:
input_ids = tokenizer(f"summarize: {transcribed_text}", return_tensors='pt').input_ids
outputs = model.generate(input_ids)
with open(OUTPUT_SUMMARISED_FILE, 'w') as file:
    file.write(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Being a multi planetry species is important for the long terme survival of humanity. It creates a sense of adventure and it makes people excited about the future. There need to be reasons to get up in the morning you know life can't just be about solving problems otherwise what's the point.


### Customisation

##### Transcription

In [12]:
audio = r"input/audio1.wav"
# audio = r"input/audio2.wav"
# audio = r"input/audio3.wav"
# audio = r"input/audio4.wav"
# audio = r"input/audio5.wav"

In [13]:
text = r"output/text1.txt"
# text = r"output/text2.txt"
# text = r"output/text3.txt"
# text = r"output/text4.txt"
# text = r"output/text5.txt"

In [14]:
import IPython
IPython.display.Audio(audio)

In [15]:
!deepspeech --model {DEEPSPEECH_MODEL_FILE} --audio {audio} > {text}

Loading model from file models/deepspeech-0.9.3-models.tflite
TensorFlow: v2.3.0-6-g23ad988fcd
DeepSpeech: v0.9.3-0-gf2e9c858
Loaded model in 0.0179s.
Running inference.
Inference took 1.940s for 3.586s audio file.


In [16]:
transcribed_text = str()
with open(text) as file:
    transcribed_text = file.read().strip()
transcribed_text

'we must find a new home in the star s'

#### Summarisation

In [17]:
text = ("The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, "
        "and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. "
        "During its construction, the Eiffel Tower surpassed the Washington Monument to become "
        "the tallest man-made structure in the world, a title it held for 41 years until "
        "the Chrysler Building in New York City was finished in 1930. "
        "It was the first structure to reach a height of 300 metres. "
        "Due to the addition of a broadcasting aerial at the top of the tower in 1957, "
        "it is now taller than the Chrysler Building by 5.2 metres (17 ft). "
        "Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France "
        "after the Millau Viaduct.")

In [18]:
input_ids = tokenizer(f"summarize: {text}", return_tensors='pt').input_ids
outputs = model.generate(input_ids)
# with open(OUTPUT_SUMMARISED_FILE, 'w') as file:
#     file.write(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world.
