# Note before running

elevenlabslib and pydub are not in the requirements.txt yet, because I'm not sure whether we will use that or the speech5 tts method, since elevenlabs is a paid service.  
To run this notebook, follow the instructions in run.md, and also: 
```
pip install elevenlabslib pydub
```

API KEY will not be pushed to the repo, you should make one for yourself at https://beta.elevenlabs.io/

In [8]:
from elevenlabslib import ElevenLabsUser
from transformers import pipeline
import pandas as pd
import os
from pydub import AudioSegment
import io
from asr import whisper_asr
from tts import speecht5_tts

In [9]:
#Dummy summarization using a simple model from huggingface

def summarize(text):
    summarizer = pipeline("summarization")
    summary = summarizer(text)
    summarized_text = summary[0]['summary_text']
    return summarized_text

In [10]:
#ElevenLabs TTS: This could be costly. Voice cloning is only a paid feature, free TTS: 10000 characters/month

def elevenlabs_tts(input_text, API_KEY, voice_name, output_path):
    user = ElevenLabsUser(API_KEY)
    voice = user.get_voices_by_name(voice_name)[0]
    returned_value = voice.generate_audio_bytes(input_text)

    recording = AudioSegment.from_file(io.BytesIO(returned_value), format="mp3")
    recording.export(output_path, format='mp3') #ElevenLabs always returns mp3 bytes. Maybe it should be converted to wav with ffmpeg?

def list_available_voices(API_KEY):
    user = ElevenLabsUser(API_KEY)
    for voice in user.get_available_voices():
        print(voice.get_name())

In [11]:
#Speech to text

target_file_name = '2minutepaper.wav'
whisper_asr(target_file_name)

88.96806900000001
57.461708000000044


In [12]:
#Reading back the text files. This should be done later asynchronously/with multithreading
target_file_name_base, ext = os.path.splitext(target_file_name)
dfs = []

for file in os.listdir():
    if target_file_name_base in file and file.endswith('csv'):
        print(file)
        dfs.append(pd.read_csv(file, delimiter=';', names=['start', 'end', 'text'], encoding='ISO-8859-1'))

df = pd.concat(dfs).reset_index(drop=True)
df

final_lines = ' '.join(df['text'])
final_lines

2minutepaper.wav_0.csv
2minutepaper.wav_1.csv


"Dear Fellow Scholars, this is 2 Minute Papers with Dr. Károly Zsolnai-Fehér. This is GPT-4, OpenAI's new language model AI that we just talked about and today I would love to show you that it has barely been out and it is already taking the world by storm. Here are 5 incredible ways it already is in use in the real world, in real products, some of which you can probably try right now or at the very least soon. One, in the Language Learning app Duolingo, they now supercharge the process of learning a new language with GPT-4. We can have scripted conversations, for instance, we can order coffee in French. That is very cool, but that's not the interesting part yet. Now look, GPT-4 will look at our conversation and give us tips on how to make our French feel a little more natural. And don't forget, this can be a two-way conversation, so if we don't quite understand what it is trying to teach us, we can also ask for additional examples. Loving it! Two, Stripe, a payment processor company, 

In [13]:
#Summary

input_text = summarize(final_lines)
input_text

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


" GPT-4, OpenAI's new language model AI, has barely been out and it is already taking the world by storm . It already helps us learn new languages, organize our knowledge, it can become an excellent tutor when studying, helps us fixing computer code and also makes our car more useful ."

In [14]:
#TTS with 2 (and a half) methods

speecht5_tts(input_text, embedding_type = 'custom', custom_embedding_path = '2minutepaper.wav', output_path = 'speecht5_tts_2minute_paper_voice_clone.wav') #Trying voice cloning here
speecht5_tts(input_text, embedding_type = 'female1', output_path = 'speecht5_tts_2minute_paper.wav') #Basic female voice
elevenlabs_tts(input_text, API_KEY, 'Bella', 'elevenlabs_tts_2minute_paper.mp3') #Female voice

Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ..\aten\src\ATen\native\SpectralOps.cpp:867.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Found cached dataset cmu-arctic-xvectors (C:/Users/kissa/.cache/huggingface/datasets/Matthijs___cmu-arctic-xvectors/default/0.0.1/a62fea1f9415e240301ea0042ffad2a3aadf4d1caa7f9a8d9512d631723e781f)
