In [None]:
Building a Text-to-Speech (TTS) and Speech Recognition system using Hugging Face involves working with pre-trained models and possibly fine-tuning them for specific tasks. Here's an end-to-end guide to creating both TTS and speech recognition systems.

Part 1: Text-to-Speech (TTS)
1. Environment Setup
Install Necessary Libraries: Start by installing the required libraries.
bash
Copy code
pip install transformers datasets torchaudio soundfile
2. Model Selection
Choose a Pre-Trained TTS Model: Hugging Face supports various TTS models like wav2vec2 and Tacotron2 through integration with the transformers and torchaudio libraries. For this example, we'll use a Tacotron2 model combined with WaveGlow for speech synthesis.
python
Copy code
from transformers import Tacotron2ForTextToSpeech, AutoTokenizer
from torchaudio.models import WaveGlow
import torch

tacotron2 = Tacotron2ForTextToSpeech.from_pretrained("nvidia/tacotron2")
waveglow = WaveGlow.from_pretrained("nvidia/waveglow")

tokenizer = AutoTokenizer.from_pretrained("nvidia/tacotron2")
3. Text Preprocessing
Tokenize Input Text: Convert the input text into a format that the TTS model can understand.
python
Copy code
text = "Hugging Face is making NLP accessible to everyone."
inputs = tokenizer(text, return_tensors="pt")
4. Generate Speech
Generate Mel Spectrogram: Use the Tacotron2 model to generate a mel spectrogram from the input text.

python
Copy code
with torch.no_grad():
    mel_outputs, mel_outputs_postnet, _, alignments = tacotron2.inference(inputs["input_ids"])
Convert to Audio Waveform: Pass the mel spectrogram through the WaveGlow model to generate the audio waveform.

python
Copy code
with torch.no_grad():
    audio = waveglow.infer(mel_outputs_postnet)
5. Save the Audio
Save the Generated Audio: Use the soundfile library to save the audio file.
python
Copy code
import soundfile as sf

audio = audio.squeeze().cpu().numpy()
sf.write("output.wav", audio, 22050)
6. Deploy the TTS System
Deploy as a Web Service: Use Flask to create a simple API for generating speech from text.
python
Copy code
from flask import Flask, request, send_file

app = Flask(__name__)

@app.route("/tts", methods=["POST"])
def tts():
    text = request.json["text"]
    inputs = tokenizer(text, return_tensors="pt")
    mel_outputs, mel_outputs_postnet, _, alignments = tacotron2.inference(inputs["input_ids"])
    audio = waveglow.infer(mel_outputs_postnet).squeeze().cpu().numpy()
    sf.write("output.wav", audio, 22050)
    return send_file("output.wav", as_attachment=True)

if __name__ == "__main__":
    app.run(debug=True)
Part 2: Speech Recognition
1. Environment Setup
Install Necessary Libraries: The same setup as TTS, but you'll focus on speech recognition models like wav2vec2.
bash
Copy code
pip install transformers datasets torchaudio
2. Model Selection
Choose a Pre-Trained Speech Recognition Model: Hugging Face provides the wav2vec2 model for speech-to-text tasks.
python
Copy code
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import torchaudio

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
3. Load and Preprocess Audio
Load an Audio File: Load a speech audio file for transcription.

python
Copy code
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")
Resample the Audio: Resample the audio to the model's required sampling rate.

python
Copy code
resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
speech = resampler(speech_array).squeeze().numpy()
Tokenize the Audio: Tokenize the audio input for the model.

python
Copy code
inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
4. Run Speech Recognition
Generate Transcription: Use the wav2vec2 model to transcribe the audio to text.
python
Copy code
with torch.no_grad():
    logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
print(transcription)
5. Deploy the Speech Recognition System
Deploy as a Web Service: Use Flask to create a simple API for converting speech to text.
python
Copy code
from flask import Flask, request, jsonify
import torchaudio

app = Flask(__name__)

@app.route("/stt", methods=["POST"])
def stt():
    audio_file = request.files["audio"]
    speech_array, sampling_rate = torchaudio.load(audio_file)
    speech = resampler(speech_array).squeeze().numpy()
    inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.decode(predicted_ids[0])
    return jsonify({"transcription": transcription})

if __name__ == "__main__":
    app.run(debug=True)
Part 3: Monitoring and Maintenance
Monitor Performance: Track the performance of your TTS and Speech Recognition systems with real-time monitoring and user feedback.
Fine-Tuning: As more data becomes available, fine-tune the models to improve performance on your specific use case.
Part 4: Documentation and Sharing
Document the Process: Provide documentation on how to use, maintain, and extend the TTS and Speech Recognition systems.
Share the Models: Share the trained models and code on platforms like GitHub or Hugging Face Model Hub.
This guide gives you a comprehensive approach to building and deploying both TTS and Speech Recognition systems using Hugging Face's NLP models and the torchaudio library. These systems can be customized further based on specific needs and use cases.

In [1]:
!pip install transformers datasets torchaudio soundfile


Collecting torchaudio
  Downloading torchaudio-2.4.0-cp312-cp312-win_amd64.whl.metadata (6.4 kB)
Collecting soundfile
  Downloading soundfile-0.12.1-py2.py3-none-win_amd64.whl.metadata (14 kB)
Collecting torch==2.4.0 (from torchaudio)
  Using cached torch-2.4.0-cp312-cp312-win_amd64.whl.metadata (27 kB)
Downloading torchaudio-2.4.0-cp312-cp312-win_amd64.whl (2.4 MB)
   ---------------------------------------- 0.0/2.4 MB ? eta -:--:--
   -- ------------------------------------- 0.2/2.4 MB 5.3 MB/s eta 0:00:01
   --------- ------------------------------ 0.6/2.4 MB 7.0 MB/s eta 0:00:01
   ---------------- ----------------------- 1.0/2.4 MB 7.9 MB/s eta 0:00:01
   ----------------------- ---------------- 1.4/2.4 MB 8.3 MB/s eta 0:00:01
   ------------------------------ --------- 1.8/2.4 MB 8.2 MB/s eta 0:00:01
   ------------------------------------ --- 2.2/2.4 MB 8.3 MB/s eta 0:00:01
   ---------------------------------------- 2.4/2.4 MB 8.1 MB/s eta 0:00:00
Using cached torch-2.4.0-cp312

  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.18.1 requires torch==2.3.1, but you have torch 2.4.0 which is incompatible.

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from transformers import Tacotron2ForTextToSpeech, AutoTokenizer
from torchaudio.models import WaveGlow
import torch

tacotron2 = Tacotron2ForTextToSpeech.from_pretrained("nvidia/tacotron2")
waveglow = WaveGlow.from_pretrained("nvidia/waveglow")

tokenizer = AutoTokenizer.from_pretrained("nvidia/tacotron2")


In [None]:
text = "Hugging Face is making NLP accessible to everyone."
inputs = tokenizer(text, return_tensors="pt")


In [None]:
with torch.no_grad():
    mel_outputs, mel_outputs_postnet, _, alignments = tacotron2.inference(inputs["input_ids"])


In [None]:
with torch.no_grad():
    audio = waveglow.infer(mel_outputs_postnet)


In [None]:
import soundfile as sf

audio = audio.squeeze().cpu().numpy()
sf.write("output.wav", audio, 22050)


In [None]:
from flask import Flask, request, send_file

app = Flask(__name__)

@app.route("/tts", methods=["POST"])
def tts():
    text = request.json["text"]
    inputs = tokenizer(text, return_tensors="pt")
    mel_outputs, mel_outputs_postnet, _, alignments = tacotron2.inference(inputs["input_ids"])
    audio = waveglow.infer(mel_outputs_postnet).squeeze().cpu().numpy()
    sf.write("output.wav", audio, 22050)
    return send_file("output.wav", as_attachment=True)

if __name__ == "__main__":
    app.run(debug=True)


In [None]:
#Part 2: Speech Recognition

In [None]:
pip install transformers datasets torchaudio


In [None]:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import torchaudio

model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")


In [None]:
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")


In [None]:
resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
speech = resampler(speech_array).squeeze().numpy()


In [None]:
inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)


In [None]:
with torch.no_grad():
    logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
print(transcription)


In [None]:
from flask import Flask, request, jsonify
import torchaudio

app = Flask(__name__)

@app.route("/stt", methods=["POST"])
def stt():
    audio_file = request.files["audio"]
    speech_array, sampling_rate = torchaudio.load(audio_file)
    speech = resampler(speech_array).squeeze().numpy()
    inputs = processor(speech, sampling_rate=16000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.decode(predicted_ids[0])
    return jsonify({"transcription": transcription})

if __name__ == "__main__":
    app.run(debug=True)


In [None]:
art 3: Monitoring and Maintenance
Monitor Performance: Track the performance of your TTS and Speech Recognition systems with real-time monitoring and user feedback.
Fine-Tuning: As more data becomes available, fine-tune the models to improve performance on your specific use case.
Part 4: Documentation and Sharing
Document the Process: Provide documentation on how to use, maintain, and extend the TTS and Speech Recognition systems.
Share the Models: Share the trained models and code on platforms like GitHub or Hugging Face Model Hub.
This guide gives you a comprehensive approach to building and deploying both TTS and Speech Recognition systems using Hugging Face's NLP models and the torchaudio library. These systems can be customized further based on specific needs and use cases.