-
Notifications
You must be signed in to change notification settings - Fork 192
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add general voice-cloning finetuning code (#281)
* add general voice-cloning code * add copyright * Refine the readme Signed-off-by: hshen14 <haihao.shen@intel.com> * Refine the readme Signed-off-by: hshen14 <haihao.shen@intel.com> * Refine the readme Signed-off-by: hshen14 <haihao.shen@intel.com> * enhance finetuning Signed-off-by: Spycsh <sihan.chen@intel.com> * refine README --------- Signed-off-by: hshen14 <haihao.shen@intel.com> Signed-off-by: Spycsh <sihan.chen@intel.com> Co-authored-by: hshen14 <haihao.shen@intel.com>
- Loading branch information
Showing
16 changed files
with
277 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
55 changes: 55 additions & 0 deletions
55
intel_extension_for_transformers/neural_chat/examples/tts_finetuning/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Voice Cloning by finetuning a Text-To-Speech (TTS) model | ||
|
||
This example shows you how to clone an arbitrary person's voice by finetuning SpeechT5. | ||
|
||
## Prepare data | ||
|
||
Under this `tts_finetuning` example directory, please make sure there exist two directories: `audios/` and `texts/`, | ||
where the former contains all the audio files and the latter contains all the texts corresponding to each audio file. | ||
|
||
The audio and text file names should be formatted like following (We just need to make sure every audio file has the same name with its text file): | ||
|
||
``` | ||
audios/ | ||
<audio_name_0>.mp3 | ||
<audio_name_1>.mp3 | ||
<audio_name_2>.mp3 | ||
... | ||
texts/ | ||
<audio_name_0>.txt | ||
<audio_name_1>.txt | ||
<audio_name_2>.txt | ||
... | ||
``` | ||
|
||
|
||
Users can use their own audios and corresponding texts, or they can download from the Internet. Here are the [audio samples](https://github.com/audio-samples/audio-samples.github.io/tree/master/samples/mp3/ted_speakers/FeiFeiLi) that we use in this example. | ||
|
||
Then, we can prepare the texts of those audio files by just listening and writing the texts manually, or running one audio-speech-recognition (ASR) model by using Intel Extension For Transformers ASR interface: | ||
|
||
```python | ||
# Replace <xxxxx_sample-0> with your input audio name | ||
python asr.py -i audios/<xxxxx_sample-0>.mp3 -m openai/whisper-tiny | ||
``` | ||
|
||
For simplicity in this example, we have already generated the texts of the aforementioned audio samples under the `texts/` folder. | ||
|
||
## Finetuning | ||
|
||
After preparing the dataset, we can start finetuning. We can just run the following command and the finetuned model is by default named `finetuned_model.pt`. | ||
|
||
``` | ||
python finetune.py | ||
``` | ||
|
||
You can change the arguments in `TTSDatasetArguments` or `TTSModelArguments` in the `finetune.py` passed to Intel Extension For Transformers TTS interface to customize your finetuning process. | ||
|
||
## Inference | ||
|
||
Now we have our finetuned model `finetuned_model.pt`, so let us check the quality and performance of that. We can run the following command: | ||
``` | ||
python inference.py | ||
``` | ||
|
||
Then you will see a prompt in the console `Write a sentence to let the talker speak:`, and you can enter your input text in the console and press ENTER on your keyboard to generate the speech with your finetuned voice. |
30 changes: 30 additions & 0 deletions
30
intel_extension_for_transformers/neural_chat/examples/tts_finetuning/asr.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# !/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (c) 2023 Intel Corporation | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
|
||
from intel_extension_for_transformers.neural_chat.pipeline.plugins.audio.asr import AudioSpeechRecognition | ||
import argparse | ||
parser = argparse.ArgumentParser( | ||
prog='asr', | ||
description='Audio Speech Recognition') | ||
parser.add_argument('-i', '--input_audio') | ||
parser.add_argument('-m', '--model_name_or_path', default="openai/whisper-tiny") | ||
parser.add_argument('-d', '--device', default="cuda") | ||
args = parser.parse_args() | ||
asr = AudioSpeechRecognition(model_name_or_path=args.model_name_or_path, device=args.device) | ||
text = asr.audio2text(args.input_audio) | ||
print(text) |
34 changes: 34 additions & 0 deletions
34
intel_extension_for_transformers/neural_chat/examples/tts_finetuning/finetune.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# !/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (c) 2023 Intel Corporation | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from intel_extension_for_transformers.neural_chat.pipeline.plugins.audio.finetuning.tts_finetuning import TTSFinetuning | ||
from intel_extension_for_transformers.neural_chat.config import TTSFinetuningConfig, TTSDatasetArguments, TTSModelArguments | ||
import torch | ||
import os | ||
|
||
workdir = os.getcwd() | ||
data_args = TTSDatasetArguments(audio_folder_path=os.path.join(workdir, "audios"), | ||
text_folder_path=os.path.join(workdir, "texts"),) | ||
model_args = TTSModelArguments(step=1000, warmup_step=125, learning_rate=1e-5) | ||
finetuning_config = TTSFinetuningConfig(data_args, model_args) | ||
|
||
tts_fintuner = TTSFinetuning(finetuning_config=finetuning_config) | ||
finetuned_model = tts_fintuner.finetune() | ||
|
||
torch.save(finetuned_model, "finetuned_model.pt") | ||
|
||
|
134 changes: 134 additions & 0 deletions
134
intel_extension_for_transformers/neural_chat/examples/tts_finetuning/inference.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
# !/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (c) 2023 Intel Corporation | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech | ||
from datasets import load_dataset, Audio, Dataset, Features, ClassLabel | ||
import os | ||
import torch | ||
from speechbrain.pretrained import EncoderClassifier | ||
from dataclasses import dataclass | ||
from typing import Any, Dict, List, Union | ||
from transformers import SpeechT5HifiGan | ||
import soundfile as sf | ||
from datetime import datetime | ||
from num2words import num2words | ||
|
||
workdir = os.getcwd() | ||
|
||
model = torch.load("finetuned_model.pt") | ||
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") | ||
|
||
spk_model_name = "speechbrain/spkrec-xvect-voxceleb" | ||
device = "cuda" if torch.cuda.is_available() else "cpu" | ||
speaker_model = EncoderClassifier.from_hparams( | ||
source=spk_model_name, | ||
run_opts={"device": device}, | ||
savedir=os.path.join("/tmp", spk_model_name) | ||
) | ||
def create_speaker_embedding(waveform): | ||
with torch.no_grad(): | ||
speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform)) | ||
speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2) | ||
speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy() | ||
return speaker_embeddings | ||
|
||
audio_dataset = Dataset.from_dict({"audio": [os.path.join(workdir, "audios/samples_mp3_ted_speakers_FeiFeiLi_sample-0.mp3")]}).cast_column("audio", Audio(sampling_rate=16000)) | ||
sembeddings = create_speaker_embedding(audio_dataset[0]["audio"]['array']) | ||
speaker_embeddings = torch.tensor(sembeddings).unsqueeze(0) | ||
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device) | ||
|
||
def correct_abbreviation(text): | ||
correct_dict = { | ||
"A": "Eigh", | ||
"B": "bee", | ||
"C": "cee", | ||
"D": "dee", | ||
"E": "yee", | ||
"F": "ef", | ||
"G": "jee", | ||
"H": "aitch", | ||
"I": "I", | ||
"J": "jay", | ||
"K": "kay", | ||
"L": "el", | ||
"M": "em", | ||
"N": "en", | ||
"O": "o", | ||
"P": "pee", | ||
"Q": "cue", | ||
"R": "ar", | ||
"S": "ess", | ||
"T": "tee", | ||
"U": "u", | ||
"V": "vee", | ||
"W": "doubleliu", | ||
"X": "ex", | ||
"Y": "wy", | ||
"Z": "zed" | ||
} | ||
words = text.split() | ||
results = [] | ||
for idx, word in enumerate(words): | ||
if word.isupper(): | ||
for c in word: | ||
if c in correct_dict: | ||
results.append(correct_dict[c]) | ||
else: | ||
results.append(c) | ||
else: | ||
results.append(word) | ||
return " ".join(results) | ||
|
||
def correct_number(text): | ||
"""Ignore the year or other exception right now""" | ||
words = text.split() | ||
results = [] | ||
for idx, word in enumerate(words): | ||
if word.isdigit(): | ||
try: | ||
word = num2words(word) | ||
except Exception as e: | ||
print(f"num2words fail with word: {word} and exception: {e}") | ||
else: | ||
try: | ||
val = int(word) | ||
word = num2words(word) | ||
except ValueError: | ||
try: | ||
val = float(word) | ||
word = num2words(word) | ||
except ValueError: | ||
pass | ||
results.append(word) | ||
return " ".join(results) | ||
|
||
|
||
while True: | ||
try: | ||
text = input("Write a sentence to let the talker speak:\n") | ||
text = correct_abbreviation(text) | ||
text = correct_number(text) | ||
inputs = processor(text=text, return_tensors="pt") | ||
spectrogram = model.generate_speech(inputs["input_ids"].to(device), speaker_embeddings.to(device)) | ||
with torch.no_grad(): | ||
speech = vocoder(spectrogram) | ||
now = datetime.now() | ||
time_stamp = now.strftime("%d_%m_%Y_%H_%M_%S") | ||
sf.write(f"output_{time_stamp}.wav", speech.cpu().numpy(), samplerate=16000) | ||
except Exception as e: | ||
print(f"Catch exception: {e}") | ||
print("Restarting\n") |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-0.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
a cramp is no small danger on a swim |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-1.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
he said the same phrase thirty times |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-2.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
plus the bright rows without least |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-3.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
two plus seven is less than ten |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-4.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
the glow deepened in the eyes of the sweet girl |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-5.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
bring your problems to the wise chief |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-6.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
write a fond note to the friend you cherish |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-7.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
clothes and lodging are free to new men |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-8.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
we from an events take a bad turn |
1 change: 1 addition & 0 deletions
1
.../neural_chat/examples/tts_finetuning/texts/samples_mp3_ted_speakers_FeiFeiLi_sample-9.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
port is a strong line with a smoky taste |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters