<a href="https://colab.research.google.com/github/mhrice/Ai-Jam-2/blob/master/projects/voice_conversion/Local_Youtube_DL_%2B_Voice_conversion_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Voice Conversion Synthesizer

**What is voice conversion?**

Voice conversion is a process that involves changing the characteristics of one person's voice to sound like another person's voice. 

**Features of this system:**

Depending on the audio you provide, the converted voice can:

- Speak with different expressivities and emotions (neutral, normal, crying, angry, screaming) as well as in any language.
- Speak fluently and naturally for long periods of time.
- Singing in any language and with multiple vocal techniques.

**What do I need to convert a voice?**

You need a model of that voice previously trained with at least 10 minutes of audios. The more audios and better quality they are, the better the model will be.

Follow the instructions in this notebook carefully and please be patient.

In [None]:
#@title ↓ 1. Press play on the cell to download and install the necessary software.

#@markdown #### (_This may take a few minutes_)

!pip install gTTS

!pip install --upgrade --no-cache-dir gdown

!pip install pydub

!pip install wget

from IPython.display import clear_output 
from google.colab import files 
import os

!rm -rf /content/sample_data

!git clone https://github.com/Mixomo/diff-svc.git
 
%cd /content/diff-svc
print('Installing torch')
#!pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
!pip install torch torchvision torchaudio
!pip install yt_dlp
!pip install -r requirements_short.txt
!pip install tensorboard<2.9,>=2.8
!pip install -r requirements_short.txt
!pip install tensorboard<2.9,>=2.8
%reload_ext tensorboard
print('Downloading pretrained models')
%cd "/content/"
%mkdir -p /content/diff-svc/checkpoints/
!gdown --id 167Q2iTsf6cROH_IFnXbkbSdSK1q4VTbg -O checkpoints.zip
!unzip /content/checkpoints.zip -d /content/diff-svc/
!gdown --id 17H40ZgJ1Me5CtEAFSolZGAK8YCmifrAh
!unzip /content/nyaru.zip -d /content/diff-svc/checkpoints/

!gdown 1v7IXf0o5NqSfI3z2Pm_T8-JZc9YEEGHw -O /content/diff-svc/checkpoints/nsf_hifigan_finetune_20221211.zip
!gdown 1-DjQo-pbRjuUQ5WI7IVuOCwqGuUx-jmA -O /content/diff-svc/checkpoints/nsf_hifigan_20221211.zip

!unzip /content/diff-svc/checkpoints/nsf_hifigan_20221211.zip -d /content/diff-svc/checkpoints
!unzip /content/diff-svc/checkpoints/nsf_hifigan_finetune_20221211.zip -d /content/diff-svc/checkpoints

print("All set ✔️")
    #@markdown ___

In [None]:
#@title Youtube Download and Demucs Class
from google.colab import files
import torch
import torchaudio
from torchaudio.pipelines import HDEMUCS_HIGH_MUSDB_PLUS
from torchaudio.transforms import Fade
import os # dope edit
from urllib.parse import parse_qs, urlparse
import yt_dlp
from pathlib import Path


class DownloadAndSplit:
    def __init__(self, sample_rate: int = 48000, fade_overlap: float = 0.1):
        self.sample_rate = sample_rate
        self.fade_overlap = fade_overlap
        print("Loading model...")
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        bundle = HDEMUCS_HIGH_MUSDB_PLUS

        sample_rate = bundle.sample_rate
        self.model = bundle.get_model()
        self.model.to(self.device)
        self.root = Path("./downloads")
        self.root.mkdir(parents=True, exist_ok=True)

    def process(self, url: str, skip_download: bool = False):
        root = Path("./downloads")
        if not skip_download:
            print("Downloading song...")
            youtube_id = self.youtube_url_to_id(url)
            path_to_download = root / youtube_id
            path_to_download.mkdir(exist_ok=True)
            output_file_path = path_to_download / "output.wav"
            self.download_youtube(url, youtube_id, path_to_download)
        else:
            output_file_path = Path(url)
            youtube_id = output_file_path.stem
            path_to_download = root / output_file_path.stem
            path_to_download.mkdir(exist_ok=True)

        print("Separating...")
        mixture, sr = torchaudio.load(output_file_path, sample_rate)
        mixture = torchaudio.functional.resample(mixture, sr, self.sample_rate)
        sources = self.separate(mixture, model=self.model, device=self.device)
        vocals = sources["vocals"]
        torchaudio.save(
            f"{path_to_download}/vocals.wav",
            vocals.cpu(),
            sample_rate=self.sample_rate,
        )
        others = torch.zeros_like(vocals)
        for source in sources:
            if source != "vocals":
                others += sources[source]
        torchaudio.save(
            f"{path_to_download}/others.wav",
            others.cpu(),
            sample_rate=self.sample_rate,
        )

        return vocals, others, youtube_id

    def download_youtube(self, url: str, youtube_id: str, output_path: str):
        options = {
            "format": "bestaudio/best",
            "postprocessors": [
                {
                    "key": "FFmpegExtractAudio",
                    "preferredcodec": "wav",
                    "preferredquality": "192",
                }
            ],
            "outtmpl": os.path.join(output_path, f"output.%(ext)s"),
        }
        with yt_dlp.YoutubeDL(options) as youtube_dl:
            youtube_dl.download([url])

    def youtube_url_to_id(self, url: str) -> str:
        url_data = urlparse(url)
        query = parse_qs(url_data.query)
        video_id = query["v"][0]
        return video_id

    def separate(
        self,
        mix: torch.Tensor,
        model: torch.nn.Module,
        device: torch.device,
        segment: int = 10,
        overlap: float = 0.1,
        overlap_frames: int = 0.1,
    ):
        mix = mix.unsqueeze(0).to(self.device)  # Add batch dimension
        batch, channels, length = mix.shape
        chunk_len = int(self.sample_rate * segment * (1 + overlap))
        start = 0
        end = chunk_len
        overlap_frames = overlap * self.sample_rate
        fade = Fade(
            fade_in_len=0, fade_out_len=int(overlap_frames), fade_shape="linear"
        )
        ref = mix.mean(0)
        mix = (mix - ref.mean()) / ref.std()  # normalization

        sources = torch.zeros(
            batch, len(model.sources), channels, length, device=device
        )

        while start < length - overlap_frames:
            chunk = mix[:, :, start:end]
            with torch.no_grad():
                out = model.forward(chunk)
            out = fade(out)
            sources[:, :, :, start:end] += out
            if start == 0:
                fade.fade_in_len = int(overlap_frames)
                start += int(chunk_len - overlap_frames)
            else:
                start += chunk_len
            end += chunk_len
            if end >= length:
                fade.fade_out_len = 0
        sources = sources * ref.std() + ref.mean()
        sources_list = model.sources
        sources = sources.squeeze(0)  # Drop batch dimension
        sources = list(sources)
        dict_sources = dict(zip(sources_list, sources))
        return dict_sources


In [None]:
#@title 2. Enter in the corresponding fields the Google Drive URLs (**VOICE MODEL and MODEL CONFIG**). make sure they are in "any user with the link" mode.

#@markdown You can find some models in the "model" channel of the Diff-SVC discord

VOICE_MODEL = "https://drive.google.com/file/d/141YeMJFt-q-u1dDOoKvD_SvyJbzWC35N/" #@param {type: "string"}

MODEL_CONFIG = "https://drive.google.com/file/d/1O6vu688nCFWeQlOB5UWKA0VEsB_9yzDc/" #@param {type: "string"}

# Extract the file ID from the URL
VOICE_MODEL_ID = VOICE_MODEL.split("/")[-2]
MODEL_CONFIG_ID = MODEL_CONFIG.split("/")[-2]

# Download the file using gdown
!gdown https://drive.google.com/uc?id=$VOICE_MODEL_ID -O last_checkpoint.ckpt

!gdown https://drive.google.com/uc?id=$MODEL_CONFIG_ID -O config.yaml

print("All set ✔️")

#@markdown ___

In [None]:
#@title Download and Split



sample_rate = 44100
# Process Youtube link
song_link = "/content/senoritapitchedup.mp3" #@param {type: "string"}
use_local = True #@param {type: "boolean"}
ds = DownloadAndSplit(sample_rate=sample_rate, )
vocals, others, youtube_id = ds.process(song_link, skip_download=use_local)
filename = "./downloads/vocals.wav"

In [None]:
#@title Workaround for numba/numpy issue. Please restart runtime after running this cell
!pip install numpy==1.23.5 
!pip install librosa==0.9.1
!pip install numba==0.56.4

In [None]:
#@title ↓ 3. Load the model.


%cd "/content/diff-svc/"

import os
os.environ['PYTHONPATH']='.'

!CUDA_VISIBLE_DEVICES=0


from utils.hparams import hparams
from preprocessing.data_gen_utils import get_pitch_parselmouth,get_pitch_crepe
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
import utils
import librosa
import torchcrepe
from infer import *
import logging
from infer_tools.infer_tool import *

logging.getLogger('numba').setLevel(logging.WARNING)

# 工程文件夹名，训练时用的那个
project_name = "any" 
model_path = "/content/last_checkpoint.ckpt"
config_path="/content/config.yaml" 
hubert_gpu=True
svc_model = Svc(project_name,config_path,hubert_gpu, model_path)
print('model loaded')
print("All set ✔️")


In [None]:
#@title 4. Voice conversion settings and execution

#@markdown #### (_This may take a few minutes_)

#@markdown **NOTE: You must adjust the settings before executing the cell**.

%cd "/content/diff-svc/"

wav_fn = f"/content/downloads/{youtube_id}/vocals.wav"

demoaudio, sr = librosa.load(wav_fn)

#@markdown ___

#@markdown #### ``This parameter is used to match the audio you provided to the system to the target voice for correct conversion. It is handled by semitones, positive and negative numbers. A value of 12 is equivalent to one octave. ``
#@markdown ___
#@markdown #### ``If the conversion is from a male voice to a male voice, or from a female voice to a female voice, you may leave this value at 0. ``
#@markdown ___
#@markdown #### ``On the other hand, if the conversion is from a male voice to a female voice, we recommend increase the value until the converted voice sounds to your liking. Reference value: 12``.
#@markdown ___
#@markdown #### ``And if the conversion is made from a female voice to a male voice, lower the value (negative numbers) until the converted voice sounds to your liking. Reference value: -12.``
#@markdown ___
#@markdown #### ``You are free to experiment with intermediate, higher, or lower values until you achieve the desired result. it is trial and error``.
#@markdown ___
key = 0#@param {type: "integer"}
#@markdown ___
#@markdown #### ``This parameter is used to modify the steps of each generation / synthesis. the minimum value that gives a decent quality of the voice is 20. you can increase the value to try to improve the quality of the conversion, although it will take more time. a recommended value may be 50. ``

pndm_speedup = 20 #@param {type: "integer"}

wav_gen='/content/converted_audio.wav' 

add_noise_step = 500 

thre = 0.05 

use_crepe= False

use_pe=False
#@markdown ___
#@markdown #### ``This parameter is used to use the MEL spectrogram of the incoming audio as a starting point for the conversion. Sometimes it gives better results, sometimes not. experimentation is recommended. ``
use_gt_mel= False #@param {type: "boolean"}

f0_tst, f0_pred, audio = run_clip(svc_model,file_path=wav_fn, key=key, acc=pndm_speedup, use_crepe=use_crepe, use_pe=use_pe, thre=thre,
                                        use_gt_mel=use_gt_mel, add_noise_step=add_noise_step,project_name=project_name,out_path=wav_gen)

print("All set ✔️")

In [None]:
#@title ↓ 5. Play the audio with the converted voice.
#@markdown If you don't like the result, change the settings in the above cell and run it again until you find the value you like.

#@markdown **It may happen that with long audios (full songs for example), it may take a while to display the player and end up crashing Colab. In such cases we recommend that you directly download the audio to your computer with the cell below.**

ipd.display(ipd.Audio(audio, rate=hparams['audio_sample_rate'], normalize=True))

In [None]:
#@title ↓ 6. Combine and download the audio
import torch.nn.functional as F
import soundfile as sf
from torchaudio.functional import resample
from pydub import AudioSegment
%cd /content/

def get_track_pitch_shift(vocal_key):
  while vocal_key < 0:
    vocal_key += 12
  if vocal_key % 12 <= 6:
    ps = vocal_key
  else:
    ps = -(12 - vocal_key)
  return ps

#read wav file to an audio-segment
converted_vocals, sr = torchaudio.load(wav_gen)
converted_vocals = converted_vocals.to(others.device)
converted_vocals = resample(converted_vocals, sr, sample_rate).squeeze(0)
# Convert to stereo
converted_vocals = torch.stack([converted_vocals, converted_vocals])

# Pitch shift beat to match vocals
ps = get_track_pitch_shift(key)
track = torchaudio.functional.pitch_shift(others, sample_rate, n_steps=ps)

# Pad the vocals to match the others
diff = track.shape[1] - converted_vocals.shape[1]
if diff > 0:
    converted_vocals = F.pad(converted_vocals, (0, diff))
elif diff < 0:
    track = F.pad(track, (0, -diff))

full_song = converted_vocals + track


sf.write("/content/converted_vocals.wav", converted_vocals.T.cpu(), sample_rate)
sf.write("/content/track.wav", track.T.cpu(), sample_rate)
sf.write("/content/output.wav", full_song.T.cpu(), sample_rate)


song = AudioSegment.from_wav("/content/converted_vocals.wav")
song.export("vocals.mp3", format="mp3")
files.download('/content/vocals.mp3')

song = AudioSegment.from_wav("/content/track.wav")
song.export("track.mp3", format="mp3")
files.download('/content/track.mp3')

song = AudioSegment.from_wav("/content/output.wav")
song.export("output.mp3", format="mp3")
files.download('/content/output.mp3') 

In [None]:
#@title ↓ 7. Delete the reference audio to re-generate or upload a new one.

!rm -r /content/reference_audio.wav

NOTE: To generate a new voice conversion, run step 4, step 5 (changing the settings) and step 6.
_______________________
*If in cell 5 and/or 6 there is an error, simply rerun them, or try uploading a shorter audio.

When you have finished using the system, go to Runtime Environment > Disconnect and delete runtime environment.