## Project Overview

The objective of this Proof of Concept (PoC) is to develop a system that generates lip-synced videos based on user-provided text input. Users will specify various parameters, including language, gender, speaker code, and the choice of one out of four available image options representing different persons. The system will produce a video where the selected person appears to be speaking the provided text in a realistic and synchronized manner.

## Scope of Work
Input Parameters:

* Text: User-provided text that the selected person will be lip-syncing.
* Language: The language in which the text is written and will be spoken.
* Gender: The gender of the speaker, affecting the voice modulation.
* Speaker Code: A unique identifier for different speaker voice profiles.
* Image Option: Selection of one out of four predefined images of persons who will appear in the video.

Expected Output:

A lip-synced video featuring the chosen person, where the video output aligns the movements of the person's lips with the provided text.

## Technical Specifications
* Programming Language: Python (or any other suitable language)
* Frameworks and Libraries:
**  Text-to-Speech: Use the suno bark small model for TTS.
**  Lip Sync: Use the wav2lip model for generating realistic lip-syncing videos.

## Running the code:
### It requires to upload requirements.txt file and upload speakers.json file and installing it using command below:

pip install -r requirements.txt

## Installing libraries and models

In [None]:
pip install -r /content/requirements.txt

In [None]:
pip install gradio -q

In [None]:
# Clone the Wav2Lip repository from GitHub
!git clone https://github.com/zabique/Wav2Lip

# Download the pretrained Wav2Lip model
!wget 'https://iiitaphyd-my.sharepoint.com/personal/radrabha_m_research_iiit_ac_in/_layouts/15/download.aspx?share=EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA' -O '/content/Wav2Lip/checkpoints/wav2lip_gan.pth'

# Install a specific package
a = !pip install https://raw.githubusercontent.com/AwaleSajil/ghc/master/ghc-1.0-py3-none-any.whl

# Install requirements from the Wav2Lip repository
!cd Wav2Lip && pip install -r requirements.txt -q

# Install youtube-dl for downloading videos
!pip install -q youtube-dl

# Install librosa for audio processing (specific version 0.9.1)
!pip install librosa==0.9.1

# Install moviepy for video editing
!pip install -q moviepy

variable_name = False  # Initialize a variable (not used in this snippet)

# Remove the default sample data directory and create a new one
!rm -rf /content/sample_data
!mkdir /content/sample_data

#Required libraries for audio processing and display
import torch
import scipy.io.wavfile  # For reading and writing WAV files
from transformers import BarkModel, AutoProcessor  # Hugging Face Transformers
from IPython.display import Audio, HTML  # For displaying audio and HTML in Jupyter notebooks
import time  # For time-related functions
from base64 import b64encode  # For encoding binary data to base64
import json  # For working with JSON data
from moviepy.editor import VideoFileClip, AudioFileClip  # For editing video and audio files
import random  # For generating random numbers
import gradio as gr




In [4]:
import gradio as gr

## Text to speech module

In [5]:

# Open the 'speakers.json' file in read mode
with open('speakers.json', 'r') as f:
    # Load the JSON data from the file into a Python dictionary
    speakers = json.load(f)


In [6]:
def choose_voice(data, language, gender, speaker_code):
    """
    Choose a specific voice based on language, gender, and speaker code.

    Parameters:
    - data (list of dict): List of speaker dictionaries, each containing 'language', 'gender', and 'code'.
    - language (str): The desired language of the speaker.
    - gender (str): The desired gender of the speaker.
    - speaker_code (str): The specific code of the speaker.

    Returns:
    - dict or None: The dictionary of the selected speaker if found, otherwise None.
    """

    # Filter speakers by the desired language
    speakers_by_language = [d for d in data if d['language'] == language]

    # Further filter by the desired gender
    speakers_by_gender = [d for d in speakers_by_language if d['gender'] == gender]

    # Find the specific speaker with the given code
    selected_speaker = next((d for d in speakers_by_gender if d['code'] == speaker_code), None)

    return selected_speaker



In [8]:
def text_to_speech(text, language, gender, speaker_code):
    """
    Convert text to speech using the specified parameters.

    Parameters:
    - text (str): The text to convert to speech.
    - language (str): The desired language of the voice.
    - gender (str): The desired gender of the voice.
    - speaker_code (str): The code of the specific speaker to use.

    Returns:
    - speech_output (tensor): The generated speech output.
    - sampling_rate (int): The sampling rate of the generated speech.
    """

    # Initialize model and processor
    model = BarkModel.from_pretrained("suno/bark-small")
    processor = AutoProcessor.from_pretrained("suno/bark")

    # Determine the computation device (GPU if available, otherwise CPU)
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model = model.to(device)

    # Choose the specific voice based on parameters
    voice = choose_voice(speakers, language, gender, speaker_code)

    if voice is None:
        raise ValueError("The specified voice was not found.")

    # Process the text with the voice preset
    inputs = processor(text, voice_preset=voice['code'])

    # Generate speech from the processed inputs
    speech_output = model.generate(**inputs.to(device))

    # Retrieve the sampling rate from the model's configuration
    sampling_rate = model.generation_config.sample_rate

    return speech_output, sampling_rate


In [9]:
def save_audio(text, language, gender, speaker_code):
    """
    Generate speech from text and save it as a WAV file.

    Parameters:
    - text (str): The text to convert to speech.
    - language (str): The desired language of the voice.
    - gender (str): The desired gender of the voice.
    - speaker_code (str): The code of the specific speaker to use.
    """
    # Generate speech output and retrieve the sampling rate
    speech_output, sampling_rate = text_to_speech(text, language, gender, speaker_code)

    # Convert speech output tensor to numpy array and ensure it's 1D
    audio_data = speech_output.cpu().numpy().squeeze()

    # Save the audio data to a WAV file
    scipy.io.wavfile.write("/content/audio.wav", rate=sampling_rate, data=audio_data)
    tts_audio_path = "/content/audio.wav"

    print("Audio saved as /content/audio.wav")
    return tts_audio_path



## Wav2lip module

In [10]:
def generate_video(tts_audio_path, video_path):
    """
    Generate a lip-synced video using Wav2Lip.

    Parameters:
    - tts_audio_path (str): Path to the text-to-speech audio file.
    - video_path (str): Path to the input video file.

    Returns:
    - None
    """

    # Define paths for the audio and video files
    tts_audio_path = tts_audio_path
    moviepy_video_path = video_path

    # Command to run the Wav2Lip inference script
    !cd Wav2Lip && python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "{video_path}" --audio "{tts_audio_path}"

    print("Video generation complete.")

## Moviepy module

In [11]:
def sync_audio_with_video(video_path, tts_audio_path):
    """
    Synchronize TTS audio with a video clip, adjusting the video duration or looping if needed.

    Parameters:
    - video_path (str): Path to the input video file.
    - tts_audio_path (str): Path to the TTS audio file.
    - output_path (str): Path to save the output video file.

    Returns:
    - None
    """
    # Load the video and audio clips
    video_clip = VideoFileClip(video_path)
    audio_clip = AudioFileClip(tts_audio_path)

    # Determine the duration of the video and audio
    video_duration = video_clip.duration
    audio_duration = audio_clip.duration

    # Calculate the random start time for the video subclip
    if video_duration > audio_duration:
        max_start_time = video_duration - audio_duration
        start_time = random.uniform(0, max_start_time)
        video_clip = video_clip.subclip(start_time, start_time + audio_duration)
    else:
        # If the video duration is less than or equal to the audio duration, loop the video
        video_clip = video_clip.loop(duration=audio_duration)

    # Set the audio of the video clip to the new TTS audio
    final_video = video_clip.set_audio(audio_clip)

    # Write the result to a file
    final_video.write_videofile("/content/output.mp4", codec="libx264", audio_codec="aac")

    print(f"Video saved")


## Display video

In [12]:
def display_video(file_path):
    """
    Display a video in an IPython notebook using base64 encoding.

    Parameters:
    - file_path (str): Path to the video file to be displayed.

    Returns:
    - HTML: HTML object to render the video in the notebook.
    """
    # Read the video file in binary mode
    with open(file_path, 'rb') as video_file:
        mp4 = video_file.read()

    # Encode the video file in base64
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

    # Create HTML for displaying the video
    video_html = f"""
    <video width="50%" height="50%" controls>
          <source src="{data_url}" type="video/mp4">
    </video>"""

    # Return the HTML object
    return HTML(video_html)



In [13]:
wav2lip_output_path = "/content/Wav2Lip/results/result_voice.mp4"
moviepy_output_path = "/content/output.mp4"

## Giving the inputs here

In [14]:
  text = "Hi my name is Sia, i am news reader for evening news channel"
  language = "English"
  gender = "Female"
  speaker_code = "v2/en_speaker_9"
  video_path = "/content/Indian.mp4"
  wav2lip_path = "/content/Wav2Lip"

## Calling the tts function to generate audio

In [None]:
tts_audio_path = save_audio(text, language, gender, speaker_code)

It took 30 sec to generate this audio in T4 first time and 17 sec second time

## Calling the wav2lip function to make a video

Upload the video files before calling this

In [None]:
generate_video(tts_audio_path, video_path)

46 sec to generate a video first time and 36 sec second time

In [None]:
display_video(wav2lip_output_path)

## Calling moviepy function to generate video

In [None]:
sync_audio_with_video(video_path, tts_audio_path)

Moviepy - Building video /content/output.mp4.
MoviePy - Writing audio in outputTEMP_MPY_wvf_snd.mp4




MoviePy - Done.
Moviepy - Writing video /content/output.mp4





Moviepy - Done !
Moviepy - video ready /content/output.mp4
Video saved


Moviepy took 1 sec to generate 5 sec video

In [None]:
display_video(moviepy_output_path)

In [13]:
def gradio_function(text, language, gender, speaker_code, selected_image, model):

    video_path = get_video_path(selected_image)
    tts_audio_path = save_audio(text, language, gender, speaker_code)

    if model == "Wav2Lip":
        generate_video(tts_audio_path, video_path)
        result_video_path = "/content/Wav2Lip/results/result_voice.mp4"
    elif model == "MoviePy":
        sync_audio_with_video(video_path, tts_audio_path)
        result_video_path = "/content/output.mp4"
    else:
        raise ValueError("Invalid model choice. Choose either 'Wav2Lip' or 'MoviePy'.")

    return result_video_path

In [14]:
models = ["Wav2Lip", "MoviePy"]

In [15]:


def update_genders(language):
    # Filter speakers by the selected language
    speakers_by_language = [d for d in speakers if d['language'] == language]
    # Extract unique genders
    genders = sorted(set(d['gender'] for d in speakers_by_language))
    return genders

def update_codes(language, gender):
    # Filter speakers by the selected language and gender
    speakers_by_language_gender = [d for d in speakers if d['language'] == language and d['gender'] == gender]
    # Extract unique codes
    codes = sorted(set(d['code'] for d in speakers_by_language_gender))
    return codes

def on_language_change(language):
    # Update genders based on the selected language
    genders = update_genders(language)
    return gr.update(choices=genders, value=None)

def on_gender_change(language, gender):
    # Update codes based on the selected language and gender
    codes = update_codes(language, gender)
    return gr.update(choices=codes, value=None)




In [22]:
# Predefined image-to-video mapping
image_to_video_mapping = {
    "Black-or-African-American": "/content/African.mp4",
    "Asian": "/content/Asian.mp4",
    "Indian": "/content/Indian.mp4",
    "White-American": "/content/American.mp4"
}

# List of image names for dropdown
image_names = list(image_to_video_mapping.keys())

In [23]:
def select_image(image_name):
    image_path = f"/content/{image_name}.jpeg"
    return image_path

def get_video_path(image_name):
    return image_to_video_mapping.get(image_name, "")

In [24]:
select_image("Black-or-African-American")

'/content/Black-or-African-American.jpeg'

In [25]:
languages = sorted(set(d['language'] for d in speakers))

In [26]:
# Gradio interface
with gr.Blocks(theme=gr.themes.Glass(primary_hue="blue", text_size="lg").set(input_text_size="*text_lg")) as app:
  gr.Markdown("# Talking Head Video Generator Tool")


  with gr.Row():
    text_input = gr.Textbox(label="Text", lines=3, elem_classes="label")

  with gr.Row():

    language_input = gr.Dropdown(choices=languages, label="Language")
    gender_input = gr.Dropdown(choices=[], label="Gender")
    speaker_code_input = gr.Dropdown(choices=[], label="Speaker Code")

  with gr.Row():

    model_input = gr.Radio(choices=models, label="Model")
    image_dropdown = gr.Dropdown(choices=image_names, label="Select Image")
    selected_image = gr.Image(label="Selected Image", width=128, height=128)

  with gr.Row():
    generate_button = gr.Button("Generate Video")

    result_output = gr.Video(label="Result", width=256, height=256, show_download_button=True, loop=True, show_share_button=True)



    # Define dynamic updates
  language_input.change(on_language_change, inputs=language_input, outputs=gender_input)
  gender_input.change(on_gender_change, inputs=[language_input, gender_input], outputs=speaker_code_input)

    # Define dynamic updates
  image_dropdown.change(
        fn=select_image,
        inputs=image_dropdown,
        outputs=selected_image
    )


    # Button click to process and generate the video
  generate_button.click(
        gradio_function,
        inputs=[text_input, language_input, gender_input, speaker_code_input, image_dropdown, model_input],
        outputs=result_output
    )


In [None]:
app.launch(debug=True)
#