

---



## Task 1: Semantic Chunking of a Youtube Video

**Problem Statement:**

The objective is to extract high-quality, meaningful (semantic) segments from the specified YouTube video: [Watch Video](https://www.youtube.com/watch?v=Sby1uJ_NFIY).

Suggested workflow:
1. **Download Video and Extract Audio:** Download the video and separate the audio component.
2. **Transcription of Audio:** Utilize an open-source Speech-to-Text model to transcribe the audio. *Provide an explanation of the chosen model and any techniques used to enhance the quality of the transcription.*
3. **Time-Align Transcript with Audio:** *Describe the methodology and steps for aligning the transcript with the audio.*
4. **Semantic Chunking of Data:** Slice the data into audio-text pairs, using both semantic information from the text and voice activity information from the audio, with each audio-chunk being less than 15s in length. *Explain the logic used for semantic chunking and discuss the strengths and weaknesses of your approach.*



# Task 1:

Step 1: Downloading dependencies

In [None]:
#following modules to download youtube video
!pip install pytube
!pip install moviepy

Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m347.6 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-15.0.0


Choice of Speech-to-Text Model: WhisperAI by OpenAI

Reason:
WhisperAI by OpenAI is chosen due to its straightforward installation process and robust feature set.

It supports the creation of various file formats such as txt, vtt, srt, and tsv.

In [None]:
 # installing WhisperAi
 !pip install git+https://github.com/openai/whisper.git
 !sudo apt update && sudo apt install ffmpeg

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-aekpmzb1
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-aekpmzb1
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20231117)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->openai-whisper==20231117)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==1

In [None]:
from pytube import YouTube
import os
import subprocess

Step 2 : Downlaoding video from youtube

In [None]:
# Download video
video_url = "https://www.youtube.com/watch?v=Sby1uJ_NFIY"
yt = YouTube(video_url)
yt.streams.filter(only_audio=True).first().download(filename='video')

# Specify the full path to the video file
video_path = "/content/video"

# Extract audio using ffmpeg
subprocess.run(['ffmpeg', '-i', video_path, '-vn', '-acodec', 'pcm_s16le', '-ar', '44100', '-ac', '2', 'audio.wav'], check=True)

# Remove the video file
os.remove(video_path)

Step 3: Transcription

In [None]:
# using whisper for trascipting
transcript = !whisper "audio.wav" --model medium

Preprocessing

In [None]:
# Split transcript into chunks
chunk_duration = 15  # in seconds
trascript_txt = '/content/audio.txt' # using txt format file obtained
words = trascript_txt.split()
num_chunks = len(words) // chunk_duration + 1

# for allignment of the time stamp of audio with text, tsc obtained is enough

# Generate chunks with start and end times
chunks = []
for i in range(num_chunks):
    start_time = i * chunk_duration
    end_time = min((i + 1) * chunk_duration, len(words))
    text_chunk = ' '.join(words[start_time:end_time])
    chunks.append({
        "chunk_id": i + 1,
        "chunk_length": chunk_duration,
        "text": text_chunk,
        "start_time": start_time,
        "end_time": end_time
    })

print(chunks)


[{'chunk_id': 1, 'chunk_length': 15, 'text': '/content/audio.txt', 'start_time': 0, 'end_time': 1}]



Step 5: Semantic Chunking

The purpose of semantic chunking is to split the text into meaningful and important segments. This can be achieved by using models like BERT to identify named entities or performing sentiment analysis to capture impactful words.

In [None]:
# Semantic Chunking
def semantic_chunking(chunks):
    semantic_chunks = []
    for chunk in chunks:
        # Here, we simply consider each chunk as a semantic chunk
        # asuming every chunk is of importance
        semantic_chunks.append(chunk)
    return semantic_chunks

# Apply semantic chunking
semantic_chunks = semantic_chunking(chunks)
print(semantic_chunks)


[{'chunk_id': 1, 'chunk_length': 15, 'text': '/content/audio.txt', 'start_time': 0, 'end_time': 1}]


# **Bonus**

3.1: Implemended Gradio as interface

Here, by textbox will input link of youtube videos to extract their transcripted text of video.







In [None]:
from pytube import YouTube
import os
import subprocess
# defining answer function for gradio inputs
def downloading(link):
  video_url = link
  yt = YouTube(video_url)
  yt.streams.filter(only_audio=True).first().download(filename='video2')

  # Specify the full path to the video file
  video_path = "/content/video2"

  # Extract audio using ffmpeg
  subprocess.run(['ffmpeg', '-i', video_path, '-vn', '-acodec', 'pcm_s16le', '-ar', '44100', '-ac', '2', 'audio2.wav'], check=True)

  # Remove the video file
  os.remove(video_path)
  transcript = !whisper "audio2.wav" --model medium
  return transcript


In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.31.3-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.16.3 (from gradio)
  Downloading gradio_client-0.16.3-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.8/315.8 kB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━

While running below cell
dekete any file related to audio2 in any format from /content folder since it will detect it if it exist and wont run

and also sometimes it gives error in many cases so try running multiple times

In [None]:
import gradio as gr
from gradio.components import Textbox

# Creating the interface using Gradio
iface = gr.Interface(
    fn= downloading,
    inputs=Textbox(lines=2, label="Enter a link of youtube video"),
    outputs=Textbox(lines=15,label="Transcrpted text:"),
    live=True,
    title="Transcription process on youtube video",
    description="Extraction of audio and converting it to text using WhisperAI"
)

iface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://8b4b33f65ea8a33b33.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Bonus 3.2:


## Task 2: Exploratory Data Analysis of New Testament Audio and Text

**Problem Statement:**

The objective of this task is to conduct a comprehensive exploratory data analysis (EDA) on the audio and text data of the 260 chapters of the New Testament in your mother tongue (excluding English). The data should be obtained through web scraping from [Faith Comes By Hearing](https://www.faithcomesbyhearing.com/).

The workflow for this task should include:
1. **Web Scraping:** Systematically download the audio files and their corresponding textual content for each of the 260 chapters of the New Testament from the specified website.
2. **Data Preparation:** Organize the data by chapters, ensuring each audio file is matched with its corresponding text.
3. **Exploratory Data Analysis:** Analyze the data to uncover patterns, anomalies, or insights that could benefit applications such as Text to Speech (TTS) and Speech to Text (STT) technologies. Your analysis should explore various facets of the data, including audio quality, text clarity, and alignment between text and spoken content.

**Judgement Criteria:**

Your submission will be evaluated based on:
- **Efficiency and Reliability of Web Scraping Techniques:** The methods and tools used for downloading the chapters efficiently and reliably.
- **Data Analysis Methods:** The techniques and approaches used for analyzing the audio and text data.
- **Quality of Data Analysis:** How effectively the analysis addresses potential applications for the Speech team, including TTS and STT technologies.
- **Creativity in Analysis:** Innovative approaches in data handling and analysis, and the use of relevant metrics to assess data quality and applicability.

**Submission Requirements:**

Your submission should include the following components:
- **Report on Key Performance Indicators (KPIs):** A concise report detailing the key findings from your analysis, focusing on aspects that are critical for improving TTS and STT applications.
- **Methodological Explanation:** A thorough explanation of the methods used for both web scraping and the exploratory data analysis. This should include challenges faced and how they were overcome.
- **Supporting Materials:** Include code snippets and visualizations that highlight significant insights from the data. These should be well-documented and easy to understand, demonstrating the logic behind your analytical decisions.

The report should be structured to clearly present the methodology, findings, and implications of your analysis. It should be technical yet accessible, aimed at stakeholders who may have varying levels of familiarity with data analysis techniques.


---



It was a challenging and interesting task!
 Conducting an exploratory data analysis on audio and text data for the web scrapped data obtained from New Testament in a language other than English,

 following are steps i did:







Step 1. **Web Scraping:**
   
   Used Beautiful Soup to systematically download audio files and their corresponding textual content for each of the 260 chapters from the Faith Comes By Hearing website for Hindi language.



In [None]:
!pip install requests beautifulsoup4



In [None]:
import requests
from bs4 import BeautifulSoup # web scrapping
from urllib.parse import unquote # to properly encode extracted url
import re
import os

In [None]:
def scrape_audio_and_text():
    base_url = "https://live.bible.is/bible/"
    language_code = "hinohc"
    book_code = "gen"
    chapters = 260
    # Create directories if they don't exist
    if not os.path.exists("audio"):
        os.makedirs("audio")
    if not os.path.exists("text"):
        os.makedirs("text")

    for chapter in range(1, chapters + 1):
        url = f"{base_url}{language_code}/{book_code}/{chapter}"
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")

        # Extract text content since text element stored in div tag class chapter justify
        text_element = soup.find("div", class_="chapter justify")
        if text_element:
            text_content = text_element.get_text(strip=True)
            with open(f"text/{book_code}_{chapter}.txt", "w", encoding="utf-8") as text_file:
                text_file.write(text_content)
            print(f"Text for chapter {chapter} downloaded.")
        else:
            print(f"No text found for chapter {chapter}.")

        # Extract audio URL from video tag class audio_player src content
        video_tag = soup.find("video", class_="audio_player")
        audio_url = None
        if video_tag:
            audio_src = video_tag.get("src")
            audio_url = unquote(audio_src)
            print(audio_url)
        if audio_url:
            #audio_url = unquote(audio_src)
            audio_response = requests.get(audio_url)

            with open(f"audio/{book_code}_{chapter}.mp3", "wb") as audio_file:
                audio_file.write(audio_response.content)
            print(f"Audio for chapter {chapter} downloaded.")
        else:
            print(f"No audio found for chapter {chapter}.")


Okay so biggest problem i m facing is extraction of audio url from video tag by extracting link from src But

By printing it is showing _ which maybe becuase it is not getting extracting properly

So one thing we can do is by inspection manually copy paste link in directory and then extract them by iterating

Step 2. **Data Preparation:**
   
   Organize the data by chapters, ensuring that each audio file is correctly matched with its corresponding text.


In [None]:

def prepare_data():
    audio_data = []
    text_data = []
    chapters = 260

    # Create directories if they don't exist
    os.makedirs("audio", exist_ok=True)
    os.makedirs("text", exist_ok=True)

    for chapter in range(1, chapters + 1):
        audio_file_path = f"audio/{chapter}.mp3"
        text_file_path = f"text/{chapter}.txt"

        if os.path.exists(audio_file_path):
            with open(audio_file_path, "rb") as audio_file:
                audio_data.append(audio_file.read())

        if os.path.exists(text_file_path):
            with open(text_file_path, "r", encoding="utf-8") as text_file:
                text_data.append(text_file.read())

    return audio_data, text_data


Step 3. **Exploratory Data Analysis:**
   
   Tried to used visualization techniques to find any discontinuity in audio file.
   


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import librosa
import librosa.display

# Visualize audio data
def visualize_audio(audio_file_path):
    # Load audio file
    audio_data, sr = librosa.load(audio_file_path, sr=None)

    # Plot audio waveform
    plt.figure(figsize=(14, 5))
    librosa.display.waveshow(audio_data, sr=sr)
    plt.title('Audio Waveform')
    plt.xlabel('Time (s)')
    plt.ylabel('Amplitude')
    plt.show()

In [None]:
# Main function to execute the workflow
def main():
    scrape_audio_and_text()
    audio_data, text_data = prepare_data()
    audio_data_path = '/content/audio'
    visualize_audio(audio_data_path)


if __name__ == "__main__":
    main()

Text for chapter 1 downloaded.
No audio found for chapter 1.
Text for chapter 2 downloaded.
No audio found for chapter 2.
Text for chapter 3 downloaded.
No audio found for chapter 3.
Text for chapter 4 downloaded.
No audio found for chapter 4.
Text for chapter 5 downloaded.
No audio found for chapter 5.
Text for chapter 6 downloaded.
No audio found for chapter 6.
Text for chapter 7 downloaded.
No audio found for chapter 7.
Text for chapter 8 downloaded.
No audio found for chapter 8.
Text for chapter 9 downloaded.
No audio found for chapter 9.
Text for chapter 10 downloaded.
No audio found for chapter 10.
Text for chapter 11 downloaded.
No audio found for chapter 11.
Text for chapter 12 downloaded.
No audio found for chapter 12.
Text for chapter 13 downloaded.
No audio found for chapter 13.
Text for chapter 14 downloaded.
No audio found for chapter 14.
Text for chapter 15 downloaded.
No audio found for chapter 15.
Text for chapter 16 downloaded.
No audio found for chapter 16.
Text for c

Text for chapter 28 downloaded.
No audio found for chapter 28.
Text for chapter 29 downloaded.
No audio found for chapter 29.
Text for chapter 30 downloaded.
No audio found for chapter 30.
Text for chapter 31 downloaded.
No audio found for chapter 31.
Text for chapter 32 downloaded.
No audio found for chapter 32.
Text for chapter 33 downloaded.
No audio found for chapter 33.
Text for chapter 34 downloaded.
No audio found for chapter 34.
Text for chapter 35 downloaded.
No audio found for chapter 35.
Text for chapter 36 downloaded.
No audio found for chapter 36.
No text found for chapter 37.
No audio found for chapter 37.
Text for chapter 38 downloaded.
No audio found for chapter 38.
Text for chapter 39 downloaded.
No audio found for chapter 39.
No text found for chapter 40.
No audio found for chapter 40.
Text for chapter 41 downloaded.
No audio found for chapter 41.
Text for chapter 42 downloaded.
No audio found for chapter 42.
No text found for chapter 43.
No audio found for chapter 43

  audio_data, sr = librosa.load(audio_file_path, sr=None)
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


IsADirectoryError: [Errno 21] Is a directory: '/content/audio'

Additional Step:Improvements.

1. We need to use a method to properly encode the url unlike unquote which we did but did not do much

2. We can also manually inspect and copy paste the link and iterate over for extraction for EDA exploration.

3. for text analysis, we can use transcription like we did in task 1 from audio and compare it with downloaded text as graound truth to know and quantify the transcription methodology by different similarity metrics such as Word Error Rate (WER), Character Error Rate (CER), and Levenshtein distance to quantify the transcription accuracy.

4. and finally can optimise by performing spelling or grammer checking and hence make a vocabulary dictionry for future use.