## Set catalog and schema

We set the catalog and schema to organise our data and ensure it is stored in the correct location. Change these to suit your workspace.

In [None]:
CATALOG = "marcell"
SCHEMA = "call_centre_processing"

Create catalog, schema and volume if they don't exist, and create directories for compressed, raw audio files and models.

In [None]:
spark.sql(f"CREATE CATALOG IF NOT EXISTS {CATALOG}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.data")
dbutils.fs.mkdirs(f"/Volumes/{CATALOG}/{SCHEMA}/data/compressed/LJSpeech")
dbutils.fs.mkdirs(f"/Volumes/{CATALOG}/{SCHEMA}/data/raw_audio/LJSpeech")
dbutils.fs.mkdirs(f"/Volumes/{CATALOG}/{SCHEMA}/data/models")

## Download raw audio files

We download the [LJSpeech dataset](https://paperswithcode.com/dataset/ljspeech) from the URL and unzip it to the raw audio directory. This is a collection of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. The files are stored in a tar.bz2 archive, so we will first download it and then unzip it.

In [None]:
# Download the LJSpeech dataset

import urllib.request

url = "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2"
target_file_path = f"/Volumes/{CATALOG}/{SCHEMA}/data/compressed/LJSpeech/LJSpeech-1.1.tar.bz2"
urllib.request.urlretrieve(url, target_file_path)

The unzipping can take quite some time (>1hr).

In [None]:
# Unzip the LJSpeech dataset

import zipfile

extract_to_path = f"/Volumes/{CATALOG}/{SCHEMA}/data/raw_audio/LJSpeech"
with zipfile.ZipFile(target_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to_path)

## Create reference dataframe

We create a reference dataframe that contains the file paths of the raw audio files. We will use this dataframe to parallelize the inference process.

In [None]:
import pyspark.sql.functions as F

df_file_reference = spark.createDataFrame(dbutils.fs.ls(f"/Volumes/{CATALOG}/{SCHEMA}/data/raw_audio/LJSpeech/LJSpeech-1.1/wavs/"))\
  .withColumn("file_path", F.expr("substring(path, 6, length(path))")) # remove the leading dbfs:/ from the path

df_file_reference.display()

Write the dataframe to a Delta table.

In [None]:
df_file_reference.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("{CATALOG}.{SCHEMA}.recording_file_reference")