# Deep Learning for Unsupervised Audio Mining
Joshua Higginbotham

## Running this Notebook

This report is formatted as a Jupyter notebook with a .ipynb file extension.
Most notebooks are written in Python, but this one is written in Julia. If you
are already familiar with Julia and know how to use it with Jupyter, or don't 
mind figuring it out for yourself, then feel free to run this notebook in 
whatever way works best for you.

But if you are new to any or all of Julia, Jupyter, or Julia in Jupyter, then
I have provided some scripts and instructions to get you started. They were
written and tested on Ubuntu 18.04. If you do not have access to a Linux
workstation then the basic steps will all be the same on Windows and Mac,
however if you want to use the provided bash scripts directly then you may 
need to do some additional setup.

### The Project Folder

Julia projects are defined by a file called `Project.toml`, which defines all of its package dependencies.

Create an empty directory and `cd` into it. Once inside, and copy the following into a file called `Project.toml`:

```toml
# Project.toml

[deps]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
Conda = "8f4d0f93-b110-5947-807f-2305c1781a2d"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
DataSets = "c9661210-8a83-48f0-b833-72e62abce419"
DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
IJulia = "7073ff75-c697-5162-941a-fcdaad2a7d2a"
OhMyREPL = "5fb14364-9ced-5910-84b2-373655c76a03"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
StatsPlots = "f3b207a7-027a-5e70-b257-86293d7955fd"
Tar = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e"
WAV = "8149f6b0-98f6-5db9-b78f-408fbbb8ef88"
```

### Getting the Project Requirements

The following bash script downloads Julia version 1.6.1, installs the dependencies defined in `Project.toml`, and does a clean installation of both Jupyter (local to Julia, so it should not interfere with your existing Jupyter installation if you have one) and the kernel that allows it run Julia notebooks in addition to Python ones.

The setup process involves opening a new Jupyter notebook from Julia and then closes it. This means closing Jupyter itself, so it is advisable to close any running Jupyter processes beforehand.

Finally, it downloads the `chime_home` dataset into the project directory.

If starting from scratch, this entire script usually takes 30 minutes to an hour to run to completion. The longest parts are setting up Jupyter and downloading the `chime_home` dataset, which is about 4 GB and has to be decompressed. Given that, as well as the system requirements for Julia and Jupyter, make sure you have enough space on your hard drive before getting started.

```bash
# install.sh

# install julia
wget -c https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.1-linux-x86_64.tar.gz -O - | tar xz

# get packages and set up IJulia
echo -e "\n" | ./julia-1.6.1/bin/julia --project=@. -e "using Pkg; Pkg.update(); using IJulia; notebook(detached=true)"

# kill notebook, not reachable by default so we need to start properly later anyway
killall -9 jupyter-notebook
# if the killall command is not available, you may need to install it:
# sudo apt install psmisc

# install chime_home dataset
wget -c https://archive.org/download/chime-home/chime_home.tar.gz -O - | tar xz
```

### Running the Code

Once Julia has installed Jupyter, you can run Julia notebooks by invoking Jupyter from within your Julia installation. The following script does this, adding the `--no-browser` flag and backgrounding the process.

```bash
# run.sh

# runs miniconda jupyter notebook on default port.
# usually this is 8888, but if you're connecting
# remotely then edit this command to make sure it
# runs on the same port as your SSH tunnel. also,
# if you're not able to connect then make sure you
# get the right URL token from stdout.
~/.julia/conda/3/bin/jupyter notebook --no-browser &
```

The `--no-browser` flag is helpful if you want to run Jupyter remotely and connect to it from the web browser on your local workstation. The comments in `run.sh` mention SSH tunnels, because I have found them to be the easiest way to get things running and connected. This requires creating an SSH tunnel to the same Linux server you are running Jupyter on, as the same user you want to run Jupyter as, before invoking `run.sh`. Once that's done, you should be able to connect to your remote Jupyter process in your local web browser using whatever localhost port you chose when creating the tunnel (I've been using 8888, Jupyter's default port). See [here](https://docs.anaconda.com/anaconda/user-guide/tasks/remote-jupyter-notebook/) for more help with Jupyter and SSH tunnels.

Once you finally connect to Jupyter, navigate to wherever you saved this notebook and click on it to open it (it should be called `unsupervised-audio-mining.ipynb` by default). From that point on it's just a Jupyter notebook running Julia, so refer to the [Julia](https://docs.julialang.org/en/v1/), [Jupyter Notebook](https://jupyter-notebook.readthedocs.io/en/stable/), and [IJulia](https://julialang.github.io/IJulia.jl/stable/) (Julia backend for Jupyter) documentation for further assistance.

### Cleaning Up

If you don't plan on using Julia, Jupyter, or the `chime_home` dataset later, you can run the following script to remove all of them. This will leave you with the directory you created earlier, `Project.toml`, and anything else you added, which you can now do as you please with.

```bash
# uninstall.sh

# this script is mostly useful for showing you
# what things get installed by install.sh and
# how you can remove them. don't run it directly
# unless it won't mess up an existing julia or
# jupyter installation.

# local julia install
rm -rf julia-1.6.1/

# julia home
rm -rf ~/.julia

# jupyter home
rm -rf ~/.jupyter

# data
rm -rf chime_home/
```

In [1]:
# navigate (within your notebook session) to whatever
# directory you saved Project.toml in. pwd() should output
# the full path of the directory you saved Project.toml in.
pwd()

# then run the following cell to activate the environment 
# you created  earlier using Project.toml. this is so Julia 
# starts running in the same virtual environment you installed 
# the project dependencies in.

"/home/ubuntu/csce-5380-project"

In [2]:
] activate .

In [3]:
using Pkg
using WAV
using CSV
using DelimitedFiles
using DataFrames
using DataSets
using Flux
using Plots
using StatsPlots

In [4]:
Pkg.project().path # this should output the full path of the Project.toml you created earlier

"/home/ubuntu/csce-5380-project/Project.toml"

Initial questions:

- How can I load this data?
- What other pre-processing does it require?
    - Background noise estimation
- What networks do I need to build?
    - Their code literally gives you all the layers and hyperparameters
- How do I train them?
    - DNN
        - Chunks
        - Batches
        - Multiple epochs
        - Binary cross-entropy objective function
        - Stochastic Gradient Descent optimization
    - DAE
        - Multiple-frame MBK to predict center (context)
        - Batches
        - Multiple epochs
        - Mean squared error objective function
        - Stochastic Gradient Descent optimization
- How do I evaluate them?
    - Equal error rate, average across 5 folds
    - Precision
    - Recall
    - F1

Notes:
- Training and network hyperparameters are given
- Will use the given training and development folds, and 16kHz rather than 48kHz
- Will have to reconstruct their input pre-processing manually based on the inputs to the different networks, because it is "done offstage" in this repo, something something MFCCs and MBKs and background noise (check out their equation, and see if the second paper does something similar), fortunately they give you a lot of the different specs but you may just need to note in your report that you're kinda guessing, don't have to match their performance either



In [37]:
project = DataSets.load_project(Dict(
    "data_config_version"=>0,
    "datasets"=>[Dict(
        "name"=>"chime_home",
        "uuid"=>"73b60068-bf34-11eb-17e9-5fa4ccb60cd2", # UUID generated externally, no need to change
        "description"=>"4-second audio chunks at different sample rates with labels and other metadata",
        "storage"=>Dict(
            "driver"=>"FileSystem",
            "type"=>"BlobTree",
            "path"=>joinpath(".", "chime_home")
        )
    )]
))

DataSets.DataProject:
  chime_home => 73b60068-bf34-11eb-17e9-5fa4ccb60cd2

In [38]:
chime_home = open(BlobTree, dataset(project, "chime_home"))

📂 Tree  @ /home/ubuntu/csce-5380-project/chime_home
 📄 CHANGES
 📄 LICENSE
 📄 README
 📄 VERSION
 📁 chunks
 📄 development_chunks_raw.csv
 📄 development_chunks_refined.csv
 📄 development_chunks_refined_crossval_dcase2016.csv
 📄 evaluation_chunks_raw.csv
 📄 evaluation_chunks_refined.csv

In [58]:
# this function can nicely display everything
# in chime_home except the chunks directory
function print_data_file(s)
    open(String, chime_home[s]) do data
        print(data)
    end
end

print_data_file("README") # description of data set useful for report, might even just include as output

Summary
-------------------
The audio recordings and annotations included in this archive form the CHiME-Home dataset. To cite this dataset, please use the reference provided at the end of this document. Please refer to the same publication for a more detailed description of the dataset.

The CHiME-Home dataset is a collection of annotated domestic environment audio recordings. The audio recordings were originally made for the CHiME project (cf. Christensen et al., 2010; Barker et al., 2013). In the CHiME-Home dataset, 4-second audio chunks are each associated with multiple labels, based on a set of 7 labels associated with sound sources in the acoustic environment. For a total of 6137 4-second audio chunks, there are 3 sets of multi-label annotations. A subset of the 6137 chunks (1946 chunks) is further specified, for which annotators' label assignments agree strongly (cf. Foster et al., 2015). For clearer distinction, please refer to the 2762-chunk dataset as CHiME-Home-refine; corre

In [68]:
# read all chunk grouping CSV files into 
# DataFrames with appropriate headers
chunk_headers=["id", "name"]
dev_chunks_raw = CSV.read(
    IOBuffer(open(String, chime_home["development_chunks_raw.csv"])), 
    DataFrame; header=chunk_headers);
dev_chunks_refined = CSV.read(
    IOBuffer(open(String, chime_home["development_chunks_refined.csv"])), 
    DataFrame; header=chunk_headers);
eval_chunks_raw = CSV.read(
    IOBuffer(open(String, chime_home["evaluation_chunks_raw.csv"])), 
    DataFrame; header=chunk_headers);
eval_chunks_refined = CSV.read(
    IOBuffer(open(String, chime_home["evaluation_chunks_refined.csv"])), 
    DataFrame; header=chunk_headers);
dev_chunks_refined_folds = CSV.read(
    IOBuffer(open(String, chime_home["development_chunks_refined_crossval_dcase2016.csv"])), 
    DataFrame; header=vcat(chunk_headers, ["fold"]))

Unnamed: 0_level_0,id,name,fold
Unnamed: 0_level_1,Int64,String,Int64
1,0,CR_lounge_220110_0731.s0_chunk27,1
2,1,CR_lounge_220110_0731.s0_chunk18,1
3,2,CR_lounge_220110_0731.s0_chunk70,1
4,3,CR_lounge_220110_0731.s0_chunk0,1
5,4,CR_lounge_220110_0731.s0_chunk39,1
6,5,CR_lounge_220110_0731.s0_chunk5,1
7,6,CR_lounge_220110_0731.s0_chunk24,1
8,7,CR_lounge_220110_0731.s0_chunk72,1
9,8,CR_lounge_220110_0731.s0_chunk15,1
10,9,CR_lounge_220110_0731.s0_chunk16,1


In [59]:
chunks = chime_home["chunks"]

📂 Tree chunks @ /home/ubuntu/csce-5380-project/chime_home
 📄 CR_lounge_200110_1601.s0_chunk0.16kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk0.48kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk0.csv
 📄 CR_lounge_200110_1601.s0_chunk1.16kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk1.48kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk1.csv
 📄 CR_lounge_200110_1601.s0_chunk10.16kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk10.48kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk10.csv
 📄 CR_lounge_200110_1601.s0_chunk11.16kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk11.48kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk11.csv
 📄 CR_lounge_200110_1601.s0_chunk12.16kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk12.48kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk12.csv
 📄 CR_lounge_200110_1601.s0_chunk13.16kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk13.48kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk13.csv
 📄 CR_lounge_200110_1601.s0_chunk14.16kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk14.48kHz.wav
 📄 CR_lounge_200110_1601.s0_chunk14.csv
 📄 CR_lounge_200110_1601

In [69]:
#=
    for my next trick, i will aggregate the entire data set into
    one data frame, with the CSV information for each chunk together
    with the 16kHz and 48kHz sound data.

    ...actually, what i want to do is make a function that retrieves
    all that data given a chunk name, which means not having to search
    the chunks directory, and also means that i can construct data sets
    from the given CSV files on-demand and release them when i'm finished
    with them.

    unsure how to format individual elements for usage with DataFrames,
    except perhaps as 1-element DataFrames. that might work.
=#

function chunk(name::AbstractString)
    return false;
end

chunk (generic function with 1 method)