# Event-based Visual Microphone

### Purpose
This notebook is designed to extract audible sounds from a video of an object vibrating in response to those sounds. The process begins with the transformation of a standard RGB video into an event video. This step is necessary because an event camera was not available at the time of this project. Following the transformation, the event video is then processed to recover sound.

### Objectives
- To break a RGB video into segments.
- To convert the RGB videos to event videos, using a v2e toolbox inspired simulator.
- To convert the event videos to sound using Abe Davis' Visual Microphone method.
- To use a bandwidth extension model to enhance the recovered sound.
- To visualise the recovered signal.

### Dependencies
To run this notebook, you will need the following libraries:
- `cv2`: OpenCV
- `numpy` : NumPy
- `scipy` : SciPy
- `torch` : PyTorch
- `librosa` : Librosa
- `tensorflow` : TensorFlow
- `matplotlib` : Matplotlib
- `soundfile` : Soundfile
- `sounddevice` : Sounddevice

In [1]:
import utilities.video_frames as frames
import event_camera.simulator_utils as camera
import steerable_pyramid.davis_method as pyramid

## Break RGB video into segments
The code bellow assess the video size and breaks it into 2GB segments. The videos are stored in a tempory folder in Documents, which is then deleted automatically at the end.

<span style="color:red"> Enter the input video file path and fps bellow: </span>

In [2]:
framerate = 2200
video_path = '/Volumes/Omkar 5T/video_dataset/plants.avi'

In [None]:
frames.extract_video_segments(video_path)

## Convert RGB video to event video
This process uses a v2e toolbox inspired model created by Tobi Delbruck, Yuhuang Hu and Zhe He. The only difference is that our model doesn't include event noise. There are two important variables:
- **Cut-off frequency** (`cutoff_freq`): for event pixel bandwidth: more information can be extracted at lower frequencies. This makes it perfect for low light conditions.
- **+/-ve thresholds** (`pos_thresh` and `neg_thresh`) : completely based on cut-off frequency and the maximum allowable events for any pixel.

### Ideal paramters for Visual Microphone dataset
For the slow-motion RGB videos from Abe Davis' Visual Microphone dataset, the ideal parameters are:
- MIDI Chips Bag: `3e-4` cutoff-freq, `8e-7` pos_thresh and `8e-7` neg_thresh
- MIDI Plants: `3e-5` cutoff-freq, `9e-8` pos_thresh and `9e-8` neg_thresh
- Speech Chips Bag 2.2kHz:
- Speech Chips Bag 20kHz:

<span style="color:red"> Enter the simulation parameters bellow: </span>

In [16]:
cutoff_freq = 3e-5
pos_thresh = 9e-8
neg_thresh = 9e-8

In [None]:
video_fps = 30
sampling_period = 1/video_fps
camera.event_simulator(sampling_period, cutoff_freq, pos_thresh, neg_thresh)

The code bellow allows to see the event video after conversion.

In [None]:
camera.show_video()

## Convert event video to sound

This step uses a phase-based method. The method applies a steerable pyramid to get a phase response at several scales and orientations. Then several steps are applied to flatten and average the response to a time-series signal. The important parameters for this process are:
- **Number of Scales** (`nscales`): defines the number of levels of the pyramid. For visual microphone this was set to 2.
- **Number of Orientations** (`norientations`): defines the number of steerable filters at a level of the pyramid. For visual microphone this was set to 4.

<span style="color:red"> Enter the steerable pyramid parameters and recovered sound file path bellow: </span>

In [19]:
nscales = 2
norientations = 4
save_path = '/Volumes/Omkar 5T/chips1.wav'

This step takes several hours, traditionally 2hrs.

In [None]:
pyramid.ebvmSoundfromVideo(save_path, nscales, norientations, framerate)