<a href="https://colab.research.google.com/github/june-oh/2023_AI_Academy_ASR/blob/main/1_Audio_file_handling_using_torchaudio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. TorchAudio를 이용한 음성파일 처리

## Python Audio Manipulation Packages
### Torchaudio
<img src="https://github.com/pytorch/audio/raw/main/docs/source/_static/img/logo.png" height=120>

The aim of torchaudio is to apply PyTorch to the audio domain. 



### Librosa 

<img src="https://github.com/librosa/librosa/raw/main/docs/img/librosa_logo_text.svg" heigt=120>

A python package for music and audio analysis.



### library import
- `torch` : Deep learning 라이브러리, 간편하게 모델을 설계하고 학습 가능(PyTorch)
- `torchaudio` : torch tensor형식으로 오디오를 다룰 수 있는 라이브러리
- `pandas` : dataframe, csv, excel - table 데이터를 다루는 라이브러리
- `matplotlib` : 시각화용 라이브러리 
- `IPython.display` : IPython 위젯을 사용할 수 있는 라이브러리
- `pathlib` : 경로 관련 라이브러리, 파일의 경로를 쉽게 사용가능


In [None]:
import torch
import torchaudio     
import torchaudio.transforms as T
import torch.nn.functional as F

import pandas as pd
import matplotlib.pyplot as plt
import IPython.display as ipd
from pathlib import Path

## Version Check

In [None]:
#version check

# Load Audio File
## Data : free-spoken-digit-dataset

음성 버전의 MNIST dataset

https://github.com/Jakobovski/free-spoken-digit-dataset

<img src="https://drive.google.com/uc?id=1yEjXMS5-KTrYriyPhrSaJqeneBTStao_">

### Current status
- 6 speakers
- 3,000 recordings (50 of each digit per speaker)
- English pronunciations
### Organization
Files are named in the following format: `{digitLabel}_{speakerName}_{index}.wav` Example: 7_jackson_32.wav

### Usage
The test set officially consists of the first 10% of the recordings. Recordings numbered 0-4 (inclusive) are in the test and 5-49 are in the training set.



In [None]:
# clone git repo

In [None]:
#check file list

In [None]:
#chekc file list in dir

```
acquire_data  metadata.py	    README.md	upload_to_hub.py
__init__.py   pip_requirements.txt  recordings	utils
```

`recodingds` : 디렉토리에 음성 파일들이 위치

In [None]:
#collect list of file using pathlib

In [None]:
#check Path

audio파일들의 Path를 확인

In [None]:
#Path, name, stem

## `ipd.Audio`를 이용한 `wav`파일 들어보기

```
??ipd.Audio
```

In [None]:
??ipd.Audio

In [None]:
#listen audio file using ipd.Audio

## Audio Meta data

- `sample_rate` is the sampling rate of the audio
- `num_channels` is the number of channels
- `num_frames` is the number of frames per channel
- `bits_per_sample` is bit depth
- `encoding` is the sample coding format

In [None]:
#audio meta data check 
# torchaudio.info

## torchaudio를 이용하여 음악파일 불러오기
### Loading audio data
To load audio data, you can use `torchaudio.load()`.

This function accepts a path-like object or file-like object as input.

The returned value is a tuple of waveform (`Tensor`) and sample rate (`int`).

By default, the resulting tensor object has `dtype=torch.float32` and its value range is` [-1.0, 1.0]`.

For the list of supported format, please refer to the torchaudio documentation.
```
waveform, sample_rate = torchaudio.load(SAMPLE_WAV)
```

In [None]:
??torchaudio.load

In [None]:
#load audio to tensor

In [None]:
#check sampling rate

In [None]:
#check samples

In [None]:
#check type of samples

In [None]:
#check shape

In [None]:
#check file duration(second)

In [None]:
#mono to stereo

In [None]:
#stereo to mono

In [None]:
#tensor slicing

마찬가지로 `ipd.Audio`를 이용해서도 `torch.Tensor`타입의 변수를 읽고 들을 수 있음.

In [None]:
??ipd.Audio

In [None]:
#listen audio using tensor and ipd.audio

## torch.Tensor타입의 Waveform의 시각화 
`matplotlib.pyplot` 을 이용하여 audio sample을 시각화 가능

python의 `Slicing`을 통해 특정구간을 확대하여 확인 가능


In [None]:
data = list(torch.sin(torch.tensor(range(10))))
data

In [None]:
plt.plot(data)

In [None]:
y[0]

In [None]:
plt.plot(y[0])

In [None]:
start,end = 100,150
plt.plot(y[0])
plt.axvline(start,color='r')
plt.axvline(end,color='r')

In [None]:
plt.plot(y[0][0:1024])
plt.plot(torch.hann_window(1024))


In [None]:
data = y[0][0:1024]*torch.hann_window(1024)

In [None]:
plt.plot(y[0][0:1024])

In [None]:
plt.plot(data)

In [None]:
torch.hann_window(1024)

In [None]:
start,dur = 1000,150
#plt.bar(range(dur),y[0][start:start+dur])
plt.figure(figsize=(10,2),dpi=100)
plt.plot(range(dur),y[0][start:start+dur])
plt.show()

### `matplotlib.pyplot.stem` 을 이용하여 sample확인
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.stem.html
```python
??matplotlib.pyplot.stem
```

In [None]:
plt.figure(figsize=(10,2),dpi=100)
plt.stem(range(dur),y[0][start:start+dur], use_line_collection=True)

plot과 함께 그리기

In [None]:
#plt.bar(range(dur),y[0][start:start+dur])
plt.figure(figsize=(10,2),dpi=100)
plt.plot(range(dur),y[0][start:start+dur])
plt.show()
plt.figure(figsize=(10,2),dpi=100)
plt.plot(range(dur),y[0][start:start+dur])
plt.stem(range(dur),y[0][start:start+dur], use_line_collection=True)
plt.show()

## Audio feature extraction 
### Overview of audio features

<img src="https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png" width=600>



frequncy Domain

STFT (DFT)

<img src="https://upload.wikimedia.org/wikipedia/commons/6/61/FFT-Time-Frequency-View.png?20171130134719" width=600>

### Raw Spectrogram 
`torchaudio.transforms.Spectrogram` class를 이용 `T.Spectrogram`
```python
n_fft = 1024
win_length = None
hop_length = 512

# Define transform
spectrogram = T.Spectrogram(
    n_fft=n_fft,
    win_length=win_length,
    hop_length=hop_length,
    center=True,
    pad_mode="reflect",
    power=2.0,
)
```

In [None]:
??T.Spectrogram

In [None]:
print(audios[1])
y,sr = torchaudio.load(audios[1])
print(y.shape)
plt.plot(y[0])


In [None]:
plt.plot(y[0][0:1024])

In [None]:
n_fft=256
win_length = n_fft
hop_length=win_length//2
start,dur = 0,2911
plt.figure(figsize=(16,1),dpi=300)
plt.plot(y[0][start:start+dur])
i=0

for x in range(start,dur,hop_length):
  i+=1  
  c='r' if i%2==0 else 'g'
  plt.axvline(x,color=c)
  plt.axvspan(x,x+win_length,color='r',alpha=0.1)


In [None]:
 plt.plot(torch.hann_window(256)) #1/2 overlap 128 

In [None]:
spec_converter = T.Spectrogram(n_fft=n_fft,
                               win_length=win_length,
                               hop_length=hop_length)
spec = spec_converter(y)

In [None]:
spec.shape

`torch.Size([1, 129, 23])` 

1 : batch size or channel

129 : n_fft // 2 +1 (n_fft = 256) 

23 : ceil(len(y) /hop_length)


In [None]:
import math
print(len(y[0]) )
print(len(y[0])/hop_length )
print(math.ceil(len(y[0])/hop_length))


In [None]:
??T.Spectrogram

In [None]:
print(len(y[0]))
print(len(y[0])//hop_length+1)

In [None]:
spec, spec.shape

In [None]:
plt.plot(y[0])
plt.show()
plt.imshow(spec[0],origin="lower",aspect='auto',interpolation='nearest')
plt.colorbar()

In [None]:
n_fft=256
win_length = n_fft
hop_length=win_length//2
start,dur = 0,2911
plt.figure(figsize=(16,1),dpi=300)
plt.plot(y[0][start:start+dur])
i=0

for x in range(start,dur,hop_length):
  i+=1  
  c='r' if i%2==0 else 'g'
  plt.axvline(x,color=c)
  plt.axvspan(x,x+win_length,color='r',alpha=0.1)
plt.show()
plt.plot(spec[0,:,5])
plt.show()

### AmplitudeToDB
Turn a tensor from the power/amplitude scale to the decibel scale.

`torchaudio.transforms.AmplitudeToDB(stype: str = 'power', top_db: Optional[float] = None)`
  

In [None]:
db_converter = T.AmplitudeToDB()

In [None]:
db_spec = db_converter(spec)
plt.imshow(db_spec[0],origin='lower',aspect='auto',interpolation='nearest')
plt.colorbar()

8000 sampling rate -> 4000 hz

129 bins 

4000/129 

In [None]:
db_spec = db_converter(spec)
plt.imshow(db_spec[0],origin='lower',aspect='auto',interpolation='nearest')
plt.set_yticklabel(torch.arange(129)*4000 /129)

In [None]:
spec.shape

### Mel-Spectrogram
참고 : https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

In [None]:
mel_scale = T.MelScale(n_mels=64,sample_rate=8000,f_min=20,f_max=4000,n_stft=129)

In [None]:
plt.imshow(mel_scale.fb,aspect='auto',origin='lower')

In [None]:
fb = torchaudio.functional.melscale_fbanks(n_freqs=129,
                                           f_min=20,
                                           f_max=4000,
                                           n_mels=10,
                                           sample_rate=8000)
fb.shape

In [None]:
  plt.plot(fb.T[5])

In [None]:

for e,bin in enumerate(fb.T) :
  plt.plot(bin)
plt.show()

In [None]:
mel_converter = T.MelSpectrogram(sample_rate=8000,n_mels=64,n_fft=256,hop_length=n_fft//2)

In [None]:
mel_spec = mel_converter(y)
plt.imshow(mel_spec[0],aspect='auto',interpolation='nearest',origin='lower')

In [None]:
mel_spec = db_converter(mel_spec)
plt.imshow(mel_spec[0],aspect='auto',interpolation='nearest',origin='lower')

# MFCC
```python
CLASS torchaudio.transforms.MFCC(
        sample_rate: int = 16000, 
        n_mfcc: int = 40, 
        dct_type: int = 2, 
        norm: str = 'ortho', 
        log_mels: bool = False, 
        melkwargs: Optional[dict] = None)
```

In [None]:
??T.MFCC

In [None]:
melkwargs={
        "n_fft":256,
        "n_mels": 64,
        "hop_length": 256//2,
        "mel_scale": "htk",
    }
mfcc_converter = T.MFCC(sample_rate=8000,n_mfcc=13,melkwargs=melkwargs)

In [None]:
mfcc = mfcc_converter(y)
mfcc.shape

In [None]:
#mfcc = db_converter(mfcc)
plt.imshow(mfcc[0],origin='lower',aspect='auto',interpolation='nearest')