## 05-3 문장과 화자 구분하기
- Whisper API는 화자 구분해주지 않음

### 실습. 화자 분리 모델로 시간대별 화자 구분하기
- pyannote.audio
    - 화자 분리 기능 제공
    - 오픈소스 툴킷
    - PyTorch 머신러닝 프레임워크에서 동작

Hugging Face에서 화자 분리 모델 내려받고 사용 준비하기
Requirements 준수
- Accept pyannote/segmentation-3.0 user conditions
- Accept pyannote/speaker-diarization-community-1 user conditions
- Accept pyannote/speaker-diarization-3.1 user conditions
- Create access token at hf.co/settings/tokens.
- 가이드문서 갱신이 느림. 반영되지 않아 문제 빈번하게 발생

#### Pyannote.audio와 numpy 설치하기

In [33]:
%pip install pyannote.audio
%pip install numpy==1.26

50204.43s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting numpy (from asteroid-filterbanks>=0.4.0->pyannote.audio)
  Using cached numpy-2.4.1-cp312-cp312-macosx_14_0_arm64.whl.metadata (6.6 kB)
Using cached numpy-2.4.1-cp312-cp312-macosx_14_0_arm64.whl (5.2 MB)
[0mInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.0
    Uninstalling numpy-1.26.0:
      Successfully uninstalled numpy-1.26.0
[0mSuccessfully installed numpy-2.4.1
Note: you may need to restart the kernel to use updated packages.


50213.48s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


[0mCollecting numpy==1.26
  Using cached numpy-1.26.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (53 kB)
Using cached numpy-1.26.0-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
[0mInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.4.1
    Uninstalling numpy-2.4.1:
      Successfully uninstalled numpy-2.4.1
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyannote-core 6.0.1 requires numpy>=2.0, but you have numpy 1.26.0 which is incompatible.
pyannote-metrics 4.0.0 requires numpy>=2.2.2, but you have numpy 1.26.0 which is incompatible.
scipy 1.17.0 requires numpy<2.7,>=1.26.4, but you have numpy 1.26.0 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.0
Note: you may need to restart the kernel to use updated packages.


#### pyannote.audio 사용하기
- 모델 상세 페이지인 Usage 영역의 예제 코드 복사
- 토큰 관련 내용 수정

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

HUGGING_FACE_TOKEN = os.getenv("HUGGING_FACE_TOKEN")

In [4]:
# instantiate the pipeline
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.1",
  token=HUGGING_FACE_TOKEN
)

CUDA 지원하는 GPU는 활용하도록 수정(미지원)
If you’re on macOS and want GPU, CUDA won’t be available; use MPS instead:
(Metal Performance Shaders)

In [5]:
import torch

if torch.cuda.is_available():
    print('cuda is available')
else:
    print('cuda is not available')

if torch.backends.mps.is_available():
    pipeline.to(torch.device("mps"))
    print("mps is available")
else:
    print("mps is not available")

cuda is not available
mps is available


### 음성 파일에서 화자 ㅍ분리하기
- 화자 분리 테스트를 위해 화자가 최소 2명 이상 등장하는 음성 파일을 사용

#### 화자 분리하고 RTTM 파일로 저장하기
- RTTM = Rich Transcription Time Marked
    - RTTM 파일은 주로 음성 처리(Speech Processing), 특히 화자 분리(Speaker Diarization) 결과를 기록하는 텍스트 포맷
    - “누가, 언제, 얼마나 말했는지”를 시간 단위로 정리해 둔 타임라인 로그라고 생각하면 돼.
- NameError: name 'AudioDecoder' is not defined 해결
    - https://discuss.huggingface.co/t/problem-with-pyannote-speaker-diarization-3-1/169415
    - https://huggingface.co/datasets/John6666/forum2/blob/main/torchcodec_windows_error_1.md

In [14]:
import torchaudio

# Preload -> bypass TorchCodec
waveform, sr = torchaudio.load("../audio/싼기타_비싼기타.mp3")
out = pipeline({"waveform": waveform, "sample_rate": sr})

ann = out.speaker_diarization
with open("싼기타_비싼기타.rttm", "w", encoding="utf-8") as rttm:
    ann.write_rttm(rttm)



### 판다스를 활용해 데이터프레임 형태로 저장하기
- 화자 분리는 잘 되지만 한화자의 발언이 여러 행에 나누어 출력되는 문제
- 같은 화자가 계속 이야기하는 경우에는 하나로 합쳐볼 것
- Pandas를 이용하면 데이터프레임 형태의 데이터를 쉽게 조작할 수 있음

#### RTTM을 CSV로 변환하고 데이터프레임으로 출력하기

In [1]:
import pandas as pd
rttm_path = "./싼기타_비싼기타.rttm"

df_rttm = pd.read_csv(
    rttm_path,  # rttm 파일 경로
    sep=' ',  # 구분자는 띄어쓰기
    header=None,  # 헤더는 없음
    names=['type', 'file', 'chnl', 'start', 'duration', 'C1', 'C2', 'speaker_id', 'C3', 'C4']
)

display(df_rttm)

Unnamed: 0,type,file,chnl,start,duration,C1,C2,speaker_id,C3,C4
0,SPEAKER,waveform,1,0.993,5.805,,,SPEAKER_00,,
1,SPEAKER,waveform,1,7.405,3.983,,,SPEAKER_00,,
2,SPEAKER,waveform,1,11.759,4.927,,,SPEAKER_00,,
3,SPEAKER,waveform,1,17.210,10.665,,,SPEAKER_00,,
4,SPEAKER,waveform,1,28.668,1.536,,,SPEAKER_00,,
...,...,...,...,...,...,...,...,...,...,...
83,SPEAKER,waveform,1,414.481,2.970,,,SPEAKER_01,,
84,SPEAKER,waveform,1,417.755,3.476,,,SPEAKER_00,,
85,SPEAKER,waveform,1,423.644,0.776,,,SPEAKER_01,,
86,SPEAKER,waveform,1,424.741,3.527,,,SPEAKER_01,,


#### 발언 끝난 시간 추가하기

In [2]:
# start + duration을 end로 변환
df_rttm['end'] = df_rttm['start'] + df_rttm['duration']

display(df_rttm)

Unnamed: 0,type,file,chnl,start,duration,C1,C2,speaker_id,C3,C4,end
0,SPEAKER,waveform,1,0.993,5.805,,,SPEAKER_00,,,6.798
1,SPEAKER,waveform,1,7.405,3.983,,,SPEAKER_00,,,11.388
2,SPEAKER,waveform,1,11.759,4.927,,,SPEAKER_00,,,16.686
3,SPEAKER,waveform,1,17.210,10.665,,,SPEAKER_00,,,27.875
4,SPEAKER,waveform,1,28.668,1.536,,,SPEAKER_00,,,30.204
...,...,...,...,...,...,...,...,...,...,...,...
83,SPEAKER,waveform,1,414.481,2.970,,,SPEAKER_01,,,417.451
84,SPEAKER,waveform,1,417.755,3.476,,,SPEAKER_00,,,421.231
85,SPEAKER,waveform,1,423.644,0.776,,,SPEAKER_01,,,424.420
86,SPEAKER,waveform,1,424.741,3.527,,,SPEAKER_01,,,428.268


#### 연속된 발화를 기록하기 위해 number 변수 추가하기
- 화자를 구분하고 발언 순서를 기록하기 위해 화자가 바뀔 때마다 발언에 번호 부여

In [3]:
df_rttm["number"] = None  # number 열 만들고 None으로 초기화
df_rttm.at[0, "number"] = 0

display(df_rttm)

Unnamed: 0,type,file,chnl,start,duration,C1,C2,speaker_id,C3,C4,end,number
0,SPEAKER,waveform,1,0.993,5.805,,,SPEAKER_00,,,6.798,0
1,SPEAKER,waveform,1,7.405,3.983,,,SPEAKER_00,,,11.388,
2,SPEAKER,waveform,1,11.759,4.927,,,SPEAKER_00,,,16.686,
3,SPEAKER,waveform,1,17.210,10.665,,,SPEAKER_00,,,27.875,
4,SPEAKER,waveform,1,28.668,1.536,,,SPEAKER_00,,,30.204,
...,...,...,...,...,...,...,...,...,...,...,...,...
83,SPEAKER,waveform,1,414.481,2.970,,,SPEAKER_01,,,417.451,
84,SPEAKER,waveform,1,417.755,3.476,,,SPEAKER_00,,,421.231,
85,SPEAKER,waveform,1,423.644,0.776,,,SPEAKER_01,,,424.420,
86,SPEAKER,waveform,1,424.741,3.527,,,SPEAKER_01,,,428.268,


#### 화자 번호 매기기
- 두 번째 행(i = 1)부터 시작해서 이전 행(i - 1)의 speaker_id가 같으면 그 행의 number를 그대로 가져오고
- 다르면 number에 1을 더해 새로운 번호 붙임

In [5]:
for i in range(1, len(df_rttm)):
    if df_rttm.at[i, "speaker_id"] != df_rttm.at[i-1, "speaker_id"]:
        df_rttm.at[i, "number"] = df_rttm.at[i-1, "number"] + 1
    else:
        df_rttm.at[i, "number"] = df_rttm.at[i-1, "number"]

display(df_rttm.head(10)) 

Unnamed: 0,type,file,chnl,start,duration,C1,C2,speaker_id,C3,C4,end,number
0,SPEAKER,waveform,1,0.993,5.805,,,SPEAKER_00,,,6.798,0
1,SPEAKER,waveform,1,7.405,3.983,,,SPEAKER_00,,,11.388,0
2,SPEAKER,waveform,1,11.759,4.927,,,SPEAKER_00,,,16.686,0
3,SPEAKER,waveform,1,17.21,10.665,,,SPEAKER_00,,,27.875,0
4,SPEAKER,waveform,1,28.668,1.536,,,SPEAKER_00,,,30.204,0
5,SPEAKER,waveform,1,32.414,0.759,,,SPEAKER_01,,,33.173,1
6,SPEAKER,waveform,1,33.545,3.561,,,SPEAKER_01,,,37.106,1
7,SPEAKER,waveform,1,37.628,3.763,,,SPEAKER_01,,,41.391,1
8,SPEAKER,waveform,1,41.611,0.844,,,SPEAKER_00,,,42.455,2
9,SPEAKER,waveform,1,41.645,1.063,,,SPEAKER_01,,,42.708,3


#### 같은 화자끼리 묶어서 정리하기
- number가 같은 행들을 하나로 묶고 start는 최솟값, end는 최댓값으로 설정

In [None]:
df_rttm_grouped = df_rttm.groupby("number").agg(
    start=pd.NamedAgg(column='start', aggfunc='min'),
    end=pd.NamedAgg(column='end', aggfunc='max'),
    speaker_id=pd.NamedAgg(column='speaker_id', aggfunc='first')
)

display(df_rttm_grouped)

Unnamed: 0_level_0,start,end,speaker_id
number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.993,30.204,SPEAKER_00
1,32.414,41.391,SPEAKER_01
2,41.611,42.455,SPEAKER_00
3,41.645,42.708,SPEAKER_01
4,42.674,44.024,SPEAKER_00
5,45.813,67.109,SPEAKER_01
6,67.227,82.786,SPEAKER_00
7,84.659,102.564,SPEAKER_01
8,103.492,117.532,SPEAKER_00
9,119.759,138.676,SPEAKER_01


#### 발화 시간 추가하고 인덱스 제거하기
- 최종 결과 출력 시 일반 열로 변경해야 처리하기 쉬우므로 인덱스 삭제

In [7]:
df_rttm_grouped["duration"] = df_rttm_grouped["end"] - df_rttm_grouped["start"]
df_rttm_grouped = df_rttm_grouped.reset_index(drop=True)
display(df_rttm_grouped)

Unnamed: 0,start,end,speaker_id,duration
0,0.993,30.204,SPEAKER_00,29.211
1,32.414,41.391,SPEAKER_01,8.977
2,41.611,42.455,SPEAKER_00,0.844
3,41.645,42.708,SPEAKER_01,1.063
4,42.674,44.024,SPEAKER_00,1.35
5,45.813,67.109,SPEAKER_01,21.296
6,67.227,82.786,SPEAKER_00,15.559
7,84.659,102.564,SPEAKER_01,17.905
8,103.492,117.532,SPEAKER_00,14.04
9,119.759,138.676,SPEAKER_01,18.917


#### 화자 분리 결과를 CSV 파일로 저장하기

In [9]:
df_rttm_grouped.to_csv(
    "./싼기타_비싼기타_rttm.csv",
    sep=',',
    index=False
)