[20220717] Weekly AI ArXiv 만담 - 59회차 #59

jungwoo-ha · 2022-07-16T01:55:47Z

News

Conferences
- NAACL 2022 모두 수고하셨습니다
- ICML 2022: 7.17 ~ 23, 미국 볼티모어
  - CLOVA 스케쥴: https://naver-career.gitbook.io/en/teams/clova-cic/events/clova-and-ai-lab-icml-2022
- COLING 2022 리뷰: 곧 나옵니당~
Andrej Karpathy 의 TESLA 퇴직
하이퍼클로바활용 의료 AI 아이디어 경진대회 (~7.29)
- 경진대회 자세히 보기: https://blog.naver.com/clova_ai/222812418031

ArXiv

Language Modelling with Pixels
- PIXEL: 언어모델을 text vocabulary 입력이 아닌 rendering 후 이미지 형태로 입력을 받아 학습 (from 유럽 연구 그룹)
- 이렇게 하면 1) voc size와 output 확률 계산 trade-off 회피 가능 2) 알파벳기반이 아닌 다국어 동시 학습 유리
- 과거에도 이런 시도가 없지 않았으나 image representation learning 기술이 좋아지면서 가능?
- BERT스타일로 하되 token recon이 아닌 patch pixel recon 형태로 학습
- Non-Latin에선 확실히 강점 (이모지 포함). Latin에선 BERT보다 좀 약함
흥미있는 논문
- Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios
  - TensorRT와 CoreML 상에서 구현후 다양한 벤치마크에서 성능 평가 (from Bytedance)
- Language Models (Mostly) Know What They Know
  - LM이 뱉어내는 결과에 대해 정량적 분석 / 평가. 신뢰가능한 AI를 만들기 위한 여러가지 실험들 from Anthropic
- Benchmarking Omni-Vision Representation through the Lens of Visual Realms
  - 다양한 21개 큰 카테고리 (realm) 7372 non-overlaaping semantic concept, 1M 이미지 데이터 --> pretrained 모델 성능평가
  - https://zhangyuanhan-ai.github.io/OmniBenchmark/
- Productivity assessment of neural code completion
  - Github팀의 Copilot이 개발 생산성 향상에 얼마만큼 기여하는 지 분석한 논문
- CogVideo 데모

ghlee3401 · 2022-07-17T01:51:21Z

Arxiv

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders
- INTERSPEECH2022 | Microsoft Azure Speech, Microsoft Research Asia | TTS
- 샘플 URL : https://cognitivespeech.github.io/delightfultts2
1. Problem
  - 기존의 TTS는 acoustic model과 vocoder를 나누어 학습을 하기 때문에 오류가 있음 (cascaded error)
  - 중간 acoustic feature인 mel-spectrogram은 phase 정보가 손실이 되어 있음
2. Goal
  - mel-spectrogram같은 feature를 사용하지 않고 acoustic model과 vocoder를 jointly 학습하는 모델을 제안
3. Method
  - Acoustic model (based on DelightfulTTS)
    - delightfultTTS를 이용하여 VQ-GAN의 encoder의 값을 예측
    - vocoder (VQ-GAN의 decoder)와 결합하여 jointly하게 학습
    - 특이한 점은 VQ-GAN의 encoder로부터 얻어지는 speech representation과의 loss를 이미지에서 사용되는 structural similarity(SSIM)를 사용했다는 점
  - Codec network (based on VQ-GAN)
    - mel-spectrogram, 다른 신호처리 기반의 feature를 사용하지 않고 frame-level의 speech representation 학습
    - VQ-GAN의 encoder를 이용하여 speech representation을 추출
    - VQ-GAN의 decoder를 이용하여 waveform으로 합성
    - decoder의 architecture는 HiFi-GAN을 기반으로 만들어졌음
4. Results
  - FastSpeech2와 DelightfulTTS와 비교하여 MOS 평가 시 더 좋은 성능을 보이는 것을 보여줌
  - CMOS를 통하여 비교를 했을 때도 제안하는 모델이 더 좋은 평가를 보여줌
5. Conclusion
  - two-stage TTS 모델에서 겪는 cascaded error 없이 text/audio data pair로 AM과 Vocoder를 같이 학습
  - 이전 모델인 DelightfulTTS와 비교하여 비슷한 파라미터와 합성 속도를 가지면서도 성능이 더 좋은 모델을 만듦
  - mel-spectrogram과 같은 중간 feature없이 사용한다면 training data에 따라 bias가 더 심해지지 않을까 예상
PoeticTTS - Controllable Poetry Reading for Literary Studies
- INTERSPEECH 2022 | University of Stuttgart (Germany) | TTS (독일어)
- 샘플 URL : https://poetictts.github.io/
- 코드 URL : https://github.com/DigitalPhonetics/IMS-Toucan
- Hugging Face Demo URL : https://huggingface.co/spaces/Flux9665/PoeticTTS
1. Problem
  - poetry를 위한 speech synthesis는 intonation 패턴(poetic intonation)이 일반 speech와 달라 어려움
  - short intonation units, more pauses, intonation units of relatively equal length, and repetition of pitch patterns
  - 적절한 표현에 맞는 운율의 speech를 만들어야하고, prosody를 미세 조절할 수 있어야 함
2. Goal
  - controllable lyric poetry synthesis
3. Method
  - Architecture FastSpeech2 & HiFi-GAN, encoder decoder에는 IMS Toucan Toolkit의 Conformer block 사용
  - Controllable : 조절 가능하도록 하기 위하여 FastPitch 스타일로 phone마다 F0와 energy를 평균을 내어 사용
  - Speaker Embedding : unseen에 대응하기 위해 Voxceleb 1, 2로 Speechbrain toolkit 학습, embedding 사용
  - Pre-training : 다양한 언어의 diverse speech를 이용하여 학습 후 독일어 데이터로 fine-tune
4. Results
  - Prose Model은 영어와 독일 데이터를 모두 학습한 모델
  - Poetry Model은 Prose Model을 poetic data로 finetuning한 모델
  - Prose text는 산문, Poetry text는 운문
5. Conclusion
  - TTS 방법을 이용하여 reference recording으로부터 prosodic parameter를 복사하여 사람처럼 자연스럽게 시를 읽을 수 있게 할 수 있고 prosody를 조절할 수 있게 하였음
  - Task는 신선하였으나 방법론이 그리 새롭지 않아서 아쉬움
흥미있는 논문
- Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS
  - INTERSPEECH 2022 | Neosapience | TTS
  - 샘플 URL : https://style-tts.github.io/demo/ (샘플 GOOD)
  - 화자 간의 스타일을 전이(transfer)하는 TTS 모델을 제안하였고 Meta-StyleSpeech에서 착안한 style encoder, Korean Sentence BERT (SBERT) 모델을 이용한 스타일 embedding 등을 사용하였으며, natural, happy, angry, sadness 감정을 사용
- Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
  - Arxiv | Northwestern Polytechnical University, Jiaotong University, Mobvoi AI Lab (중국) | Emotional TTS
  - 샘플 URL : https://silyfox.github.io/cspc/
  - reference mel-spectrogram에서 speaker 정보와 emotion 정보를 각각 speaker classifier와 emotion classifier로 추출하되, speaker encoder에 emotion classifer에 대한 gradient inversion loss를 취하여 speaker encoder output에 speaker 정보만 담기도록 함
  - 특히 emotion embedding과 speaker emedding이 서로 orthgonal하도록 loss를 취해줌
  - ASR에서 중간 feature를 가져와 prosody 예측을 하는데 사용을 하고 decoder에 speeker ID를 따로 넣어줌
- WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training
  - INTERSPEECH 2022 | Tencent | SVS
  - 샘플 URL : https://zzw922cn.github.io/wesinger2/
  - 이전 연구인 WeSinger를 기반으로 다양한 dilation rate를 가지는 convolution으로 구성하는 multi-receptive fields 기반의 postnet, 다양한 사이즈의 사각 영역을 학습하는 multi-random area discriminators (MARADs) 사용
  - 보코더는 HiFiGAN을 사용하되 linearly-interpolated F0를 quantize하여 사용하고 multi-scale discriminator (MSD), multi-period discriminator (MPD), multi-length discriminator(MLD)를 사용
  - 이전 연구들에서 사용하던 것들을 대부분 차용해서 버무린 짬뽕(?) 같은 논문. 성능이 좋아졌겠지만 그다지 안 끌리는..

veritas9872 · 2022-07-17T09:58:20Z

Pure Transformers are Powerful Graph Learners
Arxiv: https://arxiv.org/abs/2207.02505
GitHub: https://github.com/jw9730/tokengt (No upload yet).

Code Translation with Compiler Representations
https://arxiv.org/abs/2207.03578

A synthetic protein-level neural network in mammalian cells
https://doi.org/10.1101/2022.07.10.499405

Interesting Papers & Projects:
Introducing The World’s Largest Open Multilingual Language Model: BLOOM
Blog: https://bigscience.huggingface.co/blog/bloom
Website: https://huggingface.co/bigscience/bloom

CS25: Transformers United
Website: https://web.stanford.edu/class/cs25
YouTube: https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM

kimyoungdo0122 · 2022-07-17T13:08:11Z

News

진원님 모두팝 행사!!
- 다음주 화요일(19일) 모두의연구소 강남캠퍼스에서 오프라인으로 진행!
- 주제는 '시간 순으로 ImageNet SOTA 모델 살펴보기'입니다
- 신청 링크
Greater creative control for AI image generation
- 논문링크
- Text2image 에서 기존의 텍스트 prompt에서 스케치와 같은 모델 결과물 제어에 필요한 constraint를 추가하는 것에 대한 연구
- 데모 영상
- 올해 ECCV에서 oral 프레젠테이션 한다고하네요!
Microsoft Responsible AI Standard v2
- 19년에 발표한 Responsible AI standard보다 더 실제적으로, 실무적으로 따를 수 있는 지침 대폭 추가
- Accountability(책임성), Transparency(투명성), Fairness(공정성), Reliability and Safety(신뢰성과 안전성), Privacy and Security(개인정보 보호와 보안), Inclusiveness(포용성)
- 각 항목들마다 문서화, 원칙, 평가 등 구체적으로 실행 가능한 가이드가 많이 추가되었습니다
- 실제로 AI로 서비스를 개발 및 배포하는 조직에서 몇몇 항목들 참고하시면 좋을 듯합니다(기획자나 매니저가)
허깅페이스 BigScience Responsible AI License("RAIL")
- BigScience와 같은 LLM을 배포할 때 다양한 국적, 문화, 조직 등이 이용할 수 있도록 하면서도 책임있게 사용하도록 하는 라이선스를 만들고자 함
- BigScience BLOOM 모델을 사용한다면 RAIL 라이선스에 따라 다운스트림을 수행한 서비스는 자동으로 RAIL 라이선스가 적용되어 책임있게 사용되도록 합니다

jungwoo-ha closed this as completed Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[20220717] Weekly AI ArXiv 만담 - 59회차 #59

[20220717] Weekly AI ArXiv 만담 - 59회차 #59

jungwoo-ha commented Jul 16, 2022 •

edited

Loading

ghlee3401 commented Jul 17, 2022 •

edited

Loading

veritas9872 commented Jul 17, 2022 •

edited

Loading

kimyoungdo0122 commented Jul 17, 2022 •

edited

Loading

[20220717] Weekly AI ArXiv 만담 - 59회차 #59

[20220717] Weekly AI ArXiv 만담 - 59회차 #59

Comments

jungwoo-ha commented Jul 16, 2022 • edited Loading

News

ArXiv

ghlee3401 commented Jul 17, 2022 • edited Loading

Arxiv

veritas9872 commented Jul 17, 2022 • edited Loading

kimyoungdo0122 commented Jul 17, 2022 • edited Loading

News

jungwoo-ha commented Jul 16, 2022 •

edited

Loading

ghlee3401 commented Jul 17, 2022 •

edited

Loading

veritas9872 commented Jul 17, 2022 •

edited

Loading

kimyoungdo0122 commented Jul 17, 2022 •

edited

Loading