In [1]:
!pwd

/content


In [2]:
%cd spiritlm
!ls

/content/spiritlm
assets		    CONTRIBUTING.md  examples	    README.md		  setup.py	     tests
checkpoints	    data	     LICENSE	    requirements.dev.txt  spiritlm
CODE_OF_CONDUCT.md  env.yml	     MODEL_CARD.md  requirements.txt	  spiritlm.egg-info


In [2]:
!git clone https://github.com/facebookresearch/spiritlm.git
%cd spiritlm

Cloning into 'spiritlm'...
remote: Enumerating objects: 74, done.[K
remote: Counting objects: 100% (65/65), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 74 (delta 5), reused 62 (delta 5), pack-reused 9 (from 1)[K
Receiving objects: 100% (74/74), 3.63 MiB | 27.15 MiB/s, done.
Resolving deltas: 100% (5/5), done.
/content/spiritlm


In [3]:
!ls

assets		    CONTRIBUTING.md  examples	    README.md		  setup.py
checkpoints	    data	     LICENSE	    requirements.dev.txt  spiritlm
CODE_OF_CONDUCT.md  env.yml	     MODEL_CARD.md  requirements.txt	  tests


In [4]:
!cat requirements.txt

omegaconf>=2.2.0
librosa>=0.10
local-attention>=1.9
encodec>=0.1
transformers
fairscale>=0.4
sentencepiece
pyarrow>=14.0
torchfcpe>=0.0.4

In [5]:
!pip install -e requirements.txt
!pip install -e '.[eval]'

[31mERROR: requirements.txt is not a valid editable requirement. It should either be a path to a local project or a VCS URL (beginning with bzr+http, bzr+https, bzr+ssh, bzr+sftp, bzr+ftp, bzr+lp, bzr+file, git+http, git+https, git+ssh, git+git, git+file, hg+file, hg+http, hg+https, hg+ssh, hg+static-http, svn+ssh, svn+http, svn+https, svn+svn, svn+file).[0m[31m
[0mObtaining file:///content/spiritlm
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting omegaconf>=2.2.0 (from spiritlm==0.1.0)
  Downloading omegaconf-2.3.0-py3-none-any.whl.metadata (3.9 kB)
Collecting local-attention>=1.9 (from spiritlm==0.1.0)
  Downloading local_attention-1.9.15-py3-none-any.whl.metadata (683 bytes)
Collecting encodec>=0.1 (from spiritlm==0.1.0)
  Downloading encodec-0.1.1.tar.gz (3.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fairscale>=0.4

In [3]:
from spiritlm.model.spiritlm_model import Spiritlm, OutputModality, GenerationInput, ContentType

from transformers import GenerationConfig
import IPython.display as ipd

def display_outputs(outputs):
    for output in outputs:
        if output.content_type == ContentType.TEXT:
            print(output.content)
        else:
            ipd.display(ipd.Audio(output.content, rate=16_000))

We support two variants of Spirit LM models, `Spirit LM Base` and `Spirit LM Expressive`. Both `Spirit LM Base` and `Spirit LM Expressive` are fine-tuned from the 7B Llama 2 model on text-only, speech-only and aligned speech+text datasets.

Compared to `Spirit LM Base`, `Spirit LM Expressive` captures not only the semantics but also **expressivity** from the speech.

## `Spirit LM Base`

In [4]:
spirit_lm = Spiritlm("spirit-lm-base-7b")

OSError: Incorrect path_or_model_id: '/content/spiritlm/checkpoints/spiritlm_model/spirit-lm-base-7b'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

### Generation

The input `interleaved_inputs` of `generate` function is a list of either
- `GenerationInput` composed of `content_type` and `content`, or
- tuple of (`'speech'`/`'text'`, `content`)

the inputs are interleaved following the order of the list.

`output_modality` controls the output modality.
- If you want to generate only the text, specify it to `OutputModality.TEXT` or `'text'`;
- If you want to generate only the speech, specify it to `OutputModality.SPEECH`  or `'speech'`;
- If you don't have the constraint over the generation's modality, use `OutputModality.ARBITRARY` or `'arbitrary'`;

The output of generation is also a list (of `GenerationOuput`), when `output_modality` is `OutputModality.TEXT` or `OutputModality.SPEECH`, the list should have only one element.
When `output_modality` is `OutputModality.ARBITRARY`, the list can have multiple elements from different types (`ContentType.TEXT` or `ContentType.SPEECH`).

The generation arguments can either be passed through `generation_config=GenerationConfig(args)` or directly in `generate(args)`.

For a full list of generation arguments, see:
https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig

Note that the following two commands give the same outputs:

In [None]:
spirit_lm.generate(
    interleaved_inputs=[
        GenerationInput(
            content="The largest country in the world is",
            content_type=ContentType.TEXT,
        )
    ],
    output_modality=OutputModality.TEXT,
    generation_config=GenerationConfig(
        max_new_tokens=20,
        do_sample=False,
    ),
)

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


[GenerationOuput(content='Russia. Russia is a country that is located in the northern part of the Eurasian continent.', content_type=<ContentType.TEXT: 'TEXT'>)]

In [None]:
spirit_lm.generate(
    interleaved_inputs=[('text', "The largest country in the world is")],
    output_modality='text',
    max_new_tokens=20,
    do_sample=False,
)

[GenerationOuput(content='Russia. Russia is a country that is located in the northern part of the Eurasian continent.', content_type=<ContentType.TEXT: 'TEXT'>)]

#### T -> T generation

In [None]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "Here is a story about a cute cat named Meow:")],
    output_modality='text',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=30,
        do_sample=True,
    ),
)
display_outputs(outputs)

A very cute cat, a black and white cat, named Meow was born to a family that was very good to her.
She


#### T -> S generation

In [None]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "One of the most beautiful cities in the world is")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(outputs)

#### S -> T generation

When the `content` is speech, we accept several types:
1) The audio `Path`: e.g., `"examples/audio/7143-88743-0029.flac"` or `Path("examples/audio/7143-88743-0029.flac")`
2) The audio `bytes`: e.g., `open("examples/audio/7143-88743-0029.flac", "rb").read()`
3) The audio `Tensor`: e.g., `torchaudio.load("examples/audio/7143-88743-0029.flac")[0].squeeze(0)`

In [None]:
ipd.Audio("../audio/7143-88743-0029.flac")

In [None]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('speech', "../audio/7143-88743-0029.flac")],
    output_modality='text',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=30,
        do_sample=True,
    ),
)
display_outputs(outputs)

the old man led the way to a corner of the cave where he kept his stock of skins and furs in a pile and there were


#### S -> S generation

In [None]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('speech', "../audio/7143-88743-0029.flac")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(outputs)

#### Arbitrary generation

In [None]:
interleaved_outputs = spirit_lm.generate(
    interleaved_inputs=[('speech', "../audio/7143-88743-0029.flac")],
    output_modality='arbitrary',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(interleaved_outputs)

 i want to see it he had a big knife in his hand and he cut off a strip of the skin of the ox hide and


 good loop to hold it up well i said you are a man he cried and so you think you are and so you are now it is i am glad of that well here i


#### Specify the prompt by a string of tokens

This could be useful when you construct the few-shots prompt.

Note that when `prompt` is given, `generation_inputs` is not used.

In [None]:
outputs = spirit_lm.generate(
    prompt="[St71][Pi39][Hu99][Hu49][Pi57][Hu38][Hu149][Pi48][Hu71][Hu423][Hu427][Pi56][Hu492][Hu288][Pi44][Hu315][Hu153][Pi42][Hu389][Pi59][Hu497][Hu412][Pi51][Hu247][Hu354][Pi44][Hu7][Hu96][Pi43][Hu452][Pi0][Hu176][Hu266][Pi54][St71][Hu77][Pi13][Hu248][Hu336][Pi39][Hu211][Pi25][Hu166][Hu65][Pi58][Hu94][Hu224][Pi26][Hu148][Pi44][Hu492][Hu191][Pi26][Hu440][Pi13][Hu41][Pi20][Hu457][Hu79][Pi46][Hu382][Hu451][Pi26][Hu332][Hu216][Hu114][Hu340][St71][Pi40][Hu478][Hu74][Pi26][Hu79][Hu370][Pi56][Hu272][Hu370][Pi51][Hu53][Pi14][Hu477][Hu65][Pi46][Hu171][Hu60][Pi41][Hu258][Hu111][Pi40][Hu338][Hu23][Pi39][Hu338][Hu23][Hu338][St71][Pi57][Hu7][Hu338][Hu149][Pi59][Hu406][Hu7][Hu361][Hu99][Pi20][Hu209][Hu479][Pi35][Hu50][St71][Hu7][Hu149][Pi55][Hu35][Pi13][Hu130][Pi3][Hu169][Pi52][Hu72][Pi9][Hu434][Hu119][Hu272][Hu4][Pi20][Hu249][Hu245][Pi57][Hu433][Pi56][Hu159][Hu294][Hu139][Hu359][Hu343][Hu269][Hu302][St71][Hu226][Pi32][Hu370][Hu216][Pi39][Hu459][Hu424][Pi57][Hu226][Pi46][Hu382][Hu7][Pi27][Hu58][Hu138][Pi20][Hu428][Hu397][Pi44][Hu350][Pi32][Hu306][Pi59][Hu84][Hu11][Hu171][Pi42][Hu60][Pi48][Hu314][Hu227][St71][Hu355][Pi56][Hu9][Hu58][Pi44][Hu138][Hu226][Pi25][Hu370][Hu272][Pi56][Hu382][Hu334][Pi26][Hu330][Hu176][Pi56][Hu307][Pi46][Hu145][Hu248][Pi56][Hu493][Hu64][Pi40][Hu44][Hu388][Pi39][Hu7][Hu111][Pi59][St71][Hu23][Hu481][Pi13][Hu149][Pi15][Hu80][Hu70][Pi47][Hu431][Hu457][Pi13][Hu79][Pi27][Hu249][Pi55][Hu245][Pi54][Hu433][Pi36][Hu316][Pi53][Hu180][Pi3][Hu458][Pi26][Hu86][St71][Pi43][Hu225][Pi49][Hu103][Hu60][Pi3][Hu96][Hu119][Pi39][Hu129][Pi41][Hu356][Hu218][Pi14][Hu4][Hu259][Pi56][Hu392][Pi46][Hu490][Hu75][Pi14][Hu488][Hu166][Pi46][Hu65][Hu171][Pi40][Hu60][Hu7][Hu54][Pi39][Hu85][St83][Pi40][Hu361]",
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
)
display_outputs(outputs)

## `Spirit LM Expressive`

In [None]:
spirit_lm = Spiritlm("spirit-lm-expressive-7b")

  [INFO]: device is not None, use cuda:0
  [INFO]    > call by:torchfcpe.tools.spawn_infer_cf_naive_mel_pe_from_pt
  [WARN] args.model.use_harmonic_emb is None; use default False
  [WARN]    > call by:torchfcpe.tools.spawn_cf_naive_mel_pe


  ckpt = torch.load(pt_path, map_location=torch.device(device))
Some weights of Wav2Vec2StyleEncoder were not initialized from the model checkpoint at checkpoints/speech_tokenizer/style_encoder_w2v2 and are newly initialized: ['_float_tensor']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  WeightNorm.apply(module, name, dim)


In [None]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "I am so deeply saddened, it feels as if my heart is shattering into a million pieces and I can't hold back the tears that are streaming down my face.")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
    speaker_id=1,
)
display_outputs(outputs)



In [None]:
outputs = spirit_lm.generate(
    interleaved_inputs=[('text', "Wow!!! Congratulations!!! I'm so excited that")],
    output_modality='speech',
    generation_config=GenerationConfig(
        temperature=0.8,
        top_p=0.95,
        max_new_tokens=200,
        do_sample=True,
    ),
    speaker_id=1,
)
display_outputs(outputs)