## MusicGen in MindNLP

MusicGen is a Transformer-based model capable fo generating high-quality music samples conditioned on text descriptions or audio prompts. It was proposed in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet et al. from Meta AI.

The MusicGen model can be de-composed into three distinct stages:
1. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations
2. The MusicGen decoder is then trained to predict discrete audio tokens, or *audio codes*, conditioned on these hidden-states
3. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform

The pre-trained MusicGen checkpoints use Google's [t5-base](https://huggingface.co/t5-base) as the text encoder model, and [EnCodec 32kHz](https://huggingface.co/facebook/encodec_32khz) as the audio compression model. The MusicGen decoder is a pure language model architecture,
trained from scratch on the task of music generation.

The novelty in the MusicGen model is how the audio codes are predicted. Traditionally, each codebook has to be predicted by a separate model (i.e. hierarchically) or by continuously refining the output of the Transformer model (i.e. upsampling). MusicGen uses an efficient *token interleaving pattern*, thus eliminating the need to cascade multiple models to predict a set of codebooks. Instead, it is able to generate the full set of codebooks in a single forward pass of the decoder, resulting in much faster inference.

<p align="center">
  <img src="https://github.com/sanchit-gandhi/codesnippets/blob/main/delay_pattern.png?raw=true" width="600"/>
</p>


**Figure 1:** Codebook delay pattern used by MusicGen. Figure taken from the [MusicGen paper](https://arxiv.org/abs/2306.05284).


## Environment Setup

    python =3.9
    mindspore = 2.3.1
    mindnlp = 0.4.0
    jieba
    soundfile 
    librosa

**在线运行代码平台链接：**
- 1. [华为云AI Gallery](https://pangu.huaweicloud.com/gallery/asset-detail.html?id=c72241ed-465f-418d-b58a-ed4aabb0eb73)
- 2. [大模型平台AI实验室统一入口](https://xihe.mindspore.cn/projects)




测试

In [1]:
import mindspore
print(mindspore.__version__)
print(mindspore.Tensor([1.0]))  # 应输出 Tensor 对象
from mindspore import _c_expression
#print(dir(_c_expression))  # 检查是否有 'Tensor' 属性
from mindspore import dtype as mstype  # 官方推荐写法

2.6.0
[1.]


## Load the Model

The pre-trained MusicGen small, medium and large checkpoints can be loaded from the [pre-trained weights](https://huggingface.co/models?search=facebook/musicgen-) on the Hugging Face Hub. Change the repo id with the checkpoint size you wish to load. We'll default to the small checkpoint, which is the fastest of the three but has the lowest audio quality:

In [2]:
from mindnlp.transformers import MusicgenForConditionalGeneration
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

  from .autonotebook import tqdm as notebook_tqdm
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.833 seconds.
Prefix dict has been built successfully.


[MS_ALLOC_CONF]Runtime config:  enable_vmm:True  vmm_align_size:2MB


Some weights of MusicgenForConditionalGeneration were not initialized from the model checkpoint at facebook/musicgen-small and are newly initialized: ['audio_encoder.decoder.layers.0.conv.weight', 'audio_encoder.decoder.layers.10.block.1.conv.weight', 'audio_encoder.decoder.layers.10.block.3.conv.weight', 'audio_encoder.decoder.layers.12.conv.weight', 'audio_encoder.decoder.layers.13.block.1.conv.weight', 'audio_encoder.decoder.layers.13.block.3.conv.weight', 'audio_encoder.decoder.layers.15.conv.weight', 'audio_encoder.decoder.layers.3.conv.weight', 'audio_encoder.decoder.layers.4.block.1.conv.weight', 'audio_encoder.decoder.layers.4.block.3.conv.weight', 'audio_encoder.decoder.layers.6.conv.weight', 'audio_encoder.decoder.layers.7.block.1.conv.weight', 'audio_encoder.decoder.layers.7.block.3.conv.weight', 'audio_encoder.decoder.layers.9.conv.weight', 'audio_encoder.encoder.layers.0.conv.weight', 'audio_encoder.encoder.layers.1.block.1.conv.weight', 'audio_encoder.encoder.layers.1.blo

## Generation

MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly
better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default,
and can be explicitly specified by setting `do_sample=True` in the call to `MusicgenForConditionalGeneration.generate` (see below).

### Unconditional Generation

The inputs for unconditional (or 'null') generation can be obtained through the method `MusicgenForConditionalGeneration.get_unconditional_inputs`. We can then run auto-regressive generation using the `.generate` method, specifying `do_sample=True` to enable sampling mode:

In [2]:
unconditional_inputs = model.get_unconditional_inputs(num_samples=1)

audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256)

The audio outputs are a three-dimensional Torch tensor of shape `(batch_size, num_channels, sequence_length)`. To listen
to the generated audio samples, you can either play them in an ipynb notebook:

In [10]:
from IPython.display import Audio

sampling_rate = model.config.audio_encoder.sampling_rate
#Audio(audio_values[0].asnumpy(), rate=sampling_rate)

Or save them as a `.wav` file using a third-party library, e.g. `scipy` (note here that we also need to remove the channel dimension from our audio tensor):

In [4]:
import scipy

scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].asnumpy())

The argument `max_new_tokens` specifies the number of new tokens to generate. As a rule of thumb, you can work out the length of the generated audio sample in seconds by using the frame rate of the EnCodec model:

In [5]:
audio_length_in_s = 256 / model.config.audio_encoder.frame_rate

audio_length_in_s

5.12

### Text-Conditional Generation

The model can generate an audio sample conditioned on a text prompt through use of the `MusicgenProcessor` to pre-process
the inputs. The pre-processed inputs can then be passed to the `.generate` method to generate text-conditional audio samples.
Again, we enable sampling mode by setting `do_sample=True`:

In [None]:
from mindnlp.transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")

inputs = processor(
    text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="ms",
)

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)

Audio(audio_values[0].asnumpy(), rate=sampling_rate)

The `guidance_scale` is used in classifier free guidance (CFG), setting the weighting between the conditional logits
(which are predicted from the text prompts) and the unconditional logits (which are predicted from an unconditional or
'null' prompt). A higher guidance scale encourages the model to generate samples that are more closely linked to the input
prompt, usually at the expense of poorer audio quality. CFG is enabled by setting `guidance_scale > 1`. For best results,
use a `guidance_scale=3` (default) for text and audio-conditional generation.

### Audio-Prompted Generation

The same `MusicgenProcessor` can be used to pre-process an audio prompt that is used for audio continuation. In the
following example, we load an audio file using the 🤗 Datasets library, pre-process it using the processor class,
and then forward the inputs to the model for generation:

In [7]:
from mindnlp.dataset import load_dataset

dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)
sample = next(iter(dataset.create_dict_iterator(output_numpy=True)))["audio"]
print('done')

done


In [8]:
# take the first half of the audio sample
sample["array"] = sample["array"][: len(sample["array"]) // 2]

inputs = processor(
    audio=sample["array"],
    sampling_rate=sample["sampling_rate"],
    text=["80s blues track with groovy saxophone"],
    padding=True,
    return_tensors="ms",
)

In [35]:
print("Model inputs keys:", inputs.keys())  # 查看实际可用的输入键

Model inputs keys: dict_keys(['input_ids', 'attention_mask', 'input_values', 'padding_mask'])


In [None]:


import mindspore
from mindspore import dtype as mstype
#原始版本
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
Audio(audio_values[0].asnumpy(), rate=sampling_rate)

To demonstrate batched audio-prompted generation, we'll slice our sample audio by two different proportions to give two audio samples of different length.
Since the input audio prompts vary in length, they will be *padded* to the length of the longest audio sample in the batch before being passed to the model.

To recover the final audio samples, the `audio_values` generated can be post-processed to remove padding by using the processor class once again:

In [None]:
from IPython.display import Audio
sampling_rate = model.config.audio_encoder.sampling_rate
print('begin')

sample = next(iter(dataset.create_dict_iterator(output_numpy=True)))["audio"]

# take the first quater of the audio sample
sample_1 = sample["array"][: len(sample["array"]) // 4]

# take the first half of the audio sample
sample_2 = sample["array"][: len(sample["array"]) // 2]

inputs = processor(
    audio=[sample_1, sample_2],
    sampling_rate=sample["sampling_rate"],
    text=["80s blues track with groovy saxophone", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="ms",
)

audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
print('done')
# post-process to remove padding from the batched audio
audio_values = processor.batch_decode(audio_values, padding_mask=inputs.padding_mask)

Audio(audio_values[0], rate=sampling_rate)

begin
Model inputs keys: dict_keys(['input_ids', 'attention_mask', 'input_values', 'padding_mask'])
Int64
Int64
Float32
Int32


'(ReadTimeoutError("HTTPSConnectionPool(host='cdn-lfs.hf-mirror.com', port=443): Read timed out. (read timeout=10)"), '(Request ID: 779ef175-1351-47cb-ae36-b8619825996f)')' thrown while requesting GET https://hf-mirror.com/datasets/sanchit-gandhi/gtzan/resolve/4bd857132cb0e731bef3ec68558e7acc0a85f144/data/train-00000-of-00003-abaaa5719027ce5c.parquet
Retrying in 1s [Retry 1/5].


done


## Generation Config

The default parameters that control the generation process, such as sampling, guidance scale and number of generated tokens, can be found in the model's generation config, and updated as desired. Let's first inspect the default generation config:

In [13]:
model.generation_config

GenerationConfig {
  "bos_token_id": 2048,
  "decoder_start_token_id": 2048,
  "do_sample": true,
  "guidance_scale": 3.0,
  "max_length": 1500,
  "pad_token_id": 2048
}

Alright! We see that the model defaults to using sampling mode (`do_sample=True`), a guidance scale of 3, and a maximum generation length of 1500 (which is equivalent to 30s of audio). You can update any of these attributes to change the default generation parameters:

In [14]:
# increase the guidance scale to 4.0
model.generation_config.guidance_scale = 4.0

# set the max new tokens to 256
model.generation_config.max_new_tokens = 256

# set the softmax sampling temperature to 1.5
model.generation_config.temperature = 1.5

Re-running generation now will use the newly defined values in the generation config:

In [15]:
audio_values = model.generate(**inputs)

Note that any arguments passed to the generate method will **supersede** those in the generation config, so setting `do_sample=False` in the call to generate will supersede the setting of `model.generation_config.do_sample` in the generation config.