Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I add speech-to-text and text-to-speech models when using xinference? #2359

Closed
2 tasks done
andylzming opened this issue Feb 2, 2024 · 5 comments · Fixed by #3045
Closed
2 tasks done

How do I add speech-to-text and text-to-speech models when using xinference? #2359

andylzming opened this issue Feb 2, 2024 · 5 comments · Fixed by #3045
Labels
📚 documentation Improvements or additions to documentation

Comments

@andylzming
Copy link

andylzming commented Feb 2, 2024

Self Checks

Dify version

0.5.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

How do I add speech-to-text and text-to-speech models when using xinference? Xinference already supports Image model and Audio model.

20

21

22

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

@andylzming andylzming added the 🐞 bug Something isn't working label Feb 2, 2024
Copy link

dosubot bot commented Feb 2, 2024

🤖

Hey @andylzming! Great to see you back here, diving into more adventures with dify. 🚀

To add speech-to-text and text-to-speech models when using xinference in a self-hosted Docker environment with Dify version 0.5.2, you need to extend the provided TongyiText2SpeechModel class and implement the required methods.

Here is a simplified example of how you can do this:

class MyText2SpeechModel(TongyiText2SpeechModel):
    def _invoke(self, model: str, credentials: dict, content_text: str, streaming: bool, user: Optional[str] = None) -> any:
        # Your implementation here
        pass

    def validate_credentials(self, model: str, credentials: dict, user: Optional[str] = None) -> None:
        # Your implementation here
        pass

    def _tts_invoke(self, model: str, credentials: dict, content_text: str, user: Optional[str] = None) -> Response:
        # Your implementation here
        pass

    def _tts_invoke_streaming(self, model: str, credentials: dict, content_text: str, user: Optional[str] = None) -> any:
        # Your implementation here
        pass

    def _process_sentence(self, sentence: str, model: str, credentials: dict, audio_type: str):
        # Your implementation here
        pass

You need to replace the # Your implementation here comments with your own code. The _invoke method is used to invoke the text-to-speech model, _tts_invoke and _tts_invoke_streaming are used to process the text and convert it to speech, and _process_sentence is used to process each sentence individually. The validate_credentials method is used to validate the credentials for the text-to-speech model.

Once you have implemented these methods, you can use your new class in your Docker environment. Make sure to provide the necessary credentials and other parameters when invoking the model.

You can find more details about the TongyiText2SpeechModel class in the dify codebase.

Please note that I'm unable to view the images you've attached. If you need further assistance, could you please provide more details or context in text form?

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added 📚 documentation Improvements or additions to documentation NewModelSupport labels Feb 2, 2024
@crazywoola
Copy link
Member

crazywoola commented Feb 2, 2024

See https://docs.dify.ai/tutorials/model-configuration/xinference#starting-xinference for more details. Please use the right template to fire a bug.

@crazywoola crazywoola reopened this Feb 2, 2024
@crazywoola crazywoola removed the 🐞 bug Something isn't working label Feb 2, 2024
@andylzming
Copy link
Author

@crazywoola Is the functionality of adding Image models and Audit models (including speech-to-text, text-to-speech) through xinference set to be implemented in the next version?

@takatost
Copy link
Collaborator

takatost commented Feb 7, 2024

@crazywoola Is the functionality of adding Image models and Audit models (including speech-to-text, text-to-speech) through xinference set to be implemented in the next version?

We might integrate these models in our spare time, not sure though. If you're interested, you can contribute to us and we would really appreciate it!

@leslie2046
Copy link
Contributor

#3045

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📚 documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants