Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ollama based models are "not" providing full tokens (stops around 150 or so) on self-hosted docker(dify) based environment #2461

Closed
2 tasks done
ashmanpan opened this issue Feb 18, 2024 · 12 comments
Assignees
Labels
🐞 bug Something isn't working

Comments

@ashmanpan
Copy link

Self Checks

Dify version

0.5.4

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. run models on Ollama , locally on Mac M3
  2. Create app use Ollama based model, I used llama2
  3. use 2 model debugging view, use chat-gpt-4 as 2nd model to compare
  4. set max tokens on both models as 4096
  5. ask questions which will relatively long answers, you will see ChatGPT giving full details of the question in the answer but Ollama based locale model stooped after 150 or so tokens

✔️ Expected Behavior

Ollama based model also should give full long response like ChatGPT , till it reaches to max token 4096

❌ Actual Behavior

Ollama based model using dify is stopping in between the response after 150 or so tokens .

I tried Ollama same model on the terminal with the same questions and i can see full response from the model. it's not Ollama issues.

@ashmanpan ashmanpan added the 🐞 bug Something isn't working label Feb 18, 2024
Copy link

dosubot bot commented Feb 18, 2024

🤖

Hey there, @ashmanpan! 👋 I'm Dosu, your friendly neighborhood bot here to help you tackle bugs, answer your questions, and guide you on contributing while we wait for a human maintainer to swing by. I'm diving into your issue with the Ollama models and will get back to you with a full answer shortly!

@ashmanpan
Copy link
Author

detaiils and snpshot is avilable here - https://discord.com/channels/1082486657678311454/1206116751801327616

@ashmanpan ashmanpan changed the title Ollama based models are "not" providing full tokens (stops around 150 or so) on self-hosted docker(dify) based envirnment Ollama based models are "not" providing full tokens (stops around 150 or so) on self-hosted docker(dify) based environment Feb 18, 2024
Copy link

dosubot bot commented Feb 18, 2024

🤖

Based on the information you've provided, it seems like the issue you're experiencing might be related to the 'num_predict' parameter in the Ollama model configuration. This parameter controls the maximum number of tokens to predict when generating text. Even though you've set the maximum token limit to 4096, the 'num_predict' parameter might be limiting the number of tokens generated by the model.

In the Dify version 0.5.4 codebase, the default 'num_predict' parameter value for Ollama based models is set to 128. Here's the relevant code snippet:

ParameterRule(
    name='num_predict',
    use_template='max_tokens',
    label=I18nObject(en_US="Num Predict"),
    type=ParameterType.INT,
    help=I18nObject(en_US="Maximum number of tokens to predict when generating text. "
                          "(Default: 128, -1 = infinite generation, -2 = fill context)"),
    default=128,
    min=-2,
    max=int(credentials.get('max_tokens', 4096)),
),

You can find this in the api/core/model_runtime/model_providers/ollama/llm/llm.py file.

If you want the model to generate more tokens, you might need to adjust the 'num_predict' parameter to a higher value or set it to -1 for infinite generation.

Unfortunately, I can't view the additional details and snapshots you've provided at the link as I'm unable to access URLs. If you could provide more information or any error messages you're encountering directly in this thread, it would be helpful in further diagnosing the issue.

Please note that you're using Dify version 0.5.4, and the latest version is 0.5.5. It might be worth updating to the latest version to see if the issue persists, as it could have been addressed in the latest release.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@takatost
Copy link
Collaborator

You can try adjusting the Num Predict model parameter to the appropriate tokens size for generating texts.

@ashmanpan
Copy link
Author

ashmanpan commented Feb 18, 2024

I have already deployed dify on docker containers, which specific container this file has, how can I edit, or what's the way to change this?

Other question, why this kind low number as default , this is local llm no one is going to charge because of the large no of tokens so why ?

@ashmanpan
Copy link
Author

ashmanpan commented Feb 18, 2024

image image image image

@ashmanpan
Copy link
Author

below snapshot are after upgarde to 0.5.5

image

@ashmanpan
Copy link
Author

ashmanpan commented Feb 18, 2024

image

fist 2 models are running locally on ollama - you can clearly see the diffrence.

@takatost
Copy link
Collaborator

image This number is set according to the default values provided in the Ollama API document, you can change this value here.

@ashmanpan
Copy link
Author

ashmanpan commented Feb 19, 2024 via email

@akhilmadhumenon
Copy link

When we tend to set a limit on num_predict, the sentences get cut off mid way, and the model does not complete the sentence. How can we ensure that it will complete the sentence whilst maintaining the maximum token limit?

@sivertheisholt
Copy link

When we tend to set a limit on num_predict, the sentences get cut off mid way, and the model does not complete the sentence. How can we ensure that it will complete the sentence whilst maintaining the maximum token limit?

Did you figure out this? I have the same problem, cutting off mid sentence when setting the num_predict to 64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants