Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] I've now got this working with Ollama's chat completion API #16

Open
jukofyork opened this issue Dec 27, 2023 · 12 comments
Open

Comments

@jukofyork
Copy link
Contributor

Not an issue but I can't see any discussion board for this project...

I've now got this working directly with the Ollama chat completion API endpoint so it's now possible to use it with Local LLM instances:

https://github.com/jmorganca/ollama/blob/main/docs/api.md#generate-a-chat-completion

I originally tried to use LiteLLM to emulate the OpenAI API and then have it communicate with Ollama but it didn't really work.

So instead I've just made a hacked version of the plug-in and got it to communicate directly, and after finally getting to the bottom of why it was hanging after a couple of pages of text (see my other issue post for solution) it seems to be working pretty well. The main changes needed were:

  • Add ability to specify CHAT_URL and not just use a fixed value of "/v1/chat/completions" (Ollama uses "/api/chat").
  • Traverse the response tree to find the "message" key (Ollama's API has it one level up, but still uses "role" and "content" sub-keys).
  • Detect the end of a streamed response using the "done" key instead of looking for "[DONE]".

AFAIK none of the open source LLMs can handle the function format OpenAI's models use so that isn't active yet, but I'm pretty sure I can get it to work using prompts at least to some extent. LiteLLM seems to have the ability to do this using the "--add_function_to_prompt" command line option:

https://litellm.vercel.app/docs/proxy/cli

I can probably tidy up the code in a couple of days if anyone is interested?

@gradusnikov
Copy link
Owner

Hi @jukofyork . Thank you for your input. I've merged your changes.

@jukofyork
Copy link
Contributor Author

jukofyork commented Jan 5, 2024

Hi @gradusnikov ,

No problem and glad to be of help!

I've got the Ollama port running really well now. I've still got to tidy it up and have stripped out a lot of stuff that didn't really work well yet with locally run LLMs: function calling via prompts barely worked and had to have streaming turned off, the local LLMs couldn't really create a working diff file, a lot of the stuff specific to JavaDoc and the Eclipse Java AST tree, etc.

One thing you might want to add to your code is to make the right click menu context sensitive:

/**
 * This class handles the contributions to the Eclipse menu based on the active editor.
 */

package eclipse.plugin.assistai.handlers;

import org.eclipse.e4.core.di.annotations.Evaluate;
import org.eclipse.e4.ui.workbench.modeling.EModelService;
import org.eclipse.jface.text.ITextSelection;
import org.eclipse.ui.PlatformUI;
import org.eclipse.ui.texteditor.ITextEditor;

import eclipse.plugin.assistai.views.ChatConversationPresenter;

public class MenuContributionsHandler {

	@Evaluate
	public boolean evaluate(EModelService modelService) {
		// Get the active editor
		var activePage = PlatformUI.getWorkbench().getActiveWorkbenchWindow().getActivePage();
		var activeEditor = activePage.getActiveEditor();

		// Check if it is a text editor
		if (activeEditor instanceof ITextEditor)  {
			return true;
		}

		// Hide the context menu then.
		return false;

	}
}

Then add this to 'fragment.e4xmi':
<fragments xsi:type="fragment:StringModelFragment" xmi:id="_pVgfIIrOEeW7h_qdP9N9fw" featurename="menuContributions" parentElementId="xpath:/"> <elements xsi:type="menu:MenuContribution" xmi:id="_BducUIrPEeW7h_qdP9N9fw" elementId="eclipse.plugin.assistai.menucontribution.1" positionInParent="after=additions" parentId="popup"> . . .

I did also find that the view part as defined using the 'fragment.e4xmi' file was buggy in Linux and would appear blank after opening and closing it, so I moved it out into a class that extends ViewPart to fix this. The bug is probably not your code's fault but the Linux version of Eclipse's browser that is broken and has a known bug where it shows up blank unless you use 'export WEBKIT_DISABLE_COMPOSITING_MODE=1' or 'export WEBKIT_DISABLE_DMABUF_RENDERER=1'.

I'm not sure if it's possible with the ''fragment.e4xmt' view but with one extended from ViewPart you can easily add to the toolbar and dropdown menu:

private void contributeToActionBars() {
	IActionBars bars = getViewSite().getActionBars();
	bars.getStatusLineManager()
	fillLocalPullDown(bars.getMenuManager());
	fillLocalToolBar(bars.getToolBarManager());
}

private void fillLocalPullDown(IMenuManager manager) {
	manager.add(cancelAction);
 	manager.add(stopAction);
 	manager.add(new Separator());
 	manager.add(discussAction);
}

private void fillLocalToolBar(IToolBarManager manager) {
	manager.add(cancelAction);
	manager.add(stopAction);
 	manager.add(new Separator());
 	manager.add(discussAction);
}

I did actually manage to get the right click context menu working without the 'fragment.e4xmi' stuff (which then lets you programmatically add to the menu instead of having to edit each command, menu item, etc by hand), but for some reason it didn't work with the dependency injection properly and I need to revisit it to see what was wrong.

I've also made it so that you can send "system" messages to the view and they come up in a blue chat bubble. This might be worth adding to your code so that you can print out error messages for the user to see when they can't connect properly to the Open AI servers, etc.

One final thing I've done is refine the prompts using the LLMs themselves. I have 8 now: code generation (for boiler-plate code), code completion, discus, document, code-review, refactor, optimize and debugging (+ the "fix errors" special case).

To refine them I used several of the different local LLMs (codellama, deepseek-coder, etc) with instructions like "I'm writing an Eclipse IDE plugin for an AI assistant and would like you to comment on how good my prompts are". I then got them to make really wordy prompts that have lots of detail for each of the 8 categories above. This eventually produced really great prompts, but they used a lot of tokens to encode...

So finally I used the most competent local LLM I have (called deekseek-llm-67b), who was about the only one who truly got what I was doing (and didn't get confused and start trying to answer the prompts they were supposed to be refining! 🤣), to compress them down whilst keeping the crucial information.

You can probably use ChatGPT itself for this whole process, but I did find a few iterations of expanding the prompts into very wordy/detailed versions and then compressing them back down works extremely well. After a few iterations they eventually can't find anything useful to add or change and the final prompt is worded very well for them to comprehend.

I'm going to tidy up the code over the next few days: see if I can get the context menu working , harden the code that communicates with the Ollama server, etc. But after that I'm happy to share everything I've got back with you - I've no intention of making/supporting a proper folk and it will otherwise stay private.

I will then try to see if I can get the stuff you have working with Eclipses' Java AST tree to work with Eclipse's CDT AST tree for C++.

Anyway, just want to say thanks for creating this and I think it's a great project and I hope it becomes more popular!

Juk

@cpfeiffer
Copy link

Sounds, great, looking forward to try this!

@gradusnikov
Copy link
Owner

gradusnikov commented Jan 10, 2024

Hi @jukofyork

I find function calling very useful, esp. after adding web search and and web read. I think I will add more, as this is a simple and quite powerful way to make the LLM answer more accurately. I have not tried function calling with other LLMs but maybe the approach from like 6 months ago would work, where people were defining function definitions as part of the system message, along with the function call format? Or I can simply disable function calling in Settings?

@jukofyork
Copy link
Contributor Author

Hi again,

I've got the communication with the Ollama server working fairly robustly at last: their server is a Go wrapper around llama.cpp and it's very prone to crashing from OOM errors, but exponential backoff seems to give it time to restart itself and all the random http disconnects aren't a problem now.

I've struggled a lot with the dependantcy injection: either things not getting injected causing baffling null pointer exceptions or other weird things like the ILog (which I've now added a listener to to display in blue chat bubbles) seeming to have multiple copies instead of being a singleton, etc. I'm still not 100% sure why, but I think it's Eclipses' own dependantcy injection somehow interfering. Anyway had to strip a lot of it away to make sure everything works.

I've iterated over a few different methods of using the interface and finally setted on the right click context menu and a toggle button to decide if the full file should be sent as extra context or not. This along with grabbing and appending anything in the edit box to the end of the prompt message seems to be the most usable. I think the Continue input method (https://continue.dev/) with their slash commands might be worth looking at too, but this is working so well now I don't really have the motivation to try it.

I did consider seeing if I could get a "tree" of responses like a lot of the LLM web apps implement (with undo, sideways edits, etc) and possibly even see if I can journal the stuff getting sent to the Browser widget to serialise and restore, but I don't think it will really be that useful as long conversations soon exhaust the context windows of all the available locally runnable LLMs...

I've added lots of other little fixes like rate-limiting and buffering the streaming to 5 events per second as found it could start to lag the Eclipse main UI thread out badly for some of the smaller/faster models that can send 20-50 events per second.

Anyway, it's mainly just a case of tidying up the View code and I will share it back via Github and hopefully some of the stuff will be useful.

@jukofyork
Copy link
Contributor Author

Hi @jukofyork

I find function calling very useful, esp. after adding web search and and web read. I think I will add more, as this is a simple and quite powerful way to make the LLM answer more accurately. I have not tried function calling with other LLMs but maybe the approach from like 6 months ago would work, where people were defining function definitions as part of the system message, along with the function call format? Or I can simply disable function calling in Settings?

Yeah, there is quite an interesting discussion on this here:

ollama/ollama#1729

They are defining the functions in the system message and then doing 4-5 shot teaching by making the first few messages examples of calling the functions.

@jukofyork
Copy link
Contributor Author

jukofyork commented Feb 23, 2024

@gradusnikov I have added lots of things you might find useful, eg:

  • Use of a CompareEditor to selectively apply changes.
  • The Browser has Undo, autoscroll, jump to bubble, etc all implimented.
  • Use of the string template library (https://github.com/antlr/stringtemplate4/blob/master/doc/cheatsheet.md) for the prompts and a special <> token that can be used for delayed responses, forced replies, multishot prompts, etc.

The final thing I want to do is allow multiple copies of the view to be opened and then I'll upload to Github later this week.

I'm happy for others to try it out and use it, but Ollama is very buggy and I don't want to be spending lots of time helping people get Ollama working or step on @gradusnikov toes since it's his project after all and I've stripped out as much as I have added... I'll try and create a plug in installer and add some instructions too, but it's more going to be left as a foundation for others to build on rather than an active fork I want to maintain.

@jukofyork
Copy link
Contributor Author

jukofyork commented Apr 7, 2024

@gradusnikov

I've done my best to commit the code to github (no idea why it's ended up in a subfolder like that though 😕):

https://github.com/jukofyork/aiassistant

The bits that are probably most useful to you:

  • ReviewChangesBrowserFunction: Implements a CompareEditor which was particularly hard due to the Eclipse docs being out of date (see the URLs for links that explain the new/correct way to implement it). Make sure to set "Ignore white space" in Compare/Patch settings in Eclipse or else CompareEditors don't work well in general..
  • MenuVisibilityHandler: Is used to stop the right click context menu appearing all over eclipse (there is also a similar visibility handler for the 'Fix Errors' and 'Fix Warnings' options).
  • URLFieldEditor: Strictly enforces to allow valid URLs with port numbers only (I had Eclipse break really badly when I input a bad URL and had to manually find and edit the preference store to fix it!).
  • IndentationFormatter: Is useful to remove and reapply indentation (mainly for the different BrowserFunction classes).
  • LanguageFileExtensions: Loads a JSON file with all the {language, extensions} tuples used by highlight.js.
  • BrowserScriptGenerator: Has code in it to do things like: undo, scroll to top/bottom, scroll to previous/next message, detect if the scrollbar at the bottom, etc.

There are also lots of small changes to do with the main view you may or may not want to use:

  • Saves the previous messages in a buffer that the yellow arrows can cycle through.
  • Blanks out everything but the 'STOP' button whilst working.
  • Only scrolls to the bottom if already at the bottom (so you can read the text further up whilst it's streaming, etc).
  • Right click menu to allow external chunks of code or documentation to be pasted in and not get mangled up.
  • Hotkeys like Shift+Enter to allow insertion of newlines in the user message area, delayed replies, etc.

The prompts are the best I can come up with after a couple of months of trying. In general I've found the less newlines the better and starting your tasks with a '#' symbol seems to help them (possibly they think it's a markup header or maybe even they have been overtrained on Python comments). I've made it so the prompts use the StringTemplate library now:

https://github.com/antlr/stringtemplate4/blob/master/doc/cheatsheet.md

and added several other possibly useful context variables and a special <<switch-roles>> tag that can be used for delaying responses, forcing responses, multi-shot learning, etc (have a look in the prompts for examples of its use).


I've had to strip out all of the @inject stuff. I tried different versions of Javax/Jakarta and just got really weird bugs where nearly identical instances would fail to inject and give null pointer exceptions... Then one day Eclipse did some kind of update and absolutely nothing worked (even after a full reinstall/revert!), so I just had to create a new project and move each class back in one at a time.

I also found the fragment.e4xmi stuff to be very buggy (eg: the main view would be blank if you closed and reopened it, etc) and converted the main view and context menu to use the `plugin.xml' extensions instead.

I think from reading the issues here and on the Eclipse marketplace the Javax/Jakarta stuff and fragment.e4xmi view are the main cause of the problems being reported.

I also had to move all the dependencies into the main plug-in as for some reason this caused me a lot of problems too (possibly because I was trying to edit the forked version though).


I did have more advanced code for the networking (due to Ollama being so buggy and crashing often from OOM errors), but had to remove it as found due to the way Eclipse only uses a single GUI thread; it caused more problems than it solved (ie: the menus kept freezing, etc).

One thing I didn't fix but probably needs looking at is the O(n^2) complexity of the way the streamed tokens get added to the browser window: it gets more and more slow and starts to cause the main Eclipse GUI thread to stall. The best solution I could find without completely rewriting the code for this is to use estimateMaximumLag and concatenate to an internal buffer (see: 'OllamaChatCompletionClient.processStreamingResponse`). This is still O(n^2) but it does stop the whole Eclipse GUI from stalling as more and more gets added to the Subscription class's buffer.


There are probably a lot of other changes that I've forgotten to mention here, but would just like to say thanks for creating the base project!

@jukofyork
Copy link
Contributor Author

jukofyork commented Apr 7, 2024

Just noticed there is some random 'ToDo' list with prompts come up as the main readme - I'll see if I can tidy it up tomorrow (I don't really use Git and seem to always make a mess of it ☹️).

I've also deliberately not added any actual binary release for the plugin as: firstly I don't want to take away from this project, and secondly I don't want become an Ollama technical support person...

If anybody wants to use it: then you just need to install the plugin development stuff in Eclipse, use 'Import Fragments' and 'Export plugin' and it should work.

@gradusnikov
Copy link
Owner

gradusnikov commented Apr 8, 2024 via email

@jukofyork
Copy link
Contributor Author

jukofyork commented Apr 8, 2024

Hi jukofyork! Thank you very much for your edits. I will try to integrate your changes with the main branch. Cheers! /w.

No problem and I hope it is helpful :)

I've updated the README to hopefully better explain how to build/use the forked version, and have added a few pictures and notes that I might have forgot to mention above.

I have some other work I need to do for the next few weeks, but the next things I want to look at are:

  • Add a spell checker to the user input area as spelling mistakes can degrade LLM performance quite a lot.
  • TAB-autocompletion using a smaller/faster LLM. I'm not sure how feasible this is but was going to look at the Eclipse source to see how it does the autocompletion when you press shift+space.
  • Try to programmatically add custom prompts to the context menu. I've found the large "monolithic" prompts don't work that well and smaller / more focussed ones seem to work best. I saved a lot of smaller prompts I found on a website in JSON format in the quick-prompts.json file but havn't got any further (I also experimented with different CoT type prompts and some are in the Ideas.md file still).
  • See if I can fix the O(n^2) complexity of the streaming. I'm not 100% sure how though as multi-line Markdown syntax will make it very hard to decide when there is no need to go back to previous line(s), etc. I noticed OpenWebUI seems to buffer lines/paragraphs somehow as you can see things like code blocks, bolded text, etc suddenly "pop out" when it encounters the closing syntax; whereas your method doesn't do this and looks much nicer during the streaming.
  • See if I can add back the exponential-backoff code and generally make the API calls more robust. Ollama seems to have been fixing a lot of their memory-size calculations but I still get random OOM errors that cause the Ollama server to go down for a few seconds ☹️

I'll be sure to share back anything I find and will have a look through your latest code when I get back to it - the fork is based on a pull I made sometime last December and I see you have made quite a lot of changes since then.

There are also quite a few changes to the Ollama API underway: OpenAI compatibility, function calling, etc are on their ToDo list, so it's probably a good time to leave it and see what they do next too.

@jukofyork
Copy link
Contributor Author

jukofyork commented Apr 12, 2024

I'm finding Ollama to be too buggy to use now - it seems each bug they fix they create 2 more and their Golang wrapper of llama.cpp's server is getting more and more impenetrable to fix anything... It's some strange mix of llama.cpp's server.cpp code from a few months ago, which imports and uses much newer llama.h and llama.cpp files... I can only see things getting worse in the future ☹️

So it looks like I'm going to have to start using the llama.cpp server directly, but aren't sure if I should leave or just remove the Ollama server code now:

  • The recent bugs in Ollama now mean that the <<switch-roles>> stuff doesn't work any more and only the last "user" message sent gets seen.
  • The fix they tried to make to the spaghetti code that handles the chat/completions endpoint now has the exact opposite bug with the system message (ie: it used to get ignored if you didn't have a default one defined in your modelfile, but now does the opposite).
  • They seem to have broken the {{.First}} variable for the go/text/template engine.

and so on... It's really so buggy now I don't actually trust what is getting sent to the server is actually what you expect (the system message getting ignored bug went unnoticed for months!).

The problem is that if I leave the Ollama code in then options like "Add Context" won't actually work (nor will any future multi-shot prompts), but at the same time I'm reluctant to remove it as sometime in the future they may actually start to fix some of these bugs. Things like being able to list available models, load new models, allow the GPU VRAM to be unloaded after 5 minutes if you don't send keep-alive messages, and so on were all much nicer than what is going to be possible with the stock llama.cpp server 😕


@gradusnikov

On another note I have been researching how the completion engine works in Eclipse:

and specifically the CDT code that is used for C++ completion:

It looks horrifically complex, but like anything in Eclipse it is probably not that bad if you can get a minimal working example running... I doubt I'll have time to look at this properly for a few weeks though, but I think it would be worth looking into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants