Add integration and/or system tests #144

mattlindsey · 2023-06-07T11:10:36Z

I think we need the ability to add and run some 'Integration' tests that exercise interactions in high level components and use actual apis and keys. They would be run only on request and could be run before each release.

Start with a simple question to ChainOfThought with openai like in the README, with expectation that the result should be similar but not exactly equal to the result given in the README, since I assume the ai can respond slightly differently each time the test is called.

mattlindsey · 2023-06-07T12:11:28Z

I'm going to try implementing a simple https://cucumber.io/ test. It might work well here, but if it doesn't add value we don't have to use it:

Feature: Chain Of Thought
  Decompose multi-step problems into intermediate steps

  Scenario: Multistep with distance calculation
    Given I want to know a difficult distance calculation
    When I ask "How many full soccer fields would be needed to cover the distance between NYC and DC in a straight line?")
    Then I should be told something like "Approximately 2,945 soccer fields"

andreibondarev · 2023-06-07T21:07:32Z

@mattlindsey Do you envision that this would actually run in CI?

I'm also struggling a bit figuring out what value these feature tests would bring to this library?

mattlindsey · 2023-06-07T21:15:14Z

If you run them in CI I think you'd catch errors sooner, like I think there's a gem dependency error now. (Might be wrong.)
Also, the agents are fairly high level, so testing interaction with other things using 'integration' testing is certainly necessary somewhere, I think.

andreibondarev · 2023-06-07T21:19:14Z

I hope @technicalpickles doesn't mind that I pull him in. There was a mention of executing Jupyter notebooks or README code snippets in Discord. Would you have to have any thoughts here?

mattlindsey · 2023-06-07T21:22:22Z

Also see where I implemented a couple of tests here to give a better idea: #145

mattlindsey · 2023-06-11T10:14:28Z

And for a wider range of testing it would be good if someone implemented Langchain::LLM::HuggingFace#complete.

technicalpickles · 2023-06-11T22:56:31Z

Start with a simple question to ChainOfThought with openai like in the README, with expectation that the result should be similar but not exactly equal to the result given in the README, since I assume the ai can respond slightly differently each time the test is called.

I was doing a course on deeplearning.ai, it was talking about how if you set a temperature=0, you should get the same results. The course was taught using jupyter notebooks, and the results they got doing the exercise matched what the AI was returning when I did them in the notebooks. I think it can be considered relatively stable?

There was a mention of executing Jupyter notebooks or README code snippets in Discord. Would you have to have any thoughts here?

Yep! Here is what I suggested:

I've been thinking about getting the code in the README and in examples to be run as part of CI. did something like that for openfeature-sdk (open-feature/ruby-sdk#40) ... I think the challenge for the README is making sure the fragment is complete enough to run, as well as having the right environment variables to make the call.

In both cases, I'm starting to think we could get pretty far by stubbing the response from the LLM. That could help cover everything leading up to the request. The most common way I've done this is with VCR and/or webmock. The main downside there it doesn't capture changes that happen with the remote end, obviously. If we are using existing libraries to do those interactions though, it's probably a pretty good tradeoff.

mattlindsey · 2023-06-12T22:48:49Z

Thanks @technicalpickles. I'm going to try the method you used in open-feature to run our README examples with temperature=0. It will still have to be an optional script or spec, since it would require env variables - like you said.

When you say stubbing the response from the LLM, do you mean like below? Or recording responses with VCR for every example? Because the idea was to run everything against live services. https://github.com/andreibondarev/langchainrb/blob/9dd8add0703c8cc9f5d250ee7a3559f45053d7e3/spec/langchain/llm/openai_spec.rb#L68

andreibondarev · 2023-06-13T15:51:35Z

@mattlindsey I'm going to try implementing a simple https://cucumber.io/ test.
I don't see much value in using Cucumber. In the case of web apps -- it brings a lot of value abstracting the engineer out from "clicking" through the UI. It's also useful when "QA Engineers" are primarily writing these tests because it provides them a nice DSL.

We need to figure out whether we'd like these tests to run against real (non-mocked) services, with actual API keys/creds.

If yes -- then let's go with Jupyter notebooks. These would need to be run locally by a dev, we can't run these in CI because it costs $$$ to run.

If not -- then these tests/scripts should be in Rspec.

We have a pretty large testing matrix: think "Num of vectorsearch DBs X Num of LLMs X", i.e. we're saying that any LLM in the project (that supports embed()) can generate embeddings for any vectorsearch DB.

@mattlindsey @technicalpickles Thoughts?

technicalpickles · 2023-06-14T00:05:57Z

When you say stubbing the response from the LLM, do you mean like below? Or recording responses with VCR for every example? Because the idea was to run everything against live services.

That is what I meant, yeah. I think we can get still get some value out of having everything but the LLM response, since there are plenty of other moving parts.

If yes -- then let's go with Jupyter notebooks. These would need to be run locally by a dev, we can't run these in CI because it costs $$$ to run.

If that is going to require providing an API key anyways, so may as well do it in plain ruby. We could even have a rspec tag to indicate something uses the API, and have that automatically included/excluded when the ENV['OPENAI_API_KEY'] is present.

describe Whatever, :openai_integration => true do
  it "works" do
     # ...
  end
end

Then run:

$ rspec --tag openai_integration

To exclude by defaut, we can add --tag ~openai_integration to the .rspec which is for default arguments.

These would need to be run locally by a dev, we can't run these in CI because it costs $$$ to run.

Saying that, it makes me wonder if they have any policies for open source development?

OpenAI is also on Azure, and Azure has Open Source credits we could apply to https://opensource.microsoft.com/azure-credits/

mattlindsey · 2023-06-14T14:12:39Z

@andreibondarev Can Jupyter notbooks run ruby? I'm thinking rspec in a separate 'integration' directory with the tags described by Josh sounds good.

Looks like Azure takes 3-4 weeks to reply in case you want to request to use their 'OpenAI Azure' (https://learn.microsoft.com/en-us/azure/cognitive-services/openai/overview). But would that mean a new LLM class in langchainrb? I don't see any ruby examples in the documentation so I'm not sure.

technicalpickles · 2023-06-14T19:52:30Z

Can Jupyter notbooks run ruby?

I saw it in the boxcars gem, which is in the same space as this gem:
https://github.com/BoxcarsAI/boxcars/blob/main/notebooks/boxcars_examples.ipyn

mattlindsey · 2023-06-16T19:12:53Z

@technicalpickles I added a similar 'getting started' Jupyter notebook in #185, but it was somewhat difficult to get working and seems to give errors sometimes. Take a look if you want, but I don't want to waste your time!

mattlindsey · 2023-06-23T17:51:48Z

I did get a notebook working, but it's very picky and may not be worth the effort to maintain.
I'll post it here just in case: https://gist.github.com/mattlindsey/5f6388d6ff76c2decdccb723bb4ed4c5#file-getting_started-ipynb

mattlindsey mentioned this issue Jun 7, 2023

Rspec 'integration' tests for optional use by contributors and ci #145

Draft

rickychilcott mentioned this issue Feb 5, 2024

Improve Ollama to implement #summarize and #default_dimension #462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add integration and/or system tests #144

Add integration and/or system tests #144

mattlindsey commented Jun 7, 2023

mattlindsey commented Jun 7, 2023 •

edited

Loading

andreibondarev commented Jun 7, 2023

mattlindsey commented Jun 7, 2023

andreibondarev commented Jun 7, 2023

mattlindsey commented Jun 7, 2023

mattlindsey commented Jun 11, 2023

technicalpickles commented Jun 11, 2023

mattlindsey commented Jun 12, 2023 •

edited

Loading

andreibondarev commented Jun 13, 2023 •

edited

Loading

technicalpickles commented Jun 14, 2023

mattlindsey commented Jun 14, 2023

technicalpickles commented Jun 14, 2023

mattlindsey commented Jun 16, 2023

mattlindsey commented Jun 23, 2023

Add integration and/or system tests #144

Add integration and/or system tests #144

Comments

mattlindsey commented Jun 7, 2023

mattlindsey commented Jun 7, 2023 • edited Loading

andreibondarev commented Jun 7, 2023

mattlindsey commented Jun 7, 2023

andreibondarev commented Jun 7, 2023

mattlindsey commented Jun 7, 2023

mattlindsey commented Jun 11, 2023

technicalpickles commented Jun 11, 2023

mattlindsey commented Jun 12, 2023 • edited Loading

andreibondarev commented Jun 13, 2023 • edited Loading

technicalpickles commented Jun 14, 2023

mattlindsey commented Jun 14, 2023

technicalpickles commented Jun 14, 2023

mattlindsey commented Jun 16, 2023

mattlindsey commented Jun 23, 2023

mattlindsey commented Jun 7, 2023 •

edited

Loading

mattlindsey commented Jun 12, 2023 •

edited

Loading

andreibondarev commented Jun 13, 2023 •

edited

Loading