## LangSmith in depth

## LS definition, terminology and FAQs

#### What is LangSmith in her own words?
* A unified DevOps platform
* for developing,
* collaborating,
* testing,
* deploying,
* and monitoring
* LLM Applications.

#### LangSmith pitch messages
* Get your LLM App from prototype to production.
    * All-in-one platform for every step of the LLM-powered application cycle. 
* LangSmith turns LLM "magic" into enterprise-ready application.
    * no more "guessing", or "development by vibes", use testing to confirm that your LLM Application performs as desired. 

#### LangSmith Terminology
* **Traces**: record of interactions with the LLM Application. Can contain more info beyond just input and output.
* **LLM Calls**: calls to the LLM model. Input and output.
* **Run**: a single execution of the LLM App to process an input and generate an output.
* **Annotation Queues**: to add human labels and feedback on traces.
* **Datasets** for evaluation, few-shot prompting or fine-tuning: build datasets from examples, production data or existing sources.
* **(LangSmith Prompt) Hub**: for prompt engineering experimentation.
* **Collaboration**: between developers and subject matter experts.
* **Auto-evaluation**:
    * use an LLM and prompt to score your applicatoin output,
    * or write your own functional evaluation tests to record different measures of effectiveness.
* **Regression testing**: ensure that new features or updates do not adversely affect the existing functionalities of the LLM Application.
* **Online evaluation** (coming soon):
    * continuously track qualitative characteristics of any live application
    * and spot issues in real-time with monitoring.
* **Observability**: monitoring health and performance of the LLM Applications, insights into its behavior, performance of the LLM model and the interactions between the LLM Application and the LLM model.
* **Deployment, One-Click deploy**: deployment to LangServe (in beta, only in Plus and Enterprise versions).
* More LS Terminology (feedback, tags, metadata) in [LS Concepts](https://docs.smith.langchain.com/tracing/concepts)

#### FAQ Section
* You can use LS even if you don't use LC.
* You can self-host LS (with Enterprise pricing).
* LS traces are encrypted and stored securely.
* You can specify the % of traces you send to LS (**this can reduce your cost of LS**).
* LS does not make your app slower.
* LS does not use your data.

## LS: Initial Operations

#### How to create a free account in LangSmith
* Go to [smith.langchain.com](https://smith.langchain.com)

#### How to create a new LS project
Normally one LLM App is associated with one Project, but in reality a Project can be associated with any collection of traces.
* Click on New Project
* In the Project page, you will see four tabs:
    * Traces
    * LLM Calls
    * Monitor
    * **Setup: log your first run**
        * create API key
        * select desired runtime environment
        * you can use LS with a LangChain app, using the SDKs (Pyton or Typescript) or via API.
        * Here we explain how to use LS with a LangChain app. To know about the other options, check the LS [Tracing Quickstart Guide](https://docs.smith.langchain.com/tracing/quick_start).
        * To use LS with a LangChain app. [See step by step guide here](https://docs.smith.langchain.com/tracing/quick_start):
            * (recommended) create a new virtual environment
            * install langchain
            * set the LS environment variables
                * via terminal
                * or via .env file (remove the export keyword)
            * run any LLM app  
* Add data to .env file
    * Create a new LangSmith API Key.
    * Copy your API Key in the .env file.
* Use the .env file in your LC project.

#### How to see the traces of your project
* Go to Projecs > YourProjectName
* Go to the Traces tab
* Click on each trace to see the trace details
    * Input
    * Latency
    * In test dataset?
    * With annotation?

## Prototyping phase: challenges solved by LS

#### Have LS enabled since day 1

#### How to use LS Playground to iterate and experiment: How to experiment with prompts in the LS Hub Prompt Playground Environment
* You can use it just by going to Hub.
* In the LS Hub you will find examples of prompts that other developers are using for many different cases (purpose, model, etc).
    * Clicking on a prompt example will open the Prompt Playground Environment.
    * One interesting feature here is prompt versioning. With this you can see how the initial prompt has evolved with the contributions of different people. This can be an interesting way of collaborating between developers and subject matter experts (product managers, marketing people, etc).
* You can import a prompt from the LS Hub into your LangChain Application without having to copy the entire prompt. See detailed instructions on how to do this [here](https://docs.smith.langchain.com/hub/dev-setup#3-pull-an-object-from-the-hub-and-use-it).  
* But the interesting thing is that you can use it from any trace.
    * You can open any trace in this Prompt Playground Environment and change the prompt, the LLM model, or the LLM model features (temperature, etc).
    * There are 2 LLM models you can use here for free: GooglePalm and Fireworks.
    * To use chatGPT you will need to enter your OpenAI API key. When you are in the Playground, there is a button in the right top corner called "Secrets & API Keys" to do that. 

#### How to use LS Comparison View to compare the performance of alternative approaches.
In the Dataset dashboard, go to one dataset, select several tests that you performed using that dateset and click on the Compare button to see the Comparison View.
* We can use this to compare several tests: compare outputs, compare performance, etc.

#### How to use LS to create a Test Dataset and include trace examples in it
* Database section > create Database.
* After clicking on a particular trace, click on Add to Dataset.

## LS Datasets: Advanced Tips

#### How to evaluate your LLM Application with a Test Dataset
In the prototyping phase, you will create your own Test Database. In the Beta Testing phase, you will add to that initial database examples of real feedback from your beta users (mostly, relevant cases when the user has labelled the llm answer as THUMBS UP or THUMBS DOWN)
* You can use the test dataset to evaluate different versions of your LLM Application (with different LLM models, with different LLM Model features like temperature, with different prompts, etc) and compare the performance in terms of accuracy, latency, cost, etc.
* It is very useful to use the Comparison View to compare the performance of different versions of the LLM Application with the Test Dataset.

#### How many examples should have the Test Dataset?
* The LS team says that the average Test Dataset has around 20 examples when an LLM App Development team starts the Beta Testing phase, but the right number really depends on each project and how much time and effort they want/can invest on evaluation.

#### LS Datasets can be used for more things other than Evaluation
The main use of LS Datasets is evaluation, but some teams have also used them for other purposes like:
* Few-shot prompting,
* Or even fine-tuning.

#### Offline Evaluation vs. Online Evaluation
* Offline Evaluation: current LS Evaluation. Your LLM App is tested against a test dataset.
* Online Evaluation: next LS feature.
    * Evaluators will run on a sample of your traffic. For example, evaluate 20% of your down-voted traces with a particular evaluator in production, with real data.

## Beta testing phase: challenges solved by LS


#### How to use LS to filter traces with negative human feedback to understand the problems behind them.
In the project > traces view, click on any of the tags displayed on the right sidebar. Those tags were created by you when you coded, for example, the THUMBS UP and THUBS DOWN buttons of your LLM App.
* We can use this if we have something like thumbs up or down in the LLM App UI. Then we can filter with the tag that thumbs up or down button.

#### How to use LS to inspect interesting traces and how to add annotations ("human feedback") to one trace
* Click on each trace to see the trace details
    * Input
    * Latency
    * In test dataset?
    * With annotation?
* After clicking on a particular trace, click on send to annotation queue. There you can add human feedback to the trace like:
    * default LS tags:
        * correctnes.
        * faithfulness.
        * conciseness.
        * context relevancy.
        * etc.
    * custom tags.
    * notes: human feedback.
* You can also annotate one individual run inside a trace.
    * For example, if you find out that the problem with a bad trace is in the retieval step, it could be interesting to annotate the Retriever run with feedback that can help you to fix it later.
* Annotation is specially useful in a collaboration environment, when the LLM App Developer and the Subject Matter expert can both comment on the highlighted traces.

#### How to use LS to expand the Test Dataset by adding runs as examples.
* You can add traces as examples to the Test Database.
* Runs are the different steps that compose a trace.

## Production phase: challenges solved by LS

#### Keep using LS as in the Beta Testing phase to keep processing and analyzing user feedback.

#### How to use LS to monitor key metrics.
* With the Monitor tab of the project.
* Key metrics: cost, number of tokens per trace, latency, etc.
* LS allows for tag and metadata grouping, which allows users to mark different versions of their applications with different identifiers and view how they are performing side-by-side within each chart. This is helpful for A/B testing changes in prompt, model, or retrieval strategy.
    * After opening the Monitor tab, click on the tag or metadata buttons that are located on the top and select the metadata or tag that you want to use to display the monitoring data.
* Apart from A/B Testing, monitoring is a great way to see if your application in production is performing better or worse with the time.

#### How to use LS to mark different versions for A/B Testing of prompts, models or retrieval strategies.
* One of the most interesting uses of monitoring is to compare the performance of different versions of your LLM App.
    * Clicking on the Metadata button and selecting one particular metadata parameter will show you how the different versions of your app behave. It allows you to A/B Test different configurations (for example, using different LLM models, prompts or retrieval strategies) of your app and see how each of them impacts on performance metrics (cost, number of tokens per trace, latency, etc).

## LS: Advanced Tips

#### Advanced tip: Deploying multiple versions of the LLM app to production and monitor performance
* You can route percentages of the traffic to different versions of your application in production.
* Or you can route all the traffic to just one production app and clone that traffic in a "shadow pipeline" with a different version.

#### Advanced tip: What LLM models can you use with LS?
* LS can work with any LLM model.

#### Advanced tip: Can I use LS if my LLM App is not developed using LangChain?
* Yes.
* You have different ways to use LS: via API and via SDK (Python or Typescript).

#### Advanced tip: Can I use LS with multi-modal apps?
* LS is starting to experiment with this. It currently works with ChatGPT-4 Vision.

#### Advanced tip: How to deploy your project to LangServe
* This option is in beta and only available for Plus and Enterprise versions.

## LS Guides and Recommendations

#### LS Tracing: How-to Guides
* **[How to specify the % of traces you send to LS](https://docs.smith.langchain.com/tracing/faq/logging_and_viewing#setting-a-sampling-rate-for-tracing)**
* [How to add metadata and tags to taces](https://docs.smith.langchain.com/tracing/faq/customizing_trace_attributes#adding-metadata-and-tags-to-traces)
* [How to add an annotation to one particular trace](https://docs.smith.langchain.com/tracing/faq/logging_feedback#annotating-traces-with-feedback)
* [See all How-to Guides](https://docs.smith.langchain.com/tracing/faq)
* [See the LangChain-specific guides](https://docs.smith.langchain.com/tracing/faq/langchain_specific_guides)

#### LS Tracing: Use Cases Guides
* [Monitor application sentiment](https://docs.smith.langchain.com/tracing/use_cases/track-sentiment)
* [Summarize app usage](https://docs.smith.langchain.com/tracing/use_cases/summarize-usage)
* [Few-shot prompting with LS datasets](https://docs.smith.langchain.com/tracing/use_cases/few-shot-datasets)

#### LS Evaluation: Quickstart
* [Quickstart](https://docs.smith.langchain.com/evaluation/quickstart)
* [Concepts](https://docs.smith.langchain.com/evaluation/concepts)

#### LS Evaluation: How-To Guides
* [How to create custom evaluators](https://docs.smith.langchain.com/evaluation/faq/custom-evaluators)
* [How to paraphrase examples to expand your dataset](https://docs.smith.langchain.com/evaluation/faq/expand-datasets-paraphrase)
* [How to create new examples using prompting](https://docs.smith.langchain.com/evaluation/faq/expand-datasets-prompting)
* [Other How-To Guides](https://docs.smith.langchain.com/evaluation/faq)

#### LS Evaluation: Recommendations
* Test early and often
* Create domain-specific evaluators
* Use labels where possible
* Use aggregate evals
* Measure model stability
* Measure performance of subsets
* Evaluate production data
* Don't train on test datasets
* Test the model yourself
* Ask appropriate questions
* Interesting resources to learn more
* [See these recommendations in detail](https://docs.smith.langchain.com/evaluation/recommendations)

#### LS Monitoring
* [More details](https://docs.smith.langchain.com/monitoring)

#### LS Prompt Hub
* [Quickstart](https://docs.smith.langchain.com/hub/quickstart)

#### More info
* Blog post [LangChain announces general availability of LangSmith and 25M Series A led by Sequoia Capital](https://blog.langchain.dev/langsmith-ga/).
* Video [LangSmith In-Depth, by LangSmith Team](https://www.youtube.com/watch?v=3wAON0Lqviw).


## LS Next Features: Detailed View
#### Support for regression testing.
* Closer integration with CI/CD pipelines, so you can run LS tests in Github actions, or in Gitlab, etc.
* Make it very simple for people to make corrections on scores that were submitted by LLM evaluators.
#### Online evaluation on a sample of production data.
* Be able to configure an online evaluator in a very simple manner.
#### Better filtering and conversation support.
* Now you can do it with metadata filters, but in the future LS will have a more simple and intuitive way of filtering traces from chat conversations. 
#### Easy deployment of applications with hosted LangServe.
#### Enterprise features to support the administration and security needs for our largest customers.
* Permissioning. For example: who in the team has permission to make annotations? 
