Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions examples/Custom-LLM-as-a-Judge.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -499,7 +499,7 @@
"It looks like the numeric rater scored almost 94% in total. That's not bad, but if 6% of your evals are incorrectly judged, that could make it very hard to trust them. Let's dig into the Braintrust\n",
"UI to get some insight into what's going on.\n",
"\n",
"![Partial credit](../images/Custom-LLM-as-a-Judge/Partial-Credit.gif)\n",
"![Partial credit](../images/Custom-LLM-as-a-Judge-Partial-Credit.gif)\n",
"\n",
"It looks like a number of the incorrect answers were scored with numbers between 1 and 10. However, we do not currently have any insight into why the model gave these scores. Let's see if we can\n",
"fix that next.\n"
Expand Down Expand Up @@ -670,11 +670,11 @@
"It doesn't look like adding reasoning helped the score (in fact, it's half a percent worse). However, if we look at one of the failures, we'll get some insight into\n",
"what the model was thinking. Here is an example of a hallucinated answer:\n",
"\n",
"![Output](../images/Custom-LLM-as-a-Judge/Output.png)\n",
"![Output](../images/Custom-LLM-as-a-Judge-Output.png)\n",
"\n",
"And the score along with its reasoning:\n",
"\n",
"![Reasoning](../images/Custom-LLM-as-a-Judge/Reasoning.png)\n"
"![Reasoning](../images/Custom-LLM-as-a-Judge-Reasoning.png)\n"
]
},
{
Expand Down
Binary file removed images/Custom-LLM-as-a-Judge/Classifier.png
Binary file not shown.