Min score for vector learning resources endpoint by shanbady · Pull Request #3285 · mitodl/mit-learn

shanbady · 2026-05-04T19:17:22Z

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/11079

Description (What does it do?)

This PR adds a score_cutoff parameter to the vector learning resources endpoint that controls the minimum threshold score for results that are returned when there is a query string present.

There are some quirks in how this had to be implemented to get the facet counts to behave as expected (documented here)

How this works

when the "hybrid search" admin option is enabled

If there is no query, we enable pagination on the api, facet counts are fetched directly from qdrant (existing default behavior)
if there is a query - the min-score is in effect. Instead of paginating we return all results above the min score for the query (+ any filters)
- There is no pagination (we show all the results on a single page)
- clicking on any of the facets applies the filter on the already retrieved results on the frontend

How can this be tested?

checkout this branch
make sure you have courses/learning resources embedded. if not run python manage.py generate_embeddings --courses --skip-contentfiles
login as an admin/superuser
visit the search page - enable the "hybrid search" admin option
try some queries. When using a query string you should get a small handful of results (results that score above the threshold). Try filtering by different facets and see that the counts line up.

github-actions · 2026-05-04T19:17:45Z

OpenAPI Changes

1 changes: 0 error, 0 warning, 1 info

View full changelog

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

Copilot

Pull request overview

Adds a score_cutoff query parameter to the v0 vector learning-resources search endpoint to apply a minimum similarity threshold when a query string is present, and adjusts the Search UI behavior so facet filtering/counts remain coherent when the endpoint returns “all hits above cutoff” (no pagination).

Changes:

Backend: introduce score_cutoff -> Qdrant score_threshold, and when q + cutoff are present, bypass pagination by fetching all hits above the cutoff and recomputing facet counts from the returned hits.
Frontend: for vector searches with a query, request an “unfaceted” vector search and apply facet filtering + facet counts locally; hide pagination in this mode.
API/docs/tests: update request serializer + OpenAPI spec + generated v0 client, and add/adjust tests for the new behavior.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
vector_search/views.py	Implements `score_cutoff`, changes query-with-cutoff behavior to fetch all hits and recompute aggregations from returned hits.
vector_search/views_test.py	Updates pagination test to disable cutoff; adds test asserting cutoff bypasses pagination and forwards `score_threshold`.
vector_search/utils_test.py	Updates hybrid-search test expectations to account for new cutoff-driven control flow.
vector_search/serializers.py	Adds `score_cutoff` request field (default 0.1).
openapi/specs/v0.yaml	Documents `score_cutoff` in the v0 vector search endpoint.
frontends/main/src/page-components/SearchDisplay/SearchDisplay.tsx	Implements local facet filtering/counts + pagination suppression for vector query searches.
frontends/main/src/app-pages/SearchPage/SearchPage.test.tsx	Adds UI test asserting pagination hidden + local facet filtering avoids extra API calls.
frontends/api/src/generated/v0/api.ts	Regenerates client types/params to include `score_cutoff`.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

abeglova

I'm not sure how to review this. I don't think this is a good approach. Both the manual filtering in the ui and the manual facet aggregation seem not great and it doesn't seem like there is an easy fix that keeps the current approach. The current score threshold is high enough that lots of relevant resources are just not returned if you make a query that should have lots of results like "AI". If you lower the the threshold and some queries return more than the number of results in a page the filtering does not work correctly. I think realistically we need to either do something else or rethink how facets work

abeglova · 2026-05-06T19:20:44Z

+      return true
+    }
+    const resourceValues = getResourceFacetValues(resource, facet)
+    return selectedValues.some((value) => resourceValues.includes(value))


Why are you filtering in the UI and not the backend? I think manually filtering for facets is a bad idea in general but at least it should be done in the backend so the results are cachable. Also filtering the results in python wouldn't be fast but at least it will be faster than filtering in the ui code

Also, with the 0.1 cutoff i never had a query return more than 15 or so results even for very broad searches like "science" locally. The facets worked as expected but many relevant results were not returned. When i lowered the cutoff so that i would get more results than would fit on a page for some queries, qdrant would still limit the number of results returned and some facets would say there were more results than were shown because some of the results were not displayed because they did not fit in the first page

I tried again with score_cutoff 0.00001 and got the expected results. I previously tried setting score_cutoff=0 which caused to backend to still page but the UI to filter

I went this route to get the interface and facet counts to line up (regardless of qdrant/opensearch endpoint) in addition to working around some fundamental quirks in how Qdrant applies filtering (as well as a limitation with their facet api (I left some notes on this here).

Opensearch:

Search “Biology” -> I get a stable set of matching docs -> compute facets over that same set → clicking a facet ("Earth Science" etc) narrows that same set

Qdrant:

Search “Biology”-> I get the best candidates -> fuse/rank them -> cut by score/limit -> apply filters over the surviving candidates

Qdrant's facet api does not support hybrid/complex searches involving a pre-filter or passing in a score.(presumably it would be just as expensive as performing the entire search and manually aggregating counts)

This was a non-issue without the score cutoff since all searches would yield all results

Qdrant performs neither a pre-filter or post-filter (do the search first - then apply the filters) like opensearch.

It traverses a filterable HNSW as it performs the search. To further confound this Qdrant also dynamically selects how it decides to determine the filter counts depending on the cardinality of the field(s) being filtered

abeglova · 2026-05-06T19:43:28Z

-                    />
-                  )}
-                />
+                {!isVectorQuerySearch && (


What is the plan for this long term? What should happen to searches that have more results than can fit on a page?

As a first step to make it easier to iterate on vector specific logic/interface my plan was to properly subclass and separate out the vector specific logic on the frontend into its own component as part of this ticket

Long term we could consider:

what it currently does - limit the number of results by using a threshold score when dealing with the "brute force branch" (query is present + filters are applied + there is a cutoff_score > 0). its a non-issue if we are just dealing with the learning resources collection and the cutoff is appropriate (we are not dealing with thousands of results after applying a specific query+filter+cutoff_score). If we present the user with more than 10 results given the size of our learning resource catalog after query+filter+cutoff_score filters there is likely something else wrong IMO.

changing the design of the search page/facet counts - consider hiding counts once a query is present or even altogether

Long term I agree with what you said "I think realistically we need to either do something else or rethink how facets work" - this is a first dip (admin-only see how it works or doesnt)

shanbady · 2026-05-08T17:19:25Z

@abeglova This is ready for another look. I resolved the min-score = 0 issue you spotted in addition to adding an absolute hard limit for the case where we need to fetch all results without pagination

abeglova · 2026-05-11T20:23:14Z

+    score_cutoff = serializers.FloatField(
+        required=False,
+        default=settings.VECTOR_SEARCH_MIN_SCORE,
+        min_value=settings.VECTOR_SEARCH_MIN_SCORE,


Can you remove this from the serialized now that this is an environment variable? Or i think the better thing to do is to add it to the search params similar to the other admin variables

Also why is min_value the same as default?

i was planning on factoring it out to be configurable in the next feature that lives on top of this. what do you think about for now setting min-value=0 and default=0.1 ?

in terms of removing this - I think we may* need to adjust this (have the frontend pass this in) depending on the interface (channel pages may need a slightely different score to be effective) just because of how this applies in qdrant. I'm going to move this to constants instead of a variable setting

abeglova · 2026-05-11T20:25:06Z

+          type: number
+          format: double
+          minimum: 0.1
+          default: 0.1


You will be incorrect if VECTOR_SEARCH_MIN_SCORE changes.

thats a good point. suggested hard coded values for now here

abeglova

lgtm

shanbady added 9 commits April 29, 2026 10:14

adding cutoff to serializer

af1d88b

adding param to search method

f5c9a43

refactor of sync search view

523242f

ensuring all results are returned if there is a query with score_cutoff

12c2b60

ensuring all results are returned if there is a query with score_cutoff

e1be69a

spec update

f94e51e

adding frontend changes

d2c5310

fix test

1abb1a9

fix tests

4c3cb58

shanbady added the Work in Progress label May 4, 2026

shanbady changed the title ~~score_threshold for vector learning resources endpoint~~ Min score for vector learning resources endpoint May 4, 2026

shanbady added 7 commits May 5, 2026 11:35

introduce manual aggregation method

b6b35f4

re-introduce spacing at bottom

24e8865

fix facet behavior

0079288

fix lint

21c0d70

fix lint

6d7cee8

adding docstring

36d6302

refactor to not use pandas

82ae812

shanbady marked this pull request as ready for review May 5, 2026 18:13

Copilot AI review requested due to automatic review settings May 5, 2026 18:13

Copilot started reviewing on behalf of shanbady May 5, 2026 18:13 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

Comment thread vector_search/serializers.py Outdated

Comment thread vector_search/views.py Outdated

Comment thread frontends/main/src/page-components/SearchDisplay/SearchDisplay.tsx Outdated

Comment thread vector_search/views.py

shanbady and others added 7 commits May 5, 2026 14:36

Potential fix for pull request finding

67ac603

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix facet sorting

23a1a12

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

update spec

e728300

style lint

9367bda

avoid call to aggregations

f50511b

move import to top

23dbf24

adding test for aggregation buckets

daf2add

shanbady added Needs Review An open Pull Request that is ready for review and removed Work in Progress labels May 5, 2026

abeglova self-assigned this May 6, 2026

abeglova requested changes May 6, 2026

View reviewed changes

shanbady added 8 commits May 7, 2026 12:59

only include the threshold if greater than 0

6f27eb1

refactor min score default values. add min score to settings

ff84670

enforce limit and remove count call to fetch total

db9cab0

add max limit setting

33c5e26

add test

8f48410

udpate spec

c9ba0ee

fix test after refactor

a167157

Merge branch 'main' into shanbady/score_cuttoff-alt

671cc2f

shanbady requested a review from abeglova May 8, 2026 17:19

shanbady added 6 commits May 11, 2026 10:03

Merge branch 'main' into shanbady/score_cuttoff-alt

c24202b

set default to min score

0714153

fix test

d81384e

spec update

9a018bb

fix test

6481729

fix sorting

2e217a6

abeglova requested changes May 11, 2026

View reviewed changes

moving min score to constant

4c14d17

shanbady requested a review from abeglova May 12, 2026 15:50

abeglova approved these changes May 12, 2026

View reviewed changes

shanbady merged commit ef1dcc8 into main May 12, 2026
13 checks passed

shanbady deleted the shanbady/score_cuttoff-alt branch May 12, 2026 18:50

This was referenced May 13, 2026

Release 0.67.3 #3335

Closed

Release 0.68.0 #3339

Merged

Conversation

shanbady commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How this works

How can this be tested?

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAPI Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abeglova left a comment

Choose a reason for hiding this comment

Uh oh!

abeglova May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abeglova May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abeglova May 6, 2026

Choose a reason for hiding this comment

Uh oh!

shanbady May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Opensearch:

Qdrant:

Uh oh!

abeglova May 6, 2026

Choose a reason for hiding this comment

Uh oh!

shanbady May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shanbady commented May 8, 2026

Uh oh!

abeglova May 11, 2026

Choose a reason for hiding this comment

Uh oh!

abeglova May 11, 2026

Choose a reason for hiding this comment

Uh oh!

shanbady May 11, 2026

Choose a reason for hiding this comment

Uh oh!

shanbady May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abeglova May 11, 2026

Choose a reason for hiding this comment

Uh oh!

shanbady May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abeglova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

shanbady commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading

abeglova May 6, 2026 •

edited

Loading

abeglova May 6, 2026 •

edited

Loading

shanbady May 7, 2026 •

edited

Loading

shanbady May 7, 2026 •

edited

Loading

shanbady May 11, 2026 •

edited

Loading

shanbady May 11, 2026 •

edited

Loading