-
Notifications
You must be signed in to change notification settings - Fork 81
Improving the BestProximity algorithm #1
Comments
Improvement of the generated successorsI am wondering how I can express and solve a graph theory problem. I am not sure how I can express it in terms of graph and edges. The problemSo here is my problem, I have multiple nodes, a, b and c here, that can be in different positions.
A node position is composed of two parts:
A path length can be computed this way:
Examplesa[0], b[1] and c[2] gives a length of 2 because the length between a[0] and b[1] is 1 and the length between b[1] and c[2] is 1. a[0], b[1001] and c[1002] gives a length of 9 because the length between a[0] and b[1001] is 8 and the length between b[1001] and c[1002] is 1. a[3], b[3] and c[2] gives a length of 2 because the length between a[3] and b[3] is 0 and the length between b[3] and c[2] is 2. Here is a Rust code to help understandconst MAX_DISTANCE: u32 = 8;
fn index_proximity(lhs: u32, rhs: u32) -> u32 {
if lhs <= rhs {
cmp::min(rhs - lhs, MAX_DISTANCE)
} else {
cmp::min(lhs - rhs, MAX_DISTANCE) + 1
}
}
fn positions_proximity(lhs: u32, rhs: u32) -> u32 {
let (lhs_attr, lhs_index) = (lhs / 1000, lhs % 1000);
let (rhs_attr, rhs_index) = (rhs / 1000, rhs % 1000);
if lhs_attr != rhs_attr {
MAX_DISTANCE
} else {
index_proximity(lhs_index, rhs_index)
}
} What I would like to retrieveI would like to retrieve the N shortests paths the first time I call the algorithm then be able to call it again to retrieve the next shortests paths (paths that are just longer than the latest returned). I know that I will need to rewrite my how A* algorithm myself. I will be able to keep the open and closed list to retrieve the shortest paths iteratively. Once the shortest path has been found I return all paths with the same shortest length that are not in the closed list, I just need to create an Iterator where the open and closed lists are in the state of the Iterator. What do you think? How can I represent this problem?First attemptI first tried to represent this problem by generating the new path in the successors function. The start path was [a[0]] and the successors of it were:
But I am not sure it is the right way to represent the problem as the path returned by A* is now a path of paths... and I have to retrieve only the last element of the path, which represent the solution. Note that it kind of worked with dijkstra but I had to call it multiple times and return Second attemptI saw the I tried to make the successors function not return entire paths but moves. I returned a struct containing the layer at which the node was located, the position selected. So a path is only valid if the layer which that node represent is the latest one (i.e the c layer). A node can only move to the next layer, this is how the successors function works, it will output all possible moves from the given node position to all the positions of the next layer along with the cost to moving to it. There is just a strange thing: I can have multiple starts. Therefore my start node is an uninit node, when the successors function is called on it, it will produce all the nodes of the first layer. Is this correct? I always ignore the first node of the returned paths as it is an "invalid" one. This attempt work perfectly 🎉 Now I need to be able to keep the state of the algorithm to be able to iterate another time and find the second shortests paths. Here is a Rust code that work for this problemuse std::cmp;
use pathfinding::directed::astar::astar_bag;
const MAX_DISTANCE: u32 = 8;
fn index_proximity(lhs: u32, rhs: u32) -> u32 {
if lhs <= rhs {
cmp::min(rhs - lhs, MAX_DISTANCE)
} else {
cmp::min(lhs - rhs, MAX_DISTANCE) + 1
}
}
fn positions_proximity(lhs: u32, rhs: u32) -> u32 {
let (lhs_attr, lhs_index) = (lhs / 1000, lhs % 1000);
let (rhs_attr, rhs_index) = (rhs / 1000, rhs % 1000);
if lhs_attr != rhs_attr {
MAX_DISTANCE
} else {
index_proximity(lhs_index, rhs_index)
}
}
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
struct Node {
init: bool,
layer: usize,
position: u32,
}
fn main() {
let positions = &[
&[0, 4, 1000][..], // eminem
&[ 1, 5, ], // rap
&[ 4, 7, 1000], // god
];
let start = Node { init: false, layer: 0, position: 0 };
let (solutions, cost) = astar_bag(
&start,
|n| if !n.init {
positions[0].iter().map(|p| (Node { init: true, layer: 0, position: *p }, 0)).collect()
} else if n.layer == positions.len() - 1 {
vec![] // We reached the highest layer
} else {
let layer = n.layer + 1;
positions[layer].iter().map(|p| {
let proximity = positions_proximity(n.position, *p);
(Node { init: true, layer, position: *p }, proximity)
}).collect()
},
|_| 0,
|n| n.layer == positions.len() - 1,
)
.unwrap();
let mut solutions: Vec<_> = solutions.collect();
solutions.sort_unstable();
dbg!(cost, solutions);
} |
Another thing to notice is that I see way better performances for "real kylie minogue lives" (398ms) than with "minogue kylie lives real" (1267ms). |
Outdated, we now use a word pair proximity docids database combined with a custom depth first search (called mana DFS), this database stores a key of the form |
338: Fix string fields sorting r=Kerollmops a=shekhirin Resolves #333 <details> <summary>curl checks</summary> ```console ➜ ~ curl -s 'localhost:7700/indexes/movies/search' -d '{"sort": ["title:asc"], "limit": 30}' | jq -r '.hits | map(.title)[]' #1 Cheerleader Camp #Horror #RealityHigh #Roxy #SquadGoals $5 a Day $9.99 '71 (2) (500) Days of Summer (Girl)Friend *batteries not included ...And God Created Woman ...And Justice for All ...E fuori nevica! .45 1 1 Mile To You 1 Night 10 10 Cloverfield Lane 10 giorni senza mamma 10 Items or Less 10 Rillington Place 10 Rules for Sleeping Around 10 Things I Hate About You 10 to Midnight 10 Years 10,000 BC 10,000 Saints ➜ ~ curl -s 'localhost:7700/indexes/movies/search' -d '{"sort": ["title:desc"], "limit": 30}' | jq -r '.hits | map(.title)[]' 크게 될 놈 왓칭 뷰티플 마인드 노무현과 바보들 ハニー Счастье – это… Часть 2 СОТКА Смотри мою любовь Позивний 'Бандерас' Лошо момиче Күлүк Хомус Куда течет море Каникулы президента Ακίνητο Ποτάμι Üç Harfliler: Beddua È nata una Star? Æon Flux Ága À propos de Nice À Nos Amours À l'aventure ¡Three Amigos! Zulu Dawn Zulu Zulu Zu: Warriors from the Magic Mountain Zu Warriors Zorro Zorba the Greek Zootopia ``` </details> Co-authored-by: Alexey Shekhirin <a.shekhirin@gmail.com>
The best proximity algorithm is a system that returns the best proximities between positions of the different query words.
How the engine works
Consider that the search engine have indexed many documents and those documents contain the words "eminem", "rap" and "god". The documents ids will be stored under each of the words apparition under the positions they appear.
If a user searches for "eminem rap god" then this is what the engine will be able to retrieve from the key-value store, a list associated to the query words.
In the example above the engine shows that the word "eminem" appears in many different positions in the documents: in the first attribute at index 0 but also in position 2 and 4 (i.e. 0, 2 and 4). it also appears in the second attribute at index 0 and 1 (i.e. 1000 and 1001).
These numbers doesn't represent documents ids. Once the engine have found the best proximity between these positions it retrieves the documents associated to these words at these exact positions. For example the word "eminem" at position 1000 can contain documents 30, 32, 35 and 72.
Splitting documents ids related to a given word reduces the number of documents ids to do intersections on and allows to return the best documents for a given query (by proximity and position) on the fly.
Once these documents ids have been retrieved the engine will do a simple intersection between them to know which documents are concerned.
Improve the generated successors
We use a function that produces successors of a given path between the differents words positions. This successors function does not known if a positions combination (an intersection) gives documents or not, we must introduce this feature to avoid producing successors paths when not documents will be considered.
Keep the graph exploration algorithm state between iterations
We execute a graph exploration algorithm (dikjstra, A-star) to retrieve the minimum proximity we can find between the different words and positions the engine retrieved for the user query.
This algorithm will return the best path related to its computed proximity but if not enough documents results from this execution we execute it a second time by ignoring paths that we have already seen in the previous iteration. Doing so means that we droped the whole data structures we filled from the previous iteration to re-explore the same nodes, this is wasted time. We must keep the open closed list.
The text was updated successfully, but these errors were encountered: