Skip to content

Commit

Permalink
TODO added task about indexing deep objects using lifti
Browse files Browse the repository at this point in the history
  • Loading branch information
h0lg committed Jan 17, 2023
1 parent 4a476f6 commit 9870b12
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 2 deletions.
1 change: 1 addition & 0 deletions CacheModels.cs
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ public sealed class CaptionTrack
/// <summary>The <see cref="Video.Id"/>. Needs to be set before indexing to generate a valid <see cref="Key"/>.</summary>
internal string VideoId { private get; set; }

// I currently work around it like this: Here I concatenate the VideoId and LanguageName to form the index key
/// <summary>Used for indexing. Conatins <see cref="VideoId"/> and <see cref="LanguageName"/>
/// separated by <see cref="MultiPartKeySeparator"/> to identify the matched video and caption track.</summary>
internal string Key => VideoId + MultiPartKeySeparator + LanguageName;
Expand Down
21 changes: 19 additions & 2 deletions VideoIndex.cs
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,22 @@ internal VideoIndexRepository(string directory)
.WithField(nameof(Video.Title), v => v.Title)
.WithField(nameof(Video.Keywords), v => v.Keywords)
.WithField(nameof(Video.Description), v => v.Description))
/* TODO How can I index the nested caption tracks while
- still being able to identify the matched one by language?
- supporting field restrictions?
There doesn't seem to be an API for "deep-indexing" an object.
This is how I'd imagine it:
.WithNested(nameof(Video.CaptionTracks),
v => v.CaptionTracks, // accessor for enumerable property
trackBuilder => trackBuilder // tokenization builder for nested CaptionTrack
.WithKey(track => track.LanguageName) // identifies the nested CaptionTrack in the context of a Video
.WithField(nameof(CaptionTrack.Captions), t => t.GetFullText())) // the deep field to index
*/

// I currently work around it like this: Here I configure the object tokenization of the nested CaptionTracks
.WithObjectTokenization<CaptionTrack>(itemOptions => itemOptions
.WithKey(t => t.Key)
.WithKey(t => t.Key) // using a composite key that also identifies the parent Video
.WithField(nameof(CaptionTrack.Captions), t => t.GetFullText()))
.WithQueryParser(o => o.WithFuzzySearchDefaults(
maxEditDistance: termLength => (ushort)(termLength / 3),
Expand Down Expand Up @@ -104,6 +118,7 @@ internal async Task AddAsync(Video video, CancellationToken cancellation)

foreach (var track in video.CaptionTracks)
{
// I currently work around it like this: Here I set the VideoId on the track before indexing
track.VideoId = video.Id; // set for indexing
await Index.AddAsync(track);
}
Expand Down Expand Up @@ -141,16 +156,18 @@ internal async Task AddAsync(Video video, CancellationToken cancellation)
var matches = results
.Select(result =>
{
// I currently work around it like this: Here I split up the Video.Id and CaptionTrack.LanguageName again
var ids = result.Key.Split(CaptionTrack.MultiPartKeySeparator);
var videoId = ids[0];
var language = ids.Length > 1 ? ids[1] : null;
return new { videoId, language, result };
})
// make sure to only return results for the requested videos if specified; index may contain more
.Where(m => relevantVideos == default || relevantVideos.Keys.Contains(m.videoId))
.GroupBy(m => m.videoId)
.GroupBy(m => m.videoId) // and then group by Video.Id
.Select(group => new
{
// to get all results in one match
VideoId = group.Key,
InMetaData = group.SingleOrDefault(m => m.language == null)?.result,
InCaptions = group.Where(m => m.language != null),
Expand Down

4 comments on commit 9870b12

@h0lg
Copy link
Owner Author

@h0lg h0lg commented on 9870b12 Jan 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikegoatly You can use LIFTI to search the subtitles and other metadata of YouTube videos now - thanks also to you for being so responsive :)
I was wondering whether you have any ideas about how best to index object graphs and could have a brief look at my implementation. Can this be done more elegantly without encoding composite IDs like you see me do above? Thanks!

@mikegoatly
Copy link

@mikegoatly mikegoatly commented on 9870b12 Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h0lg would it make sense to treat the nested chunks of data as some sort of dynamic field associated to the parent object? It seems to me that what you want is something like this:

.WithDynamicFields(
    v => v.CaptionTracks, // The set of nested objects to treat as dynamic fields
    ct => ct.LanguageName, // A delegate to read the name of the field from each nested object
    ct => ct.GetFullText()) // A delegate to read the text for each nested object

That way a property in the nested object can be used as the field name (or derived from it to have some sort of differentiation from other fields, e.g. $"Language_{ct.LanguageName}")

I'm not sure of the method name, but does that make sense?

@mikegoatly
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking this in issue #66

@h0lg
Copy link
Owner Author

@h0lg h0lg commented on 9870b12 Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikegoatly Thanks for the response -. I'll have a look at mikegoatly/lifti#66 !

@h0lg would it make sense to treat the nested chunks of data as some sort of dynamic field associated to the parent object?

Yeah, that API looks like it would be sufficient for what I'm trying to do. As far as I understand, your approach is to flatten selected properties of nested objects or collections into the owner as named fields instead of creating a builder for the entire nested object.

That way a property in the nested object can be used as the field name (or derived from it to have some sort of differentiation from other fields, e.g. $"Language_{ct.LanguageName}")

I hadn't thought of writing field-specific queries for a specific item of a nested collection. But yeah, LGTM.

My only doubt is whether other users would want to map more properties than one on a nested object or collection. At that point a separate builder would start making sense - to avoid having to repeat the same navigation properties over and over again for each mapped nested property. But that's not a real requirement for me at the moment and I can understand if you don't want to design for it and would rather keep it simple.

Please sign in to comment.