Search Google developer videos
With minor tweaking the app and API can be used to build search for any YouTube channel with manually captioned videos.
For those who prefer to access information by reading text rather than watching videos, the app provides downloadable transcripts:
The transcripts have Google Translate built in, so you can choose read them in a different language. Caption highlighting is synchronised with video playback — and you can tap or click on any part of a transcript to navigate through the video.
Search for something: simpl.info/s
Readable transcripts for one or more videos: simpl.info/s/t?id=ngBy0H_q-GY,3i9WFgMuKHs
Link to a query: simpl.info/s?q=breakpoint
Data and transcript for a video: shearch.me/2UKPRbrw3Kk
Search any field for a query, spaces OK — can be a bit slow: shearch.me?q=http 203
More shortcuts: c for captions, s for speaker — speakers are parsed from transcript: shearch.me?c=svg&s=alex
Specify ranges for commentCount, dislikeCount, favoriteCount, likeCount, viewCount: shearch.me?speaker=Jake&viewCount>10000
Use any of these values to specify order: shearch.me?speaker=Jake&viewCount>10000&sort=viewCount
Add a hyphen for descending order: shearch.me?speaker=Jake&viewCount>10000&sort=-viewCount
Spaces are OK: shearch.me?speakers=Reto Meier&title=Android
More complex stuff works too: shearch.me?(title=Android Wear|description=Android Wear)&speakers=Reto shearch.me?(title=Android Wear|description=Android Wear)&speakers=[Reto,Wayne] shearch.me?title="Android Wear"|title=WebRTC shearch.me?(title=Android Wear|description=Android Wear)&speakers=Timothy
Fuzzy matching — with apologies to Wayne :): shearch.me?speakers=pekarsky~
For dates, use 'from' and 'to', which can cope with anything Date can handle: shearch.me?from=Feb // assumes text-only is a month this year shearch.me?from=April 2014 shearch.me?from=2013-03-01&to=2013-05-01 shearch.me?from=2013&to=2014 // midnight, 1 January to midnight, 1 January
Get total for any quantity field — this query returns the total number of views for all videos: shearch.me?count=views
Get total for any query and quantity field: shearch.me?speakers=butcher&count=views
Get all individual values for any quantity field for all videos — returns an object keyed by amounts, values are number of occurrences for each amount: shearch.me?countall=views
Get all individual values for any quantity field for any query: shearch.me?speakers=reto&countall=views
Build a chart from results (views for videos that mention 'Chrome'): simpl.info/s/chart.html
Issues and pull requests welcome.
There are three code directories:
Middle layer Node app to get data from the database. For testing, you can run this locally with the app running from localhost. The live version is on Nodejitsu at shearch.me, for queries like this: shearch.me?captions=svg&speaker=alex (same as shearch.me?c=svg&s=alex).
Why didn't you use Firebase?
Cloudant has Lucene search built in, and is based on CouchDB, which is easy to use from Node.
Firebase can now be used with Elasticsearch, but at the start of the project required extra installation.
Why didn't you just use MySQL or …
An SQL database with Lucene for full text search might have been more appropriate than CouchDB.
(This kind of search is actually much easier with Firebase now.)
How was CouchDB?
Problems came with full text search:
- Full text search is not built into CouchDB, though it can be added on with Lucene or other search engines.
- CouchDB searches return entire documents, with no 'partial' results. (In my case, a document represents all data for a video.) So, for example, to return only captions that include 'Android Wear', it's necessary to retrieve all the documents (in their entirety) that have captions that mention 'Android Wear' then filter.
- CouchDB search queries cannot be combined: for example, 'get me all videos from 2013 with WebRTC in the title'. So, again, you have to add your own filter.
How big is the database?
Around 250MB, but more like 150MB without transcripts: the transcript for each document is really just a convenience to make it quick and simple to retrieve human readable transcripts, and replicates the captions (with a few tweaks).
How often is the data updated?
At present the database is updated manually to avoid code changes breaking it.
Why didn't you use io.js?
No big reason. Node.js has been around longer.
How many videos have transcripts?
When the repo was created: 4312 videos, 3550 with transcripts.
How did you get the speaker names?
With a bit of sneaky regexing these are parsed from transcripts. NB: speaker names are not parsable for many captions, so speaker search results may not always be complete.
Why are caption matches returned as span elements?
The primary use for the caption matches is within HTML markup. Returning JSON for each span might be neater and less verbose, but for most apps that would entail extra effort transforming to HTML.
How long does it take to store and index data?
This depends a lot on connectivity. From work, the app gets and inserts the video data and transcripts in under three minutes. From home, it takes about 10 minutes.
Indexing takes about 10 minutes.
What build tools do you use?
JSCS and JSHint with grunt and githooks to force validation on commit.
- General code refactoring.
- Unit tests.
- Better error handling.
- Better Node socket handling: a lot of the code is deliberately synchronous to avoid errors.
- The shearch.me API is HTTP only as yet.
- Use the official YouTube Captions API.
- Move to Firebase. When the project started it was a bit tricky to implement full-text search with Firebase, so Cloudant was chosen (which has full text search built in). It's now pretty simple to use Firebase with ElasticSearch, so the data will be ported at some stage.
- Database updates are done manually at the moment — mostly to avoid messing up the sample app. Easily automated.
Copyright 2015 Google, Inc.
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Please note: this is not a Google product.