Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

URL previewing support #688

Merged
merged 44 commits into from Apr 11, 2016

Conversation

Projects
None yet
2 participants
Owner

ara4n commented Apr 3, 2016

  • add design sketch doc for a URL preview API
  • add a new SpiderHttpClient derived from SimpleHttpClient, which follows redirects and handles gzip CTE correctly
  • add get_file support to SimpleHttpClient, knowingly duplicated for now from matrixfederationclient.
  • add a preview_url_resource to implement the new media/r0/preview_url API. This:
    • spiders the a given URL, extracting or synthesising OpenGraph metadata for it via lxml, returning the metadata as a JSON blob
    • returns a cache (either from on disk or in memory) for the metadata of the URL as of the requested point in time, if available.
    • deduplicates requests to spider a URL such that only one req is in flight at any point.
  • adds support for thumbnailing SVGs (by just passing back the original image for now)
  • adds the local_media_repository_url_cache table to the DB for the on-disk URL cache
  • adds get_url_cache and store_url_cache to media_repository.py to wrap the new table

N.B. that following redirects will not work correctly until https://twistedmatrix.com/trac/ticket/8265 is merged. Unsure if it's worth maintaining our own Twisted fork until that happens.

Given I'm hardly a python/twisted expert, review would be particularly appreciated on:

  • Whether I should be doing anything with log contexts - I've seen some worrying warnings about log contexts not being preserved
  • Whether I should be doing anything smarter with the types or flavours of exceptions I raise. Particularly, some exceptions from the async_get method seem to lose their stack traces entirely - i haven't spotted the pattern yet.

This is part of a set of PRs spanning vector-web, matrix-react-sdk, matrix-js-sdk and synapse.
See also vector-im/vector-web#1343 and matrix-org/matrix-react-sdk#260 and matrix-org/matrix-js-sdk#122

ara4n added some commits Jan 24, 2016

initial WIP of a tentative preview_url endpoint - incomplete, unteste…
…d, experimental, etc. just putting it here for safekeeping for now

@ara4n ara4n referenced this pull request in vector-im/riot-web Apr 3, 2016

Closed

Embed URL previews using opengraph tags and similar #184

@NegativeMjark NegativeMjark commented on an outdated diff Apr 4, 2016

synapse/rest/media/v1/preview_url_resource.py
+ del og["og:image"]
+
+ if 'og:description' not in og:
+ meta_description = tree.xpath(
+ "//*/meta"
+ "[translate(@name, 'DESCRIPTION', 'description')='description']"
+ "/@content")
+ if meta_description:
+ og['og:description'] = meta_description[0]
+ else:
+ # text_nodes = tree.xpath("//h1/text() | //h2/text() | //h3/text() | "
+ # "//p/text() | //div/text() | //span/text() | //a/text()")
+ text_nodes = tree.xpath("//text()[not(ancestor::header | ancestor::nav | "
+ "ancestor::aside | ancestor::footer | "
+ "ancestor::script | ancestor::style)]" +
+ "[ancestor::body]")
@NegativeMjark

NegativeMjark Apr 4, 2016

Contributor

Maybe add a comment explaining what this is doing and why.

@NegativeMjark NegativeMjark and 1 other commented on an outdated diff Apr 4, 2016

synapse/rest/media/v1/preview_url_resource.py
+
+ # store OG in history-aware DB cache
+ yield self.store.store_url_cache(
+ url,
+ media_info["response_code"],
+ media_info["etag"],
+ media_info["expires"],
+ json.dumps(og),
+ media_info["filesystem_id"],
+ media_info["created_ts"],
+ )
+
+ respond_with_json_bytes(request, 200, json.dumps(og), send_cors=True)
+ except:
+ # XXX: if we don't explicitly respond here, the request never returns.
+ # isn't this what server.py's wrapper is meant to be doing for us?
@NegativeMjark

NegativeMjark Apr 4, 2016

Contributor

This shouldn't be necessary.

@ara4n

ara4n Apr 7, 2016

Owner

empirically it is... unsure how to proceed on this one.

@ara4n

ara4n Apr 8, 2016

Owner

ah, the problem was that i had been returning rather than raiseing in the async GET when things went wrong, i think. now fixed.

@NegativeMjark NegativeMjark commented on the diff Apr 4, 2016

synapse/rest/media/v1/preview_url_resource.py
+ def _rebase_url(self, url, base):
+ base = list(urlparse(base))
+ url = list(urlparse(url))
+ if not url[0]: # fix up schema
+ url[0] = base[0] or "http"
+ if not url[1]: # fix up hostname
+ url[1] = base[1]
+ if not url[2].startswith('/'):
+ url[2] = re.sub(r'/[^/]+$', '/', base[2]) + url[2]
+ return urlunparse(url)
+
+ @defer.inlineCallbacks
+ def _download_url(self, url, user):
+ # TODO: we should probably honour robots.txt... except in practice
+ # we're most likely being explicitly triggered by a human rather than a
+ # bot, so are we really a robot?
@NegativeMjark

NegativeMjark Apr 4, 2016

Contributor

We need to restrict which hosts the GET requests are sent to. Otherwise we have a bit of a security problem.
That probably involves applying an IP blacklist after resolving DNS.

@NegativeMjark NegativeMjark commented on an outdated diff Apr 4, 2016

synapse/storage/media_repository.py
@@ -50,6 +50,59 @@ def store_local_media(self, media_id, media_type, time_now_ms, upload_name,
desc="store_local_media",
)
+ def get_url_cache(self, url, ts):
+ """Get the media_id and ts for a cached URL as of the given timestamp
+ Returns:
+ None if the URL isn't cached.
+ """
+ def get_url_cache_txn(txn):
+ # get the most recently cached result (relative to the given ts)
+ sql = (
+ "SELECT response_code, etag, expires, og, media_id, max(download_ts)"
@NegativeMjark

NegativeMjark Apr 4, 2016

Contributor

You probably want to be doing ORDER BY download_ts DESC LIMIT 1 rather than max(download_ts)

@NegativeMjark NegativeMjark commented on an outdated diff Apr 4, 2016

synapse/storage/media_repository.py
+ # get the most recently cached result (relative to the given ts)
+ sql = (
+ "SELECT response_code, etag, expires, og, media_id, max(download_ts)"
+ " FROM local_media_repository_url_cache"
+ " WHERE url = ? AND download_ts <= ?"
+ )
+ txn.execute(sql, (url, ts))
+ row = txn.fetchone()
+
+ if not row[3]:
+ # ...or if we've requested a timestamp older than the oldest
+ # copy in the cache, return the oldest copy (if any)
+ sql = (
+ "SELECT response_code, etag, expires, og, media_id, min(download_ts)"
+ " FROM local_media_repository_url_cache"
+ " WHERE url = ? AND download_ts > ?"
@NegativeMjark

NegativeMjark Apr 4, 2016

Contributor

You probably want to be doing ORDER BY download_ts ASC LIMIT 1 rather than min(download_ts)

@NegativeMjark NegativeMjark commented on an outdated diff Apr 4, 2016

synapse/rest/media/v1/preview_url_resource.py
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .base_resource import BaseMediaResource
+
+from twisted.web.server import NOT_DONE_YET
+from twisted.internet import defer
+from lxml import html
@NegativeMjark

NegativeMjark Apr 4, 2016

Contributor

Ugh, I wish we weren't adding a lxml dependency, But I don't see how to do the xpath stuff otherwise.
It might be worth adding some notes to upgrade.rst to remind us about it when we do a release.

Contributor

NegativeMjark commented Apr 4, 2016

I think you need to run apt-get install libxslt1-dev before you can install lxml on debian fwiw.

Contributor

NegativeMjark commented Apr 4, 2016

Can we make the entire thing optional somehow? We probably can't run it by default anyway given that it needs an IP blacklist.

@NegativeMjark NegativeMjark commented on an outdated diff Apr 4, 2016

synapse/rest/media/v1/preview_url_resource.py
+ @defer.inlineCallbacks
+ def _async_render_GET(self, request):
+
+ try:
+ # XXX: if get_user_by_req fails, what should we do in an async render?
+ requester = yield self.auth.get_user_by_req(request)
+ url = request.args.get("url")[0]
+ if "ts" in request.args:
+ ts = int(request.args.get("ts")[0])
+ else:
+ ts = self.clock.time_msec()
+
+ # first check the memory cache - good to handle all the clients on this
+ # HS thundering away to preview the same URL at the same time.
+ try:
+ og = self.cache[url]
@NegativeMjark

NegativeMjark Apr 4, 2016

Contributor

Use cache.get() rather try: except:

@NegativeMjark NegativeMjark commented on an outdated diff Apr 4, 2016

synapse/rest/media/v1/preview_url_resource.py
+
+ if dims:
+ og["og:image:width"] = dims['width']
+ og["og:image:height"] = dims['height']
+ else:
+ logger.warn("Couldn't get dims for %s" % url)
+
+ # define our OG response for this media
+ elif self._is_html(media_info['media_type']):
+ # TODO: somehow stop a big HTML tree from exploding synapse's RAM
+
+ try:
+ tree = html.parse(media_info['filename'])
+ og = yield self._calc_og(tree, media_info, requester)
+ except UnicodeDecodeError:
+ # XXX: evil evil bodge
@NegativeMjark

NegativeMjark Apr 4, 2016

Contributor

Could you elaborate on what the bodge is?

ara4n added some commits Apr 7, 2016

Add url_preview_enabled config option to turn on/off preview_url endp…
…oint. defaults to off.

Add url_preview_ip_range_blacklist to let admins specify internal IP ranges that must not be spidered.
Add url_preview_url_blacklist to let admins specify URL patterns that must not be spidered.
Implement a custom SpiderEndpoint and associated support classes to implement url_preview_ip_range_blacklist
Add commentary and generally address PR feedback
Owner

ara4n commented Apr 8, 2016

incorporate all the PR feedback - @NegativeMjark PTAL

ara4n and others added some commits Apr 8, 2016

@NegativeMjark NegativeMjark commented on an outdated diff Apr 8, 2016

synapse/rest/media/v1/preview_url_resource.py
+import ujson as json
+
+import logging
+logger = logging.getLogger(__name__)
+
+try:
+ from lxml import html
+except ImportError:
+ pass
+
+
+class PreviewUrlResource(BaseMediaResource):
+ isLeaf = True
+
+ def __init__(self, hs, filepaths):
+ if not html:
@NegativeMjark

NegativeMjark Apr 8, 2016

Contributor

The not html probably throws if lxml isn't installed.

@NegativeMjark NegativeMjark commented on an outdated diff Apr 8, 2016

synapse/rest/media/v1/preview_url_resource.py
+import logging
+logger = logging.getLogger(__name__)
+
+try:
+ from lxml import html
+except ImportError:
+ pass
+
+
+class PreviewUrlResource(BaseMediaResource):
+ isLeaf = True
+
+ def __init__(self, hs, filepaths):
+ if not html:
+ logger.warn("Disabling PreviewUrlResource as lxml not available")
+ raise
@NegativeMjark

NegativeMjark Apr 8, 2016

Contributor

maybe raise an actual exception rather than None.

Owner

ara4n commented Apr 8, 2016

@NegativeMjark addressed these too, and now throwing sensible exceptions. PTAL

@NegativeMjark NegativeMjark commented on an outdated diff Apr 11, 2016

synapse/rest/media/v1/preview_url_resource.py
+ isLeaf = True
+
+ def __init__(self, hs, filepaths):
+ try:
+ if html:
+ pass
+ except:
+ raise RunTimeError("Disabling PreviewUrlResource as lxml not available")
+
+ if not hasattr(hs.config, "url_preview_ip_range_blacklist"):
+ logger.warn(
+ "For security, you must specify an explicit target IP address "
+ "blacklist in url_preview_ip_range_blacklist for url previewing "
+ "to work"
+ )
+ raise RunTimeError(
@NegativeMjark

NegativeMjark Apr 11, 2016

Contributor

Its RuntimeError not RunTimeError. This sort of typo can be picked up by running flake8 synapse fwiw.

@NegativeMjark NegativeMjark commented on an outdated diff Apr 11, 2016

synapse/rest/media/v1/preview_url_resource.py
+ self.cache[url] = og
+
+ # store OG in history-aware DB cache
+ yield self.store.store_url_cache(
+ url,
+ media_info["response_code"],
+ media_info["etag"],
+ media_info["expires"],
+ json.dumps(og),
+ media_info["filesystem_id"],
+ media_info["created_ts"],
+ )
+
+ respond_with_json_bytes(request, 200, json.dumps(og), send_cors=True)
+ except Exception as e:
+ raise e
@NegativeMjark

NegativeMjark Apr 11, 2016

Contributor

You can probably get rid of the try: except: block

Contributor

NegativeMjark commented Apr 11, 2016

Other than fixing the typo's and style warnings, it LGTM. I'm slightly concerned by the lack of tests for it though.

@ara4n ara4n merged commit 4bd3d25 into develop Apr 11, 2016

0 of 8 checks passed

Flake8 + Packaging (Commit) Build #373 origin/matthew/preview_urls failed in 37 sec
Details
Flake8 + Packaging (Merged PR) Build finished.
Details
Sytest Postgres (Commit) Build #364 origin/matthew/preview_urls in progress...
Details
Sytest Postgres (Merged PR) Build started sha1 is merged.
Details
Sytest SQLite (Commit) Build #369 origin/matthew/preview_urls in progress...
Details
Sytest SQLite (Merged PR) Build started sha1 is merged.
Details
Unit Tests (Commit) Build #417 in progress...
Details
Unit Tests (Merged PR) Build started sha1 is merged.
Details

@richvdh richvdh deleted the matthew/preview_urls branch Dec 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment