Make failures report on Treeherder #53

Merged
merged 5 commits into from Jan 19, 2017

Projects

None yet

3 participants

@JohanLorenzo
Contributor
JohanLorenzo commented Jan 12, 2017 edited

[WIP: Tests are on their way]

This requires:

  1. the presence of a public/logs/live_backing.log
  2. the file to be gzip'd
  3. and served as a gzip on S3

Edit: Depends on #59

@coveralls

Coverage Status

Coverage decreased (-0.06%) to 99.942% when pulling 16aa27e on JohanLorenzo:failures-on-treeherder into 13fa850 on mozilla-releng:master.

@escapewindow

Let's add tests and address the comments. It'll be good to see logs on treeherder!

scriptworker/log.py
@@ -88,7 +88,8 @@ def get_log_filenames(context):
Returns:
tuple: log file path, error log file path
"""
- log_file = os.path.join(context.config['task_log_dir'], 'task_output.log')
+ # XXX Even though our logs aren't live, Treeherder looks for live_backing.log to show errors in failures summary
+ log_file = os.path.join(context.config['task_log_dir'], 'live_backing.log')
@escapewindow
escapewindow Jan 12, 2017 Member

I think we should use task_error.log as the live_backing.log.
Currently task_error.log contains STDERR + STDOUT; task_output.log has STDOUT. If we use task_output.log, we'll drop everything that goes to STDERR.

Also, in practice, I haven't found task_output.log useful. We could actually move to one combined task log file. If you're up for it, let's do it. If you want to limit scope, we can stay with live_backing.log + a mostly useless task_output.log.

@JohanLorenzo
JohanLorenzo Jan 13, 2017 Contributor

I agree. In practice, I haven't noticed a the difference between the 2 files (except task_output.log contains the exit code at the end). I'll get rid task_output in this PR.

@escapewindow
escapewindow Jan 13, 2017 Member

Thanks!
I think signing logging output goes to stderr, and stack traces as well. A single unified log seems like a win.

@JohanLorenzo
JohanLorenzo Jan 17, 2017 Contributor

In order to make the reviews simpler, I made the renaming/deletion in #59

scriptworker/task.py
@@ -229,16 +230,15 @@ def guess_content_type(path):
}
args = [get_task_id(context.claim_task), get_run_id(context.claim_task),
target_path, payload]
+ _compress_if_text_plain(path, payload['contentType'])
@escapewindow
escapewindow Jan 12, 2017 Member

I'm not 100% sure we should compress all artifacts if text/plain. If this ends up compressing chainOfTrust.json.asc, do we automatically decompress on download or will we have to decompress before reading?

If we're not sure, we might want to do the compression when we copy the logs to the upload_dir.

@JohanLorenzo
JohanLorenzo Jan 13, 2017 edited Contributor

On my machine, /etc/mime.types maps asc to application/pgp-signature. As a consequence, we currently don't gzip it. For example:

 curl -I https://public-artifacts.taskcluster.net/SwE-nbJ_RQalWmmEfMbCuA/0/public/chainOfTrust.json.asc

I agree this may not be the case on every machine. I see several solutions:

  • Either we don't gzip asc file. In this case, we have to:
    • document /etc/mime.types/ is extensively used, and different machines may provide different uploads
    • make the puppet module fill /etc/mime.types for us. For the record, our CentOS VMs don't have this file.
  • Or we gzip files that are listed (hardcoded?) in scriptworker. At the moment, I'm thinking of files like: logs, json, asc, html/xml. If we go this way, here are the implications:
    • Browsers will be able to gunzip some files, which means:
      • If you want to use curl or wget, you have to pass extra args or pipe them (respectively --compressed and wget ... | gunzip -c -). Documenting that behavior would be nice.
      • If you want browser to understand every file, you have to force Content-Type to text/plain. This doesn't cause any issues once the file is downloaded.

I checkout out some existing jobs listed on Treeherder. The majority of them just gzip live_backing.log. However, I saw some of them compressing html or json files.

I currently prefer the second solution, for 4 reasons:

  1. We save space (and bandwidth) on S3, so the cost of hosting artifacts is lower. (For instance chainOfTrust.json.asc gets its size reduced by half)
  2. Less unexpected behaviors to document.
  3. Every text file is handled the same.
  4. Thanks to specifying text/plain for artifacts like asc, we make them readable from the browser (instead of downloadable, which happens on my computer). Thus, less steps to see the content of these files, hence, faster dev and troubleshooting!

What do you think @escapewindow ?

@escapewindow
escapewindow Jan 13, 2017 Member

I would love asc files to be text/plain, so we can read the chain of trust artifacts in the browser without downloading first. If gzipping doesn't affect the artifact when we download from the queue (i.e., we don't have to gunzip before we can read it), then having asc files gzipped sounds like a win. If we have to gunzip, then we'd have to change the scriptworker.cot.verify.download_cot to detect when an asc file is gzipped and gunzip (docker-worker asc files may not be gzipped, for example).

@JohanLorenzo
JohanLorenzo Jan 16, 2017 edited Contributor

Good news! No need to modify download_cot(). aiohttp gunzips files for us when downloaded (behavior tested locally).

scriptworker/task.py
skip_auto_headers = [aiohttp.hdrs.CONTENT_TYPE]
log.info("uploading {path} to {url}...".format(path=path, url=tc_response['putUrl']))
with open(path, "rb") as fh:
with aiohttp.Timeout(context.config['artifact_upload_timeout']):
async with context.session.put(
- tc_response['putUrl'], data=fh, headers=headers,
+ tc_response['putUrl'], data=fh, headers=_craft_artifact_put_hearders(tc_response),
@escapewindow
escapewindow Jan 12, 2017 Member

typo? s,hearders,headers, ?

@JohanLorenzo
JohanLorenzo Jan 13, 2017 Contributor

Typo!

scriptworker/task.py
+ log.debug('"{}" is NOT a plain text file, nothing to compress.'.format(artifact_path))
+
+
+def _craft_artifact_put_hearders(tc_response):
@escapewindow
escapewindow Jan 12, 2017 Member

the typo matches here, which means we're internally consistent at least :)

@JohanLorenzo
JohanLorenzo Jan 13, 2017 Contributor

Thank you autocompletion! 😃

scriptworker/task.py
+ headers = {
+ aiohttp.hdrs.CONTENT_TYPE: tc_response['contentType'],
+ }
+ if tc_response['contentType'] == 'text/plain':
@escapewindow
escapewindow Jan 12, 2017 Member

same here. Instead of changing the contentType here for all text/plain, we could go back to guess_content_type where we set the default contentType for all .log files to text/plain, and change that? If we decide we want to gzip all text/plain artifacts no matter what, then we can keep this.

@JohanLorenzo
JohanLorenzo Jan 13, 2017 Contributor

I agree, guess_content_type is a better central place for handling content-type and encoding. While testing the arguments above, I realized I had to change one how a given content-type is handled at several places.

@coveralls

Coverage Status

Coverage decreased (-0.5%) to 99.55% when pulling 4db26ae on JohanLorenzo:failures-on-treeherder into 6a6b809 on mozilla-releng:master.

@coveralls

Coverage Status

Coverage remained the same at 100.0% when pulling af229f2 on JohanLorenzo:failures-on-treeherder into 6a6b809 on mozilla-releng:master.

@coveralls

Coverage Status

Coverage remained the same at 100.0% when pulling f89ef4f on JohanLorenzo:failures-on-treeherder into 6a6b809 on mozilla-releng:master.

@coveralls

Coverage Status

Coverage remained the same at 100.0% when pulling 61f543e on JohanLorenzo:failures-on-treeherder into 6a6b809 on mozilla-releng:master.

@JohanLorenzo JohanLorenzo requested a review from escapewindow Jan 17, 2017
@coveralls

Coverage Status

Coverage remained the same at 100.0% when pulling 61f543e on JohanLorenzo:failures-on-treeherder into 6a6b809 on mozilla-releng:master.

@JohanLorenzo
Contributor

Hightlights about what's changed since your last review:

  • Using one file is done in #59. #59 should be merged before this PR (otherwise Treeherder still won't look the correct location)
  • Artifacts management (as in downloading, creating...) is done in a new package scriptworker.artifacts. If we merge this PR, we'll need to upgrade the version to 1.1.0b1. I don't mind rolling back this change and keep it until 1.0.0 is out.
  • Tests related to compression were added. ⚠️ We don't check if the put request has the correct headers and content. This is due to a limitation of mocks in context managers (not yet supported). If you have a workaround, I'll be glad to make the changes.
  • No test related to download_file() was added. I initially wanted to check gunzip was done, but the fake required a deeper knowledge of aiohttp (like knowing what attribute is decoded, in what order). I don't believe this is the right way to proceed, and we're testing more the library than our actual code. That's why I decided to leave the tests in scriptworker.utils as is.
@escapewindow
Member
escapewindow commented Jan 17, 2017 edited

Artifacts management (as in downloading, creating...) is done in a new package scriptworker.artifacts. If we merge this PR, we'll need to upgrade the version to 1.1.0b1. I don't mind rolling back this change and keep it until 1.0.0 is out.

As mentioned in IRC, pre-1.0.0 is not stable. It isn't clear if that means 1.0.0 pre-alpha is locked, or it's locked at beta1 or rc1. I'm fine being more strict or more loose at this point. Once we hit 1.0.0 (when we're tier1 on central) I imagine we'll increment the major and minor versions fairly quickly.

To be more clear about this specific PR: If we consider 1.0.0bX fair game for api changes, let's land pre-1.0.0. If we think we'll need to bump the minor version for this PR, let's wait til 1.0.0 releases.

N/m, let's merge this and ship whatever version. 2.0.0b1 :)

@escapewindow

Thanks! Diffing the function moves across files was the long pole here.

I'm leaning towards shipping this sooner rather than later. Whether that means we ship 1.0.0b7, 1.1.0b1, 1.1.0b7, 2.0.0b1, or 2.0.0b7 doesn't matter that much to me.

scriptworker/artifacts.py
+ log.debug('"{}" is encoded with "{}" and has mime/type "{}"'.format(artifact_path, encoding, content_type))
+
+ if encoding is None and content_type in _GZIP_SUPPORTED_CONTENT_TYPE:
+ log.debug('"{}" can be gzip\'d. Compressing...'.format(artifact_path))
@escapewindow
escapewindow Jan 17, 2017 Member

I'd log.info this, and possibly turn the final log.info into log.debug. The reasoning for me is, if we hit an error while compressing between the first log.debug and the log.info, we won't necessarily know what went wrong if verbose isn't True.

@JohanLorenzo
JohanLorenzo Jan 19, 2017 Contributor

I'm good with turning this line into info. I'd keep the final log info as well. Otherwise let's say an error occurs after gziping, then I'm afraid people may wonder if files were gzip'd in the end of not. This may lead them to wrong assumptions about what's failing.

JohanLorenzo added some commits Jan 12, 2017
@JohanLorenzo JohanLorenzo Make failures report on Treeherder
This requires:
  1. the presence of a public/logs/live_backing.log
  2. the file to be gzip'd
  3. and served as a gzip on S3
b284502
@JohanLorenzo JohanLorenzo Gzip more type of files and force some extension to text/plain
This also creates the package scriptworker.artifacts
34d3917
@JohanLorenzo JohanLorenzo Attempt to fix tests d82637b
@JohanLorenzo JohanLorenzo Add tests for gzip new feature
ead4b9b
@coveralls

Coverage Status

Coverage remained the same at 100.0% when pulling ead4b9b on JohanLorenzo:failures-on-treeherder into 0bc0cce on mozilla-releng:master.

@JohanLorenzo JohanLorenzo Address review comments
1214f88
@coveralls

Coverage Status

Coverage remained the same at 100.0% when pulling 1214f88 on JohanLorenzo:failures-on-treeherder into 0bc0cce on mozilla-releng:master.

@JohanLorenzo
Contributor

Conflicts happened when I rebased this PR against #59. Tests passing are a good sign I didn't mess up the resolution. I doubled checked the diff to see if there were leftovers of "log" of "filehandle" => we're good.

Finally I ran a session of manual testing with https://tools.taskcluster.net/task-inspector/#TRZPZW65RO-adh6aMZb-1Q/ => the task failed as it should have (Google Play refused an existing APK). Logs were uploaded as text/plain + gzip and Treeherder correctly reported the error: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-tier=3&revision=5cf9f0f65d201fc4631530c8f00e3a14887c1ad1&filter-searchStr=gps&selectedJob=70404051

@JohanLorenzo JohanLorenzo merged commit d9355eb into mozilla-releng:master Jan 19, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage remained the same at 100.0%
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment