Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make master green again! #2002

Closed
1 of 3 tasks
ghost opened this issue Nov 24, 2015 · 29 comments
Closed
1 of 3 tasks

Make master green again! #2002

ghost opened this issue Nov 24, 2015 · 29 comments
Labels
kind/bug A bug in existing code (including security flaws) topic/test failure Topic test failure

Comments

@ghost
Copy link

ghost commented Nov 24, 2015

There are a lot of very horrible CI failures lately, ranging from time-based, to plain worrying (like ipfs add -r hash mismatches).

  • Let's look at each failure (!)
  • Collect types of failure in this issue
  • Fix them

the failures

@ghost ghost added kind/bug A bug in existing code (including security flaws) topic/test failure Topic test failure labels Nov 24, 2015
@rht
Copy link
Contributor

rht commented Nov 24, 2015

Some specimens can be found in #1989 (excluding the travis_wait commits).

There are 3 major types from this PR:

  1. https://travis-ci.org/ipfs/go-ipfs/jobs/92502227 ipfs add -r --hidden mountdir/planets >actual stalled.
  2. https://travis-ci.org/ipfs/go-ipfs/jobs/92499693 one of the auto-gc test fails.
  3. ipfs refs -r counts 77 instead of 79.

I don't see more beyond the 3 types.
The rest is usual statistical goroutine errs.
All happens only in osx sharness.

@rht
Copy link
Contributor

rht commented Nov 24, 2015

(#1984 has the most recent test I ran, and they all somehow passed)

@ghost ghost mentioned this issue Nov 26, 2015
42 tasks
@jbenet
Copy link
Member

jbenet commented Nov 30, 2015

👍 yes please

@cryptix
Copy link
Contributor

cryptix commented Dec 3, 2015

I'm seeing this in t0060-daemon.sh quite often.

not ok 19 - 'ipfs daemon' should be able to run with a pipe attached to stdin (issue #861)
#
#         yes | ipfs daemon --init >stdin_daemon_out 2>stdin_daemon_err &
#         pollEndpoint -ep=/version -v -tout=1s -tries=10 >stdin_poll_apiout 2>stdin_poll_apierr &&
#         test_kill_repeat_10_sec $! ||
#         test_fsh cat stdin_daemon_out || test_fsh cat stdin_daemon_err || test_fsh cat stdin_poll_apiout || test_fsh cat stdin_poll_apierr
#

passing one in 5 times maybe, leaving a ipfs binary running n the system in failure case which is confusing on the next run. Digging deeper, the daemon simply didn't start up in time... (Also found small print bug in pollEndpoint the meantime, PR incoming)

edit: feel free to cherry pick f7fb258

@whyrusleeping
Copy link
Member

@cryptix t0060 fails if you have a daemon already running.

@ghost
Copy link
Author

ghost commented Dec 4, 2015

Doesn't properly kill the daemon after some test in https://circleci.com/gh/ipfs/go-ipfs/1981

expecting success: 
        IPFS_PID=$! &&
        pollEndpoint -ep=/version -host=$ADDR_API -v -tout=1s -tries=60 2>poll_apierr > poll_apiout ||
        test_fsh cat actual_daemon || test_fsh cat daemon_err || test_fsh cat poll_apierr || test_fsh cat poll_apiout

> cat actual_daemon
Initializing daemon...
Swarm listening on /ip4/10.0.3.235/tcp/13212
Swarm listening on /ip4/10.0.4.1/tcp/13212
Swarm listening on /ip4/127.0.0.1/tcp/13212
API server listening on /ip4/127.0.0.1/tcp/6312

> cat daemon_err
Error: serveHTTPGateway: manet.Listen(/ip4/127.0.0.1/tcp/9312) failed: listen tcp4 127.0.0.1:9312: bind: address already in use

> cat poll_apierr
00:33:29.862 DEBUG pollEndpoi: starting at %s, tries: %d, timeout: %s, url: %s2015-11-21 00:33:29.862385236 +0000 UTC 60 1s {http  <nil> 127.0.0.1:6312 /version   } main.go:57
00:33:29.862 DEBUG pollEndpoi: get failed: Get http://127.0.0.1:6312/version: dial tcp 127.0.0.1:6312: getsockopt: connection refused main.go:66
00:33:30.863 DEBUG pollEndpoi: get failed: Get http://127.0.0.1:6312/version: dial tcp 127.0.0.1:6312: getsockopt: connection refused main.go:66
[snip]

> cat poll_apiout

not ok 6 - 'ipfs daemon' is ready

@cryptix
Copy link
Contributor

cryptix commented Dec 4, 2015

@cryptix t0060 fails if you have a daemon already running.

Yes, but that failure looks different, the first cases fail.

@ghost
Copy link
Author

ghost commented Dec 14, 2015

New failure on OSX: https://travis-ci.org/ipfs/go-ipfs/jobs/96669162#L147

--- FAIL: TestReconnect5 (1.94s)

    reconnect_test.go:231: host 0 <peer.ID bBGhBC> has 4 conns! not zero.

FAIL

And a timeout on OSX sharness: https://travis-ci.org/ipfs/go-ipfs/jobs/96669163

expecting success: 

        mkdir -p testdir &&

        echo "hello test" >testdir/test.txt &&

        ipfs add -r testdir &&

        curl -i "http://localhost:$PORT_API/api/v0/refs?arg=QmTcJAn3JP8ZMAKS6WS75q8sbTyojWKbxcUHgLYGWur4Ym&stream-channels=true&encoding=text" >actual_output



No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

@jbenet
Copy link
Member

jbenet commented Dec 15, 2015

I like xmas and all, but the red is getting so annoying :S

@rht
Copy link
Contributor

rht commented Dec 16, 2015

Wouldn't this be most effective if debugged by someone on osx? Is the err specific to travis osx only?

@rht
Copy link
Contributor

rht commented Dec 16, 2015

Saw a green on 0.3.8: #2081

@chriscool
Copy link
Contributor

Yeah, it should be possible to bisect this by doing many tests on each commit instead of just one

There is also this project that has a bisect that uses Bayesian Search Theory to find intermittent bugs:

https://github.com/Ealdwulf/bbchop

Yeah, it has not been updated for a long time but it might still be interesting.

@chriscool
Copy link
Contributor

PR #2040 should have fixed: #2026 sharness t0020-init.sh fails to cleanup on OSX

@rht
Copy link
Contributor

rht commented Dec 16, 2015

@chriscool the bisection alg code you linked is only possible if it can be run on travis. Or if the test fails are reproducible on osx, then sure bbchop can be used.

PR #2040 should have fixed: #2026 sharness t0020-init.sh fails to cleanup on OSX

ok I will check 10x if this cures the test fails in general.

@chriscool
Copy link
Contributor

@chriscool the bisection alg code you linked is only possible if it can be run on travis. Or if the test fails are reproducible on osx.

Yeah, unfortunately it looks like CI services like Travis and Circle don't have a nice bisecting service (maybe we should suggest it to them), but I thought it could be interesting to talk about bbchop while at it.

PR #2040 should have fixed: #2026 sharness t0020-init.sh fails to cleanup on OSX

ok I will check 10x if this cures the test fails in general.

Great, thanks!

@rht
Copy link
Contributor

rht commented Dec 17, 2015

I have narrowed down the osx build stall to this PR #1914 ...the culprit.

@jbenet
Copy link
Member

jbenet commented Dec 17, 2015

@rht great detective work! but wait, one problem:

the closenotify thing caused hangs in <go1.5.1. we moved to go1.5.2, and i updated the travis.yml to use 1.5.2 some time after #1914 (shoud've been in same PR but wasn't). this may cause false positives for #1914 -- try rerunning that one with go1.5.2 in this line: https://github.com/ipfs/go-ipfs/blob/master/.travis.yml#L10

@rht
Copy link
Contributor

rht commented Dec 18, 2015

reran with 1.5.2, still build errs/stalls: #2088

@rht
Copy link
Contributor

rht commented Dec 18, 2015

The gc errs are likely due to race condition between periodic gc and pinning, when added files are gc-ed before they are pinned, then the pinning stalls since the blocks are not available locally nor remotely.

@Kubuxu
Copy link
Member

Kubuxu commented Dec 28, 2015

stalled: https://travis-ci.org/ipfs/go-ipfs/jobs/99128512
on t0045-ls.sh

@thelinuxkid
Copy link
Contributor

I am having the timeout issue with sharness tests too #2119. #2120 adds fuse support for Travis CI but also switches the Ubuntu version to Trusty. However, I ran the tests multiple times (go1.5.2) and the timeout is my only issue, albeit, intermittent -- redoing a test can eventually yield green: https://travis-ci.org/thelinuxkid/go-ipfs/builds/98464151. https://github.com/thelinuxkid/travis-rerun might help with debugging. I was able to build the Trusty image Travis uses. I will try and see if I can replicate the timeout there when I am in town again next week.

@Kubuxu
Copy link
Member

Kubuxu commented Jan 5, 2016

Failed on: t0200-unixfs-ls
https://travis-ci.org/ipfs/go-ipfs/jobs/100336579#L7521

@rht rht mentioned this issue Jan 23, 2016
@whyrusleeping
Copy link
Member

I think this effort was successful. Master is quite green lately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/test failure Topic test failure
Projects
None yet
Development

No branches or pull requests

7 participants