Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: fix leaderID error when state changed #2415

Merged
merged 1 commit into from
Apr 24, 2015

Conversation

xiaost
Copy link
Contributor

@xiaost xiaost commented Apr 24, 2015

I'm try to use raft module in my project and found the bug:
after follower becoming leader, the IsLeader() api don't return correctly 😭 😢

@otoolep
Copy link
Contributor

otoolep commented Apr 24, 2015

Thanks @xiaost -- I'll take a look. In the meantime can you please sign the CLA here:

http://influxdb.com/community/cla.html

We can't merge this change until then.

@xiaost
Copy link
Contributor Author

xiaost commented Apr 24, 2015

@otoolep LTGM, signed. 😃

@otoolep
Copy link
Contributor

otoolep commented Apr 24, 2015

Can you describe the problems you had with the system before this change
was in place?

On Friday, April 24, 2015, xiaost notifications@github.com wrote:

@otoolep https://github.com/otoolep LTGM, signed. [image: 😃]


Reply to this email directly or view it on GitHub
#2415 (comment).

@xiaost
Copy link
Contributor Author

xiaost commented Apr 24, 2015

@otoolep
I using influxdb raft pkg in my distributed system.
everything works well without election.

A -> Leader
B -> Follower
C -> Follower

After A restarted:
B -> Follower -> Candidate -> Leader. here is the log:

[raft] 2015/04/24 14:15:38 log state change: follower => candidate (term=1)
[raft] 2015/04/24 14:15:38 log state change: candidate => leader (term=2)

after election, B should be leader but IsLeader() return false

log.Printf("isLeader : %v, State: %v", l.IsLeader(), l.State())

outputs:

2015/04/25 04:25:39 isLeader : false, State: leader

@otoolep
Copy link
Contributor

otoolep commented Apr 24, 2015

We have discovered a leader-election bug and are fixing it in PR #2418. How easily can you reproduce the error you are seeing? I'd like to know if you can reproduce the issue after we merge 2418.

@xiaost
Copy link
Contributor Author

xiaost commented Apr 24, 2015

#2418 seems to be another bug i think. :-(

ping @jwilder

test code:

package main

import (
    "encoding/binary"
    "fmt"
    "io"
    "log"
    "net/http"
    "net/url"
    "os"
    "sync"
    "time"

    "github.com/influxdb/influxdb/raft"
)

func main() {
    os.RemoveAll("/tmp/testraft1")
    os.RemoveAll("/tmp/testraft2")
    os.RemoveAll("/tmp/testraft3")
    log1 := StartRaft("/tmp/testraft1", 3771, "")
    log2 := StartRaft("/tmp/testraft2", 3772, "http://127.0.0.1:3771")
    log3 := StartRaft("/tmp/testraft3", 3773, "http://127.0.0.1:3771")
    t := time.Tick(1 * time.Second)
    failover := false
    for _ = range t {
        if failover == false {
            log.Printf("log1 isLeader : %v, State: %v", log1.IsLeader(), log1.State())
        }
        log.Printf("log2 isLeader : %v, State: %v", log2.IsLeader(), log2.State())
        log.Printf("log3 isLeader : %v, State: %v", log3.IsLeader(), log3.State())
        if log1.IsLeader() {
            log1.Close()
            failover = true
        }
    }
}

func StartRaft(path string, port int, join string) *raft.Log {
    bindstr := fmt.Sprintf("127.0.0.1:%d", port)
    urlstr := fmt.Sprintf("http://%s/", bindstr)
    myurl, _ := url.Parse(urlstr)

    l := raft.NewLog()
    //l.DebugEnabled = true
    l.SetURL(*myurl)
    l.FSM = NewIndexFSM()
    l.Open(path)
    index, _ := l.LastLogIndexTerm()
    if index == 0 {
        if join != "" {
            joinURL, err := url.Parse(join)
            if err != nil {
                log.Fatal(err)
            }
            if err := l.Join(*joinURL); err != nil {
                if err != raft.ErrInitialized {
                    log.Fatal(err)
                }
            }
        } else {
            if err := l.Initialize(); err != nil {
                if err != raft.ErrInitialized {
                    log.Fatal(err)
                }
            }
        }
    }
    h := raft.Handler{Log: l}
    mux := http.NewServeMux()
    mux.HandleFunc("/raft/", h.ServeHTTP)
    go func() {
        log.Println(http.ListenAndServe(bindstr, mux))
    }()
    return l
}

// IndexFSM represents a state machine that only records the last applied index.
type IndexFSM struct {
    mu    sync.Mutex
    index uint64
}

func NewIndexFSM() *IndexFSM {
    fsm := &IndexFSM{}
    return fsm
}

// MustApply updates the index.
func (fsm *IndexFSM) Apply(entry *raft.LogEntry) error {
    fsm.mu.Lock()
    fsm.index = entry.Index
    fsm.mu.Unlock()
    return nil
}

// Index returns the highest applied index.
func (fsm *IndexFSM) Index() uint64 {
    fsm.mu.Lock()
    defer fsm.mu.Unlock()
    return fsm.index
}

// WriteTo writes a snapshot of the FSM to w.
func (fsm *IndexFSM) WriteTo(w io.Writer) (n int64, err error) {
    fsm.mu.Lock()
    defer fsm.mu.Unlock()
    return 0, binary.Write(w, binary.BigEndian, fsm.index)
}

// ReadFrom reads an FSM snapshot from r.
func (fsm *IndexFSM) ReadFrom(r io.Reader) (n int64, err error) {
    fsm.mu.Lock()
    defer fsm.mu.Unlock()
    return 0, binary.Read(r, binary.BigEndian, &fsm.index)
}

@otoolep
Copy link
Contributor

otoolep commented Apr 24, 2015

Thanks for the test program @xiaost -- we'll check it out.

@jwilder
Copy link
Contributor

jwilder commented Apr 24, 2015

I was able to reproduce this issue in a different way as well.

@otoolep
Copy link
Contributor

otoolep commented Apr 24, 2015

OK, we looked at the code, this does seem be a real issue, good catch @xiaost. From looking at the code, the following would occur if IsLeader does not return the correct result:

  • CQs would not run.
  • Leader redirection would not function correctly.

@jwilder
Copy link
Contributor

jwilder commented Apr 24, 2015

👍 Verified this fixed the issue for my test as well.

@otoolep
Copy link
Contributor

otoolep commented Apr 24, 2015

Thanks again @xiaost -- merging now.

otoolep added a commit that referenced this pull request Apr 24, 2015
raft: fix leaderID error when state changed
@otoolep otoolep merged commit 5894f65 into influxdata:master Apr 24, 2015
@otoolep
Copy link
Contributor

otoolep commented Apr 24, 2015

@benbjohnson

@xiaost xiaost deleted the fix-leaderid-error branch April 24, 2015 23:17
@benbjohnson
Copy link
Contributor

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants