Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NWBib down due to problems on quaoar2 #302

Closed
acka47 opened this issue Mar 29, 2016 · 3 comments
Closed

NWBib down due to problems on quaoar2 #302

acka47 opened this issue Mar 29, 2016 · 3 comments
Assignees
Labels

Comments

@acka47
Copy link
Contributor

acka47 commented Mar 29, 2016

On easter (26-28 March) several Nagios messages concerning quaoar2 and also emphytos came in.

The first critical one (26.03.2016, 16:07):

***** Nagios *****

Notification Type: PROBLEM

Service: Check disk - ALL
Host: quaoar2
Address: 193.30.112.171
State: CRITICAL

Date/Time: Sat Mar 26 16:07:04 CET 2016

Additional Info:

DISK CRITICAL - free space: / 22274 MB (5% inode=97%): /files/open_data 489833 MB (31% inode=90%): /usr/remotesrc 12982 MB (7% inode=98%):

From then on, CRITICAL ones like that for quaor2 repeated.

The first regarding emphytos (26.03.2016, 23:58):

***** Nagios *****

Notification Type: PROBLEM

Service: Frontend
Host: emphytos-lobid
Address: 193.30.112.187
State: CRITICAL

Date/Time: Sat Mar 26 23:58:54 CET 2016

Additional Info:

CRITICAL - Socket timeout after 10 seconds

I realized yesterday evening that NWBib is down and asked @jschnasse to restart the play app this morning. NWBib is up again, but strangely the detail view for NWBib resources doesn't work, e.g. http://nwbib.de/HT018866841. For other hbz01 resources it works, though, e.g. http://nwbib.de/HT018715226.

Besides resolving this bug, I need some more documentation and probably training to deal with something like this myself.

@fsteeg
Copy link
Member

fsteeg commented Mar 29, 2016

The issue seems to be that quaoar2 didn't recover (yet) from its critical state.

The lobid API is configured to access the cluster via quaoar3, so that remained working. NWBib accessed the cluster via quaoar2 for its classification data, so only NWBib titles failed. Configured NWBib to access the cluster via quaoar3:

http://nwbib.de/HT018866841
http://nwbib.de/HT018826410

Also deleted old indexes on the cluster via quaoar3 to make space (which seems to be the original issue on quaoar2). Load of elasticsearch process on quaoar2 is high, so I'll just leave it working and we'll see if the changes via quaoar3 propagate and quaoar2 sorts out itself.

@fsteeg fsteeg added review and removed working labels Mar 29, 2016
@fsteeg fsteeg assigned acka47 and unassigned fsteeg Mar 29, 2016
@acka47
Copy link
Contributor Author

acka47 commented Mar 30, 2016

Doesn't look like quaoar2 would recover. Nagios mail from 30.03.2016, 04:17:

***** Nagios *****

Notification Type: PROBLEM

Service: Check disk - ALL
Host: quaoar2
Address: 193.30.112.171
State: WARNING

Date/Time: Wed Mar 30 04:17:04 CEST 2016

Additional Info:

DISK WARNING - free space: / 27256 MB (7% inode=97%): /files/open_data 489209 MB (31% inode=90%): /usr/remotesrc 12982 MB (7% inode=98%):

@fsteeg
Copy link
Member

fsteeg commented Apr 5, 2016

After manual deletion of some indexes, a restart of elasticsearch on quaoar2, and some time, the cluster is now back to green status. Closing.

Attempted restart on quaoar2 with:
sudo service elasticsearch restart

But did not start up, checked with:
sudo service elasticsearch status

Came back after:
sudo service elasticsearch start

Opened hbz/lobid-resources#67 to avoid this kind of problem in the future.

@fsteeg fsteeg closed this as completed Apr 5, 2016
@fsteeg fsteeg removed the working label Apr 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants