Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to connect with database: JC_BAD_CONNECTION #235

Open
harshajayaweeraXHJ opened this issue Jan 26, 2020 · 11 comments
Open

Unable to connect with database: JC_BAD_CONNECTION #235

harshajayaweeraXHJ opened this issue Jan 26, 2020 · 11 comments

Comments

@harshajayaweeraXHJ
Copy link

My application runs a sampling query periodically and it was working fine without any issue for around 30 hours. Then suddenly it gave the following error

Caused by: com.toshiba.mwcloud.gs.common.GSConnectionException: [145028:JC_BAD_CONNECTION] Failed to update by notification (address=/239.0.0.1:31999, reason=Receive timed out)
at com.toshiba.mwcloud.gs.subnet.NodeResolver.updateMasterInfo(NodeResolver.java:815)
at com.toshiba.mwcloud.gs.subnet.NodeResolver.prepareConnectionAndClusterInfo(NodeResolver.java:522)
at com.toshiba.mwcloud.gs.subnet.NodeResolver.getPartitionCount(NodeResolver.java:205)
at com.toshiba.mwcloud.gs.subnet.GridStoreChannel$5.execute(GridStoreChannel.java:2106)
at com.toshiba.mwcloud.gs.subnet.GridStoreChannel.executeStatement(GridStoreChannel.java:1675)
... 38 more
Caused by: java.net.SocketTimeoutException: Receive timed out

Why is this happening? What is the cause.

This is the output of gs_stat -u admin/admin

{
"checkpoint": {
"archiveLog": 0,
"backupOperation": 0,
"duplicateLog": 0,
"endTime": 1580053987745,
"mode": "NORMAL_CHECKPOINT",
"normalCheckpointOperation": 1470,
"pendingPartition": 0,
"periodicCheckpoint": "ACTIVE",
"requestedCheckpointOperation": 0,
"startTime": 1580053987741
},
"cluster": {
"activeCount": 0,
"autoGoal": "ACTIVE",
"clusterName": "defaultCluster",
"clusterRevisionId": "4e9be62e-7911-48a4-8d93-19af09be7a15",
"clusterRevisionNo": 17651,
"clusterStatus": "SUB_CLUSTER",
"designatedCount": 1,
"loadBalancer": "ACTIVE",
"nodeList": [
{
"address": "10.128.0.2",
"port": 10040
}
],
"nodeStatus": "ABNORMAL",
"notificationMode": "MULTICAST",
"partitionStatus": "INITIAL",
"startupTime": "2020-01-25T15:20:31.377Z",
"syncCount": 0
},
"currentTime": "2020-01-26T17:20:39Z",
"performance": {
"backupCount": 0,
"batchFree": 0,
"bufferHashCollisionCount": 0,
"checkpointFileAllocateSize": 5443584,
"checkpointFileFlushCount": 0,
"checkpointFileFlushTime": 0,
"checkpointFileSize": 5439488,
"checkpointFileUsageRate": 0.927710843373494,
"checkpointMemory": 196608,
"checkpointMemoryLimit": 1073741824,
"checkpointWriteSize": 270139392,
"checkpointWriteTime": 214,
"currentCheckpointWriteBufferSize": 0,
"currentTime": 1580059239771,
"expirationDetail": {
"autoExpire": false,
"erasableExpiredTime": "1970-01-01T00:00:00.000Z",
"latestExpirationCheckTime": "1970-01-01T00:00:00.000Z"
},
"logFileFlushCount": 8832,
"logFileFlushTime": 38224,
"numBackground": 0,
"numConnection": 2,
"numNoExpireTxn": 0,
"numSession": 0,
"numTxn": 0,
"ownerCount": 128,
"peakProcessMemory": 86626304,
"processMemory": 86626304,
"recoveryReadSize": 262144,
"recoveryReadTime": 0,
"recoveryReadUncompressTime": 0,
"storeCompressionMode": "NO_BLOCK_COMPRESSION",
"storeDetail": {
"batchFreeMapData": {
"storeMemory": 0,
"storeUse": 0,
"swapRead": 0,
"swapWrite": 0
},
"batchFreeRowData": {
"storeMemory": 0,
"storeUse": 0,
"swapRead": 0,
"swapWrite": 0
},
"mapData": {
"storeMemory": 131072,
"storeUse": 131072,
"swapRead": 0,
"swapWrite": 0
},
"metaData": {
"storeMemory": 131072,
"storeUse": 131072,
"swapRead": 0,
"swapWrite": 0
},
"rowData": {
"storeMemory": 4784128,
"storeUse": 4784128,
"swapRead": 0,
"swapWrite": 0
}
},
"storeMemory": 5046272,
"storeMemoryLimit": 1073741824,
"storeTotalUse": 5046272,
"swapRead": 0,
"swapReadSize": 0,
"swapReadTime": 0,
"swapReadUncompressTime": 0,
"swapWrite": 0,
"swapWriteCompressTime": 0,
"swapWriteSize": 0,
"swapWriteTime": 0,
"syncReadSize": 0,
"syncReadTime": 0,
"syncReadUncompressTime": 0,
"totalBackupLsn": 0,
"totalLockConflictCount": 0,
"totalOtherLsn": 0,
"totalOwnerLsn": 110220,
"totalReadOperation": 4733,
"totalRowRead": 2325894,
"totalRowWrite": 55108,
"totalWriteOperation": 55108,
"txnDetail": {
"totalBackgroundOperation": 0
}
},
"recovery": {
"progressRate": 1
},
"version": "4.3.0-36424 CE"
}

@knonomura
Copy link
Member

Thank you for your usage and report.

"cluster": {
"clusterStatus": "SUB_CLUSTER", <--
"nodeStatus": "ABNORMAL", <--

I guess an error has occurred on the server. Ex. IO_ERROR
Could you please check gridstore-*.log file on $GS_LOG with "ERROR" ?

@harshajayaweeraXHJ
Copy link
Author

harshajayaweeraXHJ commented Jan 27, 2020

Sure I'll, Thanks. Can you give more specific examples please. What kind of IO errors it could be. Read/write permission error or something like that?

@harshajayaweeraXHJ
Copy link
Author

harshajayaweeraXHJ commented Jan 27, 2020

Log files in $GS_LOG says:

2020-01-26T15:53:07.742Z xxx 11745 ERROR EVENT_ENGINE [1042:CM_PLATFORM_ERROR] Listener error (engine=TRANSACTION_SERVICE, reason=Platform error (errno=24, description=Too many open files))

and also

2020-01-26T15:53:07.744Z xxx 11751 ERROR CLUSTER_OPERATION [1042:CM_PLATFORM_ERROR] [ERROR] SetError requested (reason=Prepare checkpoint failed. reason=(Failed to open (path=data/gs_log_1_1307.log, reason=Platform error (errno=24, description=Too many open files)))

What does this mean, Too many open files ?

Above errors are repeating for a while and then it gives following error

2020-01-26T15:53:07.746Z xxx 11751 WARNING CLUSTER_OPERATION [180060:CLM_STATUS_TO_SUBMASTER] Cluster status change to SUB_MASTER

2020-01-26T15:57:08.109Z xxx 11734 INFO SYSTEM_SERVICE [50912:SC_TRACE_STATS] clusterStatus=SUB_CLUSTER, nodeStatus=ABNORMAL,

It changes to sub master and then nodeStatus change to ABNORMAL

What is the root cause for this issue. when it says too many open files which files does it means.?

@knonomura
Copy link
Member

Thank you for error information.

My application runs a sampling query periodically and it was working fine without any issue for around 30 hours.

Let me confirm some points.

  • Does your application use many connections ?
  • Are the appication and GridDB server on the same machine ?
  • Could you please tell me the output of the following command ?
ulimit -n

@harshajayaweeraXHJ
Copy link
Author

  1. Yes it creates a new connection on every call like blow:
this.store = GridStoreFactory.getInstance().getGridStore(createDBProps()); 
this.ts = store.putTimeSeries(COLLECTION_NAME, Payload.class);

and then once query is done leaves the connection open. (Is there any recommended approach to handle db connections, or any pooling solution available for griddb like HikariCP)

  1. Yes my appication and GridDB server on the same machine ?

  2. Output of the ulimit -n is 1024

@knonomura
Copy link
Member

knonomura commented Jan 28, 2020

If you call GridStoreFactory#getGridStore() but don't call GridStore#close(),
connection (socket file descriptor) remains.

If so, please use the following method (A) or (B):
(A) Close GridStore object

GridStore store = GridStoreFactory.getInstance().getGridStore(props);
...
store.close();

(B) Reuse GridStore object

this.store = GridStoreFactory.getInstance().getGridStore(props);

ts = this.store.putTimeSeries(name, XXX.class);// or store.getTimeSeries(name, XXX.class)
...
ts.close()

ts = this.store.putTimeSeries(name, XXX.class);// or store.getTimeSeries(name, XXX.class) 
...
ts.close()
...

Java client library has a connection pool.
If necessary, please set maxConnectionPoolSize.
http://griddb.org/griddb_nosql/manual/GridDB_API_Reference.html#GridStoreFactory.html__setProperties(java.util.Properties)

And it might be better to increase the file descriptor limit.

@harshajayaweeraXHJ
Copy link
Author

Thank you very much for the directions.
But still my question remains. What is the root cause for above errors?
What is "Too many open files"?

@harshajayaweeraXHJ
Copy link
Author

Also from the above methods you suggested. Which one is recommended if i frequently make calls to the db(Usually front-end make db requests for data sampling query per every 5 seconds)

@knonomura
Copy link
Member

What is the root cause for above errors?
What is "Too many open files"?

This is an error message of errno=24 in Linux OS Errors.
It is defined in error.h.

#define EMFILE          24      /* Too many open files */

http://man7.org/linux/man-pages/man3/errno.3.html

"files" means file descriptors and includes things like sockets.
I guess your machine has reached the maximum number of open files.

@knonomura
Copy link
Member

Also from the above methods you suggested. Which one is recommended if i frequently make calls to the db(Usually front-end make db requests for data sampling query per every 5 seconds)

I often use the case (B) when queries are repeated in a thread or a function.

Note: GridStore object isn't thread safety. So, if you use multi-thread, please assign a GridStore object for each thread.
http://griddb.org/griddb_nosql/manual/GridDB_API_Reference.html#GridStore.html__
Thread safety of each method is not guaranteed.

@harshajayaweeraXHJ
Copy link
Author

What is the root cause for above errors?
What is "Too many open files"?

This is an error message of errno=24 in Linux OS Errors.
It is defined in error.h.

#define EMFILE 24 /* Too many open files */
http://man7.org/linux/man-pages/man3/errno.3.html

"files" means file descriptors and includes things like sockets.
I guess your machine has reached the maximum number of open files.

But my server doesn't have any open files. Is this referring to database connections in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants