Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to read from a non-existent database can create many distinct client IDs in the session metrics #202

Closed
sebright opened this issue May 13, 2020 · 9 comments · Fixed by #206 or #766
Assignees
Labels
api: spanner priority: p2 🚨 type: bug

Comments

@sebright
Copy link
Contributor

@sebright sebright commented May 13, 2020

I noticed that calling Spanner.getDatabaseClient with a non-existent database and then trying to read from the database can cause new client IDs to appear in the OpenCensus session metrics. The client IDs for the non-existent database seem problematic because they can cause the size of the metrics to grow.

Here is a simplified reproduction:

google-cloud-spanner version: 1.54.0
opencensus-java version: 0.26.0

import com.google.cloud.spanner.DatabaseId;
import com.google.cloud.spanner.InstanceConfigId;
import com.google.cloud.spanner.InstanceId;
import com.google.cloud.spanner.InstanceInfo;
import com.google.cloud.spanner.Key;
import com.google.cloud.spanner.Spanner;
import com.google.cloud.spanner.SpannerOptions;
import io.opencensus.exporter.stats.stackdriver.StackdriverStatsExporter;
import java.util.Arrays;
import org.junit.Test;

public class ClientIdTest {
  @Test
  public void testClientId() throws Exception {
    StackdriverStatsExporter.createAndRegister();
    Spanner spanner = SpannerOptions.getDefaultInstance().getService();
    String projectId = spanner.getOptions().getProjectId();
    InstanceId instanceId = InstanceId.of(projectId, "my-instance");
    InstanceConfigId instanceConfigId = InstanceConfigId.of(projectId, "regional-us-central1");
    spanner
        .getInstanceAdminClient()
        .createInstance(
            InstanceInfo.newBuilder(instanceId)
                .setInstanceConfigId(instanceConfigId)
                .setDisplayName("my-instance")
                .setNodeCount(1)
                .build())
        .get();
    DatabaseId databaseId = DatabaseId.of(instanceId, "my-database");

    for (int i = 0; i < 100; i++) {
      try {
        spanner
            .getDatabaseClient(databaseId)
            .singleUse()
            .readRow("MyTable", Key.of(0), Arrays.asList("MyColumn"));
      } catch (Exception e) {
        // ignore
      }
    }
    Thread.sleep(300_000);
  }
}

The metrics exported to Stackdriver contain 100 client IDs. For example, here is in_use_sessions:

session-metrics

session-metrics-2

@product-auto-label product-auto-label bot added the api: spanner label May 13, 2020
@olavloite olavloite self-assigned this May 13, 2020
@yoshi-automation yoshi-automation added the triage me label May 14, 2020
@skuruppu skuruppu added type: bug and removed triage me labels May 15, 2020
@sebright
Copy link
Contributor Author

@sebright sebright commented May 15, 2020

I tried updating to master after #206 was merged, but I'm still getting a large number of client IDs. I also tried running the reproduction above, but this time the client IDs ranged from 1 to 50 instead of 1 to 100.

@sebright sebright reopened this May 15, 2020
@yoshi-automation yoshi-automation added the triage me label May 15, 2020
@olavloite olavloite added priority: p2 and removed triage me labels May 15, 2020
@olavloite
Copy link
Contributor

@olavloite olavloite commented May 15, 2020

@sebright
I'm not really able to reproduce this locally. I've just added an additional test case based on your reproduction code above, but I'm seeing the same client id each time.

The client ids in the range 1 to 50 especially baffles me. Is your test code literally equal to the reproduction code you mentioned above, or could it be that the test code is somehow running parts of it in parallel?

@olavloite
Copy link
Contributor

@olavloite olavloite commented Jul 10, 2020

@sebright Friendly ping. Is this still an issue?

@sebright
Copy link
Contributor Author

@sebright sebright commented Jul 14, 2020

@olavloite Sorry for the delay. I tried running the reproduction from the first comment again with google-cloud-spanner 1.58.0. This time I didn't see any stats recorded for the metric in_use_sessions, but I saw the same behavior as in #202 (comment) for max_in_use_sessions. There were 50 distinct client IDs (client-1 through client-50).

@olavloite
Copy link
Contributor

@olavloite olavloite commented Jul 16, 2020

@sebright I'm still not able to reproduce this locally using both a real Cloud Spanner instance as well as a local mock server. I've therefore created a small integration test so we can use that to try to figure out where the difference is coming from. The integration test in #353 is based on the example at the beginning of this issue and checks that only one client id is created, and it currently succeeds.

Note that:

  1. The integration test itself does not create a new instance, but uses the instance that is created for the integration test run. That should however as far as I can see make any difference for this case.
  2. All the read operations will fail as the database that is requested does not exist. Those errors are ignored.

At the end of the test it verifies that only one client id has been generated and used. Would you mind having a look and see if you can spot where the difference in outcome could be coming from in this test case, compared to the test that you run on your setup?
The outcome of the test run can be found here: https://source.cloud.google.com/results/invocations/03df0d70-cdef-41aa-ab8b-5208de7e6959/targets/github%2Fjava-spanner%2Fgoogle-cloud-spanner%2Ftarget%2Ffailsafe-reports%2Fintegration%2Fcom.google.cloud.spanner.it.ITClientIdTest/tests

@skuruppu
Copy link
Contributor

@skuruppu skuruppu commented Sep 16, 2020

Friendly ping @sebright. If you could look at the code that @olavloite wrote, it would help us to repro the issue.

@skuruppu
Copy link
Contributor

@skuruppu skuruppu commented Oct 13, 2020

Since we haven't had updates here in a while, I'm going to close the issue. Please re-open if this issue is still happening.

@sebright
Copy link
Contributor Author

@sebright sebright commented Dec 29, 2020

Thanks for sharing the integration test. I rebased #353 and tried running it, and I think I understand the difference. My reproduction uses opencensus 0.26.0 and has a dependency on opencensus-impl, and the integration test uses opencensus 0.24.0 without opencensus-impl. The integration test fails when I add these dependencies to google-cloud-spanner/pom.xml:

    <dependency>
      <groupId>io.opencensus</groupId>
      <artifactId>opencensus-api</artifactId>
      <version>0.26.0</version>
    </dependency>
    <dependency>
      <groupId>io.opencensus</groupId>
      <artifactId>opencensus-contrib-grpc-util</artifactId>
      <version>0.26.0</version>
    </dependency>
    <dependency>
      <groupId>io.opencensus</groupId>
      <artifactId>opencensus-impl-core</artifactId>
      <version>0.26.0</version>
    </dependency>
    <dependency>
      <groupId>io.opencensus</groupId>
      <artifactId>opencensus-impl</artifactId>
      <version>0.26.0</version>
    </dependency>

It looks like opencensus-impl 0.26.0 throws an exception when multiple time series are recorded with the same client ID. The test alternates between throwing "Database not found" and "A different time series with the same labels already exists" exceptions in each iteration of the for loop. Here is the stack trace for the time series error:

java.lang.IllegalArgumentException: A different time series with the same labels already exists.
	at io.opencensus.implcore.metrics.DerivedLongGaugeImpl.createTimeSeries(DerivedLongGaugeImpl.java:91)
	at com.google.cloud.spanner.SessionPool.initMetricsCollection(SessionPool.java:2381)
	at com.google.cloud.spanner.SessionPool.<init>(SessionPool.java:1862)
	at com.google.cloud.spanner.SessionPool.createPool(SessionPool.java:1839)
	at com.google.cloud.spanner.SessionPool.createPool(SessionPool.java:1798)
	at com.google.cloud.spanner.SpannerImpl.getDatabaseClient(SpannerImpl.java:225)
	at com.google.cloud.spanner.it.ITClientIdTest.testClientId(ITClientIdTest.java:56)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.junit.runners.Suite.runChild(Suite.java:128)
	at org.junit.runners.Suite.runChild(Suite.java:27)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.apache.maven.surefire.junitcore.JUnitCore.run(JUnitCore.java:55)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:137)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeEager(JUnitCoreWrapper.java:107)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:83)
	at org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:75)
	at org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:157)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
	at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548)

EDIT: I also noticed that the list SpannerImpl.invalidatedDbClients grew by one element every two iterations, which could cause memory usage to increase.

@sebright sebright reopened this Dec 29, 2020
@yoshi-automation yoshi-automation added the 🚨 label Dec 29, 2020
@olavloite olavloite removed the 🚨 label Dec 31, 2020
@yoshi-automation yoshi-automation added the 🚨 label Dec 31, 2020
olavloite added a commit that referenced this issue Dec 31, 2020
Adding the same time series multiple times will cause an error in OpenCensus. A time
series could be added multiple times for the same Database client id if a database
client was invalidated, for example because if was created for a database that did
not exist.

Fixes #202
@olavloite olavloite removed the 🚨 label Dec 31, 2020
@yoshi-automation yoshi-automation added the 🚨 label Dec 31, 2020
@olavloite
Copy link
Contributor

@olavloite olavloite commented Dec 31, 2020

@sebright Thanks for the reproduction case! I'm able to reproduce and solve it for version 0.26 now.

See the linked PR for more information, but the TLDR is that OpenCensusImpl 0.24 does not allow the same metric to be registered multiple times, which means that it is not solvable for that version. 0.26 does work as expected with the change in #766.

@olavloite olavloite removed the 🚨 label Dec 31, 2020
@yoshi-automation yoshi-automation added the 🚨 label Dec 31, 2020
thiagotnunes pushed a commit that referenced this issue Jan 14, 2021
Adding the same time series multiple times will cause an error in OpenCensus. A time
series could be added multiple times for the same Database client id if a database
client was invalidated, for example because if was created for a database that did
not exist.

Fixes #202
thiagotnunes pushed a commit that referenced this issue May 6, 2021
Adding the same time series multiple times will cause an error in OpenCensus. A time
series could be added multiple times for the same Database client id if a database
client was invalidated, for example because if was created for a database that did
not exist.

Fixes #202
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner priority: p2 🚨 type: bug
Projects
None yet
4 participants