Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserting lots of data to Google BigQuery randomly throws an SSL Exception #10625

Closed
PureKrome opened this issue Jul 7, 2023 · 24 comments
Closed
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.

Comments

@PureKrome
Copy link

PureKrome commented Jul 7, 2023

👋🏻 G'Day!

Problem

An SSL error is randomly thrown while importing a lot of data into Google BigQuery.

Details

I've got a (once off) maintenance task which is trying to insert about 10mil rows into a Google BigQuery table. Code is nothing too crazy - because this is a once off import.

Randomly (well, it feels random) an SSL error thrown which crashes my app.

Authentication failed because the remote party sent a TLS alert: 'DecryptError'.

I've added in some retrying ability (using Polly for .NET) and we're back on track.

I'm not really sure how to reproduce it, but I've provided some info here to maybe help. It's happened after 30/40k have been pushed up. Other times, hundreds of thousands pushed up. With Polly, it retries and it works again/continues ... until the next random failure .. which Polly retries OK and we rinse/repeat.

image

Environment details

PS C:\Users\justi> dotnet --info
.NET SDK:
 Version:   7.0.304
 Commit:    7e794e2806

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.22621
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\7.0.304\

Host:
  Version:      7.0.7
  Architecture: x64
  Commit:       5b20af47d9
  • Package name and version: Google.Cloud.BigQuery.V2 v3.2.0

Here's some sample code I've got which is doing this:

// Configure the retry policy with Polly
var retryPolicy = Policy.Handle<Exception>()
    .WaitAndRetryAsync(3, retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)), (exception, timeSpan, retryCount, context) =>
    {
        // Log the exception to the console
        Console.WriteLine($"Retry #{retryCount} after {timeSpan.TotalSeconds} seconds due to: {exception}");
    });

try
{
    // Execute the retry policy for the HTTP request
    var pollyResult = await retryPolicy.ExecuteAsync(async () =>
    {
        var result = await client.InsertRowsAsync(destinationTable, rows, null, cancellationToken);

        return result;
    });


    _logger.LogDebug("Finished sending all rows to Google BigQuery. Status: {bigQueryInsertStatus}", pollyResult.Status);

    pollyResult.ThrowOnAnyError();

}
catch (Exception exception)
{
    // Log the final exception to the console
    Console.WriteLine($"Error occurred after retries: {exception}");

    throw;
}
@PureKrome PureKrome changed the title Inserting lots of data to Google BigQuery randomly throws a Inserting lots of data to Google BigQuery randomly throws an SSL Exception Jul 7, 2023
@amanda-tarafa amanda-tarafa added type: question Request for information or clarification. Not an issue. api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jul 7, 2023
@jskeet jskeet self-assigned this Jul 7, 2023
@jskeet
Copy link
Collaborator

jskeet commented Jul 7, 2023

Hi @PureKrome,

We've discussed this in the team, and we'd like to at least try to reproduce the problem for the sake of completeness, but we suspect we won't want to actually make any changes to how retries are performed... while it may be safe to automatically retry in your particular case, at the point where the network transport has problems we're in unusual territory, and in many cases retrying would be the wrong approach.

Could you let us know:

  • What your insert code looks like (just which methods you're using, and ideally a rough idea of how how large each row is, in terms of fields and total data)
  • What your network is like - are you going through a proxy which might be causing problems, for example? (That seems to be the most likely cause of problems, to be honest.)
  • How often you see the errors in terms of elapsed time (due to some internal details, I wouldn't be entirely surprised to see issues once every hour, for example)

@jskeet
Copy link
Collaborator

jskeet commented Jul 7, 2023

I've tried to reproduce this in https://github.com/jskeet/google-cloud-dotnet/tree/issue-10625 - I've not seen the problem yet.

My most recently attempt consisted of:

  • 10 million rows
  • Inserting 1000 rows per request
  • Each row consists of an integer ID, and a text field with 20 random ASCII characters

(The test took about 25 minutes from home.)

If you'd be happy to run the repro code in your environment, the results would be interesting.

@PureKrome
Copy link
Author

Thanks @jskeet for showing interest into this issue - I really really appreciate it.

I'll try and answer as much of the questions as possible:

What your insert code looks like (just which methods you're using, and ideally a rough idea of how how large each row is, in terms of fields and total data)

var result = await client.InsertRowsAsync(destinationTable, rows, null, cancellationToken);

i have 7 tables i'm inserting data into. So I just spent the last 2 hours running my app to see when/where it fails and if a specific table is always the cause. Two different tables had the same error at random times. The first one was at rows 83k -> 84k. The 2nd one was at rows 2.433m -> 2.434m. I've also seen it happen at various/random places.

What your network is like - are you going through a proxy which might be causing problems, for example? (That seems to be the most likely cause of problems, to be honest.)

The errors occurred both from my home network and through the office network. Neither has anything special. Home has a pi-hole, but that's DNS. Office has nothing like that. I don't have anything special setup on my dev box here. E.g. no custom firewalls, etc.

How often you see the errors in terms of elapsed time (due to some internal details, I wouldn't be entirely surprised to see issues once every hour, for example)

Pretty random. Sometimes it's 5 - 10 mins. other times .. an hour+ it's usually never longer than an hour -> tonight's run was unique how i got so far (from 80k -> 2.4m)

If you'd be happy to run the repro code in your environment, the results would be interesting.

Test 1

Ok. So I've done this and it worked 100% from my home env/location. Here's some data to help explain what I did (so we are both on the same page with no assumptions)


PS C:\Users\justi\Downloads\google-cloud-dotnet-issue-10625\issues\Issue10625> dotnet run <snipped> issue10625 test_table 20 10000000 1000
12:31:11 Inserted 0 rows
12:31:13 Inserted 10000 rows
...
12:39:22 Inserted 9970000 rows
12:39:22 Inserted 9980000 rows
12:39:22 Inserted 9990000 rows
PS C:\Users\justi\Downloads\google-cloud-dotnet-issue-10625\issues\Issue10625>

and this was with a simple WIFI connection to home internet (25 up).

I had to change the global.json to:

{
    "sdk": {
        "version": "6.0.100",
        "rollForward": "latestFeature"
    }
}

and this was how i did my creds:

image

the datasetId was in the same region (australia-southeast2) as the other datasetId which was showing my issue. The only difference between the two data set info was the new one (i just created) has Time travel window: 7 days (maybe this is something new?)

Test 2

This time, I'm going to try and use the async methods, to get similar to my code which uses async (i'm getting desperate..)

image

.. and this worked 💯 fine, like before.

Question.

rough idea of how how large each row is, in terms of fields and total data

How could I calculate how large each row is?

With the 2x errors that occurred in the run above,

  • Exception 1 : table has 21 columns
  • Exception 2 : table has 9 columns

cheers!

@jskeet
Copy link
Collaborator

jskeet commented Jul 7, 2023

Thanks for the info. Will try with more columns next week, although it's hard to see how that would affect things.

I may also try with a delay to deliberately run the test over a long time period without actually having to insert billions of rows.

How could I calculate how large each row is?

Just roughly in terms of what the field types are, and how long any strings or byte arrays are - very roughly, as in are we talking about 100 bytes per row or 100K...

@PureKrome
Copy link
Author

👋🏻 G'Day @jskeet

sorry for the slow/late reply. Here's some more info:

  **** Reading in rows: 1  to 1,000 ****
Total size of data (in bytes): 142,460. MaxSize:200 MinSize:100
  **** Reading in rows: 1,001  to 2,000 ****
Total size of data (in bytes): 140,902. MaxSize:206 MinSize:102
  **** Reading in rows: 2,001  to 3,000 ****
Total size of data (in bytes): 140,830. MaxSize:212 MinSize:100
  **** Reading in rows: 3,001  to 4,000 ****
Total size of data (in bytes): 140,628. MaxSize:210 MinSize:94
  **** Reading in rows: 4,001  to 5,000 ****
Total size of data (in bytes): 141,728. MaxSize:208 MinSize:94
  **** Reading in rows: 5,001  to 6,000 ****
Total size of data (in bytes): 139,766. MaxSize:198 MinSize:100
  **** Reading in rows: 6,001  to 7,000 ****
Total size of data (in bytes): 141,180. MaxSize:240 MinSize:96
  **** Reading in rows: 7,001  to 8,000 ****
Total size of data (in bytes): 140,360. MaxSize:196 MinSize:98
  **** Reading in rows: 8,001  to 9,000 ****
Total size of data (in bytes): 146,100. MaxSize:210 MinSize:98
  **** Reading in rows: 9,001  to 10,000 ****
Total size of data (in bytes): 140,604. MaxSize:208 MinSize:92
  **** Reading in rows: 10,001  to 11,000 ****
Total size of data (in bytes): 139,920. MaxSize:196 MinSize:98
  **** Reading in rows: 11,001  to 12,000 ****
Total size of data (in bytes): 140,680. MaxSize:216 MinSize:102
  **** Reading in rows: 12,001  to 13,000 ****
Total size of data (in bytes): 140,378. MaxSize:196 MinSize:90
  **** Reading in rows: 13,001  to 14,000 ****
Total size of data (in bytes): 140,988. MaxSize:210 MinSize:98
  **** Reading in rows: 14,001  to 15,000 ****
Total size of data (in bytes): 138,344. MaxSize:244 MinSize:102
  **** Reading in rows: 15,001  to 16,000 ****
Total size of data (in bytes): 139,504. MaxSize:198 MinSize:106
  **** Reading in rows: 16,001  to 17,000 ****
Total size of data (in bytes): 142,420. MaxSize:196 MinSize:104
  **** Reading in rows: 17,001  to 18,000 ****
Total size of data (in bytes): 139,298. MaxSize:228 MinSize:98
  **** Reading in rows: 18,001  to 19,000 ****
Total size of data (in bytes): 140,160. MaxSize:206 MinSize:104
  **** Reading in rows: 19,001  to 20,000 ****
Total size of data (in bytes): 143,954. MaxSize:196 MinSize:104
  **** Reading in rows: 20,001  to 21,000 ****

so out of the 10mil odd db rows, here's some data for the first 20k rows.

feels like each poco is just over 200 bytes.

the `record` has **21** properties. (click to see code)
public record ListingData(
        int ListingId,
        DateTime UpdatedOn,
        string StatusType,
        string CategoryType,
        string? StreetNumber,
        string? Street,
        string Suburb,
        string? City,
        string State,
        int Postcode,
        bool IsStreetDisplayed,
        DateTime? OnMarketOn,
        string ProviderName,
        bool IsVisible,
        bool IsLive,
        string Tier,
        int BedroomsCount,
        int BathroomsCount,
        int CarsCount,
        int OfficeId,
        DateTime ImportedOn)
    {
        public int CalculateMemorySize()
        {
            // calculate the memory size as bytes of this single instance
            var sizeOfStatusType = Encoding.Unicode.GetByteCount(StatusType);
            var sizeOfCategoryType = Encoding.Unicode.GetByteCount(CategoryType);
            var sizeOfStreetNumber = string.IsNullOrEmpty(StreetNumber) ? 0 : Encoding.Unicode.GetByteCount(StreetNumber);
            var sizeOfStreet = string.IsNullOrEmpty(Street) ? 0 : Encoding.Unicode.GetByteCount(Street);
            var sizeOfSuburb = Encoding.Unicode.GetByteCount(Suburb);
            var sizeOfCity = string.IsNullOrEmpty(City) ? 0 : Encoding.Unicode.GetByteCount(City);
            var sizeOfState = Encoding.Unicode.GetByteCount(State);
            var sizeOfProviderName = Encoding.Unicode.GetByteCount(ProviderName);
            var sizeOfTier = Encoding.Unicode.GetByteCount(Tier);

            var sizeOfInstance = System.Runtime.CompilerServices.Unsafe.SizeOf<ListingData>();

            var totalSize = sizeOfInstance +
                            sizeOfStatusType +
                            sizeOfCategoryType +
                            sizeOfStreetNumber +
                            sizeOfStreet +
                            sizeOfSuburb +
                            sizeOfCity +
                            sizeOfState +
                            sizeOfProviderName +
                            sizeOfTier;

            return totalSize;
        }
    }

does this help?

@jskeet
Copy link
Collaborator

jskeet commented Jul 10, 2023

You're using Encoding.Unicode.GetByteCount() there, which will mean two bytes per character... I suspect they're mostly ASCII characters, so that'll be 1 byte per character in UTF-8. That suggests it's more like 50-100 bytes per row in terms of data - with some overhead for the number of fields. I'll edit my test accordingly, but I'd be surprised if that made a difference. Worth testing though.

@PureKrome
Copy link
Author

That suggests it's more like 50-100 bytes per row in terms of data

kewl. which suggests it's not much data then, versus thousands and thousands of bytes, per row.

does the "region" have any difference? could there be some rogue SSL cert or server, in the region I'm pushing the data up to?

@jskeet
Copy link
Collaborator

jskeet commented Jul 10, 2023

does the "region" have any difference? could there be some rogue SSL cert or server, in the region I'm pushing the data up to?

The region may well be relevant, although I think the chances of a rogue cert on Google's side are slim.

I'll definitely try with australia-southeast2 as well though.

@jskeet
Copy link
Collaborator

jskeet commented Jul 10, 2023

Okay, next test - still in multi-region, US:

Text fields per row (in addition to ID): 20
Size of text per field: 20
Rows per insert: 1000
Total rows inserted: 10000000
Time taken (seconds): 4337

(Code is in the branch.) Now starting to run against a dataset in Australia (inserting from London).

@PureKrome
Copy link
Author

The region may well be relevant, although I think the chances of a rogue cert on Google's side are slim.

:) Yep! I was really fishing hard, with that idea 😬

@jskeet
Copy link
Collaborator

jskeet commented Jul 10, 2023

And for a dataset in Australia:

Text fields per row (in addition to ID): 20
Size of text per field: 20
Rows per insert: 1000
Total rows inserted: 10000000
Time taken (seconds): 11055

So still no joy in reproducing at the moment :(

@jskeet
Copy link
Collaborator

jskeet commented Jul 13, 2023

@PureKrome: I'm not sure what the next step is at this point. If you can run the tests (which are now checked in at https://github.com/googleapis/google-cloud-dotnet/tree/main/issues/Issue10625) some more to see if the problem is reproduced that way, that would be useful. (I'd suggest using the GOOGLE_APPLICATION_CREDENTIALS environment variable to refer to a JSON file with your service account rather than modifying the code.)

I'm on holiday for a week now, so won't be able to run any more tests (although I'll still monitor this issue). Looking at the stack trace again, it's interesting that it's failing when establishing the connection... I'm not sure why that would happen during the run. Unless you're creating a new BigQueryClient for each batch of rows? If you're able to provide a similar standalone program that does reproduce the issue for you, that would be really helpful.

@PureKrome
Copy link
Author

👋🏻 G'Day Jon!

Unless you're creating a new BigQueryClient for each batch of rows?

YES! This was what I was doing, actually! These lines:

// Id's: the current batch of Id's. e.g. 1->1000, 1001->2000, etc.
public async Task UploadDataAsync(IList<int> ids, CancellationToken cancellationToken)
{
    // Do some business logic + MS-SQL call.
    var (users, uniSubjects) = await _repository.GetDataAsync(ids, cancellationToken);

    var client = BigQueryClient.Create(_projectId, _googleCredential);
    var usersTableReference = client.GetTableReference(_datasetId, "users");
    var subjectsTableReference = client.GetTableReference(_datasetId, "subjects");

    // NOTE: there's no AWAIT here. This is just -the task-
    var usersTask = UploadToBigQueryAsync(
        client, 
        usersTableReference, 
        users, 
        ConvertToUsersHandler, 
        cancellationToken);

    // Again, no AWAIT.
    var subjectsTask = UTBQA(...);

    await Task.WhenAll(usersTask, subjectsTask);   
}

So as you can see here, each time we did a batch of 1000 rows to upload to BQ, I would create the client AND the table references!

So now I've changed the code to be something like this:

// CTOR
public BigQueryService(...)
{
    _client = new Lazy<BigQueryClient>(() => BigQueryClient.Create(projectId, googleCredential));
    _usersTableReference = new Lazy<TableReference>(() => _client.Value.GetTableReference(_datasetId, "users"));
    _subjectsTableReference = new Lazy<TableReference>(() => _client.Value.GetTableReference(_datasetId, "subjects"));
}

public async Task UploadDataAsync(IList<int> ids, CancellationToken cancellationToken)
{
    // Do some business logic + MS-SQL call.
    var (users, uniSubjects) = await _repository.GetDataAsync(ids, cancellationToken);

    // Create the client and table references when FIRST TIME required/requested/called.
    var usersTask = UploadToBigQueryAsync(
        _client.Value, 
        _usersTableReference.Value, 
        users, 
        ConvertToUsersHandler, 
        cancellationToken);

    // Again, no AWAIT.
    var subjectsTask = UTBQA(...);

    await Task.WhenAll(usersTask, subjectsTask);  
}

and the instance of this class is a singleton / single instance, in my main app. The Lazy<T> could be an overkill but I was just wanting to give that a crack.

So .. given this information ... I'm still trying to understand the relationship between the error (which the stack trace suggests it's during BQ.InsertRowsAsync and creating a client (too many times) ?

I'll also give your code another run, as request. I'll report back.

@jskeet
Copy link
Collaborator

jskeet commented Jul 13, 2023

Each client has its own HttpClient, which would create a separate set of network connections. I suspect you're running out of TCP ports or similar. I strongly suspect this will clear up the problem. I might try making a change to the repro to create a new client for each batch and see if that fixes the problem - then I could document it accordingly.

@jskeet
Copy link
Collaborator

jskeet commented Jul 13, 2023

I tried my code with the client initialization within the loop, and it was still working fine - but I didn't let it do a huge run due to other constraints. It's definitely worth changing your code to only create a single client though. The table reference is simple though - no need to worry about caching/reusing those.

@PureKrome
Copy link
Author

Hi @jskeet - I really hope you're having a rest on your break so maybe read this reply when you get back 🏖️ 🌞

When I refactored to a single client I never got the issue, again. So it feels like this is not really an issue about INSERTING data. It's really about creating heaps of clients.

Maybe something like this might be used to see if this is an issue with the client creation?

using Google.Cloud.BigQuery.V2;

string projectId = "...";

// Create 1,000 clients
var tasks = new List<Task>(1000);

for (var i = 0; i < 100_000; i++)
{
    tasks.Clear();

    for (var j = 0; j < 1000; j++)
    {
        var task = CreateTaskAndRunASingleThing(projectId);
        tasks.Add(task);
    }

    await Task.WhenAll(tasks);

    Console.WriteLine($"Iteration {i:N0} completed");



    async Task CreateTaskAndRunASingleThing(string projectId)
    {
        // Credentials provided via json file via GOOGLE_APPLICATION_CREDENTIALS env var.
        var client = await BigQueryClient.CreateAsync(projectId);

        // NOTE: not sure if this is needed?
        await client.GetBigQueryServiceAccountEmailAsync(); // So -some- type of network traffic is executed.
    }
}

@jskeet
Copy link
Collaborator

jskeet commented Jul 16, 2023

Not really answering... but I should check when I'm back whether we cache a service for default credentials, which would make my current test code unsuitable. At the very least, using the same credential for each client will be a bit different. And we don't use self-signed JWTs for BigQuery.

@jskeet
Copy link
Collaborator

jskeet commented Jul 21, 2023

Hmm... I still can't reproduce this when inserting, even with loads of clients each with a new credential. But it may well depend on precise environmental factors at that point. I'm really glad to hear that using a single client seems to have resolved the problem - but I'll leave this issue open and have another go at reproducing it next week with the code you've suggested above.

Just for clarity, does that code reproduce the issues on your machine?

@PureKrome
Copy link
Author

Just for clarity, does that code reproduce the issues on your machine?

Sorry - I don't understand. Which code do you mean, causes the SSL issue?

I also might see if i can reproduce it by packaging up my original code (1x client per upload-to-bigquery) into a single exe and then run it from some VM in azure. This way, I can see if it's a weird issue from my machine to isolate some hardware out of the equation.

@jskeet
Copy link
Collaborator

jskeet commented Jul 21, 2023

Sorry - I don't understand. Which code do you mean, causes the SSL issue?

Yup. I haven't been able to reproduce that anywhere yet. Using a single client is a better idea anyway, but at the moment I don't know whether it's really related to this issue.

@jskeet
Copy link
Collaborator

jskeet commented Jul 31, 2023

@PureKrome: Have you seen enough data now with a single client to know whether this has actually fixed the issue? While I still haven't reproduced the problem myself, I can imagine it could be related, which would be good to know in case we see other users with the same issue. It's a shame that I can't reproduce it, but if your code now works, I'd like to close this issue. If you're still seeing the error, we should absolutely continue to pursue it...

@PureKrome
Copy link
Author

PureKrome commented Aug 2, 2023

👋🏻 G'Day @jskeet - ok. interesting! I tried again running my code on an Azure VM and it didn't seem to error at all. Ran it on my localhost and yeah .. took a while but I got the error popping up again.

So i think it's then fair to say one (or more) of the following:

  • It's something weird with my laptop
  • It's something weird with my home network (wifi? my pi-hole handling DNS requests?)

I was expecting the error to pop up when I did it on the Azure VM - but it didn't.

I feel really bad about this as I honestly thought it might be a rogue setting somewhere outside of my little bubble. Serious lessons learnt here.

I'm going to close this card.

Again Jon - thank you so much for showing some interest in this post. Keep up the awesome work.

Thanks, mate.

@jskeet
Copy link
Collaborator

jskeet commented Aug 2, 2023

I feel really bad about this as I honestly thought it might be a rogue setting somewhere outside of my little bubble. Serious lessons learnt here.

Please don't feel bad about this at all - I thought it was a great collaboration. Okay, we didn't find anything in the end - but we still ended up with a code improvement for you, and I'll definitely remember this error if it comes up for another customer.
Please keep reporting bugs :)

@PureKrome
Copy link
Author

Thanks Jon for the kind words. Yep - an action item here was definitely how to reuse the Client instead of creating a new one for each item to save. So there was some good that came out of this.

Thank you again for all your patience and great questions in trying to resolve this. Really appreciated it. 👏🏻 🍰 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants