Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[10.x] The validator is very slow with large datasets when the key contains asterisks #49375

Closed
eusonlito opened this issue Dec 14, 2023 · 20 comments

Comments

@eusonlito
Copy link
Contributor

eusonlito commented Dec 14, 2023

Laravel Version

10.37.3

PHP Version

8.3.0

Database Driver & Version

No response

Description

As reported on previous Laravel versions, the laravel Validator is very very slow with large datasets:

I have tried to understand and review the validation process, but it is too complex to be able to send a PR. I could only find that it iterates thousands of times over all values when calculating the validation rules. The problem seems to come only when using asterisks in the fields to validate.

Here a tweet with some tests: https://twitter.com/lito_ordes/status/1735082009666494594

And the repo to test: https://github.com/eusonlito/laravel-validator-slow

Steps To Reproduce

Tests Results

"Read Data Time: 0.0120"
"Items: 11351"
"validatorAsterisk Time: 61.9842"
"validatorFixed Prepare: 0.0048"
"validatorFixed Time: 0.1179"
"validatorChunks1000 Time: 2.1746"
"validatorChunks5000 Time: 10.9196"
"validatorChunks10000 Time: 46.9144"

Added memory usage and chunks

"Read Data: 0.0120 seconds / 32.00 memory"
"Items: 11351"
"validatorAsterisk: 60.9496 seconds / 68.50 memory"
"rulesFixed: 0.0049 seconds / 60.50 memory"
"validatorFixed: 0.1193 seconds / 74.00 memory"
"validatorChunks100: 0.3551 seconds / 64.00 memory"
"validatorChunks1000: 2.1089 seconds / 66.00 memory"
"validatorChunks5000: 10.5173 seconds / 70.00 memory"
"validatorChunks10000: 46.5916 seconds / 77.50 memory"
@driesvints
Copy link
Member

Thanks but I just don't think these are realistic examples, sorry. Sending that much data at once through the validator will slow things down. There are other ways to work with data and validate it. Of course, if you could ever find out ways to improve validation speed, we'd be open to a PR.

@eusonlito
Copy link
Contributor Author

Thank you very much for your reply @driesvints.

In any case, with all due respect to you and your work, to say that the validator of the best PHP framework does not work with a large amount of data (10k rows I don't know if it can be considered a large amount of data) is to disregard the problem that clearly exists in the code.

The chunk examples show that there really is an n^n problem that in my opinion, should be fixed.

As realistic examples, I'm using the validator to validate CSV and Excel sheets, also to accept product listings in ecommerce import systems, bulk pricing updates, log analysis, etc... Right now I am passing that same validation to a validation of my own, but that does exactly the same job that the framework system should do.

Again, thank you very much for your great work, and I think it would be a good idea to reconsider the decision to close the issue, or at least take a look at it to see what can be done.

Best regards!

@K2ouMais
Copy link

The performance problems aren't just because of a big file. I have a json object with only 50 items in it, but a lot of fields to be validated and I can see also the same behaviour.

It would be really awesome if you could considerate this and take a look if it can be improved.

@bastien-phi
Copy link
Contributor

Having the same issue when validating csv files with a few hundred lines. I feel like there is some room for performance improvements in the validator. I agree with @eusonlito please reconsider this issue as it is a realistic example

@tpetry
Copy link
Contributor

tpetry commented Dec 15, 2023

Encountered the same problem in the past with an API response that just had 7k rows and needed to be validated to ensure its valid against the spec. The performance was sometimes so slow that the execution was stopped because of max_execution_time.

@driesvints Are these enough realistic example to reopen it and let the community think of a solution? It triggers many people and I guess each one wasted hours to research the issue (in production) and many more hours trying to search for a problem. And in the end abandoning Laravel's validation functionality for this part.

@comes
Copy link
Contributor

comes commented Dec 15, 2023

referencing this PR #49386 here

@comes
Copy link
Contributor

comes commented Dec 15, 2023

to close the circle, @eusonlito, with the improved \Arr::dot() from the linked PR, validation is still slow and can be improved. But those are the results

"Read Data: 0.0078 seconds / 32.00 memory" // app/Console/Commands/Slow.php:116
"Items: 11351" // app/Console/Commands/Slow.php:36
"validatorAsterisk: 9.1073 seconds / 72.50 memory" // app/Console/Commands/Slow.php:116
"rulesFixed: 0.0034 seconds / 64.50 memory" // app/Console/Commands/Slow.php:116
"validatorFixed: 0.1015 seconds / 78.00 memory" // app/Console/Commands/Slow.php:116
"validatorChunks10: 0.2462 seconds / 68.00 memory" // app/Console/Commands/Slow.php:116
"validatorChunks100: 0.2528 seconds / 68.00 memory" // app/Console/Commands/Slow.php:116
"validatorChunks1000: 0.8292 seconds / 70.00 memory" // app/Console/Commands/Slow.php:116
"validatorChunks5000: 3.1408 seconds / 74.00 memory" // app/Console/Commands/Slow.php:116
"validatorChunks10000: 7.2774 seconds / 81.50 memory" // app/Console/Commands/Slow.php:116```

@driesvints
Copy link
Member

Look everyone. For performance improvements we really just appreciate any PR's you could send. It seems @bastien-phi did just that and already sent in a PR which seems to help. If there's any additional speed improvements to be made we'd appreciate more PR's with proper benchmarks.

Encountered the same problem in the past with an API response that just had 7k rows and needed to be validated to ensure its valid against the spec. The performance was sometimes so slow that the execution was stopped because of max_execution_time.

@tpetry I just don't feel this is a realistic example sorry. 7K rows for an API response is just too much and should be paginated.

Same for 10K row excel, CSV tables, etc. These should be batched and the validations of these rows should be aggregated in a result set.

I know everyone in this thread feels I'm a bit quick to judge here but it seems we don't see eye to eye here and just have a difference of opinion. That's fine and the best thing you can do in this situation is either attempt a PR to improve performance if there's any bottlenecks or try and find a different way to solve the problem for your use case. We don't go into performance improvements ourselves as we have other priorities/work of our own. In these situations we appreciate an effort from the community.

I also realise most of you think I'm wrong in this situation but I'd appreciate it if you also respected my opinion even if it doesn't aligns with your own (mostly referring to people who downvoted me). I appreciate all of you who replied with thoughtful responses here in this thread.

@AlexandreGerault
Copy link
Contributor

AlexandreGerault commented Dec 19, 2023

@driesvints I'm sorry if you feel like downvoting you was disrespectful, it was not the intention. I apologize.

When I downvoted your feedback it was only to show that I indeed disagree, according to the information we have (that it seemed to be an exponential complexity). However I don't expect you to change your mind based on downvotes and say "We'll look at it and fix it", I understand the core team has other priorities.

🙏

@comes
Copy link
Contributor

comes commented Dec 19, 2023

Hej Folks,

I'm the guy who figured out the issue with Arr::dot(). I'm the one who created the poor PR #49380 with a solution.

I spent a little time those days solving the second bottleneck and I was able to boil it down to a few lines of code within the ValidationRuleParser.

https://github.com/laravel/framework/blob/10.x/src/Illuminate/Validation/ValidationRuleParser.php#L145-L174

    protected function explodeWildcardRules($results, $attribute, $rules)
    {
        $pattern = str_replace('\*', '[^\.]*', preg_quote($attribute, '/'));


        $data = ValidationData::initializeAndGatherData($attribute, $this->data);


        foreach ($data as $key => $value) {
            if (Str::startsWith($key, $attribute) || (bool) preg_match('/^'.$pattern.'\z/', $key)) {
                foreach ((array) $rules as $rule) {
                    if ($rule instanceof NestedRules) {
                        $compiled = $rule->compile($key, $value, $data);


                        $this->implicitAttributes = array_merge_recursive(
                            $compiled->implicitAttributes,
                            $this->implicitAttributes,
                            [$attribute => [$key]]
                        );


                        $results = $this->mergeRules($results, $compiled->rules);
                    } else {
                        $this->implicitAttributes[$attribute][] = $key;


                        $results = $this->mergeRules($results, $key, $rule);
                    }
                }
            }
        }


        return $results;
    }

mainly $rule->compile(...) causes a significant bottleneck.

I've created a simple benchmark within ValidationForEachTest.php

public function testAsterisk()
    {
        $subset = [];
        for ($i=0; $i < 550; $i++) {
            $subset[] = [
                'id' => $i,
                'name' => 'Name ' . $i,
                'email' => 'email' . $i . '@example.com',
                'date' => now()->addDays($i)->format('Y-m-d'),
                'another_sub_array' => [
                    'street' => 'Street ' . $i,
                    'city' => 'City ' . $i,
                    'country' => 'Country ' . $i,
                ],
            ];
        }

        $data = ['items' => $subset];

        $rules = [
            'items.*' => Rule::forEach(function () {
                return [
                    'id' => 'integer',
                    'name' => ['required', 'string'],
                    'email' => ['required', 'email'],
                    'date' => ['required', 'date'],
                ];
            }),
        ];

        $trans = $this->getIlluminateArrayTranslator();

        // $v = new Validator($trans, $data, $rules);
        $v = new Validator($trans, $data, []);
        $start = microtime(true);
        $res = (new ValidationRuleParser($data))->explode($rules);
        // dd($res);
        dd(microtime(true) - $start);

        $this->assertFalse($v->fails());
    }

If one of you can figure out how to improve this code, I'll do my best to contribute all my knowledge. Maybe it is the countless calls of array_merge_recursive(). I honestly don't know yet.

if we feed $precompiled = (new ValidationRuleParser($data))->explode($rules); with our ruleset and the data, the ValidationRuleParser will return the compiled array of rules, which was named from the issue opener @eusonlito rulesFixed.

If we feed the validator with the precompiled ruleset new Validator($trans, $data, $compiledRules->rules) it is fast as hell.

I learned this trick from the amazing @calebporzio, who did this trick in livewire https://github.com/livewire/livewire/blob/2.x/src/ComponentConcerns/ValidatesInput.php#L217-L238.
Maybe we can improve the laravel ValidationRuleParser with his idea.

@AlexandreGerault
Copy link
Contributor

AlexandreGerault commented Dec 20, 2023

I think the array_merge_recursive is indeed slow and should be avoided inside of loops.

@K2ouMais
Copy link

K2ouMais commented Dec 20, 2023

I am wondering, if instead calling array_merge_recursive in the loop, if this would help:

$mergeArrays = $compiled->implicitAttributes + $this->implicitAttributes + [$attribute => [$key]];

$this->implicitAttributes = array_merge_recursive(...$mergeArrays);

Could the array union become a problem?

@comes
Copy link
Contributor

comes commented Dec 20, 2023

I am wondering, if instead calling array_merge_recursive in the loop, if this would help:

$mergeArrays = $compiled->implicitAttributes + $this->implicitAttributes + [$attribute => [$key]];

$this->implicitAttributes = array_merge_recursive(...$mergeArrays);

Could the array union become a problem?

ArgumentCountError: array_merge_recursive() does not accept unknown named parameters

@K2ouMais
Copy link

K2ouMais commented Dec 20, 2023

I am wondering, if instead calling array_merge_recursive in the loop, if this would help:

$mergeArrays = $compiled->implicitAttributes + $this->implicitAttributes + [$attribute => [$key]];

$this->implicitAttributes = array_merge_recursive(...$mergeArrays);

Could the array union become a problem?

ArgumentCountError: array_merge_recursive() does not accept unknown named parameters

Thank you for your help, I couldnt try it from my vacation, but I am really looking forward for a performance improvement in this, as I always have problems with the validator.

EDIT:
I took at look at you test and it seems that I cant understand well, what you are trying to test here.
Maybe there is a performance bottleneck, when parsing rules, but I think that the performance bottleneck that everyone is complaining about might be on another place.

This is what I have done and performance seems to be a lot better, even when I am using the asterisk, but I am not using the Rule::forEach() method.

        // $rules = [
        //     'items.*' => Rule::forEach(function () {
        //         return [
        //             'id' => 'integer',
        //             'name' => ['required', 'string'],
        //             'email' => ['required', 'email'],
        //             'date' => ['required', 'date'],
        //         ];
        //     }),
        // ];

        $rules = [
            'items' => ['required', 'array'],
            'items.*.id' => ['integer'],
            'items.*.name' => ['required', 'string'],
            'items.*.email' => ['required', 'email'],
            'items.*.date' => ['required', 'date'],
        ];

@AJenbo
Copy link
Contributor

AJenbo commented Apr 19, 2024

I noticed that validating an array with 1000 rows takes about 1 minute, but validating them one row at a time only takes about 3 sec. There are no rules that require thing from multiple rows it it feels like there should be a way to improve this, though I have no idea what part of the code is causing the slow down.

@333sec
Copy link

333sec commented Apr 20, 2024

When validating large data such as CSV data, the following tips can greatly improve the processing speed.

  • Replace asterisks in rules with actual keys in advance.
  • Eliminate unnecessary rules in advance. (e.g., when actual data is null and rules are also nullable)

When I validate large data in production, I use my own $rules = optimizeValidateRule($data, $rules) function to optimize rule.

@AJenbo
Copy link
Contributor

AJenbo commented Apr 20, 2024

@333sec can you share your optimizeValidateRule(), else it's not exactly very useful advice.

@knobel-dk
Copy link

For those finding this thread on Google because they are sending large payloads over HTTP, let me share an idea and a quick benchmark: Skip the Laravel validation and, instead, new up native PHP objects. The difference is remarkable.

<?php

declare(strict_types=1);

readonly class RoomBulkUploaderSpeedTest {
    public function __construct(public int $room_id, public string $external_id)
    {
        throw_if($this->room_id < 0, new \DomainException("Example of a validation rule"));
    }
}

$payload = array_map(fn($i) => ['room_id' => $i, 'external_id' => "string_{$i}"], range(1, 5000));
\Illuminate\Support\Benchmark::dd([
    'validation' => fn() => \Illuminate\Support\Facades\Validator::validate([
        'array' => $payload,
    ], [
        'array.*.room_id' => 'required|integer',
        'array.*.external_id' => 'required|string',
    ]),
    'newing_up' => fn() => array_map(fn($i) => new RoomBulkUploaderSpeedTest(...$i), $payload)
], 1);

"validation" => "636.862ms"
"newing_up" => "15.103ms"

If you want a 422 instead of a 500, you can catch the exception thrown in this code and return a 422. Depending on your use case, you might omit or be sure to include declare(strict_types=1);

@comes
Copy link
Contributor

comes commented May 3, 2024

well you loose the power of validation.

instead of

[
  'array.*.room_id' => 'required|integer',
  'array.*.external_id' => 'required|string',
]```

explode the `*` your self.
```php
$rules = array_map(fn ($i) => [
  "array.{$i}.room_id" => 'required|integer',
  "array.{$i}.external_id" => 'required|string',
], range(1, 5000));

it will validate way faster.

the slow part in validation is to resolve the * asterisk

readonly class RoomBulkUploaderSpeedTest
{
  public function __construct(public int $room_id, public string $external_id)
  {
    throw_if(
      $this->room_id < 0,
      new \DomainException("Example of a validation rule")
    );
  }
}
$rules = array_map(
  fn($i) => [
    "array.{$i}.room_id" => "required|integer",
    "array.{$i}.external_id" => "required|string"
  ],
  range(1, 5000)
);

$payload = array_map(
  fn($i) => ["room_id" => $i, "external_id" => "string_{$i}"],
  range(1, 5000)
);
\Illuminate\Support\Benchmark::dd(
  [
    "validation" => fn() => \Illuminate\Support\Facades\Validator::validate(
      [
        "array" => $payload
      ],
      $rules
    ),
    "newing_up" => fn() => array_map(
      fn($i) => new RoomBulkUploaderSpeedTest(...$i),
      $payload
    )
  ],
  1
);

not blazing fast, but way faster.

@Devil4ngle
Copy link

Devil4ngle commented May 7, 2024

@333sec can you share your optimizeValidateRule(), else it's not exactly very useful advice.

This is what I came up with it goes from 60 seconds to 10 seconds with a large dataset. (still bad)
$data is the data you want to validate. Note if you have validation like
required_if:test.*.test2
it will not work anymore

public static function optimizeRules(array $rules, array $data): array
    {
        $optimizedRules = [];

        foreach ($rules as $ruleKey => $ruleValue) {
            $explodedRuleKey = explode('.', $ruleKey);
            self::recursiveRuleBuilder($explodedRuleKey, $data, '', $ruleValue, $optimizedRules);
        }
        return $optimizedRules;
    }

    private static function recursiveRuleBuilder($ruleParts, $data, $currentKey, $ruleValue, &$optimizedRules)
    {
        $part = array_shift($ruleParts);
        if (empty($ruleParts)) {
            $optimizedRules[$currentKey . $part] = $ruleValue;
            return;
        }
        if ($part === '*') {
            foreach ($data as $key => $value) {
                self::recursiveRuleBuilder($ruleParts, $data[$key], $currentKey . $key . '.', $ruleValue, $optimizedRules);
            }
        } elseif (isset($data[$part])) {
            self::recursiveRuleBuilder($ruleParts, $data[$part], $currentKey . $part . '.', $ruleValue, $optimizedRules);
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests