Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve unhandled exceptions when analyzing folders that contain large files #2494

Merged
merged 15 commits into from
Jul 8, 2022

Conversation

marmegh
Copy link
Contributor

@marmegh marmegh commented Jul 1, 2022

OutOfMemory and NullReferenceException exceptions are being thrown when analyzing folders containing large files. This change adopts a file size limit in kilobytes and defaults to 1024.

Added unit testing to cover combinations of file sizes and file size limits.

This is related to #621.

/// Gets or sets the maximum file size (in kilobytes) that will be analyzed.
/// If not set, it will analyze all sizes.
/// </summary>
int FileSizeInKilobytes { get; set; }
Copy link
Collaborator

@eddynaka eddynaka Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSizeInKilobytes

since we are here, can we update this to MaxFileSizeInKilobytes or something that actually explains why is this needed?
#WontFix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mentioned in the other PR, it would be a breaking change for SPAM. Is that acceptable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, if that would make things clearer and we tell people, yes.

What are your thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another strategy could be the following:

  1. we rename the C# property
  2. we keep the exposed argument.

So, internally, we know the good name. and we tell that in the next release we will rename the argument :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not take this change now. A breaking change of this kind needs to come with a thoughtful implementation, this is new work that is slowing down the immediate change.

So, go ahead and file an issue. We have other patterns of deprecating command-line args (like --hashes is support to be obsoleted!!). We do this in a manner such as Eddy describes, retain the old and mark it obsolete and/or update the code to properly transform old command-lines to new.

So, Ed, be sure to capture your excellent thoughts in the issue that you file. :)

[Fact]
public void MultithreadedAnalyzeCommandBase_TargetFileSizeTestCases()
{
Random random = new Random();
Copy link
Collaborator

@eddynaka eddynaka Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random random = new Random();

you should preserve the seed, otherwise, if this test fails, you won't be aware of what was the seed used and won't be able to easily reproduce it. #Resolved

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer, you just need to use the TestRule.s_seed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do step through other tests that show the output that we expect for any random use. Eddy, be sure to explain this sort of thing in detail, it's important for others to understand completely (you may have done this already offline). Basically, any test that uses a random value generator MUST emit the seed used to initialize the PRNG in its output. For tests that fail only due to specific randomized values, this seed can be hard-coded on the dev box to ensure reproducibility of the problem.

By the way! I recently fixed a test bug of this kind in an MS service. I was co-developing a change and my build failed, test rerun succeeded. When I looked at history, I could clearly see this was a flaky test, but it only failed once every several hundred pipeline builds. :) The issue was a random value generator that only provoked a negative condition in a very small % of cases. Ouch! That bug had lived in this code base for over a year...

int randomMaxFileSize = random.Next(1, int.MaxValue - 1);
long randomFileSize = (long)random.Next(2, int.MaxValue - 1);


Copy link
Collaborator

@eddynaka eddynaka Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would so dearly love to not see this kind of feedback in code reviews, i.e., for all style suggestions to mostly be resolved by an automated code formatter. Is there not an autofix for this in a Roslyn analyzer or command-line tool? It's important that we stay clean and detail-oriented, so I appreciate all of us pointing this sort of thing out, to be clear. Just encouraging us to remove as many issues of this kind from ever occurring, wherever we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually have a step that removes these extra lines as part of buildAndTest.cmd for most of our repos.

{
long fileSize = FileSystem.GetFileSize(path) / 1024;

return (maxFileSize == -1 || fileSize < maxFileSize);
Copy link
Collaborator

@eddynaka eddynaka Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileSize < maxFileSize

I think a few cases are missing:
maxfile = -1 vs -1/0/1/big
maxfile = file = 10 (a reasonable number) #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated test cases and added some explicit follow up confirming this method separately.

@@ -96,6 +96,13 @@ public Uri TargetUri

public DefaultTraces Traces { get; set; }


Copy link
Collaborator

@eddynaka eddynaka Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line #Resolved

@@ -122,6 +129,11 @@ internal bool ValidateOutputFileCanBeCreated(IAnalysisContext context, string ou
return succeeded;
}

internal bool ValidateFileSizeInKilobytes(int fileSizeInKilobytes)
Copy link
Member

@michaelcfanning michaelcfanning Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValidateFileSizeInKilobytes

This factoring looks questionable to be. Your method name is as long as its implementation. :)

In general, a utility method only has value if it contains some specialized logic that you don't want to maintain in multiple places. A simple 'greater than 0' comparison doesn't seem to meet that test.

There is another subtle issue 'ValidateFileSizeInKilobytes' is suggestive, it suggests more validation is done. You encode 'kilobytes' in the name and 'file size' and yet the implementation of this helper seems to be 'VerifyIsGreaterThanZero', and that's it.

Having said all this, I don't even see a caller for this method. :) So maybe I'm making a whole lot of pointing out you left some intermediate code in the file. :)

Final point, though, just to keep beating this to death, this helper doesn't access instance data and, if it were to survive, could be static.

#Resolved

new {
expectedExitReason = ExitReason.None,
fileSize = (long)ulong.MinValue,
maxFileSize = randomMaxFileSize
Copy link
Member

@michaelcfanning michaelcfanning Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randomMaxFileSize

Don't do this. :) there really isn't a good use of a random value in this test pattern, in my opinion. Why is that? Because the test matrix of use values is so clear. You need all boundaries (max & min), you need zero, -1, and then a couple values within the interior range.

Using random to generate something just won't buy you much, and meantime you've opted into a more complex test pattern which, for example, leads others to require you to follow specific pattern. Don't do that unless you really think you derive value from it.

#Resolved


var options = new TestAnalyzeOptions
{
Threads = 10,
Copy link
Member

@michaelcfanning michaelcfanning Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threads = 10

Why are you opting into this behavior, and enabling hashes, etc? This is a nice test, but it's doing a lot, why is that? Can't we make this a bit more focused on the matter at hand, file size validation?

When you start adding things, you increase performance costs, but you also risk introducing non-obvious breaks in tests. So, let's say we had a problem introduced by our hashing argment to --insert. This test may fail, and we'd all be confused why the file size test broken.

This is just an example. But the principle here is that we want targeted unit tests that directly focus on some test work, as far as possible

It isn't always possible to create a clean union. But for this example, I definitely don't see why we'd explicitly set any args, would prefer defaults and minimal options otherwise. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashes, etc were part of the original customer escalation. It's fairly conclusive that with and without these the repro and fix are the same. I've removed.

@@ -1,5 +1,9 @@
# SARIF Package Release History (SDK, Driver, Converters, and Multitool)

## Unreleased

* BUGFIX: Resolve OutofMemoryException and NullReferenceException' failures by adding `--file-size-in-kb` argument to specify a max file size to analyze or default to 1024KB. [#2494](https://github.com/microsoft/sarif-sdk/pull/2494)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutofMemoryException

enquote this and the other exception kind.

/// <returns>
/// A long representing the size of the file in bytes.
/// </returns>
public long GetFileSize(string path)
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetFileSize

Let's call this FileInfoLength. #Pending

filter,
SearchOption.TopDirectoryOnly))
foreach (string file in FileSystem.DirectoryEnumerateFiles(directory, filter, SearchOption.TopDirectoryOnly)
.Where(file => IsTargetWithinFileSizeLimit(file, _rootContext.FileSizeInKilobytes)))
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where

Please put your predicate/handling within the collection, the rule is to avoid chained collection traversal. #Pending

filter,
SearchOption.TopDirectoryOnly))
foreach (string file in FileSystem.DirectoryEnumerateFiles(directory, filter, SearchOption.TopDirectoryOnly)
.Where(file => IsTargetWithinFileSizeLimit(file, _rootContext.FileSizeInKilobytes)))
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where

put this filter predicate into the code block. #Pending

/// Gets or sets the maximum file size (in kilobytes) that will be analyzed.
/// If not set, it will analyze all sizes.
/// </summary>
public int FileSizeInKilobytes { get; set; } = -1;
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1

remove this #Pending

/// Gets or sets the maximum file size (in kilobytes) that will be analyzed.
/// If not set, it will analyze all sizes.
/// </summary>
int FileSizeInKilobytes { get; set; }
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSizeInKilobytes

Change this to MaxFileInKilobytes. Do not accept this change into SPAM without making sure SPAM supports its old argument as well for now.

You can look at how we handled deprecating --hashes in favor of --insert Hashes in the BinSkim code base for an example. #Pending

@@ -30,6 +30,8 @@ public class AnalyzeTestContext : IAnalysisContext

public DefaultTraces Traces { get; set; }

public int FileSizeInKilobytes { get; set; } = -1;
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1;

remove this. only the attribute should set the default value. #Pending

/// Gets or sets the maximum file size (in kilobytes) that will be analyzed.
/// If not set, it will analyze all sizes.
/// </summary>
public int FileSizeInKilobytes { get; set; } = -1;
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1

Remove this. Did you author a test for the validation command that this is honored? #Pending

@@ -9,6 +9,7 @@

using Microsoft.CodeAnalysis.Sarif;
using Microsoft.CodeAnalysis.Sarif.Driver;
using Microsoft.VisualBasic;
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Microsoft

why is this is here? #Pending

/// Gets or sets the maximum file size (in kilobytes) that will be analyzed.
/// If not set, it will analyze all sizes.
/// </summary>
public int FileSizeInKilobytes { get; set; } = -1;
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1

Remove this. #Pending

@michaelcfanning
Copy link
Member

michaelcfanning commented Jul 6, 2022

public class AnalyzeTestContext : IAnalysisContext

Why is this in here? See if you can delete it. I don't understand why a test file is in the production code, looks like it might be mistake.


In reply to: 1176781298


In reply to: 1176781298


Refers to: src/Sarif.Multitool/AnalyzeTestContext.cs:9 in eb91a3a. [](commit_id = eb91a3a, deletion_comment = False)

Copy link
Member

@michaelcfanning michaelcfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -1,5 +1,9 @@
# SARIF Package Release History (SDK, Driver, Converters, and Multitool)

## Unreleased

* BUGFIX: Resolve OutofMemoryException and NullReferenceException' failures by adding `--file-size-in-kb` argument to specify a max file size to analyze or default to 1024KB. [#2494](https://github.com/microsoft/sarif-sdk/pull/2494)
Copy link
Member

@michaelcfanning michaelcfanning Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUGFIX

FEATURE: Add file-size-in-kb argument that filters allows filtering scan targets by file size. #Pending

@marmegh
Copy link
Contributor Author

marmegh commented Jul 6, 2022

public class AnalyzeTestContext : IAnalysisContext

Eddy provided the PR for historical context. Not part of this change, but this and related files should eventually be moved to test project.


In reply to: 1176781298


Refers to: src/Sarif.Multitool/AnalyzeTestContext.cs:9 in eb91a3a. [](commit_id = eb91a3a, deletion_comment = False)

MaxFileSizeInKilobytes = testCase.maxFileSize
};

var command = new TestMultithreadedAnalyzeCommand(mockFileSystem.Object);
Copy link
Collaborator

@eddynaka eddynaka Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestMultithreadedAnalyzeCommand

You are only testing the multithreaded command base. If you take a look, we have some tests that has a boolean and that decides if it is single threaded or multithreaded since input/output is exactly the same besides the actual command.
#Resolved

if (exitReason != ExitReason.None)
{
var exception = command.ExecutionException as ExitApplicationException<ExitReason>;
exception.Should().NotBeNull();

Check warning

Code scanning / CodeQL

Dereferenced variable may be null

Variable [exception](1) may be null here because of [this](2) assignment. Variable [exception](1) may be null here because of [this](2) assignment.
Comment on lines +863 to +864
bool expectedToBeWithinLimits = testCase.maxFileSize == -1 ||
testCase.fileSize / 1024 < testCase.maxFileSize;

Check warning

Code scanning / CodeQL

Useless assignment to local variable

This assignment to [expectedToBeWithinLimits](1) is useless, since its value is never read.
Copy link
Collaborator

@eddynaka eddynaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@marmegh marmegh merged commit ce8c5cb into main Jul 8, 2022
@marmegh marmegh deleted the spamICM branch July 8, 2022 20:13
This was referenced Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants