Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[analyzer] TaintPropagation checker strlen() should not propagate #66086

Merged
merged 1 commit into from
Sep 19, 2023

Conversation

dkrupp
Copy link
Contributor

@dkrupp dkrupp commented Sep 12, 2023

strlen(..) call should not propagate taintedness,
because it brings in many false positive findings. It is a common pattern to copy user provided input to another buffer. In these cases we always
get warnings about tainted data used as the malloc parameter:

buf = malloc(strlen(tainted_txt) + 1); // false warning

This pattern can lead to a denial of service attack only, when the attacker can directly specify the size of the allocated area as an arbitrary large number (e.g. the value is converted from a user provided string).

Later, we could reintroduce strlen() as a taint propagating function with the consideration not to emit warnings when the tainted value cannot be "arbitrarily large" (such as the size of an already allocated memory block).

The change has been evaluated on the following open source projects:

In all cases the lost reports are originating from copying untrusted environment variables into another buffer.

There are 2 types of lost false positive reports:

  1. Where the warning is emitted at the malloc call by the TaintPropagation Checker
    len = strlen(portnumber_filename)+4+1; temp_portnumber_filename = malloc(len);

  2. When pointers are set based on the length of the tainted string by the ArrayOutofBoundsv2 checker.
    For example this case.

@dkrupp dkrupp requested review from a team as code owners September 12, 2023 13:41
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:static analyzer clang:dataflow Clang Dataflow Analysis framework - https://clang.llvm.org/docs/DataFlowAnalysisIntro.html labels Sep 12, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 12, 2023

@llvm/pr-subscribers-clang-static-analyzer-1

Changes

strlen(..) call should not propagate taintedness,
because it brings in many false positive findings. It is a common pattern to copy user provided input to another buffer. In these cases we always
get warnings about tainted data used as the malloc parameter:

buf = malloc(strlen(tainted_txt) + 1); // false warning

This pattern can lead to a denial of service attack only, when the attacker can directly specify the size of the allocated area as an arbitrary large number (e.g. the value is converted from a user provided string).

Later, we could reintroduce strlen() as a taint propagating function with the consideration not to emit warnings when the tainted value cannot be "arbitrarily large" (such as the size of an already allocated memory block).

The change has been evaluated on the following open source projects:

In all cases the lost reports are originating from copying untrusted environment variables into another buffer.

There are 2 types of lost false positive reports:

  1. Where the warning is emitted at the malloc call by the TaintPropagation Checker
    len = strlen(portnumber_filename)+4+1; temp_portnumber_filename = malloc(len);

  2. When pointers are set based on the length of the tainted string by the ArrayOutofBoundsv2 checker.
    For example this case.

--
Full diff: https://github.com/llvm/llvm-project/pull/66086.diff

4 Files Affected:

  • (modified) clang/docs/analyzer/checkers.rst (+2-2)
  • (modified) clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp (-2)
  • (modified) clang/test/Analysis/taint-diagnostic-visitor.c (+7-6)
  • (modified) clang/test/Analysis/taint-generic.c (-18)
diff --git a/clang/docs/analyzer/checkers.rst b/clang/docs/analyzer/checkers.rst
index 54ea49e7426cc86..dbd6d7787823530 100644
--- a/clang/docs/analyzer/checkers.rst
+++ b/clang/docs/analyzer/checkers.rst
@@ -2599,8 +2599,8 @@ Default propagations rules:
  ``memcpy``, ``memmem``, ``memmove``, ``mbtowc``, ``pread``, ``qsort``,
  ``qsort_r``, ``rawmemchr``, ``read``, ``recv``, ``recvfrom``, ``rindex``,
  ``strcasestr``, ``strchr``, ``strchrnul``, ``strcasecmp``, ``strcmp``,
- ``strcspn``, ``strlen``, ``strncasecmp``, ``strncmp``, ``strndup``,
- ``strndupa``, ``strnlen``, ``strpbrk``, ``strrchr``, ``strsep``, ``strspn``,
+ ``strcspn``, ``strncasecmp``, ``strncmp``, ``strndup``,
+ ``strndupa``, ``strpbrk``, ``strrchr``, ``strsep``, ``strspn``,
  ``strstr``, ``strtol``, ``strtoll``, ``strtoul``, ``strtoull``, ``tolower``,
  ``toupper``, ``ttyname``, ``ttyname_r``, ``wctomb``, ``wcwidth``
 
diff --git a/clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp b/clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
index 3dcb45c0b110383..9df69c1ad1b525e 100644
--- a/clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
@@ -694,8 +694,6 @@ void GenericTaintChecker::initTaintRules(CheckerContext &C) const {
       {{{"strpbrk"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
       {{{"strndup"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
       {{{"strndupa"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
-      {{{"strlen"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
-      {{{"strnlen"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
       {{{"strtol"}}, TR::Prop({{0}}, {{1, ReturnValueIndex}})},
       {{{"strtoll"}}, TR::Prop({{0}}, {{1, ReturnValueIndex}})},
       {{{"strtoul"}}, TR::Prop({{0}}, {{1, ReturnValueIndex}})},
diff --git a/clang/test/Analysis/taint-diagnostic-visitor.c b/clang/test/Analysis/taint-diagnostic-visitor.c
index 663836836d3db67..1eb926f25f9a778 100644
--- a/clang/test/Analysis/taint-diagnostic-visitor.c
+++ b/clang/test/Analysis/taint-diagnostic-visitor.c
@@ -10,6 +10,7 @@ int scanf(const char *restrict format, ...);
 int system(const char *command);
 char* getenv( const char* env_var );
 size_t strlen( const char* str );
+int atoi( const char* str );
 void *malloc(size_t size );
 void free( void *ptr );
 char *fgets(char *str, int n, FILE *stream);
@@ -54,11 +55,11 @@ void taintDiagnosticVLA(void) {
 // propagating through variables and expressions
 char *taintDiagnosticPropagation(){
   char *pathbuf;
-  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+  char *size=getenv("SIZE"); // expected-note {{Taint originated here}}
                                  // expected-note@-1 {{Taint propagated to the return value}}
-  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+  if (size){ // expected-note {{Assuming 'size' is non-null}}
 	               // expected-note@-1 {{Taking true branch}}
-    pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+    pathbuf=(char*) malloc(atoi(size)); // expected-warning{{Untrusted data is used to specify the buffer size}}
                                                 // expected-note@-1{{Untrusted data is used to specify the buffer size}}
                                                 // expected-note@-2 {{Taint propagated to the return value}}
     return pathbuf;
@@ -71,12 +72,12 @@ char *taintDiagnosticPropagation(){
 char *taintDiagnosticPropagation2(){
   char *pathbuf;
   char *user_env2=getenv("USER_ENV_VAR2");//unrelated taint source
-  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+  char *size=getenv("SIZE"); // expected-note {{Taint originated here}}
                                  // expected-note@-1 {{Taint propagated to the return value}}
   char *user_env=getenv("USER_ENV_VAR");//unrelated taint source
-  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+  if (size){ // expected-note {{Assuming 'size' is non-null}}
 	               // expected-note@-1 {{Taking true branch}}
-    pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+    pathbuf=(char*) malloc(atoi(size)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
                                                 // expected-note@-1{{Untrusted data is used to specify the buffer size}}
                                                 // expected-note@-2 {{Taint propagated to the return value}}
     return pathbuf;
diff --git a/clang/test/Analysis/taint-generic.c b/clang/test/Analysis/taint-generic.c
index b7906d201e4fad3..a614453c63af671 100644
--- a/clang/test/Analysis/taint-generic.c
+++ b/clang/test/Analysis/taint-generic.c
@@ -915,24 +915,6 @@ void testStrndupa(size_t n) {
   clang_analyzer_isTainted_charp(result); // expected-warning {{YES}}
 }
 
-size_t strlen(const char *s);
-void testStrlen() {
-  char s[10];
-  scanf("%9s", s);
-
-  size_t result = strlen(s);
-  clang_analyzer_isTainted_int(result); // expected-warning {{YES}}
-}
-
-size_t strnlen(const char *s, size_t maxlen);
-void testStrnlen(size_t maxlen) {
-  char s[10];
-  scanf("%9s", s);
-
-  size_t result = strnlen(s, maxlen);
-  clang_analyzer_isTainted_int(result); // expected-warning {{YES}}
-}
-
 long strtol(const char *restrict nptr, char **restrict endptr, int base);
 long long strtoll(const char *restrict nptr, char **restrict endptr, int base);
 unsigned long int strtoul(const char *nptr, char **endptr, int base);

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 12, 2023

@llvm/pr-subscribers-clang

Changes

strlen(..) call should not propagate taintedness,
because it brings in many false positive findings. It is a common pattern to copy user provided input to another buffer. In these cases we always
get warnings about tainted data used as the malloc parameter:

buf = malloc(strlen(tainted_txt) + 1); // false warning

This pattern can lead to a denial of service attack only, when the attacker can directly specify the size of the allocated area as an arbitrary large number (e.g. the value is converted from a user provided string).

Later, we could reintroduce strlen() as a taint propagating function with the consideration not to emit warnings when the tainted value cannot be "arbitrarily large" (such as the size of an already allocated memory block).

The change has been evaluated on the following open source projects:

In all cases the lost reports are originating from copying untrusted environment variables into another buffer.

There are 2 types of lost false positive reports:

  1. Where the warning is emitted at the malloc call by the TaintPropagation Checker
    len = strlen(portnumber_filename)+4+1; temp_portnumber_filename = malloc(len);

  2. When pointers are set based on the length of the tainted string by the ArrayOutofBoundsv2 checker.
    For example this case.

--
Full diff: https://github.com/llvm/llvm-project/pull/66086.diff

4 Files Affected:

  • (modified) clang/docs/analyzer/checkers.rst (+2-2)
  • (modified) clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp (-2)
  • (modified) clang/test/Analysis/taint-diagnostic-visitor.c (+7-6)
  • (modified) clang/test/Analysis/taint-generic.c (-18)
diff --git a/clang/docs/analyzer/checkers.rst b/clang/docs/analyzer/checkers.rst
index 54ea49e7426cc86..dbd6d7787823530 100644
--- a/clang/docs/analyzer/checkers.rst
+++ b/clang/docs/analyzer/checkers.rst
@@ -2599,8 +2599,8 @@ Default propagations rules:
  ``memcpy``, ``memmem``, ``memmove``, ``mbtowc``, ``pread``, ``qsort``,
  ``qsort_r``, ``rawmemchr``, ``read``, ``recv``, ``recvfrom``, ``rindex``,
  ``strcasestr``, ``strchr``, ``strchrnul``, ``strcasecmp``, ``strcmp``,
- ``strcspn``, ``strlen``, ``strncasecmp``, ``strncmp``, ``strndup``,
- ``strndupa``, ``strnlen``, ``strpbrk``, ``strrchr``, ``strsep``, ``strspn``,
+ ``strcspn``, ``strncasecmp``, ``strncmp``, ``strndup``,
+ ``strndupa``, ``strpbrk``, ``strrchr``, ``strsep``, ``strspn``,
  ``strstr``, ``strtol``, ``strtoll``, ``strtoul``, ``strtoull``, ``tolower``,
  ``toupper``, ``ttyname``, ``ttyname_r``, ``wctomb``, ``wcwidth``
 
diff --git a/clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp b/clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
index 3dcb45c0b110383..9df69c1ad1b525e 100644
--- a/clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/GenericTaintChecker.cpp
@@ -694,8 +694,6 @@ void GenericTaintChecker::initTaintRules(CheckerContext &C) const {
       {{{"strpbrk"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
       {{{"strndup"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
       {{{"strndupa"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
-      {{{"strlen"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
-      {{{"strnlen"}}, TR::Prop({{0}}, {{ReturnValueIndex}})},
       {{{"strtol"}}, TR::Prop({{0}}, {{1, ReturnValueIndex}})},
       {{{"strtoll"}}, TR::Prop({{0}}, {{1, ReturnValueIndex}})},
       {{{"strtoul"}}, TR::Prop({{0}}, {{1, ReturnValueIndex}})},
diff --git a/clang/test/Analysis/taint-diagnostic-visitor.c b/clang/test/Analysis/taint-diagnostic-visitor.c
index 663836836d3db67..1eb926f25f9a778 100644
--- a/clang/test/Analysis/taint-diagnostic-visitor.c
+++ b/clang/test/Analysis/taint-diagnostic-visitor.c
@@ -10,6 +10,7 @@ int scanf(const char *restrict format, ...);
 int system(const char *command);
 char* getenv( const char* env_var );
 size_t strlen( const char* str );
+int atoi( const char* str );
 void *malloc(size_t size );
 void free( void *ptr );
 char *fgets(char *str, int n, FILE *stream);
@@ -54,11 +55,11 @@ void taintDiagnosticVLA(void) {
 // propagating through variables and expressions
 char *taintDiagnosticPropagation(){
   char *pathbuf;
-  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+  char *size=getenv("SIZE"); // expected-note {{Taint originated here}}
                                  // expected-note@-1 {{Taint propagated to the return value}}
-  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+  if (size){ // expected-note {{Assuming 'size' is non-null}}
 	               // expected-note@-1 {{Taking true branch}}
-    pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+    pathbuf=(char*) malloc(atoi(size)); // expected-warning{{Untrusted data is used to specify the buffer size}}
                                                 // expected-note@-1{{Untrusted data is used to specify the buffer size}}
                                                 // expected-note@-2 {{Taint propagated to the return value}}
     return pathbuf;
@@ -71,12 +72,12 @@ char *taintDiagnosticPropagation(){
 char *taintDiagnosticPropagation2(){
   char *pathbuf;
   char *user_env2=getenv("USER_ENV_VAR2");//unrelated taint source
-  char *pathlist=getenv("PATH"); // expected-note {{Taint originated here}}
+  char *size=getenv("SIZE"); // expected-note {{Taint originated here}}
                                  // expected-note@-1 {{Taint propagated to the return value}}
   char *user_env=getenv("USER_ENV_VAR");//unrelated taint source
-  if (pathlist){ // expected-note {{Assuming 'pathlist' is non-null}}
+  if (size){ // expected-note {{Assuming 'size' is non-null}}
 	               // expected-note@-1 {{Taking true branch}}
-    pathbuf=(char*) malloc(strlen(pathlist)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
+    pathbuf=(char*) malloc(atoi(size)+1); // expected-warning{{Untrusted data is used to specify the buffer size}}
                                                 // expected-note@-1{{Untrusted data is used to specify the buffer size}}
                                                 // expected-note@-2 {{Taint propagated to the return value}}
     return pathbuf;
diff --git a/clang/test/Analysis/taint-generic.c b/clang/test/Analysis/taint-generic.c
index b7906d201e4fad3..a614453c63af671 100644
--- a/clang/test/Analysis/taint-generic.c
+++ b/clang/test/Analysis/taint-generic.c
@@ -915,24 +915,6 @@ void testStrndupa(size_t n) {
   clang_analyzer_isTainted_charp(result); // expected-warning {{YES}}
 }
 
-size_t strlen(const char *s);
-void testStrlen() {
-  char s[10];
-  scanf("%9s", s);
-
-  size_t result = strlen(s);
-  clang_analyzer_isTainted_int(result); // expected-warning {{YES}}
-}
-
-size_t strnlen(const char *s, size_t maxlen);
-void testStrnlen(size_t maxlen) {
-  char s[10];
-  scanf("%9s", s);
-
-  size_t result = strnlen(s, maxlen);
-  clang_analyzer_isTainted_int(result); // expected-warning {{YES}}
-}
-
 long strtol(const char *restrict nptr, char **restrict endptr, int base);
 long long strtoll(const char *restrict nptr, char **restrict endptr, int base);
 unsigned long int strtoul(const char *nptr, char **endptr, int base);

@dkrupp dkrupp self-assigned this Sep 12, 2023
Copy link
Contributor

@NagyDonat NagyDonat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very straightforward change and the results on open source projects clearly show that it's an improvement over the status quo.

@NagyDonat
Copy link
Contributor

Also note that the buildkite/github-pull-request (build#1418) failure is completely irrelevant: it complains about spaces found at the end of two lines in the file clang/include/clang/Basic/riscv_vector.td (which is completely unrelated to this commit).

@steakhal
Copy link
Contributor

I can understand the frustration of the FPs. However, propagating taint there is the right thing to do.
To me, the fault is on the diagnostic on the malloc. Those are the cause of the FPs, thus that needs to be removed instead of the propagation.
I have this opinion even if the empirical results suggest that this would improve the perceived accuracy of the analysis. But to me, we would just mask the root cause.

I haven't looked the the content of the patch (yet), neither the diff's. I'll try to have a deeper look tomorrow.
I just wanted to share my concerns, after seeing an approval.

@steakhal
Copy link
Contributor

steakhal commented Sep 13, 2023

I actually wanted to propose another patch where the wchar variant of strlen would propagate taint, BTW. I still plan to do it, we will see when I reach that.

EDIT: I actually proposed that a couple days ago in #66074 (wcslen) XD

@steakhal
Copy link
Contributor

I finished the review of this PR.

By looking at the disappeared reports you attached, I'm convinced that the MsgTaintedBufferSize diagnostics give little to no benefit in general. On the other side, I've seen good hits for OOBV2 in the presence of taint - even if that's rarely the case. On the theory side, I also believe that propagation should happen on strlen and similar functions.

Consequently, I agree with the raised problems, but I disagree with the approach.
I would rather remove the MsgTaintedBufferSize diagnostic to resolve those FPs.
Alternatively, we can also think of creating a heuristic to reduce such FPs. For e.g. check if the most significant bit of the allocation size is proven to be unset (aka. checked some meaningful upperbounds) and suppress reports in that case, but report otherwise.
Would it be okay with you to proceed by not removing taint propagation?

On the same token, I think we should be able to separately enable/disable diagnostics on the GenericTaintChecker (including that they should not sink execution paths if they are disabled); but that's a different subject.

Comment on lines 918 to 967
size_t strlen(const char *s);
void testStrlen() {
char s[10];
scanf("%9s", s);

size_t result = strlen(s);
clang_analyzer_isTainted_int(result); // expected-warning {{YES}}
}

size_t strnlen(const char *s, size_t maxlen);
void testStrnlen(size_t maxlen) {
char s[10];
scanf("%9s", s);

size_t result = strnlen(s, maxlen);
clang_analyzer_isTainted_int(result); // expected-warning {{YES}}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I oppose removing FN tests. They are good at documenting intent, if for nothing else.
It might be even better to add comments there about why we think it's okay and intentional to not propagate taint there. Also, adding a PR link would give the possibility to look deeper to understand the why.

@dkrupp
Copy link
Contributor Author

dkrupp commented Sep 14, 2023

If we remove the malloc(..) as the taint sink, we would lose some true positive findings where the size of the allocated
area is specified directly as a number by the attacker:

char *size=getenv("SIZE");
if (size){
    pathbuf=(char*) malloc(atoi(size)+1); // warn: denial of service attack!
...
}

The above example is prone to denial of service attack as the the attacker just specifies an arbitrarily large number to which a buffer will be allocated. The attacker needs much less resources to specify a large number than the recevier to allocate a large chuck of memory.

On the other hand when we have a code like this:

char *user_txt=getenv("SIZE");
if (user_txt){ 
    pathbuf=(char*) malloc(strlen(user_txt)+1); // No Warning as the malloc parameter comes from the size of an already allocated buffer
...
}

Here we should not warn as the size passed to malloc is the size of an already allocated buffer. So invested resources by the attacker to provide the large string and the server allocating another buffer to contain that string is symmetrical. So not prone to DoS attack.

A more sophisticated longer term solution could be that we add a flag to the taint info (or introduce a taint type) that the tainted value was originating from an existing buffer size and then specify the malloc sink so that it should not warn in that case. I know we cannot do this know, but the taint analysis could be extended into this direction.

Back to this solution.
Please note that this is only the default configuration of the checker.
The user could add the stren as a propagator into the taint config file.
If we decide to remove strlen() as a propagator (as is in this patch) we could highlight this in the documentation of the checker that the user may want to add it back.

So for me either solution would work:
a) remove strlen() as a propagator and note it in the checker doc
b) remove malloc() as a sink and note it in the checker doc
c) don't do anything and live with the false positives

Which one would you prefer?

@NagyDonat
Copy link
Contributor

By looking at the disappeared reports you attached, I'm convinced that the MsgTaintedBufferSize diagnostics give little to no benefit in general. On the other side, I've seen good hits for OOBV2 in the presence of taint - even if that's rarely the case. On the theory side, I also believe that propagation should happen on strlen and similar functions.

You're right that theoretically speaking the strlen-like functions propagate taint, but their return value is practically always constrained from above (if the string has a null terminator, the length is less than its extent in memory) and I think that propagating taint without considering these constraint is worse than not propagating taint.

By the way, could you show an OOBV2 true positive that involves taint propagated by strlen call? My impression was that the "index is tainted" errors that involve strlen are typically false positives where the analyzer cannot deduce that s[strlen(s)] is valid and e.g. s[strlen(s)-1] is also valid when the string is nonempty.

Consequently, I agree with the raised problems, but I disagree with the approach. I would rather remove the MsgTaintedBufferSize diagnostic to resolve those FPs. Alternatively, we can also think of creating a heuristic to reduce such FPs. For e.g. check if the most significant bit of the allocation size is proven to be unset (aka. checked some meaningful upperbounds) and suppress reports in that case, but report otherwise. Would it be okay with you to proceed by not removing taint propagation?

I agree that MsgTaintedBufferSize reports should be limited to cases where the tainted value can be large and I would suggest a threshold that's significantly lower than the SIZE_MAX/2 implied by your "most significant bit" suggestion (ideally the threshold value would be configurable). However, I think this eliminates the false positives related to strlen only if we also add some logic that can reliably "tag" the return value of strlen with reasonable upper bounds.

I'd be happy to accept a patch that (1) ensures that strlen propagates taint (2) reliably puts upper bounds on the result of strlen and (3) limits MsgTaintedBufferSize to reporting cases where the tainted value can be large. However, as long as we don't have this refined code, I think the best temporary solution is removing taint propagation from strlen as the untainted return value is a good practical approximation for a tainted value with strong upper bounds.

@steakhal
Copy link
Contributor

So for me either solution would work:
a) remove strlen() as a propagator and note it in the checker doc
b) remove malloc() as a sink and note it in the checker doc
c) don't do anything and live with the false positives

TBH I would prefer (b). I see removing the whole MsgTaintedBufferSize is a bit too wild. And alloca with unconstrained tainted size should be still detected.
(c) is definitely not an option, assuming that there were honest motivations behind this patch.

By the way, could you show an OOBV2 true positive that involves taint propagated by strlen call? My impression was that the "index is tainted" errors that involve strlen are typically false positives where the analyzer cannot deduce that s[strlen(s)] is valid and e.g. s[strlen(s)-1] is also valid when the string is nonempty.

I'm sure I've seen it in the Juliet suite. I believe something similar must have been the motivation for me proposing wcslen as a taint propagator because it was missing from the taint-dataflow. Since the test cases are generated, and there are char counterparts, I'd bet there would be a missing taint-flow there as well, if we didn't propagate on strlen.
I also understand that the Juliet is an artificial benchmark, that might or might not represent real-world scenarios.

Consequently, I agree with the raised problems, but I disagree with the approach. I would rather remove the MsgTaintedBufferSize diagnostic to resolve those FPs. Alternatively, we can also think of creating a heuristic to reduce such FPs. For e.g. check if the most significant bit of the allocation size is proven to be unset (aka. checked some meaningful upperbounds) and suppress reports in that case, but report otherwise. Would it be okay with you to proceed by not removing taint propagation?

I agree that MsgTaintedBufferSize reports should be limited to cases where the tainted value can be large and I would suggest a threshold that's significantly lower than the SIZE_MAX/2 implied by your "most significant bit" suggestion (ideally the threshold value would be configurable). However, I think this eliminates the false positives related to strlen only if we also add some logic that can reliably "tag" the return value of strlen with reasonable upper bounds.

Yea, upperbounding strlen result I'm not sure if it really makes sense, just to fulfill the heuristic imposed on malloc.

Let's discuss then how much benefit warning on tainted malloc allocations provide.
If it turns out the tainted malloc does not really work, we should just remove that from sinks.

@NagyDonat
Copy link
Contributor

Putting an upper bound on strlen is not just for malloc, it's also needed for ArrayBoundV2.

As a very clear example, this function strfuzz_ends_with from twin functions iterates backwards on two strings and runs into an "Out of bounds memory access (index is tainted)" false positive when it tries to access the last character of a nonempty string.

There are also other more complicated false positives from vim and postgres.

Based on these I'd say that propagating taint in strlen is not acceptable without providing upper bounds. (These FPs are not extraordinarily common, but it's simply too ugly when the analyzer says that accessing the last character of a string may be out of bounds access.)

@steakhal
Copy link
Contributor

Putting an upper bound on strlen is not just for malloc, it's also needed for ArrayBoundV2.

As a very clear example, this function strfuzz_ends_with from twin functions iterates backwards on two strings and runs into an "Out of bounds memory access (index is tainted)" false positive when it tries to access the last character of a nonempty string.
There are also other more complicated false positives from vim and postgres.

Yes, I've also seen similar cases around the result of getenv. I believe we also discussed this in D159105.
The important thing here is that they are not that frequent and pretty easy to conclude if it's a TP or FP.

Based on these I'd say that propagating taint in strlen is not acceptable without providing upper bounds. (These FPs are not extraordinarily common, but it's simply too ugly when the analyzer says that accessing the last character of a string may be out of bounds access.)

Let's see how it works out in practice. I won't object to this change.

Do you also plan to partially revert wcslen in 61924da to match the strlen behavior?

@dkrupp
Copy link
Contributor Author

dkrupp commented Sep 14, 2023

Putting an upper bound on strlen is not just for malloc, it's also needed for ArrayBoundV2.
As a very clear example, this function strfuzz_ends_with from twin functions iterates backwards on two strings and runs into an "Out of bounds memory access (index is tainted)" false positive when it tries to access the last character of a nonempty string.
There are also other more complicated false positives from vim and postgres.

Yes, I've also seen similar cases around the result of getenv. I believe we also discussed this in D159105. The important thing here is that they are not that frequent and pretty easy to conclude if it's a TP or FP.

Based on these I'd say that propagating taint in strlen is not acceptable without providing upper bounds. (These FPs are not extraordinarily common, but it's simply too ugly when the analyzer says that accessing the last character of a string may be out of bounds access.)

Let's see how it works out in practice. I won't object to this change.

Do you also plan to partially revert wcslen in 61924da to match the strlen behavior?

yes, I will do that.

@steakhal
Copy link
Contributor

Request another round of review once you are happy with the content and addressed the open comments.
On the grand scheme we are aligned.

@steakhal
Copy link
Contributor

As I'm not a maintainer, I could not push to your branch.
Here is a patch that I think has the missing pieces to satisfy my review.
0001-fixup-analyzer-TaintPropagation-checker-strlen-shoul.patch.txt
Apply it to your branch by git am 0001-fixup-analyzer-TaintPropagation-checker-strlen-shoul.patch.txt.
After that, I'm okay to squash merge this PR, if you are also okay with my suggestions.

strlen(..) call should not propagate taintedness,
because it brings in many false positive findings.
It is a common pattern to copy user provided input
to another buffer. In these cases we always
get warnings about tainted data used as the malloc parameter:

buf = malloc(strlen(tainted_txt) + 1); // false warning

This pattern can lead to a denial of service attack only, when
the attacker can directly specify the size of the allocated area
as an arbitrary large number (e.g. the value is converted
from a user provided string).

Later, we could reintroduce strlen() as a taint propagating function
with the consideration not to emit warnings when the tainted value
cannot be "arbitrarily large" (such as the size of an already allocated
memory block).
@dkrupp
Copy link
Contributor Author

dkrupp commented Sep 18, 2023

As I'm not a maintainer, I could not push to your branch. Here is a patch that I think has the missing pieces to satisfy my review. 0001-fixup-analyzer-TaintPropagation-checker-strlen-shoul.patch.txt Apply it to your branch by git am 0001-fixup-analyzer-TaintPropagation-checker-strlen-shoul.patch.txt. After that, I'm okay to squash merge this PR, if you are also okay with my suggestions.

Thanks for the suggestions. I squashed it.

Copy link
Contributor

@steakhal steakhal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope in the future we will have a more satisfying solution.
LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:dataflow Clang Dataflow Analysis framework - https://clang.llvm.org/docs/DataFlowAnalysisIntro.html clang:static analyzer clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants