Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarize std::string's when created from data. #89110

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Jlalond
Copy link
Contributor

@Jlalond Jlalond commented Apr 17, 2024

std::string children are normally formatted as their summary [my_string_prop] = "hello world". Sbvalues/ValueObjects created from data do not follow the same summary formatter path and instead hit the value object printer. This change fixes that and adds tests for the future.

Example:
image

Will now summarize like all other strings to "test3"

Added tests in TestDataFormatterStdString.py and a formatter for a custom struct.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@Jlalond Jlalond changed the title Summarize std::sting's when created from data. Summarize std::string's when created from data. Apr 17, 2024
@llvmbot llvmbot added the lldb label Apr 17, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Apr 17, 2024

@llvm/pr-subscribers-lldb

Author: Jacob Lalonde (Jlalond)

Changes

std::string children are normally formatted as their summary [my_string_prop] = "hello world". Sbvalues/ValueObjects created from data do not follow the same summary formatter path and instead hit the value object printer. This change fixes that and adds tests for the future.

Added tests in TestDataFormatterStdString.py and a formatter for a custom struct.


Full diff: https://github.com/llvm/llvm-project/pull/89110.diff

4 Files Affected:

  • (modified) lldb/source/Plugins/Language/CPlusPlus/LibStdcpp.cpp (+27-4)
  • (added) lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/ConvertToDataFormatter.py (+50)
  • (modified) lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/TestDataFormatterStdString.py (+28)
  • (modified) lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/main.cpp (+13)
diff --git a/lldb/source/Plugins/Language/CPlusPlus/LibStdcpp.cpp b/lldb/source/Plugins/Language/CPlusPlus/LibStdcpp.cpp
index 86bb575af5ca34..0da01e9ca07fb6 100644
--- a/lldb/source/Plugins/Language/CPlusPlus/LibStdcpp.cpp
+++ b/lldb/source/Plugins/Language/CPlusPlus/LibStdcpp.cpp
@@ -254,13 +254,13 @@ bool lldb_private::formatters::LibStdcppStringSummaryProvider(
   } else
     addr_of_string =
         valobj.GetAddressOf(scalar_is_load_addr, &addr_type);
-  if (addr_of_string != LLDB_INVALID_ADDRESS) {
+  if (addr_of_string != LLDB_INVALID_ADDRESS ||
+        addr_type == eAddressTypeHost) {
     switch (addr_type) {
     case eAddressTypeLoad: {
       ProcessSP process_sp(valobj.GetProcessSP());
       if (!process_sp)
         return false;
-
       StringPrinter::ReadStringAndDumpToStreamOptions options(valobj);
       Status error;
       lldb::addr_t addr_of_data =
@@ -287,8 +287,31 @@ bool lldb_private::formatters::LibStdcppStringSummaryProvider(
       } else
         return true;
     } break;
-    case eAddressTypeHost:
-      break;
+    case eAddressTypeHost: {
+      // We have the host address of our std::string
+      // But we need to read the pointee data from the debugged process.
+      ProcessSP process_sp(valobj.GetProcessSP());
+      DataExtractor data;
+      Status error;
+      valobj.GetData(data, error);
+      if (error.Fail())
+        return false;
+      // We want to read the address from std::string, which is the first 8 bytes.
+      lldb::offset_t offset = 0;
+      lldb::addr_t addr = data.GetAddress(&offset);
+      if (!addr)
+      {
+        stream.Printf("nullptr");
+        return true;
+      }
+
+      std::string contents;
+      process_sp->ReadCStringFromMemory(addr, contents, error);
+      if (error.Fail())
+        return false;
+      stream.Printf("%s", contents.c_str());
+      return true;
+    } break;
     case eAddressTypeInvalid:
     case eAddressTypeFile:
       break;
diff --git a/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/ConvertToDataFormatter.py b/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/ConvertToDataFormatter.py
new file mode 100644
index 00000000000000..57e42c920f8294
--- /dev/null
+++ b/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/ConvertToDataFormatter.py
@@ -0,0 +1,50 @@
+# coding=utf8
+"""
+Helper formmater to verify Std::String by created via SBData
+"""
+
+import lldb
+
+class SyntheticFormatter:
+    def __init__(self, valobj, dict):
+        self.valobj = valobj
+
+    def num_children(self):
+        return 6
+
+    def has_children(self):
+        return True
+
+    def get_child_at_index(self, index):
+        name = None
+        match index:
+            case 0:
+                name = "short_string"
+            case 1:
+                name = "long_string"
+            case 2:
+                name = "short_string_ptr"
+            case 3:
+                name = "long_string_ptr"
+            case 4:
+                name = "short_string_ref"
+            case 5:
+                name = "long_string_ref"
+            case _:
+                return None
+
+        child = self.valobj.GetChildMemberWithName(name)
+        valType = child.GetType()
+        return self.valobj.CreateValueFromData(name,
+                child.GetData(),
+                valType)
+
+
+def __lldb_init_module(debugger, dict):
+    typeName = "string_container"
+    debugger.HandleCommand(
+        'type synthetic add -x "'
+        + typeName
+        + '" --python-class '
+        + f"{__name__}.SyntheticFormatter"
+    )
diff --git a/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/TestDataFormatterStdString.py b/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/TestDataFormatterStdString.py
index 0b78fb7dc3911c..2adfe47f9a3cfd 100644
--- a/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/TestDataFormatterStdString.py
+++ b/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/TestDataFormatterStdString.py
@@ -37,6 +37,8 @@ def test_with_run_command(self):
             substrs=["stopped", "stop reason = breakpoint"],
         )
 
+        self.runCmd("command script import ./ConvertToDataFormatter.py", check=True)
+
         # This is the function to remove the custom formats in order to have a
         # clean slate for the next test case.
         def cleanup():
@@ -60,6 +62,7 @@ def cleanup():
         var_rQ = self.frame().FindVariable("rQ")
         var_pq = self.frame().FindVariable("pq")
         var_pQ = self.frame().FindVariable("pQ")
+        var_str_container = self.frame().FindVariable("sc")
 
         self.assertEqual(var_wempty.GetSummary(), 'L""', "wempty summary wrong")
         self.assertEqual(
@@ -89,6 +92,31 @@ def cleanup():
             "pQ summary wrong",
         )
 
+        self.assertEqual(
+            var_str_container.GetChildAtIndex(0).GetSummary(),
+            '"u22"',
+            "string container child wrong")
+        self.assertEqual(
+            var_str_container.GetChildAtIndex(1).GetSummary(),
+            '"quite a long std::string with lots of info inside it inside a struct"',
+            "string container child wrong")
+        self.assertEqual(
+            var_str_container.GetChildAtIndex(2).GetSummary(),
+            '"u22"',
+            "string container child wrong")
+        self.assertEqual(
+            var_str_container.GetChildAtIndex(3).GetSummary(),
+            '"quite a long std::string with lots of info inside it inside a struct"',
+            "string container child wrong")
+        self.assertEqual(
+            var_str_container.GetChildAtIndex(4).GetSummary(),
+            '"u22"',
+            "string container child wrong")
+        self.assertEqual(
+            var_str_container.GetChildAtIndex(5).GetSummary(),
+            '"quite a long std::string with lots of info inside it inside a struct"',
+            "string container child wrong")
+
         self.runCmd("next")
 
         self.assertEqual(var_S.GetSummary(), 'L"!!!!!"', "new S summary wrong")
diff --git a/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/main.cpp b/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/main.cpp
index eefb96c4573e4e..b169a2e6945568 100644
--- a/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/main.cpp
+++ b/lldb/test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/string/main.cpp
@@ -1,5 +1,14 @@
 #include <string>
 
+struct string_container {
+    std::string short_string;
+    std::string long_string;
+    std::string *short_string_ptr = &short_string;
+    std::string *long_string_ptr = &long_string;
+    std::string &short_string_ref = short_string;
+    std::string &long_string_ref = long_string;
+};
+
 int main()
 {
     std::wstring wempty(L"");
@@ -11,6 +20,10 @@ int main()
     std::string Q("quite a long std::strin with lots of info inside it");
     auto &rq = q, &rQ = Q;
     std::string *pq = &q, *pQ = &Q;
+    string_container sc;
+    sc.short_string = "u22";
+    sc.long_string = "quite a long std::string with lots of info inside it inside a struct";
+
     S.assign(L"!!!!!"); // Set break point at this line.
     return 0;
 }

@@ -254,13 +254,13 @@ bool lldb_private::formatters::LibStdcppStringSummaryProvider(
} else
addr_of_string =
valobj.GetAddressOf(scalar_is_load_addr, &addr_type);
if (addr_of_string != LLDB_INVALID_ADDRESS) {
if (addr_of_string != LLDB_INVALID_ADDRESS ||
addr_type == eAddressTypeHost) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this since we iterate all cases anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the switch statement is within the if block, if we don't add this exception we will return false and fall back to the ValueObjectPrinter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh wow, so for this specific situation we will have addr_of_string == LLDB_INVALID_ADDRESS?
Would you mind adding a comment for that?

@jimingham
Copy link
Collaborator

Most of this change makes sense. However, you are assuming that if the address type of a ValueObject is Host, its children are necessarily going to be in the target. But there's actually an API that answers that question: ValueObject::GetAddressTypeOfChildren. You should consult that rather than assuming the children live in the target.

@Jlalond
Copy link
Contributor Author

Jlalond commented Apr 17, 2024

@jimingham

What should we do if the child address type is File? I don't know how that could ever happen, and should we support it? Because currently we didn't support summarizing host or file.

@jeffreytan81 jeffreytan81 self-requested a review April 23, 2024 18:08
@@ -37,6 +37,8 @@ def test_with_run_command(self):
substrs=["stopped", "stop reason = breakpoint"],
)

self.runCmd("command script import ./ConvertToDataFormatter.py", check=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing existing test method to test create from data case, let's add a new test method instead so that both default and create from data formatters are tested.

@jimingham
Copy link
Collaborator

@jimingham

What should we do if the child address type is File? I don't know how that could ever happen, and should we support it? Because currently we didn't support summarizing host or file.

It doesn't look like C++ allows you to make constexpr std::strings, so for now I don't think there's a way to put constructed std::string's in the DATA segments of object files. But if that were possible, handling them shouldn't be any different from reading std::string contents from process memory.

Comment on lines 261 to 262
if (addr_of_string != LLDB_INVALID_ADDRESS ||
addr_type == eAddressTypeHost) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does addr_of_string = valobj.GetAddressOf(scalar_is_load_addr, &addr_type); return eAddressTypeHost with addr_of_string being set to LLDB_INVALID_ADDRESS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I don't know. ValueObject does explicitly return LLDB_INVALID_ADDRESS when addressType == eAddressTypeHost. I thought this was weird and potentially returning the address of the in memory buffer made more sense, but that seemed like a major refactor for a minor string issue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be more careful here. GetAddressOf is really meant to do "can you find an address in the target for this object". We use it that way in a whole bunch of places, e.g.:

      cstr_address = GetAddressOf(true, &cstr_address_type);
    } else {
      // We have a pointer
      cstr_address = GetPointerValue(&cstr_address_type);
    }

    if (cstr_address == 0 || cstr_address == LLDB_INVALID_ADDRESS) {
      if (cstr_address_type == eAddressTypeHost && is_array) {
        const char *cstr = GetDataExtractor().PeekCStr(0);

So in that case we are expecting a host address type to return an invalid address from GetAddressOf.

This change worries me, I don't think it will be what other code expects? Maybe you can get the value you need like the code above does instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could take the approach you provided. Greg and I originally talked and mentioned how GetAddressOf will return the pointer within a buffer in certain cases. I'm not enough of an expert here to have strong opinions but I think we're in a confusing middleground, where GetAddressOf works for Load addresses, but fails for a host address even if we have a pointer internally to a data buffer.

I think solving that is more fundamental to my (2nd?) PR in lldb, but if we can all agree on the returning invaild for host addresses. I'm okay with that, as long as we also add a comment/documentation to the call that HostAddress addresses are invalid and to instead use a data extractor. My only opinion on that as a new contributor is the API to work with ValueObject's data is clunky, because I can seemingly only construct a copy via GetData.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be more careful here. GetAddressOf is really meant to do "can you find an address in the target for this object". We use it that way in a whole bunch of places, e.g.:

      cstr_address = GetAddressOf(true, &cstr_address_type);
    } else {
      // We have a pointer
      cstr_address = GetPointerValue(&cstr_address_type);
    }

    if (cstr_address == 0 || cstr_address == LLDB_INVALID_ADDRESS) {
      if (cstr_address_type == eAddressTypeHost && is_array) {
        const char *cstr = GetDataExtractor().PeekCStr(0);

So in that case we are expecting a host address type to return an invalid address from GetAddressOf.

This change worries me, I don't think it will be what other code expects? Maybe you can get the value you need like the code above does instead?

The code above works, but I think it doesn't total sense from a readability standpoint. We have a GetAddressOf function that return the address kind and a value, I am not sure why we wouldn't want to take advantage of that and use it? The above code is a bit interesting as it makes a bunch of assumptions like "is this a host address and if this is an array, I know my data is in the data extractor at offset zero". I mean it works, but does it make sense to duplicate these kind of assumptions?

Make sense to fix GetAddressOf to take advantage of the API it is implementing. If the address kind can be host and we can return a valid host address value, I would say we use it. We will need to look over all uses of this API internally if we do change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine too, but you have to be careful here since getting host vrs. target address attribution wrong is one quick way to crash lldb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to fix GetAddressOf to take advantage of the API it is implementing. If the address kind can be host and we can return a valid host address value, I would say we use it. We will need to look over all uses of this API internally if we do change it.

If we decide to go forward with this refactor, I think we should probably split this out into it's own independent PR and put this one on pause. As Jim mentioned above there are currently places that make assumptions based on the value object being host and getting back an invalid address, and we would need to correct for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to fix GetAddressOf to take advantage of the API it is implementing. If the address kind can be host and we can return a valid host address value, I would say we use it. We will need to look over all uses of this API internally if we do change it.

If we decide to go forward with this refactor, I think we should probably split this out into it's own independent PR and put this one on pause. As Jim mentioned above there are currently places that make assumptions based on the value object being host and getting back an invalid address, and we would need to correct for that.

I would be interesting to see how many places do this kind of pattern. If it is only a few locations, then yes a quick PR to enable this correctly would be nice, but if it is many, we can just copy the code that Jim mentions to work around the issue.

Comment on lines 304 to 305
if (child_addressType == eAddressTypeLoad)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The { goes at the end of the if line per llvm coding guidelines

Comment on lines 311 to 312
if (!addr)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The { goes at the end of the if line per llvm coding guidelines

Comment on lines 325 to 326
if (child_addressType == eAddressTypeHost)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The { goes at the end of the if line per llvm coding guidelines

Comment on lines 296 to 316
DataExtractor data;
Status error;
valobj.GetData(data, error);
if (error.Fail())
return false;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the code that detects the short string optimization? I don't see it here anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't short string optimization be covered by the check at the address type of children (Line 303)? If the children are host we directly read the std::string from the data extractor

if (!dataStart)
return false;

const std::string* str = static_cast<const std::string*>(dataStart);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't assume that the host std::string will match the debugee std::string. If we are remote debugging to another architecture, then we will have problems.

return true;
}

if (child_addressType == eAddressTypeHost) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is never being run because no one will set the child address type to host memory.

Copy link

github-actions bot commented May 21, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link

github-actions bot commented May 21, 2024

✅ With the latest revision this PR passed the Python code formatter.

…ould not use the built in summary provider, but default to the valueobjectprinter

std::string children are normally formatted as their summary [my_string_prop] = "hello world".
Sbvalues created from data do not follow the same summary formatter path and instead hit the value object printer.
This change fixes that and adds tests for the future.

Added tests in TestDataFormatterStdString.py and a formatter for a custom struct.
…ge libstdcpp to dereference that address to get char* for std::string when formatting eAddressTypeHost
@Jlalond Jlalond force-pushed the std-string-create-value-by-data-summary-fix branch from 29b55fc to afc30d4 Compare May 23, 2024 14:07
@Jlalond
Copy link
Contributor Author

Jlalond commented May 23, 2024

@jimingham it turned out when created by data, the value object's address points to it's buffer which itself contains the char* of the string.

I'm not sure how we can support the small string optimization if this is the case, my testing was against a small string "test2", and it was always saved as a char* for the value object. I think for this PR we can probably ignore it and open an issue to support the small string optimization.

Please review when you have time

#include "lldb/Utility/Endian.h"
#include "lldb/Utility/Status.h"
#include "lldb/Utility/Stream.h"
#include "lldb/lldb-types.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually put the top level lldb includes first in the lldb includes list.

<= GetCompilerType().GetByteSize(exe_scope)
|| m_value.GetValueType() == Value::ValueType::LoadAddress)
return DoCast(compiler_type);
if (compiler_type.GetByteSize(exe_scope) <=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a reformatting, right? We try to only clang-format the code the PR actually changes, (using git-clang-format) as it makes reviews easier. Also clang-format changes its mind about how to format over time, and so doing other code introduces spurious diffs that make archeology harder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah I ran git-clang-format against this.
Specifically, git fetch upstream; git-clang-format upstream/main and it produced this. I'll manually revert this one

@@ -0,0 +1,49 @@
# coding=utf8
"""
Helper formmater to verify Std::String by created via SBData
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formmater -> formatter

@@ -17,6 +17,15 @@ def setUp(self):
# Find the line number to break at.
self.line = line_number("main.cpp", "// Set break point at this line.")

# This is the function to remove the custom formats in order to have a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make all your custom summaries in a category (e.g. pass -w SOME_UNIQUE_NAME then you can delete just the category with type category delete. That way you know you are only deleting what you added.

Categories come into being disabled, so if you do it this way, you need to add:

type category enable SOME_UNIQUE_NAME

to your __lldb_init_module

Copy link
Collaborator

@jimingham jimingham May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see you didn't originally do that, so don't feel required to change what was there just moved... But it's still the better way to do it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is already old enough that we can lump it in!

@jimingham
Copy link
Collaborator

jimingham commented May 23, 2024

We should definitely put more comments in the ValueObject.h header, a lot of those interfaces are mysterious.

GetAddressOf is solving a very particular problem, which is that if someone does:

(lldb) expr MakeMeAFooObject()
$0= {whatever}
(lldb) expr CallAFunctionTakingAFooPointer(&$0)

People really want that to work, but $0 is a "constant result" so where it lives is not clear. It's also important for data formatters that want to use a pointer to some ValueObject or ValueObject child. For expr results we make a shadow copy in the target so we can resolve this issue (this is one of the jobs of the m_live_address ivar).

So it is really for finding addresses in the target, and should fail if we can't realize that.

It's other job is if an expression result is a pointer to a variable in the target, you want to be able to use & on that result, so we keep a connection in this case between the expression result and the "real thing" from the expression so that works. This part of the ValueObject system has some black magic in it for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants