Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add core and usnic support for fid_nic #4433

Merged
merged 1 commit into from
Oct 16, 2018

Conversation

jsquyres
Copy link
Member

@jsquyres jsquyres commented Sep 22, 2018

This PR adds support for https://ofiwg.github.io/libfabric/master/man/fi_nic.3.html.

However, there are (at least) 2 problems that need to be resolved:

EDIT: there's only 1 problem left

  1. https://ofiwg.github.io/libfabric/master/man/fi_nic.3.html states that fid_nic.prov_attr is only understood by the provider.
    1. How do we know how to tostr() it?
    2. How do we know how to free it (during fi_freeinfo())?

Do we need additional provider ops for these functions?

@shefty
Copy link
Member

shefty commented Sep 22, 2018

info->handle is a fid, which includes a type field.

We either need to narrow the scope of what prov_attr can be (e.g. a pointer to a single allocated memory region), or find a way to hand it to the provider for processing.

@jsquyres
Copy link
Member Author

Ok, fixed the code to check for FI_CLASS_NIC now.

prov_attr needs to be nailed down, though.

src/fabric.c Outdated
free(nic->prov_attr);
}
free(nic);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a separate function? I haven't looked and don't remember, but I'm guessing fi_dupinfo needs to be updated as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a subroutine. Also added handling for fi_dupinfo().

Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to see what we can do for the prov_attr. But I'm fine having whatever that is go in as separate changes.


free(nic->prov_attr);
free(nic);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you call ofi_free_nic() instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Fixed.

@jsquyres jsquyres force-pushed the pr/usnic-fid_nic branch 2 times, most recently from 40f75ca to 81db8ef Compare September 24, 2018 21:02
@jsquyres
Copy link
Member Author

Ok, cleaned up the comments and the commit messages.

@bturrubiates usnic fi_info -v output on an old model VIC looks like this:

fi_info:
    caps: [ FI_MSG, FI_RECV, FI_SEND, FI_SOURCE ]
    mode: [ FI_MSG_PREFIX, FI_LOCAL_MR ]
    addr_format: FI_SOCKADDR_IN
    src_addrlen: 16
    dest_addrlen: 0
    src_addr: fi_sockaddr_in://10.10.0.6:0
    dest_addr: (null)
    handle (fid_nic):
        fi_device_attr:
            name: usnic_0
            device_id: UCSC-PCIE-CSC-02 (VIC 1225)
            device_version: 0xcf
            vendor_id: 0x1137
            driver: usnic_verbs
            firmware: 4.1(3f)
        fi_bus_attr:
            fi_bus_type: FI_BUS_UNKNOWN
        fi_link_attr:
            address: 10.10.0.6/16
            mtu: 9000
            speed: 10000
            state: FI_LINK_UP
            network_type: Ethernet
    fi_tx_attr:
...etc.

shefty
shefty previously approved these changes Sep 24, 2018
Copy link
Member

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core changes look good

@shefty
Copy link
Member

shefty commented Sep 24, 2018

For the prov_attr, we need to extend struct fi_provider if we want to be able to print them or allow a complex structure. So, I guess the first decision is how flexible do we want provider attributes to be?

My current thinking is to add tostr(), allocinfo, and freeinfo() calls to fi_provider. The provider would only be responsible for handling provider specific portions of input data.

@jsquyres
Copy link
Member Author

Did you mean allocnic and freenic? Or were you specifically being more general in terms of allocating/freeing an entire info? Can you flesh out a little more about what you're thinking for these 2 functions?

@shefty
Copy link
Member

shefty commented Sep 24, 2018

I was considering adding the other external API calls to the provider, but the provider would only deal with the provider specific pieces. For example, today if an app calls fi_dupinfo() on an fi_info where the handle is set to FI_CLASS_CONNREQ, the handle is ignored. So, I'd like to have a more generic solution.

These would be the additions:

dupinfo(const struct fi_info *info, struct fi_info *new_info);
freeinfo(struct fi_info *info);
tostr(const void *data, enum fi_type datatype);

There needs to be enough information in whatever input data there is in order to route the call to the correct provider. So, not all calls into the main API can find their way to the provider, even it provider specific data is present.

For dupinfo/freeinfo, the provider only deals with the handle field. But maybe this would allow a provider to store data beyond the end of the user visible structure(s)?


nic = calloc(1, sizeof(*nic));
if (!nic)
goto nomem;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nomem label makes a call to ofi_free_nic if the calloc fails. With the current version of the code, that will cause a seg fault because ofi_free_nic is not safe if a NULL pointer is passed in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed ofi_free_nic() to handle NULL.

src/fabric.c Outdated
@@ -481,6 +481,31 @@ FI_DESTRUCTOR(fi_fini(void))
ofi_osd_fini();
}

void ofi_free_nic(struct fid_nic *nic)
{
if (nic->device_attr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to handle a NULL nic pointer in here somehow?

Maybe wrap this in one more if-block, or maybe make all of these something like:

if (nic && nic->device_attr) {
}

if (ret < 0) {
USDF_DBG(
"asprintf failed while creating device_id\n");
ret = -ENOMEM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing a goto in this cleanup path. You set ret but then overwrite it immediately afterward.

if (ret < 0) {
USDF_DBG(
"asprintf failed while creating device_version\n");
ret = -ENOMEM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, think you're missing a goto.

if (ret < 0) {
USDF_DBG(
"asprintf failed while creating vendor_id\n");
ret = -ENOMEM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goto missing.

if (!la)
goto nomem;
uint8_t *p = (uint8_t*) &dap->uda_ipaddr_be;
ret = asprintf(&la->address, "%d.%d.%d.%d/%d",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that we use anything other than v4, but if we ever do support anything else this conversion can probably be replaced with inet_ntop.

if (ret < 0) {
USDF_DBG(
"asprintf failed while creating link address\n");
ret = -ENOMEM;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing goto.

src/fabric.c Outdated
@@ -507,6 +532,10 @@ void DEFAULT_SYMVER_PRE(fi_freeinfo)(struct fi_info *info)
free(info->fabric_attr->prov_name);
free(info->fabric_attr);
}
if (info->handle && info->handle->fclass == FI_CLASS_NIC) {
struct fid_nic *nic = (struct fid_nic*) info->handle;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this, you can just directly ofi_free_nic(info->handle);.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types are different -- a cast of some flavor or another is necessary.

I chose to make the param of ofi_free_nic() be (struct fid_nic *) because, well, you're freeing a struct fid_nic, not a generic fid. So I figured the onus should be on the caller to make sure the right type is passed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types are different -- a cast of some flavor or another is necessary.

Ooops, I thought handle here was void *.

src/fabric.c Outdated
la_dup->mtu = la->mtu;
la_dup->speed = la->speed;
la_dup->state = la->state;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it's very important, but should you #undef MEMBER_STRDUP here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure; I have no strong opinions about this. Done.

src/fabric.c Outdated
@@ -853,6 +957,14 @@ struct fi_info *DEFAULT_SYMVER_PRE(fi_dupinfo)(const struct fi_info *info)
if (dup->fabric_attr->prov_name == NULL)
goto fail;
}
if (info->handle && info->handle->fclass == FI_CLASS_NIC) {
int ret;
ret = ofi_dup_nic((struct fid_nic*) info->handle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not using the return value, and you don't need the cast. This can be expressed clearer as:

if (ofi_dup_nic(info->handle, &dup->handle) < 0)
    goto fail;

@shefty
Copy link
Member

shefty commented Sep 25, 2018

We discussed this some at today's ofiwg. @j-xiong made a suggestion to add function pointers to prov_attr, basically turning it into a struct fid. Along those lines, we have struct fid_nic. We can add additional ops to fid_nic for handling the prov_attr, or use the existing fid_nic::control() function to handle the prov_attr.

For example, fid_nic::close() could be called to free the nic data from fi_freeinfo(). Internally, the provider can call the generic helper ofi_free_nic() to free the common structures, then free the prov_attr themselves.

Similarly, we can define FI_CONTROL codes for fid_nic that would copy or print the NIC attributes. A generic helper would handle the common structures, but the provider would handle the prov_attr field.

@jsquyres jsquyres force-pushed the pr/usnic-fid_nic branch 2 times, most recently from 8a9d812 to 8795e93 Compare September 25, 2018 17:37
ba->bus_type = FI_BUS_UNKNOWN;

la = calloc(1, sizeof(*la));
if (!la) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, remove braces here.

src/fabric.c Outdated
ofi_free_nic(nic_dup);
return -FI_ENOMEM;

#undef MEMBER_STRUP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

@jsquyres
Copy link
Member Author

@bturrubiates Sigh. Where is my brain today? Fixed.

@jsquyres
Copy link
Member Author

Any idea why AppVeyor is failing? That seems to have nothing to do with this PR:

msg_pingpong -I 5
SERVER OUTPUT: 
----
CLIENT OUTPUT: 
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
64      5       640         0.10s      0.01   10099.70       0.00
256     5       2.5k        0.78s      0.00   78397.40       0.00
1k      5       10k         0.10s      0.10    9999.60       0.00
4k      5       40k         0.92s      0.04   92297.20       0.00
64k     5       640k        0.10s      6.55    9999.60       0.00
1m      5       10m         0.10s    101.81   10299.70       0.00
----
server error code = -1073741819 
test failed

See https://ci.appveyor.com/project/shefty/libfabric/build/4384/job/j40ldvxy4097d5by#L2146

@shefty
Copy link
Member

shefty commented Sep 25, 2018

I restarted appveyor to see if the error re-appears. I can't recall any recent changes that would have broken windows.

@jsquyres
Copy link
Member Author

@shefty Failed again, but a different test this time -- https://ci.appveyor.com/project/shefty/libfabric/build/4387/job/i7nbp95e1w3v249x#L2037:

msg_sockets
SERVER OUTPUT: 
----
CLIENT OUTPUT: 
bound_addr: "[127.0.0.1]:9229"
Sending message...
Send completion received
----
server error code = -1073741819 
test failed

@shefty
Copy link
Member

shefty commented Sep 25, 2018

Other recent PRs passed appveyor, so maybe there's something in this change that the existing tests aren't handling, or there's some other issue. The changes you've made to the core look okay to me though.

@shefty
Copy link
Member

shefty commented Sep 25, 2018

I am in the process of pulling in the common pieces of this PR into a separate series, with an updated proposal to handle the prov_attr component. Please allow that PR to merge first, then we can update the usnic provider accordingly.

@jsquyres
Copy link
Member Author

@shefty no problem

@shefty shefty dismissed their stale review September 26, 2018 03:23

working on defining prov_attr

@shefty
Copy link
Member

shefty commented Sep 26, 2018

Please take a look at #4443. That contains a proposal for dealing with the prov_attr by making use of the fid_nic component of the structure.

@shefty
Copy link
Member

shefty commented Oct 4, 2018

FYI - I have been sidetracked from this, but I do plan on getting back to it. I have some patches in progress locally to add a fid_nic to fi_info. But that breaks the ABI, so it requires some additional compatibility code.

@jsquyres
Copy link
Member Author

jsquyres commented Oct 4, 2018

Me too -- I was planning on refreshing this PR according to the new fid_nic code on master "soon".

@jsquyres
Copy link
Member Author

jsquyres commented Oct 7, 2018

@shefty @bturrubiates Revamped, now there's two commits:

  1. Added default fid_nic tostr functions.
    • Open question here: the fid tostr function takes a len parameter that is effectively obviated in all the tostr.c functions by just using the BUFSIZ constant. I left in the len parameter and even pass it through to the sub functions in case we want to do something with it, but... we need to decide something here.
  2. Re-did usnic fid_nic support for usnic to use the new core infrastructure.

@shefty
Copy link
Member

shefty commented Oct 8, 2018

I have a patch to fixup the fid tostr function.

I still need to update the ABI compat area to extend fi_info.

@shefty
Copy link
Member

shefty commented Oct 8, 2018

See PR #4495 for what I have. I also have an incomplete patch locally to address the ABI change

@shefty
Copy link
Member

shefty commented Oct 12, 2018

In theory, the core has all the updates that are needed now. This PR should just need the usnic specific patch now (c787f7e).

@jsquyres
Copy link
Member Author

@bturrubiates All previous review comments addressed, and updated to utilize fid_nic core infrastructure. This PR now ready for review.

Output for usnic looks like this:

fi_info:
...
    fid_nic:
        fi_device_attr:
            name: usnic_0
            device_id: UCSC-PCIE-CSC-02 (VIC 1225)
            device_version: 0xcf
            vendor_id: 0x1137
            driver: usnic_verbs
            firmware: 4.1(3f)
        fi_bus_attr:
            fi_bus_type: FI_BUS_UNKNOWN
        fi_link_attr:
            address: 10.10.0.1
            mtu: 9000
            speed: 10000
            state: FI_LINK_UP
            network_type: Ethernet

la = nic->link_attr;

socklen_t size = 32;
la->address = calloc(1, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use INET_ADDRSTRLEN.


nic = ofi_nic_dup(NULL);
if (!nic)
goto nomem;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nomem label will dereference nic, so this will likely cause a crash in the error path.

if (!da->firmware)
goto nomem;

ba = nic->bus_attr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing, but you probably don't need to use a variable for ba since you only use it once. Probably easier to just nic->bus_addr->bus_type = FI_BUS_UNKNOWN.

The usnic driver does not currently export PCI information, so that
data is left empty in the fid_nic.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres
Copy link
Member Author

@bturrubiates Fixed.

@jsquyres jsquyres merged commit 7e9c2bd into ofiwg:master Oct 16, 2018
@jsquyres jsquyres deleted the pr/usnic-fid_nic branch October 16, 2018 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants