New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel SMB (cifs) fails to work by name, it works just be IP after joining the domain, from Window 10 Client #254

Closed
BobTB opened this Issue Aug 20, 2018 · 29 comments

Comments

Projects
None yet
3 participants
@BobTB

BobTB commented Aug 20, 2018

Ok. How to make the R151026 fail with SMB:

  1. Install a brand new latest omnios from ISO, set static IP, and set a hostname (ax55)
  2. Set DNS and Gateway on OmniOs - DNS set to Windwos Server AD DNS for local resolve, run pkg update
  3. Add a DNS record for the OmniOS IP in the windows DNS server
  4. Install Napp-it and add a simple pool with filesystem and a SMB share (not guest enabled)
  5. Test access to share by using the DNS name of OmniOS and it IP - it will ask for password, which is ok
  6. Go to Napp-IT and join the omnios server to the AD domain as usually, works for me everytime (Server 2016, 2012, 2008 whatever)
  7. Open the share again from Windows 10 (\ax55) - if you are lucky it will open, no password asked and work for a few minutes
  8. Wait some time, or just reboot Windows 10 PC from which you are accessing the \ax55 share.
    Try to open the share again \ax55. Experience fail - either "windows can't find...." or "unspecifed error ..."

You can still reslove the name to the ip for ax55 from command prompt, you can still ping the OmnioS with IP and hostname from Windows 10, just the share is not working.
You can still access the share by using the IP directly \192.168....
It still works in Windows 7, XP , Server 2008 etc...

Now you can disconnect the OmniOS server from domain, reboot it, whatever and share will never work (by name) again in Windows 10
If I go and change the hostname, it will work again, until it is joined to domain, and then it will fail as described above.

R151022 works as expected, nothing breaks, everything works. Something was changed from R151022 to R151026 to cause this...

SMB 1 disabled / enabled doesn't change anything, I tried it all

@BobTB

This comment has been minimized.

BobTB commented Aug 20, 2018

I went and installed all I could from r151022 to r151026. For every version I did the clean ISO install, repeated the steps I described above, and check if it works. Then I did pkg update, and repeat.

r151022 works
r151024 works up to the latest 00afcd0 (july 2018)
r151026 does not work at all, the ISO I have is r151026-673c59f55d which already does fail as described above.

I can try to apply patches to r151024 until it breaks, but do not know how or where to get them.

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 20, 2018

Starting with the biggest change in SMB between r151024 & 26, can you please try installing the following hot-fix on r151024?

# pkg apply-hot-fix --be-name=1575 https://downloads.omniosce.org/pkg/r151024/1575.p5p
@BobTB

This comment has been minimized.

BobTB commented Aug 20, 2018

I just applied the patch, rebooted to make sure. It broke the SMB - it now acts exactly as r151026, no access to the share on W10 machines anymore, except by IP . Therefore something in here is the culprit.

@gwr

This comment has been minimized.

gwr commented Aug 21, 2018

The main clue about where to look is that, when joined to a
domain and when connecting via a name, the client will send a
Kerberos blob for authentication. By contrast, when connecting
via IP address the client will send NTLMSSP authentication.

I suggest getting a network trace of a connection attempt
(taken on the illumos server) when the share becomes
inaccessible. If possible, simultaneously get dtrace logs
with usr/lib/smbsrv/dtrace/smbsrv.d (for the kernel
smbsrv module) and smbd-all.d (for smbd).

If we can get the same captures & dtrace both before and after
the suspect change and compare them, that should let us
narrow down to where things might have gone wrong.

@citrus-it citrus-it self-assigned this Aug 21, 2018

@citrus-it citrus-it added the bug label Aug 21, 2018

@BobTB

This comment has been minimized.

BobTB commented Aug 21, 2018

I was unable to run smdb-all.d script, I was told that line 30: "pid0' does not contain a valid pid". Probably because it is not running at all? I am just using the kernel smbsrv...

smbsrv.d script worked, and I made the smbsrv.d traces of when it fails (after patch) and when it works (before patch) both run on r151024.

I hope this will show why this is happening. If I need to do something else (or I did not do this right) I am more then prepared to do it.

smbsrv.d.trace.zip

SMB parameters:

system_comment=
max_workers=1024
netbios_enable=false
netbios_scope=
lmauth_level=4
keep_alive=5400
wins_server_1=
wins_server_2=
wins_exclude=
signing_enabled=true
signing_required=false
restrict_anonymous=false
pdc=
ads_site=
ddns_enable=false
autohome_map=/etc
ipv6_enable=false
print_enable=false
traverse_mounts=true
map=
unmap=
disposition=
max_protocol=

@BobTB

This comment has been minimized.

BobTB commented Aug 21, 2018

I also made a dtrace when accessing the share by IP on a patched system, where it doesn't work by name. Attached here.

smbsrv.d.trace.ip.zip

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 21, 2018

For the smbd-all trace, do:

dtrace -s /usr/lib/smbsrv/dtrace/smbd-all.d -p `pgrep smbd`

Could you also grab a network packet trace in both working and non-working cases? You might want to check the files with something like wireshark to check they're ok to share.

snoop -d <network device> -o /tmp/file.snp 192.168.X.Y

I think the ideal would be all three traces from the same connection attempt so that they can be correlated.

@BobTB

This comment has been minimized.

BobTB commented Aug 21, 2018

I got this when running

dtrace -s /usr/lib/smbsrv/dtrace/smbd-all.d -p `pgrep smbd`

dtrace: failed to compile script /usr/lib/smbsrv/dtrace/smbd-all.d: line 95: probe description pid2580:libc_hwcap1.so.1:syslog:entry does not match any probes

pgrep smbd does give me a pid, so its running. If I can get this running I will do the simultaneous traces.

@citrus-it citrus-it changed the title from Kernel SMB (cfis) fails to work by name, it works just be IP after joining the domain, from Window 10 Client to Kernel SMB (cifs) fails to work by name, it works just be IP after joining the domain, from Window 10 Client Aug 21, 2018

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 21, 2018

I'm having the same trouble running smbd-all.d - working fine on bloody but not previous releases. For now, just grab the other bits if you can and I'll look at this tomorrow.

@gwr

This comment has been minimized.

gwr commented Aug 22, 2018

The smbsrv kmod is giving up on the logon attempt after the
first auth. message send/recv. That generally means that the
authentication service up in smbd returned an error code.

We need to see what's happening in the smbd auth. service.
Either smbd-all.d or smbd-authsvc.d will show that.
(the latter may be sufficient and less "noisy")

BTW, the user-level dtrace scripts need

dtrace -p `pgrep smbd` ...
@BobTB

This comment has been minimized.

BobTB commented Aug 22, 2018

I managed to run all three smbd-all.d, smbd-authsvc.d and smbsrv.d again. Here is the attached zip with all three traces. This was taken at the moment the share was accessed from a client PC at the moment if fails to open, 5 seconds or so total. Somewhat larger files then before.

share_fail.zip

@gwr

This comment has been minimized.

gwr commented Aug 23, 2018

The function smb_decode_krb5_pac is failing.

Unfortunately, that dtrace script does not hook all the probes we want to see.
It would help a lot if you could edit smbd-authsvc.d and add all probes in
libmlsvc and libmlrpc. Have a look at smbd-rpcsvc.d and consider adding
the "mask" logic (to reduce some of the noise in the mlrpc code).

@BobTB

This comment has been minimized.

BobTB commented Aug 23, 2018

Thank you very much for looking into this, but this is now beyond my knowledge and capability. If anyone can help here with a fixed dtrace script for me to run, I can run it.

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 24, 2018

Give this one a try please - run it the same way as before. As @gwr said, smb_decode_krb5_pac is failing with RPC_NT_PROTOCOL_ERROR and this should give some more information on where it is failing.

dtrace -Zs smbd-254.d -p `pgrep smbd`

smbd-254.d.txt

@BobTB

This comment has been minimized.

BobTB commented Aug 24, 2018

Thank you. Here is the trace now. Its really short, I hope I captured what is needed.

smbd-244.zip

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 24, 2018

Could you try this one too please?
smbd-254.d.txt

@BobTB

This comment has been minimized.

BobTB commented Aug 24, 2018

Here it goes.

trace2.zip

@gwr

This comment has been minimized.

gwr commented Aug 25, 2018

Sorry, need still more info. Can you please attempt the logon again with this dtrace script running?
smbd-254b.d.txt
Thanks

@BobTB

This comment has been minimized.

BobTB commented Aug 25, 2018

No problem, here it is!

trace3.txt

@gwr

This comment has been minimized.

gwr commented Aug 25, 2018

Got the name of libmlrpc wrong. Again please...
smbd-254c.d.txt

@gwr

This comment has been minimized.

gwr commented Aug 25, 2018

If you have a snoop of this at the same time, that would be handy.
(tcp port 445)

@BobTB

This comment has been minimized.

BobTB commented Aug 26, 2018

Ok, I just did another go at this. Just in case it helps I took two sets.

There are 4 files. snoopOK and traceOK which were taken when the share is accessed from Windows 7 client, which works ok.

snoopFail and traceFail were taken when the share was accessed from Windows 10 client, which fails. Both files in each set were taken simultaneously.

trace3.zip

@gwr

This comment has been minimized.

gwr commented Aug 26, 2018

Can you send me your krb5 keytab? (unicast -- it's sensitive)

Alternatively, load the keytab into wireshark, decode frame 19 in snoop-fail, and send me the full decode.

@gwr

This comment has been minimized.

gwr commented Aug 26, 2018

libmlrpc is failing while decoding a UTF-16 string like:
Jxxxxxxx Fxxxx~ - KFM d.o.o.
(exact name obscured here for privacy)

The failure happens in the function ndr_s_wchar, which calls ndr_inner
for each wchar in the string. Those all succeed, but then ndr_s_wchar
calls ndr__wcstombs, which calls ndr__wcslen (OK so far, len=0x1c)
but then (not in the trace) the call into libc:uconv_u16tou8
apparently fails, and then ndr__wcstombs returns -1

  0                            -> ndr__wcstombs 	0x81bcfbc	0xfc84c9d4	0x1c	0x1b	0x81b7a94	0x0
  0                              -> ndr__wcslen 	0xfc84c9d4	0xfc84c9ac	0xfc84c980	0xfe66926c	0x0	0x81bce98
  0                              <- ndr__wcslen 	0x1c
  0                            <- ndr__wcstombs 	0xffffffff
  0                          <- ndr_s_wchar   	0x0

At this point, we could add libc:uconv_u16tou8 to the dtrace script,
or run this again with a breakpoint set in ndr__wcstombs and just
look around when the debugger stops with that string.

If we can get the exact UTF-16 string (either from debug or from the decoded trace)
then we could experiment calling libc:uconv_u16tou8 with that string and the same
buffer sizes etc. to figure out what's going wrong.

@gwr

This comment has been minimized.

gwr commented Aug 26, 2018

I think I see the (likely) problem. Decoding UTF-16 strings to UTF-8 can result in longer strings, and the length limit passed to uconv_u16tou8 is smaller that what the caller has actually provided.
Any chance you can try this fix?

diff --git a/usr/src/lib/libmlrpc/common/ndr_process.c ...
index 3188500a8b..c32433a291 100644
--- a/usr/src/lib/libmlrpc/common/ndr_process.c
+++ b/usr/src/lib/libmlrpc/common/ndr_process.c
@@ -1984,7 +1984,8 @@ ndr_s_wchar(ndr_ref_t *encl_ref)
 	 */
 	if (nds->m_op == NDR_M_OP_UNMARSHALL) {
 		wcs[wlen] = 0;
-		slen = ndr__wcstombs(valp, wcs, wlen);
+		slen = encl_ref->size_is * NDR_MB_CHAR_MAX;
+		slen = ndr__wcstombs(valp, wcs, slen);
 		if (slen == (size_t)-1)
 			return (0);
 		valp[slen] = '\0';
@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 26, 2018

Here you go:

# pkg apply-hot-fix --be-name=1575b https://downloads.omniosce.org/pkg/r151024/1575b.p5p
@BobTB

This comment has been minimized.

BobTB commented Aug 26, 2018

Great! This is fixed. I tried with various clients (from XP, 7, Vista, 8.1 to w10) and it works every time. Great! Thank you all for bearing with me.

@gwr

This comment has been minimized.

gwr commented Aug 26, 2018

Thanks for the test case. Is there an issue opened for this yet?
The key to this is login where some of the AD-provided strings (i.e. user full name)
contain characters that expand when converted from UTF-16 to UTF-8.

@citrus-it

This comment has been minimized.

Member

citrus-it commented Aug 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment