Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netdata/packaging/installer: Finish up netdata installer refactoring #5705

Closed
paulkatsoulakis opened this issue Mar 25, 2019 · 10 comments
Closed
Labels
area/packaging Packaging and operating systems support priority/medium
Milestone

Comments

@paulkatsoulakis
Copy link
Contributor

paulkatsoulakis commented Mar 25, 2019

Feature idea summary

This issue is a wrap-up task of PR #5505
We will use it to track any further problems identified during testing and also make it the link between the old PR and the new that i will be producing with any further fixes we may come up with.

CC: @cakrit

Expected behavior

netdata-installer.sh runs as before, with the fixes included.

@paulkatsoulakis paulkatsoulakis added area/packaging Packaging and operating systems support priority/medium labels Mar 25, 2019
@paulkatsoulakis paulkatsoulakis added this to the v1.14-rc0 milestone Mar 25, 2019
@paulkatsoulakis
Copy link
Contributor Author

Last time the previous PR was tested, the following problem was identified when run stable and improved versions of netdata in brand new containers:

Raw diff of logs between the two versions

$ diff install.err.paulfantom install.err.stable 
7c7
< 
---
> Press ENTER to build and install netdata to your system > 
131c131
< [/root/netdata52]# chown -R root:netdata /usr/libexec/netdata 
---
> [/root/netdata52]# chown -R root /usr/libexec/netdata 
158c158
< [/root/netdata52]# chown root:netdata /usr/libexec/netdata/plugins.d/cgroup-network-helper.sh 
---
> [/root/netdata52]# chown root /usr/libexec/netdata/plugins.d/cgroup-network-helper.sh 
177c177
< [/root/netdata52]# tar -xf /tmp/netdata-go-BmvTXc/config.tar.gz -C /usr/lib/netdata/conf.d/ 
---
> [/root/netdata52]# tar -xf /tmp/netdata-go-nXVIre/config.tar.gz -C /usr/lib/netdata/conf.d/ 
183c183
< [/root/netdata52]# mv /tmp/netdata-go-BmvTXc/go.d.plugin-v0.2.0.linux-amd64 /usr/libexec/netdata/plugins.d/go.d.plugin 
---
> [/root/netdata52]# mv /tmp/netdata-go-nXVIre/go.d.plugin-v0.2.0.linux-amd64 /usr/libexec/netdata/plugins.d/go.d.plugin 
191a192,197
> [/root/netdata52]# chmod a+rX /usr/libexec 
>  OK   
> 
> [/root/netdata52]# chmod a+rX /usr/share/netdata 
>  OK   
> 
211,213c217,218
< Ooops! it seems netdata is not started.
< [/root/netdata52]# /usr/bin/systemctl start netdata 
<  OK   
---
> OK. NetData Started!
> 
215c220
<  OK  netdata started! 
---
> -------------------------------------------------------------------------------
218c223
< [/root/netdata52]# curl -sSL --connect-timeout 10 --retry 3 http://localhost:19999/netdata.conf 
---
> [/root/netdata52]# curl -s -o /etc/netdata/netdata.conf.new http://localhost:19999/netdata.conf 

paulkats@etrius.local:~/Documents/Development/Personal
$ 

From the above, the noticeable are the following (they are easier to understand with vimdiff

  1. These permissions are only set in stable version, we no longer do it on the refactored verson. Intentional?
> [/root/netdata52]# chmod a+rX /usr/libexec 
>  OK   
> 
> [/root/netdata52]# chmod a+rX /usr/share/netdata 
>  OK   
  1. Netdata seems to start with systemctl restart netdata when doing it with the stable code, while in the refactored version we fail on the first attempt and it retries in order to succeed. Here's the message on the refactored code that indicates that glitch
211,213c217,218
< Ooops! it seems netdata is not started.
< [/root/netdata52]# /usr/bin/systemctl start netdata 
<  OK   
---

@paulkatsoulakis
Copy link
Contributor Author

paulkatsoulakis commented Mar 27, 2019

The chmod removal on the branched change was introduced intentionally by cde4264, depre

According to #1292 bug, this addition was attending a problem back then.
According to @paulfantom input, we are removing it now, because:

  1. We create the netdata dir with the right permissions in the first place
  2. We don't have to and we shouldn't touch the system folder.

@paulkatsoulakis
Copy link
Contributor Author

The other difference that i am investigating next is the post-install start up of the service.
Seems that the refactored version is producing the following error, that is auto-fixed but introduces delay in the installation completion:

211,213c217,218
< Ooops! it seems netdata is not started.
< [/root/netdata52]# /usr/bin/systemctl start netdata 
<  OK   

@paulkatsoulakis
Copy link
Contributor Author

I 've spotted a couple more nits from the log file that i have committed already.
Regarding the above error, unfortunately i couldn't reproduce it on my laptop so i can't do much.
I went through the implementation that raises this message and i couldn't not see a use case raising it.
Unfortunately for this i will have to wait until i have my LXC containers playground environment available that gave me the reproduce.

Moving forward with my checks, i executed the following commands, so that i can then validate the generated folder structures from the installer

$ docker run -it -v "XXXXXXXXX/install_dir_refactored:/install_dir_refactored:rw" -v "${PWD}:/code:rw" -w /code "netdata/os-test:ubuntu1804" ./netdata-installer.sh --dont-start-it --dont-wait --install /install_dir_refactored
$ docker run -it -v "XXXXXXXXX/install_dir_stable:/install_dir_stable:rw" -v "${PWD}:/code:rw" -w /code "netdata/os-test:ubuntu1804" ./netdata-installer.sh --dont-start-it --dont-wait --install /install_dir_stable

@paulkatsoulakis
Copy link
Contributor Author

executing the tree command on both install folders and then vimdiffing them, i confirm that they generate an identical tree structure

XX:~/XX/XX/Personal
$ tree -d install_dir_refactored > refactored
XX:~/XX/XX/Personal
$ tree -d install_dir_stable> stable 

@paulkatsoulakis
Copy link
Contributor Author

I also confirmed the attribute details on each file by doing the following

 5353  3/29/2019 03:06  ls -ltrR install_dir_refactored > 11
 5354  3/29/2019 03:06  ls -ltrR install_dir_stable >> 22
 5357  3/29/2019 03:08  cat 11 | sort > 111
 5358  3/29/2019 03:08  cat 22 | sort > 222
 5359  3/29/2019 03:08  vimdiff 111 222

The only differences are in some file sizes because of the different parent dir names in the paths primarily and also some timestamps because of the different execution time

@paulkatsoulakis
Copy link
Contributor Author

Now, in order to make sure we correctly evaluate the user/group ownership details we should run the content listing from within the container.
So i ran the following to help me get the correct user/group information from the container

$  docker run -it -v "/Users/paulkats/Documents/Development/Personal/install_dir_stable:/install_dir_stable:rw" -v "${PWD}:/code:rw" -w /code "netdata/os-test:ubuntu1804" /bin/bash -c './netdata-installer.sh --dont-start-it --dont-wait --install /install_dir_stable; ls -ltrR /install_dir_stable' > ../stable_contents
$ docker run -it -v "/Users/paulkats/Documents/Development/Personal/install_dir_refactored:/install_dir_refactored:rw" -v "${PWD}:/code:rw" -w /code "netdata/os-test:ubuntu1804" /bin/bash -c './netdata-installer.sh --dont-start-it --dont-wait --install /install_dir_refactored; ls -ltrR /install_dir_refactored' > ../refactored_contents

@paulkatsoulakis
Copy link
Contributor Author

Refactor code | Stable
1683 checking for xenstat_init in -lxenstat... no | 1684 checking for xenstat_init in -lxenstat...

1706 checking whether C compiler accepts -flto... yes | 1707 checking whether C compiler accepts -flto...

So even though captured contents of files are fine, with the previous commands we also captured the output of the installation. It seems that we have a difference on these commands.
Stable does not produce a response while optimised does.

I will have to check whether this is again one of the intentional changes

@paulkatsoulakis
Copy link
Contributor Author

We finally nailed the root cause for this difference also!
With the support of @vlvkobal on the troubleshooting, we noticed that the install process was stalling during the pidof execution for some time.

we deep dived and we discovered that when we switched to command -v, we basically introduced an endless loop because of the BASH function pidof that was returned from command -v as the command to execute. That resulted on the script to end up on an endless loop that eventually crashed and made the installer continue to the next approach for netdata restart (it does that during the restart sequences).

To mitigate this, we basically renamed BASH function pidof to safe_pidof which is actually more accurate name too and thus we prevented the loop. PR is already updated.

The other difference that i am investigating next is the post-install start up of the service.
Seems that the refactored version is producing the following error, that is auto-fixed but introduces delay in the installation completion:

211,213c217,218
< Ooops! it seems netdata is not started.
< [/root/netdata52]# /usr/bin/systemctl start netdata 
<  OK   

@paulkatsoulakis
Copy link
Contributor Author

Merged all fixes, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/packaging Packaging and operating systems support priority/medium
Projects
None yet
Development

No branches or pull requests

1 participant