Data analytics toolkit part of the KAVE, installed with AmbariKave, and also installable stand-alone http://beta.kave.io, a wiki for the entire KAVE is maintained on the cluster installer, AmbariKave wiki
Installing more pip modules on top of kave toolbox as the root user:
- Installing a python module into anaconda when the first install used root privilages
- Installing some derived library (some ROOT component) which needs ROOT and python integration
Why is this complicated?
- The root user by default does not have the KAVE environment setup, it has the system python and cannot see the kave components
- whatever new modules you try and install as root will then build against the system libraries, not KaveToolbox
What errors might I see?
- Complaints that you don't have the right privilages
- Software you think you've installed does not link or run correctly against KaveToolbox
- Software you think you've installed does not run for your user, but seems to run for the root user
- sudo su #changes you to the root user
- source /opt/KaveToolbox/pro/scripts/KaveEnv.sh # the regular kave environment is not automatically fired for the root user
- #install as normal
sudo su source /opt/KaveToolbox/pro/scripts/KaveEnv.sh conda update conda pip install pymongo
- RootNotes: using ROOT plotting in ipython notebooks
- StatTools: tool for confidence level calculations
- LogMon : monitor a logfile (e.g. hive logfile)
- gdown : Auto download files from google docs/drive
- geomaps : Utilities to give a postal code-based map, or other geographical-based map in ipython Notebook
- Python through (ana)conda, includes SciPy, numpy, pip etc., (continuum.io)
- ROOT, CERN's data analysis package (root.cern.ch)
- R with integration into iPython notebook (http://nbviewer.ipython.org/github/dboyliao/cookbook-code/blob/master/notebooks/chapter07_stats/08_r.ipynb)
- Additional hadoopy-python modules, dumbo, mrjob, pyleus and pymongo_hadoop (if hadoop is available)
- Pentaho kettle (only if specifically configured, see ReleaseNotes.md for details), graphical process and data management tool (http://community.pentaho.com/projects/data-integration/)
- robomongo (only if specifically configured, see ReleaseNotes.md for details)
- IPython notebooks for RootNotes, StatTools, R and geomaps
CentOS6, CentOS7, Redhat7, Ubuntu 14 and Ubuntu 16 are used for testing, although no guarantees are given.
Only bash as a default shell is supported at the moment, users with a different default have reported many problems.
Please get in touch if you would like to make enquiries about this.
KaveToolbox is aimed at making the installation of our key analytics software and libraries seamless so that one-click deployment is possible and encouraged, taking the pain out of working out prerequisites, compilation, for most of our software. When you just want to get stuck straight into the data, you can bring along your same toolbox. It ensures a common environment to allow for simpler code distribution across all data nodes "fire and forget" instead of "push and pray".
KaveToolbox recognises two types of distribution:
- Node - no x-windows, needs libraries necessary for linking/running jobs, but no GUI management for that
- Workstation - complete data analysis workstation, with all graphical components, vnc, x-windows, etc.
Node: 5 GB of disk space for the software, additional 2 GB temp space needed during installation
Workstation: 7 GB of disk space for the software, additional 2 GB temp space needed during installation
Node: 1 core 2GB of RAM
Workstation: 2 core 4GB RAM
An internet connection (many packages will be downloaded form various sites)
Centos6/7 review your yum.conf file to make sure you are not ignoring certain packages from being installed
Nodes are likely to have even higher requirements for other service requirements such as Hadoop or storm.
- (2 core + 4 GB RAM)+(1 core + 2 GB RAM)*(number of simultaneous users)
- (100 GB)*(number of all-time users) home directory
- 20 GB "/" free on top of system size, or direct mount of 20 GB as /opt/
- 100 GB "/tmp" size
- GB Ethernet with high upload bandwidth for VNC connections
- We recommend that any servers/services requiring 100% uptime are not run on the analysis workstation (e.g. Hue/Ganglia/nagios/ldap) since analysis users will have erratic usage with a very high peak usage, we recommend running such services on dedicated servers in the network.
We also release the software packaged within docker containers. See http://hub.docker.com/r/kave/kavetoolbox. For example:
docker run -it kave/kavetoolbox:3.7-Beta.c7.node /bin/bash
When making a local installation you have two choices:
- Installing a released version from the repos server
- Installing the head, branch or specific tag from GIT
We recommend to install with the default configurations, but in case you want to modify the configurations you can create a file in /etc/kave/CustomInstall.py,For an example and more information run the installer with --help
- 1: Installing the released version, for example, 3.7-Beta
yum -y install wget curl tar zip unzip gzip python wget http://repos:firstname.lastname@example.org/noarch/KaveToolbox/3.7-Beta/kavetoolbox-installer-3.7-Beta.sh sudo bash kavetoolbox-installer-3.7-Beta.sh [--quiet]
(--quiet is for a quieter install, remove the brackets!) Remember the help at this stage [--help] ( NB: yum is the standard package manager for Centos/redhat. To install on Ubuntu the equivalent is apt-get )
( NB: the repository server uses a semi-private password only as a means of avoiding robots and reducing DOS attacks this password is intended to be widely known and is used here as an extension of the URL )
- 2: Installing the head from git, Example given using ssh.
#test ssh keys with ssh -T email@example.com #if this works, git clone firstname.lastname@example.org:KaveIO/KaveToolbox.git #then install with sudo ./KaveToolbox/scripts/KaveInstall [--quiet]
(--quiet is for a quieter install, remove the brackets!) Remember the help at this stage [--help]
- Then to browse through examples
cd /opt/KaveToolbox/pro/examples ipython notebook
And/or visit http://nbviewer.ipython.org/
Optional: Editing configuration files
- Default will install into directories in /opt
- Default will not overwrite existing packages
- Default configurations are well-tested, read all the configurations from config/kavedefaults.py
- To override configurations, create a simple python file in /etc/kave/CustomInstall.py
- To override pip requirements, create and edit the fine /etc/kave/requirements.txt
- this python should be used to logically overwrite any property of a service appearing in kavedefaults.py and will not be over-written on re-install/upgrade
- For an example and more information call ./kavetoolbox/scripts/KaveInstall --help
Optional: Set mirrors/nearside cache
- A list of mirrors of where to locate our software can be added to /etc/kave/mirror .
- The "mirror" file will be interpreted line-by-line should be used to add a list of nearside cache directories or nearside mirrors of the KPMG repository.
- All mirrors listed here must follow the same directory structure as the main repository, this looks like: mirror/os-version(s)/KaveToolbox/toolbox-version(s)/files.ext
- See more details below in setting up such a cache
Optional: Additional installation options
- The installer script has more options to help steer the installation
- take a look at the --help for the KaveInstall script for more details.
- Examples include automatically cleaning old versions from /opt. (--clean-after)
- Examples include completely cleaning directories before install from /opt (--clean-before)
- If you want to only select a certain list of components to install, this is possible with command-line arguments, e.g. KaveInstall KaveToolbox anaconda will only install the KaveToolbox scripts and anaconda python, but nothing else
- "Warning: end of file not at end of line" during installation: this means you don't have enough virtual memory for the compilation of root. Modify configuration file for "low memory mode"
- Other errors in root or python installation: if installation fails, it may be due to conflicts with a previous install, try touch ~/.nokaveEnv and then obtain a clean shell, possibly via ssh
- ProtectNotebooks.sh script: if run as root, will add a system-wide ipython_notebook_config.py file if run as a user will add a user-level ipython_notebook_config.py file this file chooses a default port based on username and protects notebooks with the user's login password
- Re-running the installer over a pre-existing installation will only install new software and pick up new configuration changes.
- New software will be installed into versioned directories, to make it easier to track
- In case of an error during installation the install will stop, to complete an incomplete installation, re-run the installer, this will not delete any partially created directories, you will need to do that yourself
- To fix some component within a broken installation, delete any installed directories in /opt (or whatever you specified them to be) and re-run the installer, it will only install those parts you either deleted or didn't work the first time.
- To perform a complete re-install remove relevent directories from /opt, like /opt/root, /opt/kettle etc. or add the --clean-before flag to the script
- To re-install only the core KaveToolbox with any new features, see Updating
- To re-install specific components, add the component name as an arguement ' KaveInstall eclipse kettle --clean-before '
There are three possible update mechanisms
- Downloading/rerunning the latest install script (from git or from the repository -> 2 methods)
- Running the KaveUpdate script
sudo /opt/KaveToolbox/pro/scripts/KaveUpdate --list sudo /opt/KaveToolbox/pro/scripts/KaveUpdate --help sudo /opt/KaveToolbox/pro/scripts/KaveUpdate --quiet
The update script works well for updating between 2.X versions, and can also be used for 1.X, but only with either:
- the --clean-before flag.
- or by moving/removing directories in /opt, e.g. moving /opt/KaveToolbox/ to /opt/KaveToolbox/1.X and /opt/anaconda to /opt/anaconda/2.2 (version of old anaconda install)
The --clean-after flag is a common addition to the update to remove deprecated software after install
If you are trying to upgrade from 1.X to 2.X, either use --clean-before to remove the previous install, or move /opt/KaveToolbox/ to /opt/KaveToolbox/1.X and /opt/anaconda to /opt/anaconda/2.2 before installation
The correct paths to directly use our tools will be automatically added to your environment provided:
- you are not the root user
- you do not have .nokaveEnv in your home directory
- you use bash as your default shell
In other cases you will need to get/set environment manually
source [directory, e.g. /opt/KaveToolbox/pro/scripts]/KaveEnv.sh
the ASCII-art KAVE banner only shows up for interactive, non-dumb terminals, to turn off the KAVE banner even in that case do
To disable automatic setting of the environment for this user:
To force setting the environment for this user in case they would normally be skipped, first remove .nokaveEnv, then:
Test if it works?
- take a look at the examples!
cd $KAVETOOLBOX/examples ipython notebook --> Choose, for example, rootnotes.ipynb --> Kernel --> Restart --> Cells --> RunAll
Migration to Python3 as of (July 6 2017)
we have Migrated to python3 as default
Internet during installation, firewalls and nearside cache/mirror options
Ideally all of your nodes will have access to the internet during installation in order to download software.
If this is not the case, you can, possibly, implement a near-side cache/mirror of all required software. This is not very easy, but once it is done one time, you can keep it for later.
- Centos6: Howto
- EPEL: Mirror FAQ , Mirroring
- Ambari: Local Repositories Deploying HDP behind a firewall
To setup a local near-side cache for the KAVE tool stack is quite easy. First either copy the entire repository website to your own internal apache server, or copy the contents of the directories to your own shared directory visible from every node.
mkdir -p /my/shared/dir cd /my/shared/dir wget -R http://repos:email@example.com/
Then create a /etc/kave/mirror file on each node with the new top-level directory to try first before looking for our website:
echo "/my/shared/dir" >> /etc/kave/mirror echo "http://my/local/apache/mirror" >> /etc/kave/mirror
So long as the directory structure of the nearside cache is identical to our website, you can drop, remove or replace, any local packages you will never install from this directory structure, and update it as our repo server updates.
How can I install from behind a proxy?
You might consider creating a near-side cache, and/or configuring your proxy settings correctly, since we use wget for the downloads, your existing proxy settings (e.g. HTTP_PROXY environment variable) should be sufficient.
Don't forget that the root user/sudo also must comminicate over the proxy, and this may mean propagating the right environment variables. Try adding:
Defaults env_keep +="http_proxy" Defaults env_keep +="https_proxy"
to your sudoers file with visudo
I can reach the repos server fine, but the installer still tells me there is a problem!
We can't trouble shoot your networking issues for you, but if you are trying to install from behind a proxy, check the "How can I install behind a proxy" FAQ, also talk with your network administrator and decide if you need to setup a nearside cache.
The precompiled root version will not run from a non-default configuration
- This is to be expected.
- If you have edited the configuration file to change installed packages or locations, it is quite likely that root will not install from the precompiled version correctly.
- To fix this, revert your copy of kaveconfiguration.py to the default settings and re-install root, or, configure/compile root yourself in this new location like:
cd /root/install/location ./configure [options e.g. linuxx8664gcc --enable-python --enable-mathmore --enable-minuit2 --enable-roofit --fail-on-missing] make -j numcores
or follow the instructions on the root website to install root yourself
ROOT installation fails during yum install
Many different packages are needed, did you maybe run out of space? Or did you ignore kernel packages in your yum.conf?
Check /etc/yum.conf and see if there is anything your are ignoring or forbidding from installing.
can't checkout due to gnome-ssh-askpass Gtk-WARNING cannot open display
This is gnome trying to spawn an x window to have you enter your password. Work around by:
Is it possible to install the software without root/superuser privilages?
So long as the pre-requisites are already installed (see the yum install commands in the kaveconfiguration.py) it is possible to install all the software we package into a local directory, however that is not implemented yet and will not permit seamless integration of all users and machines in a network, and it will not be possible to automatically source the environment for all users.
This is usually your local browser which is blocking things:
- Not allowed to run unsafe/unverified scripts (look for the tell-tale icon in the browser toolbar)
- Not allowed to display mixed content (if the trying to display https, look for the tell-tale icon in the browser toolbar)
In the first case, you can simply permit scripts running, by clicking on the correct icon and choosing the correct option.
VNC opens to a black screen with just a small dialog box?
On Centos7, for some reason the vnc installation/start does not be default recognise the gnome installation
To fix this, edit your .vnc/xstartup file to contain:
#!/bin/sh [ -r /etc/sysconfig/i18n ] && . /etc/sysconfig/i18n export LANG export SYSFONT vncconfig -iconic & unset SESSION_MANAGER unset DBUS_SESSION_BUS_ADDRESS OS=`uname -s` if [ $OS = 'Linux' ]; then case "$WINDOWMANAGER" in *gnome*) if [ -e /etc/SuSE-release ]; then PATH=$PATH:/opt/gnome/bin export PATH fi ;; esac fi if [ -x /etc/X11/xinit/xinitrc ]; then exec /etc/X11/xinit/xinitrc fi if [ -f /etc/X11/xinit/xinitrc ]; then exec sh /etc/X11/xinit/xinitrc fi [ -r $HOME/.Xresources ] && xrdb $HOME/.Xresources xsetroot -solid grey xterm -geometry 80x24+10+10 -ls -title "$VNCDESKTOP Desktop" & twm &
Cannot open ipython/jupyter notebooks: Permission denied: '/run/user/0/jupyter'
This is caused by an environment variable being inherited form one user to the next. Simple fix, unset $XDG_RUNTIME_DIR .
Cannot import pandas in notebook, notebooks run under different kernel!
In some cases we have seen that users have a file ~/.local/share/jupyter/kernels/python2/kernel.json where the wrong python executable is given.
Easy fix, change the name of the python executable to simply 'python' in this file.