Online Monitor for the Pandas Dataframe Transformations
====================================================================================

$$\\[2pt]$$


Igor Marfin **[Unister Gmb@2014]** < <igor.marfin@unister.de>>


$$\\[40pt]$$

#Table of Contents
* &nbsp;
	* [0.1 Abstract](#0.1-Abstract)
	* [0.2 Initialization](#0.2-Initialization)
	* [0.3 Introduction to the topic and motivation](#0.3-Introduction-to-the-topic-and-motivation)
		* [0.3.1 Utility functions](#0.3.1-Utility-functions)


$$\\[10pt]$$

## 0.1 Abstract

--------------------------

Today I want to introduce you to my project  called ** Online Monitor of the pandas dataframe transformations **. The source code and documentation can be found in the 
repository <a name="ref-1"/>[(I.Marfin:Repository_of_the_project, 2015)](#cite-PROJECTREPO)
This  project illustrates the practical steps of the making online monitor to
trace the status of the long-time transformations of the pandas dataframe.

More details can be found at https://bitbucket.org/iggy_floyd/pandas-dataframe-transformation-monitor.



$$\\[5pt]$$

## 0.2 Initialization

--------------------------



To set up the python environment for the data analysis and make a nicer style of the notebook, one can run the following commands in the beginning of our modeling:

In [2]:
import sys
sys.path = ['/usr/local/lib/python2.7/dist-packages'] + sys.path # to fix the problem with numpy: this replaces  1.6 version by 1.9

%matplotlib inline
%pylab inline
ion()

import os
import matplotlib 
import numpy as np
import matplotlib.pyplot as pl
import matplotlib as mpl
import logging
import pymc as pm

# a plotter and dataframe modules
import seaborn as  sns # seaborn to make a nice plots of the data
import pandas as pd
import scipy.stats as stats





# Set up logging.
logger = logging.getLogger()
logger.setLevel(logging.INFO)

from book_format import load_style, figsize, set_figsize
load_style()



Populating the interactive namespace from numpy and matplotlib


In [2]:
%%javascript
IPython.load_extensions("calico-spell-check", "calico-document-tools",
"calico-cell-tools");


<IPython.core.display.Javascript object>

$$\\[5pt]$$

## 0.3 Introduction to the topic and motivation 

----------------

My motivations, which force me to develop this project, are quite straightforward,
and I will explain them now.

The main part of any statistical inference is preparation of data to consumable 
format and generation of new data on the basis of existing information. 
For ``pythonic`` statisticians, this means that 

1. a pandas dataframe has to be built from the database or file

2. new data should be inserted into the dataframe

3. the dataframe should be transformed (normalized)


All steps usually take much time if the number of entries is large.

If you run your analysis on the desktop, you can logout the trace of your computations to the 
display via calls of ``sys.stdout(...) or print(...)`` .
In the case when your code is running online in a cloud or on some remote server,
such approach will be problematic. Also the logging of your calculations to files remotely, 
can not serve your intentions  if you don't have a ``ssh`` tunnel to the remote server.


So it would be interesting to develop such system which would monitor processes of the 
dataframe transformation in real time and be accessible in the Internet.
The project <a name="ref-2"/>[(I.Marfin:Repository_of_the_project, 2015)](#cite-PROJECTREPO) illustrates the practical steps which one can make to
build such system.

$$\\[5pt]$$

## Installation

----------------
Installation looks easy.

Simply, clone the project 

    git clone https://iggy_floyd@bitbucket.org/iggy_floyd/pandas-dataframe-transformation-monitor.git
    
    
and test a configuration of your system that you have all components installed:    

* python
* scientific python modules

In order to do this, you can just run the default rule of  ``Makefile`` via the 
command in the shell:

    make
    
    

In [1]:
!cd bin; ./setup.daemons.sh; export DAEMON_NAME=MONITORSOCKETIO_MULTI_TRANSFORM; ./daemon.sh start;
#!cd bin; ./setup.daemons.sh; export DAEMON_NAME=NGROK; ./daemon.sh start;


^C


In [52]:
from IPython.display import IFrame
IFrame("templates/monitor_multi_transform_socketio.html", width="100%", height=800)



In [3]:
# it seems that the ./daemon.sh doesn't work (:-
#!cd bin; ./setup.daemons.sh; export DAEMON_NAME=MONITORSOCKETIO_MULTI_TRANSFORM; ./daemon.sh stop;
!ps aux | grep monitorserver_multi_transform | awk '{print $2}' | xargs -I {} kill -9 {}
!ps aux | grep ngrok | awk '{print $2}' | xargs -I {} kill -9 {}

In [17]:
%install_ext https://raw.githubusercontent.com/mozilla/kuma-lib/master/packages/ipython/IPython/Extensions/jobctrl.py

Installed jobctrl.py. To use it, type:
  %load_ext jobctrl


In [19]:
%%script bash --bg --out script_out

sleep 10
echo hi!

Starting job # 0 in a separate thread.


In [20]:
%%script bash --bg --out script_out2
echo hi!

Starting job # 2 in a separate thread.


In [18]:
%load_ext jobctrl

ImportError: cannot import name genutils

In [7]:
from IPython import parallel
c = parallel.Client()
view = c.load_balanced_view()



In [10]:
%%px --local



class MyLongAssComputation(object):
    def compute(self, x):
        return x**2
    
def compute_func(x):
    import sys
    sys.path = ['/usr/local/lib/python2.7/dist-packages'] + sys.path # to fix the problem with numpy: this replaces  1.6 version by 1.9
    import pandas as pd
    import pickle
    c = MyLongAssComputation()
    return c.compute(x)

In [11]:

x = np.arange(1, 1000)
squared = view.map_sync(compute_func, x)
pickle.dump(squared, open('output.pickle', 'w'))


CompositeError: one or more exceptions from call to method: compute_func
[0:apply]: ValueError: numpy.dtype has the wrong size, try recompiling
[2:apply]: ValueError: numpy.dtype has the wrong size, try recompiling
[1:apply]: ValueError: numpy.dtype has the wrong size, try recompiling
[1:apply]: ImportError: cannot import name hashtable
.... 995 more exceptions ...

In [4]:
### Do not delete the following Markdown section! 
### This is the BibTeX references!

<!--bibtex

@Article{PROJECTREPO,
  Author    = {I. Marfin: Repository_of_the_project,},
  Title     = {Online Monitor for the Pandas Dataframe Transformations}, 
  year      = 2015,
  url       = "https://bitbucket.org/iggy_floyd/pandas-dataframe-transformation-monitor",  
}


@Article{WARNER,
  Author    = {D. WARNER NORTH,},
  Title     = {A Tutorial Introduction to Decision Theory}, 
  year      = 1968,
  url       = "https://drive.google.com/file/d/0B5OwgVT-YmdbVDdYVFk2LXlxVWc/view?usp=sharing",  
}

@Article{Risk_aversion,
  Author    = {Wikimedia Foundation, Inc},
  Title     = {Risk aversion}, 
  year      = 2015,
  url       = "https://en.wikipedia.org/wiki/Risk_aversion",  
}

@Article{James_DT,
  Author    = {James F.,},
  Title     = {Decision Theory},
  year      = 2012,
  url       = "https://www.luminpdf.com/viewer/SJqiwgnfkYz9gqQHP?sk=336a7cb6-5f20-421a-9ff2-6540eea71c7e",  
}


@Article{Jordan_DT,
  Author    = {Jordan I.M.,},
  Title     = {Lecture3: Decision theory},
  year      = 2014,
  url       = "http://www.cs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture3.pdf",  
}

@Article{Price_List,
  Author    = {www.datagrabber.org,},
  Title     = {Facebook Price Is Right SHOWCASE (Retail Prices)},
  year      = 2015,
  url       = "http://www.datagrabber.org/welcome-to-datagrabber-org/price-is-right-showcase-cheat-prices/?doing_wp_cron=1439468477.1399750709533691406250",  
}

@Article{Google_marketlist,
  Author    = {Google Finance,},
  Title     = {Finance Data Listing and Disclaimers},
  year      = 2015,
  url       = "http://www.google.com/intl/en/googlefinance/disclaimer/",  
}

@Article{MaximillianVitek,
  Author    = {Maximillian Vitek,},
  Title     = {A python module for getting intraday data from Google Finance},
  year      = 2014,
  url       = "https://github.com/maxvitek/intradata",  
}


-->

#References

<a name="cite-PROJECTREPO"/><sup>[^](#ref-1) [^](#ref-2) </sup>I. Marfin: Repository_of_the_project,. 2015. _Online Monitor for the Pandas Dataframe Transformations_. [URL](https://bitbucket.org/iggy_floyd/pandas-dataframe-transformation-monitor)

<a name="cite-WARNER"/><sup>[^](#ref-3) </sup>D. WARNER NORTH,. 1968. _A Tutorial Introduction to Decision Theory_. [URL](https://drive.google.com/file/d/0B5OwgVT-YmdbVDdYVFk2LXlxVWc/view?usp=sharing)

