# In Memory Data Mining Tools

luoq08@gmail.com OR hzluoqiang@corp.netease.com

### general process of data mining

* data acquisition
  * crawler
* [data wrangling](https://en.wikipedia.org/wiki/Data_wrangling)
  * load data(sql, csv/xls, json, html)
  * clean
  * transformation (join, group, sort)
  * data visualization
* model

### focus of this talk

* in a single machine
  * 48core, 2TB storage, 256G ram
  * [Big RAM is eating big data](http://datascience.la/big-ram-is-eating-big-data-size-of-datasets-used-for-analytics/)
* data fitting in memory(sometimes out of core)
* opensource: python and some shell tools

Not included
* hadoop, mpi, spark

### python

[Why is Python so popular in machine learning?](https://www.quora.com/Why-is-Python-so-popular-in-machine-learning)

pros:

* easy to use
* libraries; swiss army knife of machine learning. 
* speed up using C(not easy to write)
* dynamic(live coding, easy to manipulate)

cons:

* parallel: multiprocessing

* dynamic(test needed)

## setup

1. Install [anaconda](https://www.continuum.io/downloads)
```
conda install xxx
pip install yyy
```

2. start jupyter
```
jupyter notebook
```
access server. [ssh tunnel](http://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) may be useful

3. import libraries

```python
import numpy as np
import scipy.sparse as sp
import pandas as pd

from sklearn.linear_model import LogsiticRegression
import xgboost as xgb

import gensim
import nltk
from sklearn.feature_extraction import Counte

import matplotlib.pylab as plt
import seaborn
seaborn.set()
%matplotlib inline
```

## SciPy

The SciPy Stack: Scientific Computing Tools for Python

<div>
<ul>
<li>
<div class="thumbnail">
<div class="pull-left img">
  <a href="http://numpy.org">
  <img alt="numpy" src="https://scipy.org/_static/images/numpylogo_med.png" width="64">
  </a>
</div>
<div class="img-label">
  <h4 class="media-heading"><a href="http://numpy.org">NumPy</a></h4>
  Base N-dimensional array package
</div>
</div>
</li>

<li>
<div class="thumbnail">
<div class="pull-left img">
  <a href="scipylib/index.html">
  <img alt="scipy" src="https://scipy.org/_static/images/scipy_med.png" width="64">
  </a>
</div>
<div class="img-label">
  <h4 class="media-heading"><a href="scipylib/index.html">SciPy library</a></h4>
  Fundamental library for scientific computing
</div>
</div>
</li>

<li>
<div class="thumbnail">
<div class="pull-left img">
  <a href="http://matplotlib.org/">
  <img alt="matplotlib" src="https://scipy.org/_static/images/matplotlib_med.png" width="64">
  </a>
</div>
<div class="img-label">
  <h4 class="media-heading"><a href="http://matplotlib.org/">Matplotlib</a></h4>
  Comprehensive 2D Plotting
</div>
</div>
</li>

<li>
<div class="thumbnail">
<div class="pull-left img">
  <a href="http://ipython.org">
  <img alt="ipython" src="https://scipy.org/_static/images/ipython.png" width="64">
  </a>
</div>
<div class="img-label">
  <h4 class="media-heading"><a href="http://ipython.org">IPython</a></h4>
  Enhanced Interactive Console
</div>
</div>
</li>

<li>
<div class="thumbnail">
<div class="pull-left img">
  <a href="http://sympy.org/">
  <img alt="sympy" src="https://scipy.org/_static/images/sympy_logo.png" width="64">
  </a>
</div>
<div class="img-label">
  <h4 class="media-heading"><a href="http://sympy.org/">Sympy</a></h4>
  Symbolic mathematics
</div>
</div>
</li>

<li>
<div class="thumbnail">
<div class="pull-left img">
  <a href="http://pandas.pydata.org/">
  <img alt="pandas badge" src="https://scipy.org/_static/images/pandas_badge2.jpg" width="64">
  </a>
</div>
<div class="img-label">
  <h4 class="media-heading"><a href="http://pandas.pydata.org/">pandas</a></h4>
  Data structures &amp; analysis
</div>
</div>
</li>

</ul>
</div>

## [jupyter](http://jupyter.org/)


* live [REPL](https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop); similar to Mathematica(https://www.wolfram.com/mathematica/), [Sage](http://www.sagemath.org/) notebook, 
* multi languages: python, R, julia, scala [A list](https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages)
* rich content: markdown, html, picture, js, interactive widget
* share: export to html, slides, nbviewer


### jupyter tour
[Try it](https://try.jupyter.org/)

[Awesome Data Science: 1.0 Jupyter Notebook Tour](https://www.youtube.com/watch?v=e9cSF3eVQv0)

More examples: 
* [Jupyter Notebook Viewer](http://nbviewer.jupyter.org/)
* [A gallery of interesting IPython Notebooks](https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks)

matlab like enviroment

* numpy: dense matrix
* scipy: scipy.sparse for sparse matrix; linear algebra, optimization ...
* matplotlib: matlab like plotting; image based, static
* [seaborn](https://web.stanford.edu/~mwaskom/software/seaborn/): matplotlib based; high level; more attractive
* [mpld3](http://mpld3.github.io/): bring matplotlib to d3

### [numpy](http://www.numpy.org/)

* [tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)
* [Numpy for Matlab users](https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html)
* [another NumPy for MATLAB users](http://mathesaurus.sourceforge.net/matlab-numpy.html)

### [scipy](http://docs.scipy.org/doc/scipy/reference/)

* [sparse matrix](http://docs.scipy.org/doc/scipy/reference/sparse.html)

```python
X = sp.csr_matrix((V, (I, J)))
```

* [optimization](http://docs.scipy.org/doc/scipy-0.15.1/reference/optimize.html)

### [matplotlib](http://matplotlib.org/)

* matlab plot
* [gallery](http://matplotlib.org/gallery.html)

## [pandas](http://pandas.pydata.org/)

* io, pd.read_csv, pd.read_xlsx, pd.read_sql, pd.read_json, pd.read_hdf
* DataFrame object: np.array with row and column label; different types for columns; tabular data
* Panel object: 3d DataFrame, sometimes usefull( stock data)
* group by, sort, join(pd.merge), reshape/pivot

caution:

* slow as k,v store(vectorize)
* many tricks

more:

* [10 Minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
* [Things in Pandas I Wish I'd Known Earlier](http://nbviewer.jupyter.org/github/rasbt/python_reference/blob/master/tutorials/things_in_pandas.ipynb)


### [scikit-learn](http://scikit-learn.org/stable/)

* active community and development; clear interface
* good documentation with reference
* fullset of algorithms; pipeline; parameter tuning
* wrap famous tools [libsvm](https://www.csie.ntu.edu.tw/~cjlin/libsvm/), [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/)



#### sklearn API

```python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(loss='l1')
model.fit(X_train, y_train)
model.predict(X_test)
model.predict_proba(X_test)
```

<table>
<tr style="border:None; font-size:20px; padding:10px;"><th>``model.predict``</th><th>``model.transform``</th></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Classification</td><td>Preprocessing</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Regression</td><td>Dimensionality Reduction</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Clustering</td><td>Feature Extraction</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>&nbsp;</td><td>Feature selection</td></tr>

</table>

* [flowchart](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
* [doc](http://scikit-learn.org/stable/documentation.html)
* [API](http://scikit-learn.org/stable/modules/classes.html)

## showcase

* [SciPy 2015 Scikit-learn Tutorial](https://github.com/luoq/scipy_2015_sklearn_tutorial)

others:

* [Sample pipeline for text feature extraction and evaluation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html)
* [python data visualization for iris data set](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations)
* [preliminary exploration for titanic competiton](https://www.kaggle.com/letfly/titanic/preliminary-exploration)

### shell

* cat, head, tail, grep, sed, awk, sort, uniq, comm; understand pipe
* [csvkit](https://github.com/wireservice/csvkit), [jq](https://stedolan.github.io/jq/)
* [parallel](https://www.gnu.org/software/parallel/): [tutorial](https://www.gnu.org/software/parallel/parallel_tutorial.html)
* more: [Data Science at the Command Line](http://datascienceatthecommandline.com/)

### demo

```bash
function work(){
pv --rate -i 5 \
 | csvcut -c 'images_array_1,images_array_2' | csvjson --stream \
  | parallel --gnu -k --pipe -N 20  --jobs 16 python -m feature.image_feature

}

# Generate image feature for training data set and testing data set
cat data/data_files/image_itemPairs_train.csv | work > data/data_files/image_feature_train.csv
cat data/data_files/image_itemPairs_test.csv | work > data/data_files/image_feature_test.csv
```

feature.image_feature.py:
```python
if __name__ == '__main__':
    import sys
    for line in sys.stdin:
        line = line.rstrip()
        #do something with line
        ...
        print(result)
```

## more data mungling tools

### [bokeh](http://bokeh.pydata.org/en/latest/)


* __interactive__ visualization library that targets modern __web browsers__
* [quickstart](http://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart)

### [d3.js](https://d3js.org/)

Data-Driven Documents

```
D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.
```

* bind data to DOM, and manipulate
* powerful
* prepare to write code



* [A Visual Introduction to Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
* A free online book: [Interactive Data Visualization for the Web](http://chimera.labs.oreilly.com/books/1230000000345/)

## data acquisition

web crawling

* crawler: [pyspider](https://github.com/binux/pyspider), [scrapy](http://scrapy.org/)(no py3)

* html parsing: [pyquery](https://pythonhosted.org/pyquery/), [lxml](http://lxml.de/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)

* chrome: [SelectorGadget](http://selectorgadget.com/), Network panel

### data clean

* [openrefine](http://openrefine.org/)

## More Machine Learning tools

### [xgboost](https://github.com/dmlc/xgboost)

* performance verified (in various kaggle competition)
* handle nonlinear relation
* handle missing value, no need for standardization
* fast, scalable
* support R, python, julia, scala, java
* sklearn interface

## Text/ NLP

* [nltk](http://www.nltk.org/): NLP; tokenzier, stemmer, wrap for corenlp; slow?
* [snowballstemmer](https://github.com/shibukawa/snowball_py): python interface for Snowball stemmer
* [jieba](https://github.com/fxsjy/jieba):  Chinese text segmentation; dictionary matters
* [sklearn.feature_extraction.text](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text): generate document term matrix
* [gensim](https://radimrehurek.com/gensim/): lda, word2vec
* [stanford corenlp](http://stanfordnlp.github.io/CoreNLP/): java; deep nlp
* [jpype](https://github.com/originell/jpype): run java code
* [More](http://note.luoq.me/machinelearning/tools)

### [vowpal wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki)

* out of core; online; scalable; $10^{12}$ sparse features; linear model
* hashing trick for raw text feature

## How to learn more

* Google is your friend
* youtube

[My list](http://note.luoq.me/machinelearning/)