Updated the userguide

myui · Nov 17, 2016 · ee244c3 · ee244c3
1 parent f6cef1f
commit ee244c3
Show file tree

Hide file tree

Showing 36 changed files with 834 additions and 508 deletions.
diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md
@@ -92,6 +92,8 @@
 * [Webspam Tutorial](binaryclass/webspam.md)
     * [Data pareparation](binaryclass/webspam_dataset.md)
     * [PA1, AROW, SCW](binaryclass/webspam_scw.md)
+
+* [Kaggle Titanic Tutorial](binaryclass/titanic_rf.md)
 
 ## Part VI - Multiclass classification
 

diff --git a/docs/gitbook/anomaly/lof.md b/docs/gitbook/anomaly/lof.md
@@ -19,6 +19,8 @@
         
 This article introduce how to find outliers using [Local Outlier Detection (LOF)](http://en.wikipedia.org/wiki/Local_outlier_factor) on Hivemall.
 
+<!-- toc -->
+
 # Data Preparation
 
 ```sql
@@ -36,9 +38,9 @@ ROW FORMAT DELIMITED
 STORED AS TEXTFILE LOCATION '/dataset/lof/hundred_balls';
 ```
 
-Download [hundred_balls.txt](https://github.com/myui/hivemall/blob/master/resources/examples/lof/hundred_balls.txt) that is originally provides in [this article](http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259).
+Download [hundred_balls.txt](https://gist.githubusercontent.com/myui/f8b44ab925bc198e6d11b18fdd21269d/raw/bed05f811e4c351ed959e0159405690f2f11e577/hundred_balls.txt) that is originally provides in [this article](http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259).
 
-You can find outliers in [this picture](http://next.rikunabi.com/tech/contents/ts_report/img/201303/002259/part1_img1.jpg). As you can see, Rowid `87` is apparently an outlier.
+In this example, Rowid `87` is apparently an outlier.
 
 ```sh
 awk '{FS=" "; OFS=" "; print NR,$0}' hundred_balls.txt | \
@@ -144,11 +146,15 @@ where
 ;
 ```
 
-_Note: `list_neighbours` table SHOULD be created because `list_neighbours` is used multiple times._
+> #### Caution
+>
+> `list_neighbours` table SHOULD be created because `list_neighbours` is used multiple times.
 
-_Note: [`each_top_k`](https://github.com/myui/hivemall/pull/196) is supported from Hivemall v0.3.2-3 or later._
+# Parallelize Top-k computation
 
-_Note: To parallelize a top-k computation, break LEFT-hand table into piece as describe in [this page](https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF#parallelization-of-similarity-computation-using-with-clause)._
+> #### Info
+>
+> To parallelize a top-k computation, break LEFT-hand table into piece as describe in [this page](../misc/topk.html).
 
 ```sql
 WITH k_distance as (

diff --git a/docs/gitbook/binaryclass/a9a_lr.md b/docs/gitbook/binaryclass/a9a_lr.md
@@ -1,98 +1,91 @@
-<!--
-  Licensed to the Apache Software Foundation (ASF) under one
-  or more contributor license agreements.  See the NOTICE file
-  distributed with this work for additional information
-  regarding copyright ownership.  The ASF licenses this file
-  to you under the Apache License, Version 2.0 (the
-  "License"); you may not use this file except in compliance
-  with the License.  You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing,
-  software distributed under the License is distributed on an
-  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-  KIND, either express or implied.  See the License for the
-  specific language governing permissions and limitations
-  under the License.
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
 -->
-        
-a9a
-===
-http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a
-
-_Training with iterations is OBSOLUTE in Hivemall._  
-_Using amplifier and shuffling inputs is RECOMMENDED in Hivemall._
-
----
-
-## UDF preparation
-
-```sql
-select count(1) from a9atrain;
--- set total_steps ideally be "count(1) / #map tasks"
-set hivevar:total_steps=32561;
-
-select count(1) from a9atest;
-set hivevar:num_test_instances=16281;
-```
-
-## training
-```sql
-create table a9a_model1 
-as
-select 
- cast(feature as int) as feature,
- avg(weight) as weight
-from 
- (select 
-     logress(addBias(features),label,"-total_steps ${total_steps}") as (feature,weight)
-  from 
-     a9atrain
- ) t 
-group by feature;
-```
-_"-total_steps" option is optional for logress() function._  
-_I recommend you NOT to use options (e.g., total_steps and eta0) if you are not familiar with those options. Hivemall then uses an autonomic ETA (learning rate) estimator._
-
-## prediction
-```sql
-create or replace view a9a_predict1 
-as
-WITH a9atest_exploded as (
-select 
-  rowid,
-  label,
-  extract_feature(feature) as feature,
-  extract_weight(feature) as value
-from 
-  a9atest LATERAL VIEW explode(addBias(features)) t AS feature
-)
-select
-  t.rowid, 
-  sigmoid(sum(m.weight * t.value)) as prob,
-  CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label
-from 
-  a9atest_exploded t LEFT OUTER JOIN
-  a9a_model1 m ON (t.feature = m.feature)
-group by
-  t.rowid;
-```
-
-## evaluation
-```sql
-create or replace view a9a_submit1 as
-select 
-  t.label as actual, 
-  pd.label as predicted, 
-  pd.prob as probability
-from 
-  a9atest t JOIN a9a_predict1 pd 
-    on (t.rowid = pd.rowid);
-```
-
-```sql
-select count(1) / ${num_test_instances} from a9a_submit1 
-where actual == predicted;
-```
-> 0.8430071862907684
+
+<!-- toc -->
+
+# UDF preparation
+
+```sql
+select count(1) from a9atrain;
+-- set total_steps ideally be "count(1) / #map tasks"
+set hivevar:total_steps=32561;
+
+select count(1) from a9atest;
+set hivevar:num_test_instances=16281;
+```
+
+# training
+```sql
+create table a9a_model1 
+as
+select 
+ cast(feature as int) as feature,
+ avg(weight) as weight
+from 
+ (select 
+     logress(addBias(features),label,"-total_steps ${total_steps}") as (feature,weight)
+  from 
+     a9atrain
+ ) t 
+group by feature;
+```
+_"-total_steps" option is optional for logress() function._  
+_I recommend you NOT to use options (e.g., total_steps and eta0) if you are not familiar with those options. Hivemall then uses an autonomic ETA (learning rate) estimator._
+
+# prediction
+```sql
+create or replace view a9a_predict1 
+as
+WITH a9atest_exploded as (
+select 
+  rowid,
+  label,
+  extract_feature(feature) as feature,
+  extract_weight(feature) as value
+from 
+  a9atest LATERAL VIEW explode(addBias(features)) t AS feature
+)
+select
+  t.rowid, 
+  sigmoid(sum(m.weight * t.value)) as prob,
+  CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label
+from 
+  a9atest_exploded t LEFT OUTER JOIN
+  a9a_model1 m ON (t.feature = m.feature)
+group by
+  t.rowid;
+```
+
+# evaluation
+```sql
+create or replace view a9a_submit1 as
+select 
+  t.label as actual, 
+  pd.label as predicted, 
+  pd.prob as probability
+from 
+  a9atest t JOIN a9a_predict1 pd 
+    on (t.rowid = pd.rowid);
+```
+
+```sql
+select count(1) / ${num_test_instances} from a9a_submit1 
+where actual == predicted;
+```
+> 0.8430071862907684
diff --git a/docs/gitbook/binaryclass/a9a_minibatch.md b/docs/gitbook/binaryclass/a9a_minibatch.md
@@ -17,13 +17,12 @@
   under the License.
 -->
         
-This page explains how to apply [Mini-Batch Gradient Descent](https://class.coursera.org/ml-003/lecture/106) for the training of logistic regression explained in [this example](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)). 
-
-See [this page](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)) first. This content depends on it.
+This page explains how to apply [Mini-Batch Gradient Descent](https://class.coursera.org/ml-003/lecture/106) for the training of logistic regression explained in [this example](./a9a_lr.html). 
+So, refer [this page](./a9a_lr.html) first. This content depends on it.
 
 # Training
 
-Replace `a9a_model1` of [this example](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)).
+Replace `a9a_model1` of [this example](./a9a_lr.html).
 
 ```sql
 set hivevar:total_steps=32561;

diff --git a/docs/gitbook/binaryclass/kdd2010a_dataset.md b/docs/gitbook/binaryclass/kdd2010a_dataset.md
@@ -19,9 +19,9 @@
         
 [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (algebra)](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (algebra))
 
-* # of classes: 2
-* # of data: 8,407,752 (training) / 510,302 (testing)
-* # of features: 20,216,830 in about 2.73 GB (training) / 20,216,830 (testing) 
+* the number of classes: 2
+* the number of data: 8,407,752 (training) / 510,302 (testing)
+* the number of features: 20,216,830 in about 2.73 GB (training) / 20,216,830 (testing) 
 
 ---
 # Define training/testing tables

diff --git a/docs/gitbook/binaryclass/kdd2010b_dataset.md b/docs/gitbook/binaryclass/kdd2010b_dataset.md
@@ -19,9 +19,9 @@
         
 [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (bridge to algebra)](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (bridge to algebra))
 
-* # of classes: 2
-* # of data: 19,264,097 / 748,401 (testing)
-* # of features: 29,890,095 / 29,890,095 (testing)
+* the number of classes: 2
+* the number of examples: 19,264,097 (training) / 748,401 (testing)
+* the number of features: 29,890,095 (training) / 29,890,095 (testing)
 
 ---
 # Define training/testing tables

diff --git a/docs/gitbook/binaryclass/news20_scw.md b/docs/gitbook/binaryclass/news20_scw.md
@@ -16,7 +16,7 @@
   specific language governing permissions and limitations
   under the License.
 -->
-        
+
 ## UDF preparation
 ```
 use news20;