From 2efe9d14f08b60c211db3d88d4893a35457bad3c Mon Sep 17 00:00:00 2001 From: myui Date: Thu, 17 Nov 2016 21:16:14 +0900 Subject: [PATCH] Updated the userguide --- docs/gitbook/SUMMARY.md | 2 + docs/gitbook/anomaly/lof.md | 16 +- docs/gitbook/binaryclass/a9a_lr.md | 187 +++++----- docs/gitbook/binaryclass/a9a_minibatch.md | 7 +- docs/gitbook/binaryclass/kdd2010a_dataset.md | 6 +- docs/gitbook/binaryclass/kdd2010b_dataset.md | 6 +- docs/gitbook/binaryclass/news20_scw.md | 2 +- docs/gitbook/binaryclass/titanic_rf.md | 318 ++++++++++++++++++ docs/gitbook/binaryclass/webspam_scw.md | 2 +- docs/gitbook/eval/lr_datagen.md | 6 +- docs/gitbook/eval/stat_eval.md | 10 +- docs/gitbook/ft_engineering/hashing.md | 4 +- docs/gitbook/getting_started/input-format.md | 14 +- .../getting_started/permanent-functions.md | 5 +- docs/gitbook/misc/generic_funcs.md | 203 ++++++----- docs/gitbook/misc/topk.md | 17 +- docs/gitbook/multiclass/iris_dataset.md | 2 +- docs/gitbook/multiclass/iris_randomforest.md | 4 +- docs/gitbook/multiclass/iris_scw.md | 2 +- docs/gitbook/multiclass/news20_scw.md | 2 +- docs/gitbook/recommend/item_based_cf.md | 4 +- docs/gitbook/recommend/movielens_fm.md | 7 +- docs/gitbook/recommend/movielens_mf.md | 20 +- docs/gitbook/recommend/news20_knn.md | 2 +- docs/gitbook/regression/e2006_arow.md | 2 +- .../gitbook/regression/kddcup12tr2_adagrad.md | 254 +++++++------- .../gitbook/regression/kddcup12tr2_dataset.md | 2 +- .../regression/kddcup12tr2_lr_amplify.md | 6 +- docs/gitbook/tips/addbias.md | 2 +- docs/gitbook/tips/emr.md | 2 + docs/gitbook/tips/hadoop_tuning.md | 2 + docs/gitbook/tips/mixserver.md | 169 +++++----- docs/gitbook/tips/rand_amplify.md | 12 +- docs/gitbook/tips/rowid.md | 27 +- docs/gitbook/tips/rt_prediction.md | 16 +- 35 files changed, 834 insertions(+), 508 deletions(-) create mode 100644 docs/gitbook/binaryclass/titanic_rf.md diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index 7ef1b9bf..c333c989 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -92,6 +92,8 @@ * [Webspam Tutorial](binaryclass/webspam.md) * [Data pareparation](binaryclass/webspam_dataset.md) * [PA1, AROW, SCW](binaryclass/webspam_scw.md) + +* [Kaggle Titanic Tutorial](binaryclass/titanic_rf.md) ## Part VI - Multiclass classification diff --git a/docs/gitbook/anomaly/lof.md b/docs/gitbook/anomaly/lof.md index 48990f86..39a6e9f6 100644 --- a/docs/gitbook/anomaly/lof.md +++ b/docs/gitbook/anomaly/lof.md @@ -19,6 +19,8 @@ This article introduce how to find outliers using [Local Outlier Detection (LOF)](http://en.wikipedia.org/wiki/Local_outlier_factor) on Hivemall. + + # Data Preparation ```sql @@ -36,9 +38,9 @@ ROW FORMAT DELIMITED STORED AS TEXTFILE LOCATION '/dataset/lof/hundred_balls'; ``` -Download [hundred_balls.txt](https://github.com/myui/hivemall/blob/master/resources/examples/lof/hundred_balls.txt) that is originally provides in [this article](http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259). +Download [hundred_balls.txt](https://gist.githubusercontent.com/myui/f8b44ab925bc198e6d11b18fdd21269d/raw/bed05f811e4c351ed959e0159405690f2f11e577/hundred_balls.txt) that is originally provides in [this article](http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259). -You can find outliers in [this picture](http://next.rikunabi.com/tech/contents/ts_report/img/201303/002259/part1_img1.jpg). As you can see, Rowid `87` is apparently an outlier. +In this example, Rowid `87` is apparently an outlier. ```sh awk '{FS=" "; OFS=" "; print NR,$0}' hundred_balls.txt | \ @@ -144,11 +146,15 @@ where ; ``` -_Note: `list_neighbours` table SHOULD be created because `list_neighbours` is used multiple times._ +> #### Caution +> +> `list_neighbours` table SHOULD be created because `list_neighbours` is used multiple times. -_Note: [`each_top_k`](https://github.com/myui/hivemall/pull/196) is supported from Hivemall v0.3.2-3 or later._ +# Parallelize Top-k computation -_Note: To parallelize a top-k computation, break LEFT-hand table into piece as describe in [this page](https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF#parallelization-of-similarity-computation-using-with-clause)._ +> #### Info +> +> To parallelize a top-k computation, break LEFT-hand table into piece as describe in [this page](../misc/topk.html). ```sql WITH k_distance as ( diff --git a/docs/gitbook/binaryclass/a9a_lr.md b/docs/gitbook/binaryclass/a9a_lr.md index 17d91c06..9bac63ee 100644 --- a/docs/gitbook/binaryclass/a9a_lr.md +++ b/docs/gitbook/binaryclass/a9a_lr.md @@ -1,98 +1,91 @@ - - -a9a -=== -http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a - -_Training with iterations is OBSOLUTE in Hivemall._ -_Using amplifier and shuffling inputs is RECOMMENDED in Hivemall._ - ---- - -## UDF preparation - -```sql -select count(1) from a9atrain; --- set total_steps ideally be "count(1) / #map tasks" -set hivevar:total_steps=32561; - -select count(1) from a9atest; -set hivevar:num_test_instances=16281; -``` - -## training -```sql -create table a9a_model1 -as -select - cast(feature as int) as feature, - avg(weight) as weight -from - (select - logress(addBias(features),label,"-total_steps ${total_steps}") as (feature,weight) - from - a9atrain - ) t -group by feature; -``` -_"-total_steps" option is optional for logress() function._ -_I recommend you NOT to use options (e.g., total_steps and eta0) if you are not familiar with those options. Hivemall then uses an autonomic ETA (learning rate) estimator._ - -## prediction -```sql -create or replace view a9a_predict1 -as -WITH a9atest_exploded as ( -select - rowid, - label, - extract_feature(feature) as feature, - extract_weight(feature) as value -from - a9atest LATERAL VIEW explode(addBias(features)) t AS feature -) -select - t.rowid, - sigmoid(sum(m.weight * t.value)) as prob, - CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label -from - a9atest_exploded t LEFT OUTER JOIN - a9a_model1 m ON (t.feature = m.feature) -group by - t.rowid; -``` - -## evaluation -```sql -create or replace view a9a_submit1 as -select - t.label as actual, - pd.label as predicted, - pd.prob as probability -from - a9atest t JOIN a9a_predict1 pd - on (t.rowid = pd.rowid); -``` - -```sql -select count(1) / ${num_test_instances} from a9a_submit1 -where actual == predicted; -``` -> 0.8430071862907684 \ No newline at end of file + + + +# UDF preparation + +```sql +select count(1) from a9atrain; +-- set total_steps ideally be "count(1) / #map tasks" +set hivevar:total_steps=32561; + +select count(1) from a9atest; +set hivevar:num_test_instances=16281; +``` + +# training +```sql +create table a9a_model1 +as +select + cast(feature as int) as feature, + avg(weight) as weight +from + (select + logress(addBias(features),label,"-total_steps ${total_steps}") as (feature,weight) + from + a9atrain + ) t +group by feature; +``` +_"-total_steps" option is optional for logress() function._ +_I recommend you NOT to use options (e.g., total_steps and eta0) if you are not familiar with those options. Hivemall then uses an autonomic ETA (learning rate) estimator._ + +# prediction +```sql +create or replace view a9a_predict1 +as +WITH a9atest_exploded as ( +select + rowid, + label, + extract_feature(feature) as feature, + extract_weight(feature) as value +from + a9atest LATERAL VIEW explode(addBias(features)) t AS feature +) +select + t.rowid, + sigmoid(sum(m.weight * t.value)) as prob, + CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label +from + a9atest_exploded t LEFT OUTER JOIN + a9a_model1 m ON (t.feature = m.feature) +group by + t.rowid; +``` + +# evaluation +```sql +create or replace view a9a_submit1 as +select + t.label as actual, + pd.label as predicted, + pd.prob as probability +from + a9atest t JOIN a9a_predict1 pd + on (t.rowid = pd.rowid); +``` + +```sql +select count(1) / ${num_test_instances} from a9a_submit1 +where actual == predicted; +``` +> 0.8430071862907684 diff --git a/docs/gitbook/binaryclass/a9a_minibatch.md b/docs/gitbook/binaryclass/a9a_minibatch.md index eaa7a06e..a79ed863 100644 --- a/docs/gitbook/binaryclass/a9a_minibatch.md +++ b/docs/gitbook/binaryclass/a9a_minibatch.md @@ -17,13 +17,12 @@ under the License. --> -This page explains how to apply [Mini-Batch Gradient Descent](https://class.coursera.org/ml-003/lecture/106) for the training of logistic regression explained in [this example](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)). - -See [this page](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)) first. This content depends on it. +This page explains how to apply [Mini-Batch Gradient Descent](https://class.coursera.org/ml-003/lecture/106) for the training of logistic regression explained in [this example](./a9a_lr.html). +So, refer [this page](./a9a_lr.html) first. This content depends on it. # Training -Replace `a9a_model1` of [this example](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)). +Replace `a9a_model1` of [this example](./a9a_lr.html). ```sql set hivevar:total_steps=32561; diff --git a/docs/gitbook/binaryclass/kdd2010a_dataset.md b/docs/gitbook/binaryclass/kdd2010a_dataset.md index ca221c31..7634f66e 100644 --- a/docs/gitbook/binaryclass/kdd2010a_dataset.md +++ b/docs/gitbook/binaryclass/kdd2010a_dataset.md @@ -19,9 +19,9 @@ [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (algebra)](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (algebra)) -* # of classes: 2 -* # of data: 8,407,752 (training) / 510,302 (testing) -* # of features: 20,216,830 in about 2.73 GB (training) / 20,216,830 (testing) +* the number of classes: 2 +* the number of data: 8,407,752 (training) / 510,302 (testing) +* the number of features: 20,216,830 in about 2.73 GB (training) / 20,216,830 (testing) --- # Define training/testing tables diff --git a/docs/gitbook/binaryclass/kdd2010b_dataset.md b/docs/gitbook/binaryclass/kdd2010b_dataset.md index 41f05132..291a7839 100644 --- a/docs/gitbook/binaryclass/kdd2010b_dataset.md +++ b/docs/gitbook/binaryclass/kdd2010b_dataset.md @@ -19,9 +19,9 @@ [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (bridge to algebra)](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (bridge to algebra)) -* # of classes: 2 -* # of data: 19,264,097 / 748,401 (testing) -* # of features: 29,890,095 / 29,890,095 (testing) +* the number of classes: 2 +* the number of examples: 19,264,097 (training) / 748,401 (testing) +* the number of features: 29,890,095 (training) / 29,890,095 (testing) --- # Define training/testing tables diff --git a/docs/gitbook/binaryclass/news20_scw.md b/docs/gitbook/binaryclass/news20_scw.md index fa1da7f5..c3f51f47 100644 --- a/docs/gitbook/binaryclass/news20_scw.md +++ b/docs/gitbook/binaryclass/news20_scw.md @@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. --> - + ## UDF preparation ``` use news20; diff --git a/docs/gitbook/binaryclass/titanic_rf.md b/docs/gitbook/binaryclass/titanic_rf.md new file mode 100644 index 00000000..1a9786e9 --- /dev/null +++ b/docs/gitbook/binaryclass/titanic_rf.md @@ -0,0 +1,318 @@ + + +This examples gives a basic usage of RandomForest on Hivemall using [Kaggle Titanic](https://www.kaggle.com/c/titanic) dataset. +The example gives a baseline score without any feature engineering. + + + +# Data preparation + +```sql +create database titanic; +use titanic; + +drop table train; +create external table train ( + passengerid int, -- unique id + survived int, -- target label + pclass int, + name string, + sex string, + age int, + sibsp int, -- Number of Siblings/Spouses Aboard + parch int, -- Number of Parents/Children Aboard + ticket string, + fare double, + cabin string, + embarked string +) +ROW FORMAT DELIMITED + FIELDS TERMINATED BY '|' + LINES TERMINATED BY '\n' +STORED AS TEXTFILE LOCATION '/dataset/titanic/train'; + +hadoop fs -rm /dataset/titanic/train/train.csv +awk '{ FPAT="([^,]*)|(\"[^\"]+\")";OFS="|"; } NR >1 {$1=$1;$4=substr($4,2,length($4)-2);print $0}' train.csv | hadoop fs -put - /dataset/titanic/train/train.csv + +drop table test_raw; +create external table test_raw ( + passengerid int, + pclass int, + name string, + sex string, + age int, + sibsp int, -- Number of Siblings/Spouses Aboard + parch int, -- Number of Parents/Children Aboard + ticket string, + fare double, + cabin string, + embarked string +) +ROW FORMAT DELIMITED + FIELDS TERMINATED BY '|' + LINES TERMINATED BY '\n' +STORED AS TEXTFILE LOCATION '/dataset/titanic/test_raw'; + +hadoop fs -rm /dataset/titanic/test_raw/test.csv +awk '{ FPAT="([^,]*)|(\"[^\"]+\")";OFS="|"; } NR >1 {$1=$1;$3=substr($3,2,length($3)-2);print $0}' test.csv | hadoop fs -put - /dataset/titanic/test_raw/test.csv +``` + +## Data preparation for RandomForest + +```sql +set hivevar:output_row=true; + +drop table train_rf; +create table train_rf +as +WITH train_quantified as ( + select + quantify( + ${output_row}, passengerid, survived, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked + ) as (passengerid, survived, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked) + from ( + select * from train + order by passengerid asc + ) t +) +select + rand(31) as rnd, + passengerid, + array(pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked) as features, + survived +from + train_quantified +; + +drop table test_rf; +create table test_rf +as +WITH test_quantified as ( + select + quantify( + output_row, passengerid, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked + ) as (passengerid, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked) + from ( + -- need training data to assign consistent ids to categorical variables + select * from ( + select + 1 as train_first, false as output_row, passengerid, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked + from + train + union all + select + 2 as train_first, true as output_row, passengerid, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked + from + test_raw + ) t0 + order by train_first asc, passengerid asc + ) t1 +) +select + passengerid, + array(pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked) as features +from + test_quantified +; +``` + +--- + +# Training + +`select guess_attribute_types(pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked) from train limit 1;` +> Q,C,C,Q,Q,Q,C,Q,C,C + +`Q` and `C` represent quantitative variable and categorical variables, respectively. + +*Caution:* Note that the output of `guess_attribute_types` is not perfect. Revise it by your self. +For example, `pclass` is a categorical variable. + +```sql +set hivevar:attrs=C,C,C,Q,Q,Q,C,Q,C,C; + +drop table model_rf; +create table model_rf +AS +select + train_randomforest_classifier(features, survived, "-trees 500 -attrs ${attrs}") + -- as (model_id, model_type, pred_model, var_importance, oob_errors, oob_tests) +from + train_rf +; + +select + array_sum(var_importance) as var_importance, + sum(oob_errors) / sum(oob_tests) as oob_err_rate +from + model_rf; + +> [137.00242639169272,1194.2140119834373,328.78017188176966,628.2568660509628,200.31275032394072,160.12876797647078,1083.5987543408116,664.1234312561456,422.89449844090393,130.72019667694784] 0.18742985409652077 +``` + +# Prediction + +```sql +SET hivevar:classification=true; +set hive.auto.convert.join=true; +SET hive.mapjoin.optimized.hashtable=false; +SET mapred.reduce.tasks=16; + +drop table predicted_rf; +create table predicted_rf +as +SELECT + passengerid, + predicted.label, + predicted.probability, + predicted.probabilities +FROM ( + SELECT + passengerid, + rf_ensemble(predicted) as predicted + FROM ( + SELECT + t.passengerid, + -- hivemall v0.4.1-alpha.2 or before + -- tree_predict(p.model, t.features, ${classification}) as predicted +   -- hivemall v0.4.1-alpha.3 or later + tree_predict(p.model_id, p.model_type, p.pred_model, t.features, ${classification}) as predicted + FROM ( + SELECT model_id, model_type, pred_model FROM model_rf + DISTRIBUTE BY rand(1) + ) p + LEFT OUTER JOIN test_rf t + ) t1 + group by + passengerid +) t2 +; +``` + +# Kaggle submission + +```sql +drop table predicted_rf_submit; +create table predicted_rf_submit + ROW FORMAT DELIMITED + FIELDS TERMINATED BY "," + LINES TERMINATED BY "\n" + STORED AS TEXTFILE +as +SELECT passengerid, label as survived +FROM predicted_rf +ORDER BY passengerid ASC; +``` + +```sh +hadoop fs -getmerge /user/hive/warehouse/titanic.db/predicted_rf_submit predicted_rf_submit.csv + +sed -i -e "1i PassengerId,Survived" predicted_rf_submit.csv +``` + +Accuracy would gives `0.76555` for a Kaggle submission. + +--- + +# Test by dividing training dataset + +```sql +drop table train_rf_07; +create table train_rf_07 +as +select * from train_rf +where rnd < 0.7; + +drop table test_rf_03; +create table test_rf_03 +as +select * from train_rf +where rnd >= 0.7; + +drop table model_rf_07; +create table model_rf_07 +AS +select + train_randomforest_classifier(features, survived, "-trees 500 -attrs ${attrs}") +from + train_rf_07; + +select + array_sum(var_importance) as var_importance, + sum(oob_errors) / sum(oob_tests) as oob_err_rate +from + model_rf_07; +> [116.12055542977338,960.8569891444097,291.08765260103837,469.74671636586226,163.721292772701,120.784769882858,847.9769298113661,554.4617571355476,346.3500941757221,97.42593940113392] 0.1838351822503962 + +SET hivevar:classification=true; +SET hive.mapjoin.optimized.hashtable=false; +SET mapred.reduce.tasks=16; + +drop table predicted_rf_03; +create table predicted_rf_03 +as +SELECT + passengerid, + predicted.label, + predicted.probability, + predicted.probabilities +FROM ( + SELECT + passengerid, + rf_ensemble(predicted) as predicted + FROM ( + SELECT + t.passengerid, + -- hivemall v0.4.1-alpha.2 or before + -- tree_predict(p.model, t.features, ${classification}) as predicted + -- hivemall v0.4.1-alpha.3 or later + tree_predict(p.model_id, p.model_type, p.pred_model, t.features, ${classification}) as predicted + FROM ( + SELECT model_id, model_type, pred_model FROM model_rf_07 + DISTRIBUTE BY rand(1) + ) p + LEFT OUTER JOIN test_rf_03 t + ) t1 + group by + passengerid +) t2 +; + +create or replace view rf_submit_03 as +select + t.survived as actual, + p.label as predicted, + p.probabilities +from + test_rf_03 t + JOIN predicted_rf_03 p on (t.passengerid = p.passengerid) +; + +select count(1) from test_rf_03; +> 260 + +set hivevar:testcnt=260; + +select count(1)/${testcnt} as accuracy +from rf_submit_03 +where actual = predicted; + +> 0.8 +``` diff --git a/docs/gitbook/binaryclass/webspam_scw.md b/docs/gitbook/binaryclass/webspam_scw.md index cadd0abd..067e8f2a 100644 --- a/docs/gitbook/binaryclass/webspam_scw.md +++ b/docs/gitbook/binaryclass/webspam_scw.md @@ -152,4 +152,4 @@ from select count(1)/70000 from webspam_scw_submit1 where actual = predicted; ``` -> Prediction accuracy: 0.9778714285714286 \ No newline at end of file +> Prediction accuracy: 0.9778714285714286 diff --git a/docs/gitbook/eval/lr_datagen.md b/docs/gitbook/eval/lr_datagen.md index 8fa5239a..c0cbce0e 100644 --- a/docs/gitbook/eval/lr_datagen.md +++ b/docs/gitbook/eval/lr_datagen.md @@ -17,7 +17,7 @@ under the License. --> -_Note this feature is supported on hivemall v0.2-alpha3 or later._ + # create a dual table @@ -33,10 +33,10 @@ INSERT INTO TABLE dual SELECT count(*)+1 FROM dual; ```sql create table regression_data1 as -select lr_datagen("-n_examples 10k -n_features 10 -seed 100") as (label,features) +select lr_datagen('-n_examples 10k -n_features 10 -seed 100') as (label,features) from dual; ``` -Find the details of the option in [LogisticRegressionDataGeneratorUDTF.java](https://github.com/myui/hivemall/blob/master/core/src/main/java/hivemall/dataset/LogisticRegressionDataGeneratorUDTF.java#L69). +Find the details of the option, run `lr_datagen('-help')`. You can generate a sparse dataset as well as a dense dataset. By the default, a sparse dataset is generated. ```sql diff --git a/docs/gitbook/eval/stat_eval.md b/docs/gitbook/eval/stat_eval.md index 6b0af8e8..149adf85 100644 --- a/docs/gitbook/eval/stat_eval.md +++ b/docs/gitbook/eval/stat_eval.md @@ -17,7 +17,9 @@ under the License. --> -Using the [E2006 tfidf regression example](https://github.com/myui/hivemall/wiki/E2006-tfidf-regression-evaluation-(PA,-AROW)), we explain how to evaluate the prediction model on Hive. +Using the [E2006 tfidf regression example](../regression/e2006_arow.html), we explain how to evaluate the prediction model on Hive. + + # Scoring by evaluation metrics @@ -69,7 +71,7 @@ from t; ``` > 1.9610366706408238 1.9610366706408238 --- -**References** +# References + * R2 http://en.wikipedia.org/wiki/Coefficient_of_determination -* Evaluation Metrics https://www.kaggle.com/wiki/Metrics \ No newline at end of file +* Evaluation Metrics https://www.kaggle.com/wiki/Metrics diff --git a/docs/gitbook/ft_engineering/hashing.md b/docs/gitbook/ft_engineering/hashing.md index daf4a232..f467002d 100644 --- a/docs/gitbook/ft_engineering/hashing.md +++ b/docs/gitbook/ft_engineering/hashing.md @@ -17,10 +17,10 @@ under the License. --> -Hivemall supports [Feature Hashing](https://github.com/myui/hivemall/wiki/Feature-hashing) (a.k.a. hashing trick) through `feature_hashing` and `mhash` functions. +Hivemall supports [Feature Hashing](https://en.wikipedia.org/wiki/Feature_hashing) (a.k.a. hashing trick) through `feature_hashing` and `mhash` functions. Find the differences in the following examples. -_Note: `feature_hashing` UDF is supported since Hivemall `v0.4.2-rc.1`._ + ## `feature_hashing` function diff --git a/docs/gitbook/getting_started/input-format.md b/docs/gitbook/getting_started/input-format.md index 698c0953..59e6a5f3 100644 --- a/docs/gitbook/getting_started/input-format.md +++ b/docs/gitbook/getting_started/input-format.md @@ -24,14 +24,14 @@ Here, we use [EBNF](http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Fo # Input Format for Classification -The classifiers of Hivemall takes 2 (or 3) arguments: *features*, *label*, and *options* (a.k.a. [hyperparameters](http://en.wikipedia.org/wiki/Hyperparameter)). The first two arguments of training functions (e.g., [logress](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)) and [train_scw](https://github.com/myui/hivemall/wiki/news20-binary-classification-%232-(CW,-AROW,-SCW))) represents training examples. +The classifiers of Hivemall takes 2 (or 3) arguments: *features*, *label*, and *options* (a.k.a. [hyperparameters](http://en.wikipedia.org/wiki/Hyperparameter)). The first two arguments of training functions represents training examples. In Statistics, *features* and *label* are called [Explanatory variable and Response Variable](http://www.oswego.edu/~srp/stats/variable_types.htm), respectively. # Features format (for classification and regression) The format of *features* is common between (binary and multi-class) classification and regression. -Hivemall accepts ARRAY<INT|BIGINT|TEXT> for the type of *features* column. +Hivemall accepts `ARRAY<INT|BIGINT|TEXT>` for the type of *features* column. Hivemall uses a *sparse* data format (cf. [Compressed Row Storage](http://netlib.org/linalg/html_templates/node91.html)) which is similar to [LIBSVM](http://stackoverflow.com/questions/12112558/read-write-data-in-libsvm-format) and [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format). @@ -52,7 +52,7 @@ Here is an instance of a features. 10:3.4 123:0.5 34567:0.231 ``` -*Note:* As mentioned later, *index* "0" is reserved for a [Bias/Dummy variable](https://github.com/myui/hivemall/wiki/Using-explicit-addBias()-for-a-better-prediction). +*Note:* As mentioned later, *index* "0" is reserved for a [Bias/Dummy variable](../tips/addbias.html). In addition to numbers, you can use a TEXT value for an index. For example, you can use array("height:1.5", "length:2.0") for the features. ``` @@ -80,15 +80,15 @@ Note 1.0 is used for the weight when omitting *weight*. Note that "0" is reserved for a Bias variable (called dummy variable in Statistics). -The [addBias](https://github.com/myui/hivemall/wiki/Using-explicit-addBias()-for-a-better-prediction) function is Hivemall appends "0:1.0" as an element of array in *features*. +The [addBias](../tips/addbias.html) function is Hivemall appends "0:1.0" as an element of array in *features*. ## Feature hashing -Hivemall supports [feature hashing/hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) through [mhash function](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset#converting-feature-representation-by-feature-hashing). +Hivemall supports [feature hashing/hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) through [mhash function](../ft_engineering/hashing.html#mhash-function). The mhash function takes a feature (i.e., *index*) of TEXT format and generates a hash number of a range from 1 to 2^24 (=16777216) by the default setting. -Feature hashing is useful where the dimension of feature vector (i.e., the number of elements in *features*) is so large. Consider applying [mhash function]((https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset#converting-feature-representation-by-feature-hashing)) when a prediction model does not fit in memory and OutOfMemory exception happens. +Feature hashing is useful where the dimension of feature vector (i.e., the number of elements in *features*) is so large. Consider applying [mhash function]((../ft_engineering/hashing.html#mhash-function)) when a prediction model does not fit in memory and OutOfMemory exception happens. In general, you don't need to use mhash when the dimension of feature vector is less than 16777216. If feature *index* is very long TEXT (e.g., "xxxxxxx-yyyyyy-weight:55.3") and uses huge memory spaces, consider using mhash as follows: @@ -103,7 +103,7 @@ feature(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), extract_weight("xx ## Feature Normalization -Feature (weight) normalization is important in machine learning. Please refer [https://github.com/myui/hivemall/wiki/Feature-scaling](https://github.com/myui/hivemall/wiki/Feature-scaling) for detail. +Feature (weight) normalization is important in machine learning. Please refer [this article](../ft_engineering/scaling.html) for detail. *** diff --git a/docs/gitbook/getting_started/permanent-functions.md b/docs/gitbook/getting_started/permanent-functions.md index 75156fee..7afc780b 100644 --- a/docs/gitbook/getting_started/permanent-functions.md +++ b/docs/gitbook/getting_started/permanent-functions.md @@ -21,8 +21,6 @@ Hive v0.13 or later supports [permanent functions](https://cwiki.apache.org/conf Permanent functions are useful when you are using Hive through Hiveserver or to avoid hivemall installation for each session. -_Note: This feature is supported since hivemall-0.3 beta 3 or later._ - # Put hivemall jar to HDFS @@ -58,4 +56,5 @@ show functions "hivemall.*"; ``` > #### Caution -You need to specify "hivemall." prefix to call hivemall UDFs in your queries if UDFs are loaded into non-default scheme, in this case _hivemall_. +> +> You need to specify "hivemall." prefix to call hivemall UDFs in your queries if UDFs are loaded into non-default scheme, in this case _hivemall_. diff --git a/docs/gitbook/misc/generic_funcs.md b/docs/gitbook/misc/generic_funcs.md index 9749dae1..b3a0421a 100644 --- a/docs/gitbook/misc/generic_funcs.md +++ b/docs/gitbook/misc/generic_funcs.md @@ -19,61 +19,63 @@ This page describes a list of useful Hivemall generic functions. + + # Array functions ## Array UDFs - `array_concat(array x1, array x2, ..)` - Returns a concatenated array -```sql -select array_concat(array(1),array(2,3)); -> [1,2,3] -``` + ```sql + select array_concat(array(1),array(2,3)); + > [1,2,3] + ``` - `array_intersect(array x1, array x2, ..)` - Returns an intersect of given arrays -```sql -select array_intersect(array(1,3,4),array(2,3,4),array(3,5)); -> [3] -``` + ```sql + select array_intersect(array(1,3,4),array(2,3,4),array(3,5)); + > [3] + ``` - `array_remove(array original, int|text|array target)` - Returns an array that the target is removed from the original array -```sql -select array_remove(array(1,null,3),array(null)); -> [3] - -select array_remove(array("aaa","bbb"),"bbb"); -> ["aaa"] -``` + ```sql + select array_remove(array(1,null,3),array(null)); + > [3] + + select array_remove(array("aaa","bbb"),"bbb"); + > ["aaa"] + ``` -- `sort_and_uniq_array(array)` - Takes an array of type int and returns a sorted array in a natural order with duplicate elements eliminated +- `sort_and_uniq_array(array)` - Takes an array of type INT and returns a sorted array in a natural order with duplicate elements eliminated -```sql -select sort_and_uniq_array(array(3,1,1,-2,10)); -> [-2,1,3,10] -``` + ```sql + select sort_and_uniq_array(array(3,1,1,-2,10)); + > [-2,1,3,10] + ``` - `subarray_endwith(array original, int|text key)` - Returns an array that ends with the specified key - -```sql -select subarray_endwith(array(1,2,3,4), 3); -> [1,2,3] -``` + + ```sql + select subarray_endwith(array(1,2,3,4), 3); + > [1,2,3] + ``` - `subarray_startwith(array original, int|text key)` - Returns an array that starts with the specified key -```sql -select subarray_startwith(array(1,2,3,4), 2); -> [2,3,4] -``` + ```sql + select subarray_startwith(array(1,2,3,4), 2); + > [2,3,4] + ``` -- `subarray(array orignal, int fromIndex, int toIndex)` - Returns a slice of the original array between the inclusive fromIndex and the exclusive toIndex +- `subarray(array orignal, int fromIndex, int toIndex)` - Returns a slice of the original array between the inclusive `fromIndex` and the exclusive `toIndex` -```sql -select subarray(array(1,2,3,4,5,6), 2,4); -> [3,4] -``` + ```sql + select subarray(array(1,2,3,4,5,6), 2,4); + > [3,4] + ``` ## Array UDAFs @@ -87,47 +89,45 @@ select subarray(array(1,2,3,4,5,6), 2,4); - `to_bits(int[] indexes)` - Returns an bitset representation if the given indexes in long[] -```sql -select to_bits(array(1,2,3,128)); ->[14,-9223372036854775808] -``` + ```sql + select to_bits(array(1,2,3,128)); + >[14,-9223372036854775808] + ``` - `unbits(long[] bitset)` - Returns an long array of the give bitset representation -```sql -select unbits(to_bits(array(1,4,2,3))); -> [1,2,3,4] -``` + ```sql + select unbits(to_bits(array(1,4,2,3))); + > [1,2,3,4] + ``` - `bits_or(array b1, array b2, ..)` - Returns a logical OR given bitsets -```sql -select unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3)))); -> [1,2,3,4] -``` + ```sql + select unbits(bits_or(to_bits(array(1,4)),to_bits(array(2,3)))); + > [1,2,3,4] + ``` ## Bitset UDAF - `bits_collect(int|long x)` - Returns a bitset in array - # Compression functions -- `deflate(TEXT data [, const int compressionLevel])` - Returns a compressed BINARY obeject by using Deflater. +- `deflate(TEXT data [, const int compressionLevel])` - Returns a compressed BINARY object by using Deflater. The compression level must be in range [-1,9] -```sql -select base91(deflate('aaaaaaaaaaaaaaaabbbbccc')); -> AA+=kaIM|WTt!+wbGAA -``` + ```sql + select base91(deflate('aaaaaaaaaaaaaaaabbbbccc')); + > AA+=kaIM|WTt!+wbGAA + ``` - `inflate(BINARY compressedData)` - Returns a decompressed STRING by using Inflater - -```sql -select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc')))); -> aaaaaaaaaaaaaaaabbbbccc -``` + ```sql + select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc')))); + > aaaaaaaaaaaaaaaabbbbccc + ``` # Map functions @@ -152,33 +152,33 @@ select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc')))); # Math functions -- `sigmoid(x)` - Returns 1.0 / (1.0 + exp(-x)) +- `sigmoid(x)` - Returns `1.0 / (1.0 + exp(-x))` # Text processing functions - `base91(binary)` - Convert the argument from binary to a BASE91 string -```sql -select base91(deflate('aaaaaaaaaaaaaaaabbbbccc')); -> AA+=kaIM|WTt!+wbGAA -``` + ```sql + select base91(deflate('aaaaaaaaaaaaaaaabbbbccc')); + > AA+=kaIM|WTt!+wbGAA + ``` - `unbase91(string)` - Convert a BASE91 string to a binary -```sql -select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc')))); -> aaaaaaaaaaaaaaaabbbbccc -``` + ```sql + select inflate(unbase91(base91(deflate('aaaaaaaaaaaaaaaabbbbccc')))); + > aaaaaaaaaaaaaaaabbbbccc + ``` - `normalize_unicode(string str [, string form])` - Transforms `str` with the specified normalization form. The `form` takes one of NFC (default), NFD, NFKC, or NFKD -```sql -select normalize_unicode('ハンカクカナ','NFKC'); -> ハンカクカナ - -select normalize_unicode('㈱㌧㌦Ⅲ','NFKC'); -> (株)トンドルIII -``` + ```sql + select normalize_unicode('ハンカクカナ','NFKC'); + > ハンカクカナ + + select normalize_unicode('㈱㌧㌦Ⅲ','NFKC'); + > (株)トンドルIII + ``` - `split_words(string query [, string regex])` - Returns an array containing splitted strings @@ -186,44 +186,37 @@ select normalize_unicode('㈱㌧㌦Ⅲ','NFKC'); - `tokenize(string englishText [, boolean toLowerCase])` - Returns words in array -- `tokenize_ja(String line [, const string mode = "normal", const list stopWords, const list stopTags])` - returns tokenized strings in array - -```sql -select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。"); +- `tokenize_ja(String line [, const string mode = "normal", const list stopWords, const list stopTags])` - returns tokenized strings in array. Refer [this article](../misc/tokenizer.html) for detail. -> ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal"," モード"] -``` - -https://github.com/myui/hivemall/wiki/Tokenizer + ```sql + select tokenize_ja("kuromojiを使った分かち書きのテストです。第二引数にはnormal/search/extendedを指定できます。デフォルトではnormalモードです。"); + + > ["kuromoji","使う","分かち書き","テスト","第","二","引数","normal","search","extended","指定","デフォルト","normal"," モード"] + ``` # Other functions - `convert_label(const int|const float)` - Convert from -1|1 to 0.0f|1.0f, or from 0.0f|1.0f to -1|1 -- `each_top_k(int K, Object group, double cmpKey, *)` - Returns top-K values (or tail-K values when k is less than 0) - -https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF +- `each_top_k(int K, Object group, double cmpKey, *)` - Returns top-K values (or tail-K values when k is less than 0). Refer [this article](../misc/topk.html) for detail. - `generate_series(const int|bigint start, const int|bigint end)` - Generate a series of values, from start to end -```sql -WITH dual as ( - select 1 -) -select generate_series(1,9) -from dual; - -1 -2 -3 -4 -5 -6 -7 -8 -9 -``` - -A similar function to PostgreSQL's `generate_serics`. -http://www.postgresql.org/docs/current/static/functions-srf.html -- `x_rank(KEY)` - Generates a pseudo sequence number starting from 1 for each key \ No newline at end of file + ```sql + select generate_series(1,9); + + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + ``` + + A similar function to PostgreSQL's `generate_serics`. + http://www.postgresql.org/docs/current/static/functions-srf.html + +- `x_rank(KEY)` - Generates a pseudo sequence number starting from 1 for each key diff --git a/docs/gitbook/misc/topk.md b/docs/gitbook/misc/topk.md index d6e7b93c..6a805146 100644 --- a/docs/gitbook/misc/topk.md +++ b/docs/gitbook/misc/topk.md @@ -23,7 +23,10 @@ This function is particularly useful for applying a similarity/distance function `each_top_k` is very fast when compared to other methods running top-k queries (e.g., [`rank/distribute by`](https://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/)) in Hive. -## Caution + + +# Caution + * `each_top_k` is supported from Hivemall v0.3.2-3 or later. * This UDTF assumes that input records are sorted by `group`. Use `DISTRIBUTE BY group SORT BY group` to ensure that. Or, you can use `LEFT OUTER JOIN` for certain cases. * It takes variable lengths arguments in `argN`. @@ -32,7 +35,9 @@ This function is particularly useful for applying a similarity/distance function * If k is less than 0, reverse order is used and `tail-K` records are returned for each `group`. * Note that this function returns [a pseudo ranking](http://www.michaelpollmeier.com/selecting-top-k-items-from-a-list-efficiently-in-java-groovy/) for top-k. It always returns `at-most K` records for each group. The ranking scheme is similar to `dense_rank` but slightly different in certain cases. -# Efficient Top-k Query Processing using `each_top_k` +# Usage + +## Efficient Top-k Query Processing using `each_top_k` Efficient processing of Top-k queries is a crucial requirement in many interactive environments that involve massive amounts of data. Our Hive extension `each_top_k` helps running Top-k processing efficiently. @@ -87,7 +92,8 @@ FROM ( ``` > #### Note -`CLUSTER BY x` is a synonym of `DISTRIBUTE BY x CLASS SORT BY x` and required when using `each_top_k`. +> +> `CLUSTER BY x` is a synonym of `DISTRIBUTE BY x CLASS SORT BY x` and required when using `each_top_k`. The function signature of `each_top_k` is `each_top_k(int k, ANY group, double value, arg1, arg2, ..., argN)` and it returns a relation `(int rank, double value, arg1, arg2, .., argN)`. @@ -99,9 +105,8 @@ If `k` is less than 0, reverse order is used and tail-K records are returned for The ranking semantics of `each_top_k` follows SQL's `dense_rank` and then limits results by `k`. > #### Caution -`each_top_k` is benefical where the number of grouping keys are large. If the number of grouping keys are not so large (e.g., less than 100), consider using `rank() over` instead. - -# Usage +> +> `each_top_k` is benefical where the number of grouping keys are large. If the number of grouping keys are not so large (e.g., less than 100), consider using `rank() over` instead. ## top-k clicks diff --git a/docs/gitbook/multiclass/iris_dataset.md b/docs/gitbook/multiclass/iris_dataset.md index 38a68310..e67737e5 100644 --- a/docs/gitbook/multiclass/iris_dataset.md +++ b/docs/gitbook/multiclass/iris_dataset.md @@ -113,7 +113,7 @@ select * from iris_scaled limit 3; > 3 Iris-setosa ["1:0.11111101","2:0.5","3:0.05084745","4:0.041666664","0:1.0"] ``` -_[LibSVM web page](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#iris) provides a normalized (using [ZScore](https://github.com/myui/hivemall/wiki/Feature-scaling)) version of Iris dataset._ +_[LibSVM web page](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#iris) provides a normalized (using [ZScore](../ft_engineering/scaling.html#feature-scaling-by-zscore)) version of Iris dataset._ # Create training/test data diff --git a/docs/gitbook/multiclass/iris_randomforest.md b/docs/gitbook/multiclass/iris_randomforest.md index fd85471c..4b0750c3 100644 --- a/docs/gitbook/multiclass/iris_randomforest.md +++ b/docs/gitbook/multiclass/iris_randomforest.md @@ -16,8 +16,6 @@ specific language governing permissions and limitations under the License. --> - -*NOTE: RandomForest is being supported from Hivemall v0.4 or later.* # Dataset @@ -323,4 +321,4 @@ WHERE actual = predicted ; ``` -> 0.9533333333333334 \ No newline at end of file +> 0.9533333333333334 diff --git a/docs/gitbook/multiclass/iris_scw.md b/docs/gitbook/multiclass/iris_scw.md index fd85471c..79cdaf47 100644 --- a/docs/gitbook/multiclass/iris_scw.md +++ b/docs/gitbook/multiclass/iris_scw.md @@ -323,4 +323,4 @@ WHERE actual = predicted ; ``` -> 0.9533333333333334 \ No newline at end of file +> 0.9533333333333334 diff --git a/docs/gitbook/multiclass/news20_scw.md b/docs/gitbook/multiclass/news20_scw.md index f6f21afb..24e0fad0 100644 --- a/docs/gitbook/multiclass/news20_scw.md +++ b/docs/gitbook/multiclass/news20_scw.md @@ -335,4 +335,4 @@ where actual == predicted; drop table news20mc_scw2_model1; drop table news20mc_scw2_predict1; drop view news20mc_scw2_submit1; -``` \ No newline at end of file +``` diff --git a/docs/gitbook/recommend/item_based_cf.md b/docs/gitbook/recommend/item_based_cf.md index 2eb7890b..a674f708 100644 --- a/docs/gitbook/recommend/item_based_cf.md +++ b/docs/gitbook/recommend/item_based_cf.md @@ -90,7 +90,7 @@ group by **Caution:** _Item-Item cooccurrence matrix is a symmetric matrix that has the number of total occurrence for each diagonal element . If the size of items are `k`, then the size of expected matrix is `k * (k - 1) / 2`, usually a very large one._ -_Better to use [2.2.2.](https://github.com/myui/hivemall/wiki/Item-based-Collaborative-Filtering#limiting-size-of-elements-in-cooccurrence_upper_triangular) instead of [2.2.1.](https://github.com/myui/hivemall/wiki/Item-based-Collaborative-Filtering#221-create-cooccurrence-table-directly) for creating a `cooccurrence` table where dataset is large._ +_Better to use [2.2.2.](#222-create-cooccurrence-table-from-upper-triangular-matrix-of-cooccurrence) instead of [2.2.1.](#221-create-cooccurrence-table-directly) for creating a `cooccurrence` table where dataset is large._ ### 2.2.1. Create cooccurrence table directly @@ -257,7 +257,7 @@ GROUP BY Item-Item similarity computation is known to be computation complexity `O(n^2)` where `n` is the number of items. Depending on your cluster size and your dataset, the optimal solution differs. -**Note:** _Better to use [3.1.1.](https://github.com/myui/hivemall/wiki/Item-based-Collaborative-Filtering#311-similarity-computation-using-the-symmetric-property-of-item-similarity-matrix) scheme where dataset is large._ +**Note:** _Better to use [3.1.1.](#311-similarity-computation-using-the-symmetric-property-of-item-similarity-matrix) scheme where dataset is large._ ### 3.1. Shuffle heavy similarity computation diff --git a/docs/gitbook/recommend/movielens_fm.md b/docs/gitbook/recommend/movielens_fm.md index eac80132..ad593247 100644 --- a/docs/gitbook/recommend/movielens_fm.md +++ b/docs/gitbook/recommend/movielens_fm.md @@ -21,8 +21,7 @@ _Caution: Factorization Machine is supported from Hivemall v0.4 or later._ # Data preparation -First of all, please create `ratings` table described in the following page: -https://github.com/myui/hivemall/wiki/MovieLens-Dataset +First of all, please create `ratings` table described in [this article](../recommend/movielens_dataset.html). ```sql use movielens; @@ -190,7 +189,7 @@ usage: train_fm(array x, double y [, const string options]) - ```sql -- workaround for a bug --- https://github.com/myui/hivemall/wiki/Map-side-Join-causes-ClassCastException-on-Tez:-LazyBinaryArray-cannot-be-cast-to-%5BLjava.lang.Object; +-- https://issues.apache.org/jira/browse/HIVE-11051 set hive.mapjoin.optimized.hashtable=false; drop table fm_predict; @@ -222,7 +221,7 @@ from # Fast Factorization Machines Training using Int Features Training of Factorization Machines (FM) can be done more efficietly, in term of speed, by using INT features. -In this section, we show how to run FM training by using int features, more specifically by using [feature hashing](https://github.com/myui/hivemall/wiki/Feature-hashing). +In this section, we show how to run FM training by using int features, more specifically by using [feature hashing](../ft_engineering/hashing.html). ```sql set hivevar:factor=10; diff --git a/docs/gitbook/recommend/movielens_mf.md b/docs/gitbook/recommend/movielens_mf.md index f275df82..ca38fecb 100644 --- a/docs/gitbook/recommend/movielens_mf.md +++ b/docs/gitbook/recommend/movielens_mf.md @@ -17,9 +17,9 @@ under the License. --> -This page explains how to run matrix factorization on [MovieLens 1M dataset](https://github.com/myui/hivemall/wiki/MovieLens-Dataset). +This page explains how to run matrix factorization on [MovieLens 1M dataset](../recommend/movielens_dataset.html). -*Caution:* Matrix factorization is supported in Hivemall v0.3 or later. + ## Calculate the mean rating in the training dataset ```sql @@ -38,9 +38,8 @@ set hivevar:factor=10; -- maximum number of training iterations set hivevar:iters=50; ``` -See [this article](https://github.com/myui/hivemall/wiki/List-of-parameters-of-Matrix-Factorization) or [OnlineMatrixFactorizationUDTF#getOption()](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/mf/OnlineMatrixFactorizationUDTF.java#L123) to get the details of options. -Note that there are no need to set an exact value for $mu. It actually works without setting $mu but recommended to set one for getting a better prediction. +Note that there are no need to set an exact value for `$mu`. It actually works without setting `$mu` but recommended to set one for getting a better prediction. _Due to [a bug](https://issues.apache.org/jira/browse/HIVE-8396) in Hive, do not issue comments in CLI._ @@ -56,13 +55,17 @@ select avg(m_bias) as Bi from ( select - train_mf_sgd(userid, movieid, rating, "-factor ${factor} -mu ${mu} -iter ${iters}") as (idx, u_rank, m_rank, u_bias, m_bias) + train_mf_sgd(userid, movieid, rating, '-factor ${factor} -mu ${mu} -iter ${iters}') as (idx, u_rank, m_rank, u_bias, m_bias) from training ) t group by idx; ``` -Note: Hivemall also provides *train_mf_adagrad* for training using AdaGrad. + +> #### Note +> +> Hivemall also provides *train_mf_adagrad* for training using AdaGrad. +> `-help` option shows a complete list of hyperparameters. # Predict @@ -109,9 +112,10 @@ from ( ON (t2.movieid = p2.idx) ) t; ``` -> 0.6728969407733578 (MAE) -> 0.8584162122694449 (RMSE) +| MAE | RMSE | +|:---:|:----:| +| 0.6728969407733578 | 0.8584162122694449 | # Item Recommendation diff --git a/docs/gitbook/recommend/news20_knn.md b/docs/gitbook/recommend/news20_knn.md index 1e0ae974..fca9db5b 100644 --- a/docs/gitbook/recommend/news20_knn.md +++ b/docs/gitbook/recommend/news20_knn.md @@ -119,4 +119,4 @@ limit ${topn}; | 8482 | 0.15229382 | -Refer [this page](https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF#top-k-similarity-computation) for efficient top-k kNN computation. \ No newline at end of file +Refer [this page](../misc/topk.html#top-k-similarity-computation) for efficient top-k kNN computation. diff --git a/docs/gitbook/regression/e2006_arow.md b/docs/gitbook/regression/e2006_arow.md index a02dfa86..abdb725f 100644 --- a/docs/gitbook/regression/e2006_arow.md +++ b/docs/gitbook/regression/e2006_arow.md @@ -275,4 +275,4 @@ select from e2006tfidf_arowe_submit; ``` -> 0.37789148212861856 0.14280197226536404 0.2357339155291536 0.5060283955470721 \ No newline at end of file +> 0.37789148212861856 0.14280197226536404 0.2357339155291536 0.5060283955470721 diff --git a/docs/gitbook/regression/kddcup12tr2_adagrad.md b/docs/gitbook/regression/kddcup12tr2_adagrad.md index f6c76752..1b82bd92 100644 --- a/docs/gitbook/regression/kddcup12tr2_adagrad.md +++ b/docs/gitbook/regression/kddcup12tr2_adagrad.md @@ -1,128 +1,128 @@ - - -_Note adagrad/adadelta is supported from hivemall v0.3b2 or later (or in the master branch)._ - -# Preparation -```sql -add jar ./tmp/hivemall-with-dependencies.jar; -source ./tmp/define-all.hive; - -use kdd12track2; - --- SET mapreduce.framework.name=yarn; --- SET hive.execution.engine=mr; --- SET mapreduce.framework.name=yarn-tez; --- SET hive.execution.engine=tez; -SET mapred.reduce.tasks=32; -- [optional] set the explicit number of reducers to make group-by aggregation faster -``` - -# AdaGrad -```sql -drop table adagrad_model; -create table adagrad_model -as -select - feature, - avg(weight) as weight -from - (select - adagrad(features,label) as (feature,weight) - from - training_orcfile - ) t -group by feature; - -drop table adagrad_predict; -create table adagrad_predict - ROW FORMAT DELIMITED - FIELDS TERMINATED BY "\t" - LINES TERMINATED BY "\n" - STORED AS TEXTFILE -as -select - t.rowid, - sigmoid(sum(m.weight)) as prob -from - testing_exploded t LEFT OUTER JOIN - adagrad_model m ON (t.feature = m.feature) -group by - t.rowid -order by - rowid ASC; -``` - -```sh -hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adagrad_predict adagrad_predict.tbl - -gawk -F "\t" '{print $2;}' adagrad_predict.tbl > adagrad_predict.submit - -pypy scoreKDD.py KDD_Track2_solution.csv adagrad_predict.submit -``` ->AUC(SGD) : 0.739351 - ->AUC(ADAGRAD) : 0.743279 - -# AdaDelta -```sql -drop table adadelta_model; -create table adadelta_model -as -select - feature, - cast(avg(weight) as float) as weight -from - (select - adadelta(features,label) as (feature,weight) - from - training_orcfile - ) t -group by feature; - -drop table adadelta_predict; -create table adadelta_predict - ROW FORMAT DELIMITED - FIELDS TERMINATED BY "\t" - LINES TERMINATED BY "\n" - STORED AS TEXTFILE -as -select - t.rowid, - sigmoid(sum(m.weight)) as prob -from - testing_exploded t LEFT OUTER JOIN - adadelta_model m ON (t.feature = m.feature) -group by - t.rowid -order by - rowid ASC; -``` - -```sh -hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adadelta_predict adadelta_predict.tbl - -gawk -F "\t" '{print $2;}' adadelta_predict.tbl > adadelta_predict.submit - -pypy scoreKDD.py KDD_Track2_solution.csv adadelta_predict.submit -``` ->AUC(SGD) : 0.739351 - ->AUC(ADAGRAD) : 0.743279 - -> AUC(AdaDelta) : 0.746878 \ No newline at end of file + +_Note adagrad/adadelta is supported from hivemall v0.3b2 or later (or in the master branch)._ + +# Preparation +```sql +add jar ./tmp/hivemall-with-dependencies.jar; +source ./tmp/define-all.hive; + +use kdd12track2; + +-- SET mapreduce.framework.name=yarn; +-- SET hive.execution.engine=mr; +-- SET mapreduce.framework.name=yarn-tez; +-- SET hive.execution.engine=tez; +SET mapred.reduce.tasks=32; -- [optional] set the explicit number of reducers to make group-by aggregation faster +``` + +# AdaGrad +```sql +drop table adagrad_model; +create table adagrad_model +as +select + feature, + avg(weight) as weight +from + (select + adagrad(features,label) as (feature,weight) + from + training_orcfile + ) t +group by feature; + +drop table adagrad_predict; +create table adagrad_predict + ROW FORMAT DELIMITED + FIELDS TERMINATED BY "\t" + LINES TERMINATED BY "\n" + STORED AS TEXTFILE +as +select + t.rowid, + sigmoid(sum(m.weight)) as prob +from + testing_exploded t LEFT OUTER JOIN + adagrad_model m ON (t.feature = m.feature) +group by + t.rowid +order by + rowid ASC; +``` + +```sh +hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adagrad_predict adagrad_predict.tbl + +gawk -F "\t" '{print $2;}' adagrad_predict.tbl > adagrad_predict.submit + +pypy scoreKDD.py KDD_Track2_solution.csv adagrad_predict.submit +``` +>AUC(SGD) : 0.739351 + +>AUC(ADAGRAD) : 0.743279 + +# AdaDelta +```sql +drop table adadelta_model; +create table adadelta_model +as +select + feature, + cast(avg(weight) as float) as weight +from + (select + adadelta(features,label) as (feature,weight) + from + training_orcfile + ) t +group by feature; + +drop table adadelta_predict; +create table adadelta_predict + ROW FORMAT DELIMITED + FIELDS TERMINATED BY "\t" + LINES TERMINATED BY "\n" + STORED AS TEXTFILE +as +select + t.rowid, + sigmoid(sum(m.weight)) as prob +from + testing_exploded t LEFT OUTER JOIN + adadelta_model m ON (t.feature = m.feature) +group by + t.rowid +order by + rowid ASC; +``` + +```sh +hadoop fs -getmerge /user/hive/warehouse/kdd12track2.db/adadelta_predict adadelta_predict.tbl + +gawk -F "\t" '{print $2;}' adadelta_predict.tbl > adadelta_predict.submit + +pypy scoreKDD.py KDD_Track2_solution.csv adadelta_predict.submit +``` +>AUC(SGD) : 0.739351 + +>AUC(ADAGRAD) : 0.743279 + +> AUC(AdaDelta) : 0.746878 diff --git a/docs/gitbook/regression/kddcup12tr2_dataset.md b/docs/gitbook/regression/kddcup12tr2_dataset.md index 15bfbfdd..c32958f6 100644 --- a/docs/gitbook/regression/kddcup12tr2_dataset.md +++ b/docs/gitbook/regression/kddcup12tr2_dataset.md @@ -35,7 +35,7 @@ http://www.kddcup2012.org/c/kddcup2012-track2 | training.txt | 9.9GB | 149,639,105 | | serid_profile.txt | 283MB | 23,669,283 | -![tables](https://raw.github.com/myui/hivemall/master/resources/examples/kddtrack2/tables.png) +![tables](../resources/images/kddtrack2tables.png) _Tokens are actually not used in this example. Try using them on your own._ diff --git a/docs/gitbook/regression/kddcup12tr2_lr_amplify.md b/docs/gitbook/regression/kddcup12tr2_lr_amplify.md index e402ce4d..5ede9533 100644 --- a/docs/gitbook/regression/kddcup12tr2_lr_amplify.md +++ b/docs/gitbook/regression/kddcup12tr2_lr_amplify.md @@ -21,7 +21,7 @@ This article explains *amplify* technique that is useful for improving predictio Iterations are mandatory in machine learning (e.g., in [stochastic gradient descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)) to get good prediction models. However, MapReduce is known to be not suited for iterative algorithms because IN/OUT of each MapReduce job is through HDFS. -In this example, we show how Hivemall deals with this problem. We use [KDD Cup 2012, Track 2 Task](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset) as an example. +In this example, we show how Hivemall deals with this problem. We use [KDD Cup 2012, Track 2 Task](../regression/kddcup12tr2_dataset.html) as an example. **WARNING**: rand_amplify() is supported in v0.2-beta1 and later. @@ -73,7 +73,7 @@ The above query is executed by 2 MapReduce jobs as shown below: amplifier -Using *trainning_x3* instead of the plain training table results in higher and better AUC (0.746214) in [this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\)) example. +Using *trainning_x3* instead of the plain training table results in higher and better AUC (0.746214) in [this example](../regression/kddcup12tr2_lr.html#evaluation). A problem in amplify() is that the shuffle (copy) and merge phase of the stage 1 could become a bottleneck. When the training table is so large that involves 100 Map tasks, the merge operator needs to merge at least 100 files by (external) merge sort! @@ -108,7 +108,7 @@ The map-local multiplication and shuffling has no bottleneck in the merge phase rand_amplify elapsed -Using *rand_amplify* results in a better AUC (0.743392) in [this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\)) example. +Using *rand_amplify* results in a better AUC (0.743392) in [this example](../regression/kddcup12tr2_lr.html#evaluation). --- # Conclusion diff --git a/docs/gitbook/tips/addbias.md b/docs/gitbook/tips/addbias.md index dfa4bfc0..021ca64d 100644 --- a/docs/gitbook/tips/addbias.md +++ b/docs/gitbook/tips/addbias.md @@ -28,7 +28,7 @@ Then, the predicted model considers bias existing in the dataset and the predict **addBias()** of Hivemall, adds a bias to a feature vector. To enable a bias clause, use addBias() for **both**_(important!)_ training and test data as follows. -The bias _b_ is a feature of "0" ("-1" in before v0.3) by the default. See [AddBiasUDF](https://github.com/myui/hivemall/blob/master/src/main/hivemall/ftvec/AddBiasUDF.java) for the detail. +The bias _b_ is a feature of "0" ("-1" in before v0.3) by the default. See [AddBiasUDF](../tips/addbias.html) for the detail. Note that Bias is expressed as a feature that found in all training/testing examples. diff --git a/docs/gitbook/tips/emr.md b/docs/gitbook/tips/emr.md index 61cb25b6..049e6dae 100644 --- a/docs/gitbook/tips/emr.md +++ b/docs/gitbook/tips/emr.md @@ -16,6 +16,8 @@ specific language governing permissions and limitations under the License. --> + + ## Prerequisite Learn how to use Hive with Elastic MapReduce (EMR). diff --git a/docs/gitbook/tips/hadoop_tuning.md b/docs/gitbook/tips/hadoop_tuning.md index 71255089..507e19d7 100644 --- a/docs/gitbook/tips/hadoop_tuning.md +++ b/docs/gitbook/tips/hadoop_tuning.md @@ -16,6 +16,8 @@ specific language governing permissions and limitations under the License. --> + + # Prerequisites diff --git a/docs/gitbook/tips/mixserver.md b/docs/gitbook/tips/mixserver.md index bd58279a..f9878e65 100644 --- a/docs/gitbook/tips/mixserver.md +++ b/docs/gitbook/tips/mixserver.md @@ -1,87 +1,86 @@ - - -In this page, we will explain how to use model mixing on Hivemall. The model mixing is useful for a better prediction performance and faster convergence in training classifiers. - - - -Prerequisite -============ - -* Hivemall v0.3 or later - -We recommend to use Mixing in a cluster with fast networking. The current standard GbE is enough though. - -Running Mix Server -=================== - -First, put the following files on server(s) that are accessible from Hadoop worker nodes: -* [target/hivemall-mixserv.jar](https://github.com/myui/hivemall/releases) -* [bin/run_mixserv.sh](https://github.com/myui/hivemall/raw/master/bin/run_mixserv.sh) - -_Caution: hivemall-mixserv.jar is large in size and thus only used for Mix servers._ - -```sh -# run a Mix Server -./run_mixserv.sh -``` - -We assume in this example that Mix servers are running on host01, host03 and host03. -The default port used by Mix server is 11212 and the port is configurable through "-port" option of run_mixserv.sh. - -See [MixServer.java](https://github.com/myui/hivemall/blob/master/mixserv/src/main/java/hivemall/mix/server/MixServer.java#L90) to get detail of the Mix server options. - -We recommended to use multiple MIX servers to get better MIX throughput (3-5 or so would be enough for normal cluster size). The MIX protocol of Hivemall is *horizontally scalable* by adding MIX server nodes. - -Using Mix Protocol through Hivemall -=================================== - -[Install Hivemall](https://github.com/myui/hivemall/wiki/Installation) on Hive. - -_Make sure that [hivemall-with-dependencies.jar](https://github.com/myui/hivemall/raw/master/target/hivemall-with-dependencies.jar) is used for installation. The jar contains minimum requirement jars (netty,jsr305) for running Hivemall on Hive._ - -Now, we explain that how to use mixing in [an example using KDD2010a dataset](https://github.com/myui/hivemall/wiki/KDD2010a-classification). - -Enabling the mixing on Hivemall is simple as follows: -```sql -use kdd2010; - -create table kdd10a_pa1_model1 as -select - feature, - cast(voted_avg(weight) as float) as weight -from - (select - train_pa1(addBias(features),label,"-mix host01,host02,host03") as (feature,weight) - from - kdd10a_train_x3 - ) t -group by feature; -``` - -All you have to do is just adding "*-mix*" training option as seen in the above query. - -The effect of model mixing -=========================== - -In my experience, the MIX improved the prediction accuracy of the above KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 0.8678096499719774 (w/ mix). - + +In this page, we will explain how to use model mixing on Hivemall. The model mixing is useful for a better prediction performance and faster convergence in training classifiers. +You can find a brief explanation of the internal design of MIX protocol in [this slide](http://www.slideshare.net/myui/hivemall-mix-internal). + + + +Prerequisite +============ + +* Hivemall v0.3 or later + + We recommend to use Mixing in a cluster with fast networking. The current standard GbE is enough though. + +Running Mix Server +=================== + +First, put the following files on server(s) that are accessible from Hadoop worker nodes: +* [target/hivemall-mixserv.jar](https://github.com/myui/hivemall/releases) +* [bin/run_mixserv.sh](https://github.com/myui/hivemall/raw/master/bin/run_mixserv.sh) + +_Caution: hivemall-mixserv.jar is large in size and thus only used for Mix servers._ + +```sh +# run a Mix Server +./run_mixserv.sh +``` + +We assume in this example that Mix servers are running on host01, host03 and host03. +The default port used by Mix server is 11212 and the port is configurable through "-port" option of run_mixserv.sh. + +See [MixServer.java](https://github.com/myui/hivemall/blob/master/mixserv/src/main/java/hivemall/mix/server/MixServer.java#L90) to get detail of the Mix server options. + +We recommended to use multiple MIX servers to get better MIX throughput (3-5 or so would be enough for normal cluster size). The MIX protocol of Hivemall is *horizontally scalable* by adding MIX server nodes. + +Using Mix Protocol through Hivemall +=================================== + +[Install Hivemall](../getting_started/installation.html) on Hive. + +_Make sure that [hivemall-with-dependencies.jar](https://github.com/myui/hivemall/raw/master/target/hivemall-with-dependencies.jar) is used for installation. The jar contains minimum requirement jars (netty,jsr305) for running Hivemall on Hive._ + +Now, we explain that how to use mixing in [an example using KDD2010a dataset](../binaryclass/kdd2010a_dataset.html). + +Enabling the mixing on Hivemall is simple as follows: +```sql +use kdd2010; + +create table kdd10a_pa1_model1 as +select + feature, + cast(voted_avg(weight) as float) as weight +from + (select + train_pa1(addBias(features),label,"-mix host01,host02,host03") as (feature,weight) + from + kdd10a_train_x3 + ) t +group by feature; +``` + +All you have to do is just adding "*-mix*" training option as seen in the above query. + +The effect of model mixing +=========================== + +In my experience, the MIX improved the prediction accuracy of the above KDD2010a PA1 training on a 32 nodes cluster from 0.844835019263103 (w/o mix) to 0.8678096499719774 (w/ mix). + The overhead of using the MIX protocol is *almost negligible* because the MIX communication is efficiently handled using asynchronous non-blocking I/O. Furthermore, the training time could be improved on certain settings because of the faster convergence due to mixing. \ No newline at end of file diff --git a/docs/gitbook/tips/rand_amplify.md b/docs/gitbook/tips/rand_amplify.md index cd546ec6..6d68dea1 100644 --- a/docs/gitbook/tips/rand_amplify.md +++ b/docs/gitbook/tips/rand_amplify.md @@ -21,16 +21,16 @@ This article explains *amplify* technique that is useful for improving predictio Iterations are mandatory in machine learning (e.g., in [stochastic gradient descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)) to get good prediction models. However, MapReduce is known to be not suited for iterative algorithms because IN/OUT of each MapReduce job is through HDFS. -In this example, we show how Hivemall deals with this problem. We use [KDD Cup 2012, Track 2 Task](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-dataset) as an example. +In this example, we show how Hivemall deals with this problem. We use [KDD Cup 2012, Track 2 Task](../regression/kddcup12tr2_dataset.html) as an example. -**WARNING**: rand_amplify() is supported in v0.2-beta1 and later. + --- # Amplify training examples in Map phase and shuffle them in Reduce phase Hivemall provides the **amplify** UDTF to enumerate iteration effects in machine learning without several MapReduce steps. The amplify function returns multiple rows for each row. -The first argument ${xtimes} is the multiplication factor. +The first argument `${xtimes}` is the multiplication factor. In the following examples, the multiplication factor is set to 3. ```sql @@ -72,9 +72,9 @@ group by feature; The above query is executed by 2 MapReduce jobs as shown below: amplifier -Using *trainning_x3* instead of the plain training table results in higher and better AUC (0.746214) in [this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\)) example. +Using *trainning_x3* instead of the plain training table results in higher and better AUC (0.746214) in [this example](../regression/kddcup12tr2_lr_amplify.html#conclusion). -A problem in amplify() is that the shuffle (copy) and merge phase of the stage 1 could become a bottleneck. +A problem in `amplify()` is that the shuffle (copy) and merge phase of the stage 1 could become a bottleneck. When the training table is so large that involves 100 Map tasks, the merge operator needs to merge at least 100 files by (external) merge sort! Note that the actual bottleneck is not M/R iterations but shuffling training instance. Iteration without shuffling (as in [the Spark example](http://spark.incubator.apache.org/examples.html)) causes very slow convergence and results in requiring more iterations. Shuffling cannot be avoided even in iterative MapReduce variants. @@ -107,7 +107,7 @@ The map-local multiplication and shuffling has no bottleneck in the merge phase randamplify_elapsed -Using *rand_amplify* results in a better AUC (0.743392) in [this](https://github.com/myui/hivemall/wiki/KDDCup-2012-track-2-CTR-prediction-(regression\)) example. +Using *rand_amplify* results in a better AUC (0.743392) in [this example](../regression/kddcup12tr2_lr_amplify.html#conclusion). --- # Conclusion diff --git a/docs/gitbook/tips/rowid.md b/docs/gitbook/tips/rowid.md index 2b244016..ed6431ee 100644 --- a/docs/gitbook/tips/rowid.md +++ b/docs/gitbook/tips/rowid.md @@ -16,7 +16,21 @@ specific language governing permissions and limitations under the License. --> - + + + +# Rowid generator provided in Hivemall +You can use [rowid() function](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/tools/mapred/RowIdUDF.java) to generate an unique rowid in Hivemall v0.2 or later. +```sql +select + rowid() as rowid, -- returns ${task_id}-${sequence_number} + * +from + xxx +``` + +# Other Rowid generation schemes using SQL + ```sql CREATE TABLE xxx AS @@ -37,14 +51,3 @@ select * from a9atest; ``` - -*** -# Rowid generator provided in Hivemall v0.2 or later -You can use [rowid() function](https://github.com/myui/hivemall/blob/master/src/main/java/hivemall/tools/mapred/RowIdUDF.java) to generate an unique rowid in Hivemall v0.2 or later. -```sql -select - rowid() as rowid, -- returns ${task_id}-${sequence_number} - * -from - xxx -``` \ No newline at end of file diff --git a/docs/gitbook/tips/rt_prediction.md b/docs/gitbook/tips/rt_prediction.md index c3423784..25d9ff70 100644 --- a/docs/gitbook/tips/rt_prediction.md +++ b/docs/gitbook/tips/rt_prediction.md @@ -16,23 +16,25 @@ specific language governing permissions and limitations under the License. --> - -Hivemall provides a batch learning scheme that builds prediction models on Apache Hive. + +Apache Hivemall provides a batch learning scheme that builds prediction models on Apache Hive. The learning process itself is a batch process; however, an online/real-time prediction can be achieved by carrying a prediction on a transactional relational DBMS. In this article, we explain how to run a real-time prediction using a relational DBMS. -We assume that you have already run the [a9a binary classification task](https://github.com/myui/hivemall/wiki#a9a-binary-classification). +We assume that you have already run the [a9a binary classification task](../binaryclass/a9a.html). + + # Prerequisites - MySQL -Put mysql-connector-java.jar (JDBC driver) on $SQOOP_HOME/lib. + Put mysql-connector-java.jar (JDBC driver) on $SQOOP_HOME/lib. - [Sqoop](http://sqoop.apache.org/) -Sqoop 1.4.5 does not support Hadoop v2.6.0. So, you need to build packages for Hadoop 2.6. -To do that you need to edit build.xml and ivy.xml as shown in [this patch](https://gist.github.com/myui/e8db4a31b574103133c6). + Sqoop 1.4.5 does not support Hadoop v2.6.0. So, you need to build packages for Hadoop 2.6. + To do that you need to edit build.xml and ivy.xml as shown in [this patch](https://gist.github.com/myui/e8db4a31b574103133c6). # Preparing Model Tables on MySQL @@ -228,7 +230,7 @@ where 1 row in set (0.00 sec) ``` -Similar to [the way in Hive](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)#prediction), you can run prediction as follows: +Similar to [the way in Hive](../binaryclass/a9a_lr.html#prediction), you can run prediction as follows: ```sql select