Skip to content
This repository has been archived by the owner on Oct 8, 2019. It is now read-only.

Commit

Permalink
Updated the userguide
Browse files Browse the repository at this point in the history
  • Loading branch information
myui committed Nov 17, 2016
1 parent f6cef1f commit ee244c3
Show file tree
Hide file tree
Showing 36 changed files with 834 additions and 508 deletions.
2 changes: 2 additions & 0 deletions docs/gitbook/SUMMARY.md
Expand Up @@ -92,6 +92,8 @@
* [Webspam Tutorial](binaryclass/webspam.md)
* [Data pareparation](binaryclass/webspam_dataset.md)
* [PA1, AROW, SCW](binaryclass/webspam_scw.md)

* [Kaggle Titanic Tutorial](binaryclass/titanic_rf.md)

## Part VI - Multiclass classification

Expand Down
16 changes: 11 additions & 5 deletions docs/gitbook/anomaly/lof.md
Expand Up @@ -19,6 +19,8 @@
This article introduce how to find outliers using [Local Outlier Detection (LOF)](http://en.wikipedia.org/wiki/Local_outlier_factor) on Hivemall.

<!-- toc -->

# Data Preparation

```sql
Expand All @@ -36,9 +38,9 @@ ROW FORMAT DELIMITED
STORED AS TEXTFILE LOCATION '/dataset/lof/hundred_balls';
```

Download [hundred_balls.txt](https://github.com/myui/hivemall/blob/master/resources/examples/lof/hundred_balls.txt) that is originally provides in [this article](http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259).
Download [hundred_balls.txt](https://gist.githubusercontent.com/myui/f8b44ab925bc198e6d11b18fdd21269d/raw/bed05f811e4c351ed959e0159405690f2f11e577/hundred_balls.txt) that is originally provides in [this article](http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259).

You can find outliers in [this picture](http://next.rikunabi.com/tech/contents/ts_report/img/201303/002259/part1_img1.jpg). As you can see, Rowid `87` is apparently an outlier.
In this example, Rowid `87` is apparently an outlier.

```sh
awk '{FS=" "; OFS=" "; print NR,$0}' hundred_balls.txt | \
Expand Down Expand Up @@ -144,11 +146,15 @@ where
;
```

_Note: `list_neighbours` table SHOULD be created because `list_neighbours` is used multiple times._
> #### Caution
>
> `list_neighbours` table SHOULD be created because `list_neighbours` is used multiple times.
_Note: [`each_top_k`](https://github.com/myui/hivemall/pull/196) is supported from Hivemall v0.3.2-3 or later._
# Parallelize Top-k computation

_Note: To parallelize a top-k computation, break LEFT-hand table into piece as describe in [this page](https://github.com/myui/hivemall/wiki/Efficient-Top-k-computation-on-Apache-Hive-using-Hivemall-UDTF#parallelization-of-similarity-computation-using-with-clause)._
> #### Info
>
> To parallelize a top-k computation, break LEFT-hand table into piece as describe in [this page](../misc/topk.html).
```sql
WITH k_distance as (
Expand Down
187 changes: 90 additions & 97 deletions docs/gitbook/binaryclass/a9a_lr.md
@@ -1,98 +1,91 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
a9a
===
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a

_Training with iterations is OBSOLUTE in Hivemall._
_Using amplifier and shuffling inputs is RECOMMENDED in Hivemall._

---

## UDF preparation

```sql
select count(1) from a9atrain;
-- set total_steps ideally be "count(1) / #map tasks"
set hivevar:total_steps=32561;

select count(1) from a9atest;
set hivevar:num_test_instances=16281;
```

## training
```sql
create table a9a_model1
as
select
cast(feature as int) as feature,
avg(weight) as weight
from
(select
logress(addBias(features),label,"-total_steps ${total_steps}") as (feature,weight)
from
a9atrain
) t
group by feature;
```
_"-total_steps" option is optional for logress() function._
_I recommend you NOT to use options (e.g., total_steps and eta0) if you are not familiar with those options. Hivemall then uses an autonomic ETA (learning rate) estimator._

## prediction
```sql
create or replace view a9a_predict1
as
WITH a9atest_exploded as (
select
rowid,
label,
extract_feature(feature) as feature,
extract_weight(feature) as value
from
a9atest LATERAL VIEW explode(addBias(features)) t AS feature
)
select
t.rowid,
sigmoid(sum(m.weight * t.value)) as prob,
CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label
from
a9atest_exploded t LEFT OUTER JOIN
a9a_model1 m ON (t.feature = m.feature)
group by
t.rowid;
```

## evaluation
```sql
create or replace view a9a_submit1 as
select
t.label as actual,
pd.label as predicted,
pd.prob as probability
from
a9atest t JOIN a9a_predict1 pd
on (t.rowid = pd.rowid);
```

```sql
select count(1) / ${num_test_instances} from a9a_submit1
where actual == predicted;
```
> 0.8430071862907684

<!-- toc -->

# UDF preparation

```sql
select count(1) from a9atrain;
-- set total_steps ideally be "count(1) / #map tasks"
set hivevar:total_steps=32561;

select count(1) from a9atest;
set hivevar:num_test_instances=16281;
```

# training
```sql
create table a9a_model1
as
select
cast(feature as int) as feature,
avg(weight) as weight
from
(select
logress(addBias(features),label,"-total_steps ${total_steps}") as (feature,weight)
from
a9atrain
) t
group by feature;
```
_"-total_steps" option is optional for logress() function._
_I recommend you NOT to use options (e.g., total_steps and eta0) if you are not familiar with those options. Hivemall then uses an autonomic ETA (learning rate) estimator._

# prediction
```sql
create or replace view a9a_predict1
as
WITH a9atest_exploded as (
select
rowid,
label,
extract_feature(feature) as feature,
extract_weight(feature) as value
from
a9atest LATERAL VIEW explode(addBias(features)) t AS feature
)
select
t.rowid,
sigmoid(sum(m.weight * t.value)) as prob,
CAST((case when sigmoid(sum(m.weight * t.value)) >= 0.5 then 1.0 else 0.0 end) as FLOAT) as label
from
a9atest_exploded t LEFT OUTER JOIN
a9a_model1 m ON (t.feature = m.feature)
group by
t.rowid;
```

# evaluation
```sql
create or replace view a9a_submit1 as
select
t.label as actual,
pd.label as predicted,
pd.prob as probability
from
a9atest t JOIN a9a_predict1 pd
on (t.rowid = pd.rowid);
```

```sql
select count(1) / ${num_test_instances} from a9a_submit1
where actual == predicted;
```
> 0.8430071862907684
7 changes: 3 additions & 4 deletions docs/gitbook/binaryclass/a9a_minibatch.md
Expand Up @@ -17,13 +17,12 @@
under the License.
-->
This page explains how to apply [Mini-Batch Gradient Descent](https://class.coursera.org/ml-003/lecture/106) for the training of logistic regression explained in [this example](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)).

See [this page](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)) first. This content depends on it.
This page explains how to apply [Mini-Batch Gradient Descent](https://class.coursera.org/ml-003/lecture/106) for the training of logistic regression explained in [this example](./a9a_lr.html).
So, refer [this page](./a9a_lr.html) first. This content depends on it.

# Training

Replace `a9a_model1` of [this example](https://github.com/myui/hivemall/wiki/a9a-binary-classification-(logistic-regression)).
Replace `a9a_model1` of [this example](./a9a_lr.html).

```sql
set hivevar:total_steps=32561;
Expand Down
6 changes: 3 additions & 3 deletions docs/gitbook/binaryclass/kdd2010a_dataset.md
Expand Up @@ -19,9 +19,9 @@
[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (algebra)](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (algebra))

* # of classes: 2
* # of data: 8,407,752 (training) / 510,302 (testing)
* # of features: 20,216,830 in about 2.73 GB (training) / 20,216,830 (testing)
* the number of classes: 2
* the number of data: 8,407,752 (training) / 510,302 (testing)
* the number of features: 20,216,830 in about 2.73 GB (training) / 20,216,830 (testing)

---
# Define training/testing tables
Expand Down
6 changes: 3 additions & 3 deletions docs/gitbook/binaryclass/kdd2010b_dataset.md
Expand Up @@ -19,9 +19,9 @@
[http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (bridge to algebra)](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010 (bridge to algebra))

* # of classes: 2
* # of data: 19,264,097 / 748,401 (testing)
* # of features: 29,890,095 / 29,890,095 (testing)
* the number of classes: 2
* the number of examples: 19,264,097 (training) / 748,401 (testing)
* the number of features: 29,890,095 (training) / 29,890,095 (testing)

---
# Define training/testing tables
Expand Down
2 changes: 1 addition & 1 deletion docs/gitbook/binaryclass/news20_scw.md
Expand Up @@ -16,7 +16,7 @@
specific language governing permissions and limitations
under the License.
-->

## UDF preparation
```
use news20;
Expand Down

0 comments on commit ee244c3

Please sign in to comment.