Skip to content

Commit

Permalink
Upgraded for 0.9.1 and Readme Completed
Browse files Browse the repository at this point in the history
  • Loading branch information
mohanaprasad1994 committed Mar 28, 2015
1 parent 154da35 commit cd1de65
Show file tree
Hide file tree
Showing 16 changed files with 1,215 additions and 16 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ data/*.txt
manifest.json
pio.log
target/
/pio.sbt
4 changes: 4 additions & 0 deletions .gitignore~
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
data/*.txt
manifest.json
pio.log
target/
211 changes: 202 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,211 @@
# Classification Engine Template
# PredictionIO-MLlib-Decision-Trees-Template
## Overview
An engine template is an almost-complete implementation of an engine. In this Engine Template, we have integrated Apache Spark MLlib's Decision Tree algorithm by default.
The default use case of this classification Engine Template is to predict the service plan (plan) a user will subscribe to based on his 3 properties: attr0, attr1 and attr2.
You can customize it easily to fit your specific use case and needs.

## Documentation
We are going to show you how to create your own classification engine for production use based on this template.

Please refer to http://docs.prediction.io/templates/classification/quickstart/
## Usage

## Versions
### Event Data Requirements
By default, the template requires the following events to be collected ( we can check this at TemplateFolder/data/import_eventserver.py ):

### develop
- user $set event, which set the attributes of the user

### Input Query
- array of features values ( 3 features)
```
{"features": [0, 2, 0]}
```

### v0.1.1
### Output Predicted Result
- the predicted label
```
{"label":0.0}
```

### Dataset
We will be using the sample data set from https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_naive_bayes_data.txt
The training sample events have the following format (Generated by data/import_eventserver.py):
```
client.create_event(
event="$set",
entity_type="user",
entity_id=str(count), # use the count num as user ID
properties= {
"attr0" : int(attr[0]),
"attr1" : int(attr[1]),
"attr2" : int(attr[2]),
"plan" : int(plan)
}
```
## Install and Run PredictionIO
First you need to [install PredictionIO 0.9.1](http://docs.prediction.io/install/) (if you haven't done it).
Let's say you have installed PredictionIO at /home/yourname/PredictionIO/. For convenience, add PredictionIO's binary command path to your PATH, i.e. /home/yourname/PredictionIO/bin
```
$ PATH=$PATH:/home/yourname/PredictionIO/bin; export PATH
```
Once you have completed the installation process, please make sure all the components (PredictionIO Event Server, Elasticsearch, and HBase) are up and running.

```
$ pio-start-all
```
For versions before 0.9.1, you need to individually get the PredictionIO Event Server, Elasticsearch, and HBase up and running.

You can check the status by running:
```
$ pio status
```
## Download Template
Clone the current repository by executing the following command in the directory where you want the code to reside:

```
git clone https://github.com/mohanaprasad1994/PredictionIO-MLlib-Decision-Trees-Template.git MyClassification
```
## Generate an App ID and Access Key
Let's assume you want to use this engine in an application named "MyApp1". You will need to collect some training data for machine learning modeling. You can generate an App ID and Access Key that represent "MyApp1" on the Event Server easily:
```
$ pio app new MyApp1
```
You should find the following in the console output:
```
...
[INFO] [App$] Initialized Event Store for this app ID: 1.
[INFO] [App$] Created new app:
[INFO] [App$] Name: MyApp1
[INFO] [App$] ID: 1
[INFO] [App$] Access Key: 3mZWDzci2D5YsqAnqNnXH9SB6Rg3dsTBs8iHkK6X2i54IQsIZI1eEeQQyMfs7b3F
```
Take note of the Access Key and App ID. You will need the Access Key to refer to "MyApp1" when you collect data. At the same time, you will use App ID to refer to "MyApp1" in engine code.

$ pio app list will return a list of names and IDs of apps created in the Event Server.

```
$ pio app list
[INFO] [App$] Name | ID | Access Key | Allowed Event(s)
[INFO] [App$] MyApp1 | 1 | 3mZWDzci2D5YsqAnqNnXH9SB6Rg3dsTBs8iHkK6X2i54IQsIZI1eEeQQyMfs7b3F | (all)
[INFO] [App$] MyApp2 | 2 | io5lz6Eg4m3Xe4JZTBFE13GMAf1dhFl6ZteuJfrO84XpdOz9wRCrDU44EUaYuXq5 | (all)
[INFO] [App$] Finished listing 2 app(s).
```

## Collecting Data

Next, let's collect some training data. By default, the Classification Engine Template reads 4 properties of a user record: attr0, attr1, attr2 and plan.

You can send these data to PredictionIO Event Server in real-time easily by making a HTTP request or through the EventClient of an SDK.

Although you can integrate your app with PredictionIO and collect training data in real-time, we are going to import a sample dataset with the provided scripts for demonstration purpose.

Execute the following command in the Engine directory(MyClassification) to get the sample dataset from MLlib repo:
```
curl https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_naive_bayes_data.txt --create-dirs -o data/sample_decision_trees.txt
```

A Python import script import_eventserver.py is provided in the template to import the data to Event Server using Python SDK.
Replace the value of access_key parameter by your Access Key and run:
```python
$ cd MyRecomendation
$ python data/import_eventserver.py --access_key 3mZWDzci2D5YsqAnqNnXH9SB6Rg3dsTBs8iHkK6X2i54IQsIZI1eEeQQyMfs7b3F
```
You should see the following output:
```
Importing data...
6 events are imported.
```
This python script converts the data file to proper events formats as needed by the event server.
Now the training data is stored as events inside the Event Store.

## Deploy the Engine as a Service
Now you can build, train, and deploy the engine. First, make sure you are under the MyClassification directory.

### Engine.json

Under the directory, you should find an engine.json file; this is where you specify parameters for the engine.
Make sure the appId defined in the file match your App ID. (This links the template engine with the App)

Parameters for the Decision Tree model are to be set here.

numClasses: Number of classes in the dataset

maxDepth: Max depth of the tree generated

maxBins: Max number of Bins

```
{
"id": "default",
"description": "Default settings",
"engineFactory": "org.template.classification.ClassificationEngine",
"datasource": {
"params": {
"appId": 1
}
},
"algorithms": [
{
"name": "decisiontree",
"params": {
"numClasses": 3,
"maxDepth": 5,
"maxBins": 100
}
}
]
}
```
### Build

Start with building your MyClassification engine.
```
$ pio build
```
This command should take few minutes for the first time; all subsequent builds should be less than a minute. You can also run it with --verbose to see all log messages.

Upon successful build, you should see a console message similar to the following.
```
[INFO] [Console$] Your engine is ready for training.
```

### Training the Predictive Model

Train your engine.

```
$ pio train
```
When your engine is trained successfully, you should see a console message similar to the following.

```
[INFO] [CoreWorkflow$] Training completed successfully.
```
### Deploying the Engine

Now your engine is ready to deploy.

```
$ pio deploy
```
This will deploy an engine that binds to http://localhost:8000. You can visit that page in your web browser to check its status.

## Use the Engine

Now, You can try to retrieve predicted results. For example, to predict the label (i.e. plan in this case) of a user with attr0=2, attr1=0 and attr2=0, you send this JSON { "features": [2, 0, 0] } to the deployed engine and it will return a JSON of the predicted plan. Simply send a query by making a HTTP request or through the EngineClient of an SDK:
```python
import predictionio
engine_client = predictionio.EngineClient(url="http://localhost:8000")
print engine_client.send_query({"features": [2, 0, 0]})
```
The following is sample JSON response:

```
{"label":0.0}
```

The sample quesry can be found in **test.py**

- Persist RDD to memory (.cache()) in DataSource for better performance

### v0.1.0

- initial version
Loading

0 comments on commit cd1de65

Please sign in to comment.