Introduction

Spark program that can do 2 things:

read data (the program can read 2 existing formats)
train a clustering algorithm
using an existing model to predict new data

How to compile

In order to compile, make sure that you have already

JDK 8.0 installed
sbt
If you have memory problem while compiling, make sure to set JAVA_OPTS=-Xmx2G

How to compile Go the root path of the project

sbt assembly

How to run the program

run run_model.sh

You can modify the script to change the default value

action=training
inputFile=data-test/sample.json
nbIterations=10
nbClusters=5
modelPath=data-test/model

When the program finishes, run run_prediction.sh to make prediction on new data

action=predict
inputFile=data-test/sample.json
modelPath=data-test/model
outputFile=data-test/result.json

Backlogs

Convert to Jsonlines format

The input data is actually in json format. It can not be stored in a distributed system and be read in parallel. We need to convert this input into Jsonlines format

Jenkins automation

The building process and test must be automated by Jenkins

Deployment automation

This can be a step from Jenkins, the deployment of master branch should be automated

Better errors handling

For the moment the are only one log message to count the total number of invalid input data There should be better way to persist these errors (kibana, database, hdfs ...) in order to be investigated later

Result Persist

For the moment the result is collected to the driver and persisted in a simple text file. But this data should be in a cluster (hdfs, database)

Model updated

Find an automated process that can ease the updating of the model

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data-test		data-test
project		project
src		src
.gitignore		.gitignore
Readme.md		Readme.md
build.sbt		build.sbt
run_model.sh		run_model.sh
run_prediction.sh		run_prediction.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-test

data-test

project

project

src

src

.gitignore

.gitignore

Readme.md

Readme.md

build.sbt

build.sbt

run_model.sh

run_model.sh

run_prediction.sh

run_prediction.sh

Repository files navigation

Introduction

How to compile

How to run the program

Backlogs

Convert to Jsonlines format

Jenkins automation

Deployment automation

Better errors handling

Result Persist

Model updated

About

Releases

Packages

Languages

ziboumima/clustering

Folders and files

Latest commit

History

Repository files navigation

Introduction

How to compile

How to run the program

Backlogs

Convert to Jsonlines format

Jenkins automation

Deployment automation

Better errors handling

Result Persist

Model updated

About

Resources

Stars

Watchers

Forks

Languages