Skip to content

Commit

Permalink
Release CaraML - 1.0.0 (#29)
Browse files Browse the repository at this point in the history
* Added Unit Tests trait

* Added Unit Tests trait (#1)

* Started work on YAML Parser

* WIP - Yaml parser

* extract stages from yamls

* development of parser

* TAG: Parser with annotations

* yaml parser stages creation

* cleaned dataset stage

* fixed cara parser file

* Add LogisticRegression Class with building lr model

* Finalized LogisticRegression Class and move GetMethode function to the trait class CaraStage

* Merged LogisticRegressionStage and cleaned

* Buildin spark ml pipelines and unit tests

* Refactored CaraParser and added parse method + updated tests

* yaml_parser: update unit tests for CaraYaml

* Updated CaraParser adding try, update tests

* Feature/model schema (#3)

* LogisticRegressionTest contains error to clear

* Finalize LogisticRegression's class and  tests

* refactor names to caml case and correct spaces

Co-authored-by: merzouk <merzoukoumedda@gmail.com>

* Started model training

* Added Evaluator parser

* Evolution of parser

* Feature/dataset parser (#6)

* first implementation of HashingTF, IDF,Tokenizer,Word2Vec

* add new Dataset Features + Fix build function

* fix CountVectorizerModel + handle Model classes

* build edited, in progress for

* add tests for all classes -- must review CountVModel to fix tests

* fixed CountVectorizerModel Test

* getMethode completed + all class and tests ok + indentation ok

* fixed PR changes

* Feature/yaml parser (#7)

* Added Evaluator parser

* Evolution of parser

* Added tuner parser

* Feature/yaml parser (#8)

* Added Evaluator parser

* Evolution of parser

* Added tuner parser

* Added companion object to CaraParser

* Feature/yaml parser (#9)

* Added Evaluator parser

* Evolution of parser

* Added tuner parser

* Added companion object to CaraParser

* Added tuner to CaraPipeline

* skeleton for CaraModel

* Renamed CaraYaml class to CaraYaml Reader

* Created CaraModel Pipeline skeleton for train

* first commit branch

* finish generateModel method and add CaraModelTest class

* review cara_pipine_model test

* Feature/cara pipeline model (#10)

* first commit branch

* finish generateModel method and add CaraModelTest class

* review cara_pipine_model test

Co-authored-by: merzouk <merzoukoumedda@gmail.com>

* Changed datasetPath to dataset itself

* finilize class LinearRegression plus tests (#11)

Co-authored-by: merzouk <merzoukoumedda@gmail.com>

* Added evaluation method

* updated cara model

* Feature/model schema (#13)

* LogisticRegressionTest contains error to clear

* Finalize LogisticRegression's class and  tests

* refactor names to caml case and correct spaces

* Adjust LogisticRegretion format code and add DecisionTreeClassifier model class's and test's

* Add GBTClassifier model class's and tests

* tests not ended

* finilize tests new models classes

* CarastageMapper update

* update caraMapperModel

Co-authored-by: merzouk <merzoukoumedda@gmail.com>

* Feature/model schema (#15)

* LogisticRegressionTest contains error to clear

* Finalize LogisticRegression's class and  tests

* refactor names to caml case and correct spaces

* Adjust LogisticRegretion format code and add DecisionTreeClassifier model class's and test's

* Add GBTClassifier model class's and tests

* tests not ended

* finilize tests new models classes

* CarastageMapper update

* update caraMapperModel

* Add Kmeans, LDA and NaiveBayes models and class's tests

Co-authored-by: merzouk <merzoukoumedda@gmail.com>

* added MulticlassClassificationEvaluator (#16)

* Overwrite save

* Fixed the case where no tuner is specified

* Removed sparksession from CaraModel

* Feature/model schema (#19)

* LogisticRegressionTest contains error to clear

* Finalize LogisticRegression's class and  tests

* refactor names to caml case and correct spaces

* Adjust LogisticRegretion format code and add DecisionTreeClassifier model class's and test's

* Add GBTClassifier model class's and tests

* tests not ended

* finilize tests new models classes

* CarastageMapper update

* update caraMapperModel

* Add Kmeans, LDA and NaiveBayes models and class's tests

* add decisionTreeRegressor class and test's

* Add RandomForestRegressor class and test's

* Add GBTRegressor class and test's

Co-authored-by: merzouk <merzoukoumedda@gmail.com>

* Global refactoring (#20)

* Made build method for stages generic

* Code review on source code

* Added some scaladoc

* Started reviewing tests

* Refacto on unit tests

* Renamed packages

* Added father package

* Publish to repository

* Feature/readme documentation (#22)

* Set readme plan

* Update ReadMe

* Update README.md

* ReadMe Updates

* ReadMe updates

* updates ReadMe

* ReadMe updates

* Update README.md

* Updates ReadMe

* Updates ReadMe

* Update README.md

* ReadMe Updates

* ReadMe updates

* Update README.md

* Update README.md

* Add Schema

* Update README.md

* Update README.md

* Update README.md

* Update ReadMe add CaraML jar link

* update readme

Co-authored-by: merzouk <merzoukoumedda@gmail.com>

* fix ReadMe (#23)

Co-authored-by: merzouk <merzoukoumedda@gmail.com>

* Fix readme requirements (#24)

* update readme

* update readme

* update readme

* update readme

* update readme

Co-authored-by: merzouk <merzoukoumedda@gmail.com>

* Changed build version for release

* Feature/generate report (#28)

* generateReport fixed + modelEvaluate

* fixed Resources files

* Generate Report finished

* code clean generateReport

Co-authored-by: merzouk <merzoukoumedda@gmail.com>
Co-authored-by: SAI-Aghylas <55828644+SAI-Aghylas@users.noreply.github.com>
Co-authored-by: merzouk13 <57535044+merzouk13@users.noreply.github.com>
  • Loading branch information
4 people committed Jul 3, 2021
1 parent 9f07bd2 commit e9dff0f
Show file tree
Hide file tree
Showing 74 changed files with 4,183 additions and 5 deletions.
Binary file added PA.PNG
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
186 changes: 185 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,185 @@
# CaraML
# CaraML

## *Presentation*

CaraML is a Scala/Apache Spark framework for distributed Machine Learning programs, using the Apache Spark MLlib in the simplest possible way. No need to write hundreds or thousands code lines, just discribing pipline of models and/or transformations. The purpose is to do "Machine Learning as Code"



## *Requirements*

To use CaraML framework, you must satisfy the following requirements:

- Scala version >= 2.12.13
- Spark version >= 3.1
- Java 11



## *Installation*

- Spark : [Download here](https://spark.apache.org/downloads.html)
- Scala : [Download here](https://www.scala-lang.org/download/)
- CaraML library : [CaraML](https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/jsarni/caraml_2.12/1.0.0-SNAPSHOT/)



## *Usage*

To use CaraML, you can add the framework dependency in your Spark application

- Sbt

```scala
libraryDependencies += "io.github.jsarni" %% "caraml" % "1.0.0"
```

- Gradle

```scala
compile group: 'io.github.jsarni"', name: 'caraml', version: '1.0.0'
```
- Maven

```scala
<dependency>
<groupId>io.github.jsarni"</groupId>
<artifactId>caraml</artifactId>
<version>1.0.0</version>
</dependency>
```

CaraML needs the following information

- Prepared dataset that will be used to transform and train models
- Path where to save the final trained model and its metrics
- Path of the CaraYaml file, where the user will declare and set the pipeline with stages of SparkML models and/or SparkML transformations

The Yaml file will be used to describe a pipeline of stages, each stage could be a SparkML model or a Spark ML method of data preprocessing.
All CaraYaml files must start with "CaraPipeline:" keyword and could contain the following keywords

### *CaraPipeline*
* **"CaraPipeline:"** : keyword that must be set in the beginning of each CaraYaml file


### *Stage*
* **"- stage:"** Is a keyword used to declare and describe a stage. It could be an Estimator or a Transformer :
* **SparkML Estimator** : Which is the name of the SparkML model that you want to use in the stage.
* **SparkML Transformer** : Is the name of SparkML feature transformation that you want to apply to your dataset (preprocessing)


Each stage will be followed by "params:" keyword, which contain one or many parameters/hyperparameters of the stage and their values.

```yaml
params:
- "Param1 name" : "Param value"
- "Param2 name" : "Param value"
- ....
- "Paramn name" : "Param value"
```

### *Evaluator*
* **"- evaluator:"** Which is used to evaluate model output and returns scalar metrics


### Tuner
* **"- tuner:"** Which is used for tuning ML algorithms that allow users to optimize hyperparameters in algorithms and Pipelines

Each tuner will be followed by "params:" keyword, which contain one or many parameters/hyperparameters of the tuner and their values.

```yaml
params:
- "Param1 name" : "Param value"
- "Param2 name" : "Param value"
- ....
- "Paramn name" : "Param value"
```

### **CaraYaml example**
```yaml
CaraPipeline:
- stage: LogisticRegression
params:
- MaxIter: 5
- RegParam: 0.3
- ElasticNetParam: 0.8
- stage: Tokenizer
params:
- InputCol: Input
- OutputCol: ResCol


- evaluator: MulticlassClassificationEvaluator
- tuner: TrainValidationSplit
params:
- TrainRatio: 0.8


```

**For more details and documentation you can refer to the Spark [MLlib](https://spark.apache.org/docs/3.1.2/ml-guide.html) documentation**



## *SparkML available components in CaraML*

This section lists all available SparkML components that you can use with CaraML framework

### *Models*

* **Classification**

- LogisticRegression [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#logistic-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/LogisticRegression.html)
- DecisionTreeClassifier [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#decision-tree-classifier) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html)
- GBTClassifier (Gradient-boosted tree classifier) [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#gradient-boosted-tree-classifier) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/GBTClassifier.html)
- NaiveBayes [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#naive-bayes) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/NaiveBayes.html)
- RandomForestClassifier [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#random-forest-classifier) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html)

* **Regression**

- LinearRegression [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#linear-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/LinearRegression.html)
- DecisionTreeRegressor [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#decision-tree-regressionhttps://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html) and [Documontation]()
- RandomForestRegressor [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#random-forest-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html)
- GBTRegressor (Gradient-boosted tree Regressor) [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#gradient-boosted-tree-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/GBTRegressor.html)


* **Clustering**

- K-means [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-clustering.html#k-means) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/clustering/KMeans.html)
- LDA (Latent Dirichlet allocation) [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-clustering.html#latent-dirichlet-allocation-lda) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/clustering/LDA.html)

### *Dataset operation*

- Binarizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#binarizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Binarizer.html)
- BucketedRandomProjectionLSH [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#bucketed-random-projection-for-euclidean-distance) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html)
- Bucketizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#bucketizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Bucketizer.html)
- ChiSqSelector [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#chisqselector) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html)
- CountVectorizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#countvectorizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/CountVectorizer.html)
- HashingTF [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tf-idf) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/HashingTF.html)
- IDF [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tf-idf) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/IDF.html)
- RegexTokenizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tokenizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html)
- Tokenizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tokenizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Tokenizer.html)
- Word2Vec [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#word2vec) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Word2Vec.html)



### *Tuner*

- CrossValidator [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-tuning.html#cross-validation) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/tuning/CrossValidator.html)
- TrainValidationSplit [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-tuning.html#train-validation-split) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html)


### *Evaluator*

- RegressionEvaluator [Documontation](https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html)
- MulticlassClassificationEvaluator [Documontation](https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html)


## *CamaML schema*
![schema](PA.PNG?raw=true)


## *Example*

For practical example you can refer to this [Link](https://github.com/jsarni/CaraMLTest), which is a github project that contain a project using the CaraML framework.

24 changes: 20 additions & 4 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,38 @@ version := "1.0.0"
organization := "io.github.jsarni"
homepage := Some(url("https://github.com/jsarni/CaraML"))
scmInfo := Some(ScmInfo(url("https://github.com/jsarni/CaraML"), "git@github.com:jsarni/CaraML.git"))
developers :=
List(
developers := List(
Developer("Juba", "SARNI", "juba.sarni@gmail.com", url("https://github.com/jsarni")),
Developer("Merzouk", "OUMEDDAH", "merzoukoumeddah@gmail.com ", url("https://github.com/merzouk13")),
Developer("Aghylas", "SAI", "aghilassai@gmail.com", url("https://github.com/SAI-Aghylas"))
)
)
licenses += ("Apache-2.0", url("http://www.apache.org/licenses/LICENSE-2.0"))
publishMavenStyle := true

publishTo := Some(
if (isSnapshot.value)
"Sonatype Snapshots Nexus" at "https://s01.oss.sonatype.org/content/repositories/snapshots"
else
"Sonatype Releases Nexus" at "https://s01.oss.sonatype.org/content/repositories/releases"
)

// Dependencies
val scalaTest = "org.scalatest" %% "scalatest" % "3.2.7" % Test
val mockito = "org.mockito" %% "mockito-scala" % "1.16.37" % Test
val spark = "org.apache.spark" %% "spark-mllib" % "3.1.1"
val snakeYaml = "org.yaml" % "snakeyaml" % "1.28"
val jacksonCore = "com.fasterxml.jackson.core" % "jackson-core" % "2.10.5"
val jacksonDataformat = "com.fasterxml.jackson.dataformat" % "jackson-dataformat-yaml" % "2.10.5"
val jacksonAnnotation = "com.fasterxml.jackson.core" % "jackson-annotations" % "2.10.5"

lazy val caraML = (project in file("."))
.settings(
name := "CaraML",
libraryDependencies += scalaTest,
libraryDependencies += mockito,
libraryDependencies += spark
libraryDependencies += spark,
libraryDependencies += snakeYaml,
libraryDependencies += jacksonCore,
libraryDependencies += jacksonDataformat,
libraryDependencies += jacksonAnnotation
)
2 changes: 2 additions & 0 deletions project/plugins.sbt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
addSbtPlugin("org.xerial.sbt" % "sbt-sonatype" % "2.3")
addSbtPlugin("com.jsuereth" % "sbt-pgp" % "1.1.1")
3 changes: 3 additions & 0 deletions src/main/resources/body_part1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<div id="Stage Name/ Model Name Gray" style="width: 1000px; margin: 0 auto;" class="shadow p-1 mb-1 bg-body">
<div class=" p-3 mb-2 bg-secondary text-white" >
<p style="color: #FFFFFF; height: 10px ; "class="text-center"><b> <!--Block1 END-->
11 changes: 11 additions & 0 deletions src/main/resources/body_part2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
<!-- Put Model Name--> <!--Block2 BEGIN--></b>
<table class ="table table-striped " style=" width: 1000px; margin: 0 auto;">
<thead class="thead-blue" >
<tr style="height: 18px;">
<th scope="col"> </th>
<th scope="col" class="text-center" style="font-size: 25px;"> Metric</th>
<th scope="col" class="text-center" style="font-size: 25px;"> Value </th>
</tr>
</thead>
<tbody>
<!--Block2 END-->
5 changes: 5 additions & 0 deletions src/main/resources/body_part3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<!--BEGIN 2ND LOOP-->
<!--Block1 BEGIN-->
<tr style="height: 18px;">
<th scope="row"></th>
<td> <!--Block1 END -->
Binary file added src/main/resources/caraML_logo_200x100.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 35 additions & 0 deletions src/main/resources/header.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<!doctype html>
<html lang="fr">

<head>

<title>CaraML Report</title>

<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">

</head>

<body>

<figure class="figure">
<p> </p>
</figure>

<div style="font-family: Lucida Sans Unicode, Lucida Grande, sans-serif; color: #5e9ca0;">
<div class="shadow-sm p-1 mb-5 bg-white rounded">
<h1 >CaraML : Pipeline Metrics Report</h1>
</div>
</div>

<p class="text-center shadow-sm p-1 mb-1 bg-white rounded" >This is the Fit report of your PipelineModel&nbsp; &nbsp;&nbsp;</p>

<h2 class="text-center shadow-sm p-1 mb-1 bg-white rounded" style="color: #2e6c80;">Summary</h2>

<div id="Models and Metrics BLUE" class=" p-3 mb-2 bg-info text-white" >
<p style="color: #FFFFFF; font-size: 25px;"class="text-center">Models and Metrics </p>
</div>
<!-- BEGIN 1ST LOOP FROM HERE -->
<!--Block1 BEGIN-->

0 comments on commit e9dff0f

Please sign in to comment.