Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release CaraML - 1.0.0 #29

Merged
merged 66 commits into from
Jul 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
8a7abbc
Added Unit Tests trait
jsarni Apr 10, 2021
6d3e447
Added Unit Tests trait (#1)
jsarni Apr 10, 2021
d26488b
Started work on YAML Parser
jsarni Apr 10, 2021
d53b246
Merge remote-tracking branch 'origin/feature/unit_test_package' into …
jsarni Apr 10, 2021
cc4f6ca
WIP - Yaml parser
jsarni Apr 12, 2021
811afcb
extract stages from yamls
jsarni Apr 17, 2021
827eb88
Merge remote-tracking branch 'origin/feature/yaml_parser' into featur…
Apr 17, 2021
c1ebe7d
development of parser
jsarni Apr 23, 2021
fc2181d
TAG: Parser with annotations
jsarni Apr 24, 2021
74d42f9
yaml parser stages creation
jsarni Apr 24, 2021
74cb16b
cleaned dataset stage
jsarni Apr 24, 2021
8897a7a
Merge remote-tracking branch 'origin/feature/yaml_parser' into featur…
Apr 24, 2021
2df19a2
fixed cara parser file
jsarni Apr 24, 2021
14218fb
Add LogisticRegression Class with building lr model
Apr 26, 2021
ce00929
Finalized LogisticRegression Class and move GetMethode function to th…
May 1, 2021
ec98e61
Merge branch 'feature/model_schema' into feature/yaml_parser
jsarni May 2, 2021
e6966a6
Merged LogisticRegressionStage and cleaned
jsarni May 2, 2021
d356c25
Buildin spark ml pipelines and unit tests
jsarni May 2, 2021
7822b57
Refactored CaraParser and added parse method + updated tests
jsarni May 2, 2021
692097e
yaml_parser: update unit tests for CaraYaml
jsarni May 6, 2021
c879a20
Updated CaraParser adding try, update tests
jsarni May 9, 2021
cf1852b
Merge pull request #2 from jsarni/feature/yaml_parser
SAI-Aghylas May 9, 2021
8ec7732
Feature/model schema (#3)
merzouk13 May 15, 2021
74fe3e2
Started model training
jsarni May 16, 2021
269fd6c
Added Evaluator parser
jsarni May 16, 2021
90bba10
Evolution of parser
jsarni May 23, 2021
419adb1
Merge branch 'feature/yaml_parser' into feature/cara_pipeline
jsarni May 23, 2021
3968405
Feature/dataset parser (#6)
SAI-Aghylas May 25, 2021
8b661fb
Feature/yaml parser (#7)
jsarni May 25, 2021
532cba8
Merge branch 'develop' into feature/cara_pipeline
jsarni May 25, 2021
b6bd2e1
Feature/yaml parser (#8)
jsarni May 25, 2021
709100f
Merge branch 'develop' into feature/cara_pipeline
jsarni May 25, 2021
cbcdd91
Feature/yaml parser (#9)
jsarni May 25, 2021
81edac1
Merge branch 'develop' into feature/cara_pipeline
jsarni May 25, 2021
d19488a
skeleton for CaraModel
jsarni May 25, 2021
9312f25
Renamed CaraYaml class to CaraYaml Reader
jsarni May 26, 2021
e9c9a83
Created CaraModel Pipeline skeleton for train
jsarni May 26, 2021
3a03728
first commit branch
May 26, 2021
76672d3
finish generateModel method and add CaraModelTest class
May 26, 2021
8f830ae
review cara_pipine_model test
May 27, 2021
8b69ce8
Feature/cara pipeline model (#10)
merzouk13 May 27, 2021
0f90ca1
Changed datasetPath to dataset itself
jsarni Jun 5, 2021
b0c71fa
finilize class LinearRegression plus tests (#11)
merzouk13 Jun 5, 2021
a464225
Merge branch 'develop' into feature/cara_pipeline
jsarni Jun 5, 2021
00263b0
Added evaluation method
jsarni Jun 5, 2021
a532e5c
updated cara model
jsarni Jun 18, 2021
36f534b
Merge branch 'feature/cara_pipeline_model' into feature/cara_pipeline
jsarni Jun 18, 2021
f1e469a
Merge pull request #12 from jsarni/feature/cara_pipeline
merzouk13 Jun 18, 2021
93aeef5
Feature/model schema (#13)
merzouk13 Jun 18, 2021
05e6c6e
Feature/model schema (#15)
merzouk13 Jun 27, 2021
45e706b
added MulticlassClassificationEvaluator (#16)
jsarni Jun 27, 2021
e168d09
Overwrite save
jsarni Jun 27, 2021
72a90bd
Merge pull request #17 from jsarni/fix/overwrite_save
merzouk13 Jun 27, 2021
80605d7
Fixed the case where no tuner is specified
jsarni Jun 27, 2021
d16a9bf
Merge pull request #18 from jsarni/fix/parser_tuner
merzouk13 Jun 27, 2021
1186b17
Removed sparksession from CaraModel
jsarni Jun 27, 2021
84100c5
Feature/model schema (#19)
merzouk13 Jun 30, 2021
e527d76
Global refactoring (#20)
jsarni Jul 1, 2021
b4ad04e
Publish to repository
jsarni Jul 1, 2021
d791eeb
Merge pull request #21 from jsarni/feature/publish_artefact
merzouk13 Jul 2, 2021
4c3bbcf
Feature/readme documentation (#22)
merzouk13 Jul 2, 2021
c8c43a8
fix ReadMe (#23)
merzouk13 Jul 2, 2021
ed5918f
Fix readme requirements (#24)
merzouk13 Jul 3, 2021
35f20cf
Changed build version for release
jsarni Jul 3, 2021
e1147b2
Merge pull request #26 from jsarni/prepare_release
merzouk13 Jul 3, 2021
7df7a97
Feature/generate report (#28)
SAI-Aghylas Jul 3, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added PA.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
186 changes: 185 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,185 @@
# CaraML
# CaraML

## *Presentation*

CaraML is a Scala/Apache Spark framework for distributed Machine Learning programs, using the Apache Spark MLlib in the simplest possible way. No need to write hundreds or thousands code lines, just discribing pipline of models and/or transformations. The purpose is to do "Machine Learning as Code"



## *Requirements*

To use CaraML framework, you must satisfy the following requirements:

- Scala version >= 2.12.13
- Spark version >= 3.1
- Java 11



## *Installation*

- Spark : [Download here](https://spark.apache.org/downloads.html)
- Scala : [Download here](https://www.scala-lang.org/download/)
- CaraML library : [CaraML](https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/jsarni/caraml_2.12/1.0.0-SNAPSHOT/)



## *Usage*

To use CaraML, you can add the framework dependency in your Spark application

- Sbt

```scala
libraryDependencies += "io.github.jsarni" %% "caraml" % "1.0.0"
```

- Gradle

```scala
compile group: 'io.github.jsarni"', name: 'caraml', version: '1.0.0'
```
- Maven

```scala
<dependency>
<groupId>io.github.jsarni"</groupId>
<artifactId>caraml</artifactId>
<version>1.0.0</version>
</dependency>
```

CaraML needs the following information

- Prepared dataset that will be used to transform and train models
- Path where to save the final trained model and its metrics
- Path of the CaraYaml file, where the user will declare and set the pipeline with stages of SparkML models and/or SparkML transformations

The Yaml file will be used to describe a pipeline of stages, each stage could be a SparkML model or a Spark ML method of data preprocessing.
All CaraYaml files must start with "CaraPipeline:" keyword and could contain the following keywords

### *CaraPipeline*
* **"CaraPipeline:"** : keyword that must be set in the beginning of each CaraYaml file


### *Stage*
* **"- stage:"** Is a keyword used to declare and describe a stage. It could be an Estimator or a Transformer :
* **SparkML Estimator** : Which is the name of the SparkML model that you want to use in the stage.
* **SparkML Transformer** : Is the name of SparkML feature transformation that you want to apply to your dataset (preprocessing)


Each stage will be followed by "params:" keyword, which contain one or many parameters/hyperparameters of the stage and their values.

```yaml
params:
- "Param1 name" : "Param value"
- "Param2 name" : "Param value"
- ....
- "Paramn name" : "Param value"
```

### *Evaluator*
* **"- evaluator:"** Which is used to evaluate model output and returns scalar metrics


### Tuner
* **"- tuner:"** Which is used for tuning ML algorithms that allow users to optimize hyperparameters in algorithms and Pipelines

Each tuner will be followed by "params:" keyword, which contain one or many parameters/hyperparameters of the tuner and their values.

```yaml
params:
- "Param1 name" : "Param value"
- "Param2 name" : "Param value"
- ....
- "Paramn name" : "Param value"
```

### **CaraYaml example**
```yaml
CaraPipeline:
- stage: LogisticRegression
params:
- MaxIter: 5
- RegParam: 0.3
- ElasticNetParam: 0.8
- stage: Tokenizer
params:
- InputCol: Input
- OutputCol: ResCol


- evaluator: MulticlassClassificationEvaluator
- tuner: TrainValidationSplit
params:
- TrainRatio: 0.8


```

**For more details and documentation you can refer to the Spark [MLlib](https://spark.apache.org/docs/3.1.2/ml-guide.html) documentation**



## *SparkML available components in CaraML*

This section lists all available SparkML components that you can use with CaraML framework

### *Models*

* **Classification**

- LogisticRegression [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#logistic-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/LogisticRegression.html)
- DecisionTreeClassifier [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#decision-tree-classifier) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html)
- GBTClassifier (Gradient-boosted tree classifier) [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#gradient-boosted-tree-classifier) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/GBTClassifier.html)
- NaiveBayes [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#naive-bayes) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/NaiveBayes.html)
- RandomForestClassifier [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#random-forest-classifier) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html)

* **Regression**

- LinearRegression [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#linear-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/LinearRegression.html)
- DecisionTreeRegressor [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#decision-tree-regressionhttps://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html) and [Documontation]()
- RandomForestRegressor [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#random-forest-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html)
- GBTRegressor (Gradient-boosted tree Regressor) [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#gradient-boosted-tree-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/GBTRegressor.html)


* **Clustering**

- K-means [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-clustering.html#k-means) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/clustering/KMeans.html)
- LDA (Latent Dirichlet allocation) [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-clustering.html#latent-dirichlet-allocation-lda) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/clustering/LDA.html)

### *Dataset operation*

- Binarizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#binarizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Binarizer.html)
- BucketedRandomProjectionLSH [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#bucketed-random-projection-for-euclidean-distance) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html)
- Bucketizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#bucketizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Bucketizer.html)
- ChiSqSelector [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#chisqselector) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html)
- CountVectorizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#countvectorizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/CountVectorizer.html)
- HashingTF [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tf-idf) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/HashingTF.html)
- IDF [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tf-idf) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/IDF.html)
- RegexTokenizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tokenizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html)
- Tokenizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tokenizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Tokenizer.html)
- Word2Vec [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#word2vec) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Word2Vec.html)



### *Tuner*

- CrossValidator [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-tuning.html#cross-validation) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/tuning/CrossValidator.html)
- TrainValidationSplit [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-tuning.html#train-validation-split) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html)


### *Evaluator*

- RegressionEvaluator [Documontation](https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html)
- MulticlassClassificationEvaluator [Documontation](https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html)


## *CamaML schema*
![schema](PA.PNG?raw=true)


## *Example*

For practical example you can refer to this [Link](https://github.com/jsarni/CaraMLTest), which is a github project that contain a project using the CaraML framework.

24 changes: 20 additions & 4 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,38 @@ version := "1.0.0"
organization := "io.github.jsarni"
homepage := Some(url("https://github.com/jsarni/CaraML"))
scmInfo := Some(ScmInfo(url("https://github.com/jsarni/CaraML"), "git@github.com:jsarni/CaraML.git"))
developers :=
List(
developers := List(
Developer("Juba", "SARNI", "juba.sarni@gmail.com", url("https://github.com/jsarni")),
Developer("Merzouk", "OUMEDDAH", "merzoukoumeddah@gmail.com ", url("https://github.com/merzouk13")),
Developer("Aghylas", "SAI", "aghilassai@gmail.com", url("https://github.com/SAI-Aghylas"))
)
)
licenses += ("Apache-2.0", url("http://www.apache.org/licenses/LICENSE-2.0"))
publishMavenStyle := true

publishTo := Some(
if (isSnapshot.value)
"Sonatype Snapshots Nexus" at "https://s01.oss.sonatype.org/content/repositories/snapshots"
else
"Sonatype Releases Nexus" at "https://s01.oss.sonatype.org/content/repositories/releases"
)

// Dependencies
val scalaTest = "org.scalatest" %% "scalatest" % "3.2.7" % Test
val mockito = "org.mockito" %% "mockito-scala" % "1.16.37" % Test
val spark = "org.apache.spark" %% "spark-mllib" % "3.1.1"
val snakeYaml = "org.yaml" % "snakeyaml" % "1.28"
val jacksonCore = "com.fasterxml.jackson.core" % "jackson-core" % "2.10.5"
val jacksonDataformat = "com.fasterxml.jackson.dataformat" % "jackson-dataformat-yaml" % "2.10.5"
val jacksonAnnotation = "com.fasterxml.jackson.core" % "jackson-annotations" % "2.10.5"

lazy val caraML = (project in file("."))
.settings(
name := "CaraML",
libraryDependencies += scalaTest,
libraryDependencies += mockito,
libraryDependencies += spark
libraryDependencies += spark,
libraryDependencies += snakeYaml,
libraryDependencies += jacksonCore,
libraryDependencies += jacksonDataformat,
libraryDependencies += jacksonAnnotation
)
2 changes: 2 additions & 0 deletions project/plugins.sbt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
addSbtPlugin("org.xerial.sbt" % "sbt-sonatype" % "2.3")
addSbtPlugin("com.jsuereth" % "sbt-pgp" % "1.1.1")
3 changes: 3 additions & 0 deletions src/main/resources/body_part1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<div id="Stage Name/ Model Name Gray" style="width: 1000px; margin: 0 auto;" class="shadow p-1 mb-1 bg-body">
<div class=" p-3 mb-2 bg-secondary text-white" >
<p style="color: #FFFFFF; height: 10px ; "class="text-center"><b> <!--Block1 END-->
11 changes: 11 additions & 0 deletions src/main/resources/body_part2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
<!-- Put Model Name--> <!--Block2 BEGIN--></b>
<table class ="table table-striped " style=" width: 1000px; margin: 0 auto;">
<thead class="thead-blue" >
<tr style="height: 18px;">
<th scope="col"> </th>
<th scope="col" class="text-center" style="font-size: 25px;"> Metric</th>
<th scope="col" class="text-center" style="font-size: 25px;"> Value </th>
</tr>
</thead>
<tbody>
<!--Block2 END-->
5 changes: 5 additions & 0 deletions src/main/resources/body_part3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<!--BEGIN 2ND LOOP-->
<!--Block1 BEGIN-->
<tr style="height: 18px;">
<th scope="row"></th>
<td> <!--Block1 END -->
Binary file added src/main/resources/caraML_logo_200x100.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 35 additions & 0 deletions src/main/resources/header.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<!doctype html>
<html lang="fr">

<head>

<title>CaraML Report</title>

<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">

</head>

<body>

<figure class="figure">
<p> </p>
</figure>

<div style="font-family: Lucida Sans Unicode, Lucida Grande, sans-serif; color: #5e9ca0;">
<div class="shadow-sm p-1 mb-5 bg-white rounded">
<h1 >CaraML : Pipeline Metrics Report</h1>
</div>
</div>

<p class="text-center shadow-sm p-1 mb-1 bg-white rounded" >This is the Fit report of your PipelineModel&nbsp; &nbsp;&nbsp;</p>

<h2 class="text-center shadow-sm p-1 mb-1 bg-white rounded" style="color: #2e6c80;">Summary</h2>

<div id="Models and Metrics BLUE" class=" p-3 mb-2 bg-info text-white" >
<p style="color: #FFFFFF; font-size: 25px;"class="text-center">Models and Metrics </p>
</div>
<!-- BEGIN 1ST LOOP FROM HERE -->
<!--Block1 BEGIN-->
Loading