jsarni · SAI-Aghylas · Jul 3, 2021 · Apr 10, 2021 · Apr 10, 2021 · Apr 10, 2021
diff --git a/PA.PNG b/PA.PNG
diff --git a/README.md b/README.md
@@ -1 +1,185 @@
-# CaraML
+# CaraML
+
+## *Presentation*
+
+CaraML is a Scala/Apache Spark framework for distributed Machine Learning programs, using the Apache Spark MLlib in the simplest possible way. No need to write hundreds or thousands code lines, just discribing pipline of models and/or transformations. The purpose is to do "Machine Learning as Code" 
+
+
+
+## *Requirements*
+
+To use CaraML framework, you must satisfy the following requirements:
+
+- Scala version >= 2.12.13
+- Spark version >= 3.1
+- Java 11
+
+
+
+## *Installation* 
+
+  - Spark :  [Download here](https://spark.apache.org/downloads.html) 
+  - Scala : [Download here](https://www.scala-lang.org/download/)
+  - CaraML library : [CaraML](https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/jsarni/caraml_2.12/1.0.0-SNAPSHOT/)
+
+
+
+## *Usage*
+
+To use CaraML, you can add the framework dependency in your Spark application
+
+- Sbt
+
+```scala
+  libraryDependencies += "io.github.jsarni" %% "caraml" % "1.0.0"
+```
+
+- Gradle
+
+```scala
+    compile group: 'io.github.jsarni"', name: 'caraml', version: '1.0.0'
+```
+- Maven
+
+```scala
+    <dependency>
+    <groupId>io.github.jsarni"</groupId>
+    <artifactId>caraml</artifactId>
+    <version>1.0.0</version>
+    </dependency>
+```
+
+CaraML needs the following information 
+
+- Prepared dataset that will be used to transform and train models
+- Path where to save the final trained model and its metrics
+- Path of the CaraYaml file, where the user will declare and set the pipeline with stages of SparkML models and/or SparkML transformations
+
+The Yaml file will be used to describe a pipeline of stages, each stage could be a SparkML model or a Spark ML method of data preprocessing.
+All CaraYaml files must start with "CaraPipeline:" keyword and could contain the following keywords 
+
+### *CaraPipeline*
+* **"CaraPipeline:"** : keyword that must be set in the beginning of each CaraYaml file
+
+
+### *Stage*
+* **"- stage:"** Is a keyword used to declare and describe a stage. It could be an Estimator or a Transformer :
+  * **SparkML Estimator** : Which is the name of the SparkML model that you want to use in the stage. 
+  * **SparkML Transformer** : Is the name of SparkML feature transformation that you want to apply to your dataset (preprocessing)
+
+
+Each stage will be followed by "params:" keyword, which contain one or many parameters/hyperparameters of the stage and their values.
+
+```yaml
+    params:
+        - "Param1 name" : "Param value"
+        - "Param2 name" : "Param value"
+        - ....
+        - "Paramn name" : "Param value"
+```
+
+### *Evaluator*
+* **"- evaluator:"** Which is used to evaluate model output and  returns scalar metrics
+
+
+### Tuner
+* **"- tuner:"** Which is used for tuning ML algorithms that allow users to optimize hyperparameters in algorithms and Pipelines
+
+Each tuner will be followed by "params:" keyword, which contain one or many parameters/hyperparameters of the tuner and their values.
+
+```yaml
+    params:
+        - "Param1 name" : "Param value"
+        - "Param2 name" : "Param value"
+        - ....
+        - "Paramn name" : "Param value"
+```
+
+### **CaraYaml example**
+```yaml
+CaraPipeline:
+- stage: LogisticRegression
+  params:
+    - MaxIter: 5
+    - RegParam: 0.3
+    - ElasticNetParam: 0.8
+- stage: Tokenizer
+  params:
+    - InputCol: Input
+    - OutputCol: ResCol
+
+
+- evaluator: MulticlassClassificationEvaluator
+- tuner: TrainValidationSplit
+  params:
+    - TrainRatio: 0.8
+
+
+```
+
+**For more details and documentation you can refer to the Spark [MLlib](https://spark.apache.org/docs/3.1.2/ml-guide.html) documentation**
+
+
+
+## *SparkML available components in CaraML*
+
+This section lists all available SparkML components that you can use with CaraML framework
+
+### *Models*
+
+* **Classification**
+
+  - LogisticRegression [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#logistic-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/LogisticRegression.html) 
+  - DecisionTreeClassifier [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#decision-tree-classifier) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html)
+  - GBTClassifier (Gradient-boosted tree classifier) [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#gradient-boosted-tree-classifier) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/GBTClassifier.html)
+  - NaiveBayes [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#naive-bayes) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/NaiveBayes.html)
+  - RandomForestClassifier [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#random-forest-classifier) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html)
+
+* **Regression**
+
+  - LinearRegression [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#linear-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/LinearRegression.html)
+  - DecisionTreeRegressor [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#decision-tree-regressionhttps://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html) and [Documontation]()
+  - RandomForestRegressor [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#random-forest-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html)
+  - GBTRegressor (Gradient-boosted tree Regressor) [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-classification-regression.html#gradient-boosted-tree-regression) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/regression/GBTRegressor.html)
+
+
+* **Clustering**
+
+  - K-means [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-clustering.html#k-means) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/clustering/KMeans.html) 
+  - LDA (Latent Dirichlet allocation) [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-clustering.html#latent-dirichlet-allocation-lda) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/clustering/LDA.html)
+
+### *Dataset operation*
+
+- Binarizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#binarizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Binarizer.html)
+- BucketedRandomProjectionLSH [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#bucketed-random-projection-for-euclidean-distance) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html)
+- Bucketizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#bucketizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Bucketizer.html)
+- ChiSqSelector [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#chisqselector) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/ChiSqSelector.html)
+- CountVectorizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#countvectorizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/CountVectorizer.html)
+- HashingTF [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tf-idf) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/HashingTF.html)
+- IDF [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tf-idf) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/IDF.html)
+- RegexTokenizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tokenizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/RegexTokenizer.html)
+- Tokenizer [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#tokenizer) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Tokenizer.html)
+- Word2Vec [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-features.html#word2vec) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/feature/Word2Vec.html)
+
+
+
+### *Tuner*
+
+- CrossValidator [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-tuning.html#cross-validation) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/tuning/CrossValidator.html)
+- TrainValidationSplit [Spark MLlib example](https://spark.apache.org/docs/3.1.2/ml-tuning.html#train-validation-split) and [Documontation](https://spark.apache.org/docs/3.1.2/api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html)
+
+
+### *Evaluator*
+
+- RegressionEvaluator [Documontation](https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html)
+- MulticlassClassificationEvaluator [Documontation](https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html)
+
+
+## *CamaML schema*
+![schema](PA.PNG?raw=true)
+
+
+## *Example*
+
+For practical example you can refer to this [Link](https://github.com/jsarni/CaraMLTest), which is a github project that contain a project using the CaraML framework.
+
diff --git a/build.sbt b/build.sbt
@@ -7,22 +7,38 @@ version := "1.0.0"
 organization := "io.github.jsarni"
 homepage := Some(url("https://github.com/jsarni/CaraML"))
 scmInfo := Some(ScmInfo(url("https://github.com/jsarni/CaraML"), "git@github.com:jsarni/CaraML.git"))
-developers :=
-  List(
+developers := List(
     Developer("Juba", "SARNI", "juba.sarni@gmail.com", url("https://github.com/jsarni")),
     Developer("Merzouk", "OUMEDDAH", "merzoukoumeddah@gmail.com ", url("https://github.com/merzouk13")),
     Developer("Aghylas", "SAI", "aghilassai@gmail.com", url("https://github.com/SAI-Aghylas"))
-  )
+)
+licenses += ("Apache-2.0", url("http://www.apache.org/licenses/LICENSE-2.0"))
+publishMavenStyle := true
+
+publishTo := Some(
+    if (isSnapshot.value)
+      "Sonatype Snapshots Nexus" at "https://s01.oss.sonatype.org/content/repositories/snapshots"
+    else
+      "Sonatype Releases Nexus" at "https://s01.oss.sonatype.org/content/repositories/releases"
+)
 
 // Dependencies
 val scalaTest = "org.scalatest" %% "scalatest" % "3.2.7" % Test
 val mockito = "org.mockito" %% "mockito-scala" % "1.16.37" % Test
 val spark = "org.apache.spark" %% "spark-mllib" % "3.1.1"
+val snakeYaml = "org.yaml" % "snakeyaml" % "1.28"
+val jacksonCore = "com.fasterxml.jackson.core" % "jackson-core" % "2.10.5"
+val jacksonDataformat = "com.fasterxml.jackson.dataformat" % "jackson-dataformat-yaml" % "2.10.5"
+val jacksonAnnotation = "com.fasterxml.jackson.core" % "jackson-annotations" % "2.10.5"
 
 lazy val caraML = (project in file("."))
   .settings(
     name := "CaraML",
     libraryDependencies += scalaTest,
     libraryDependencies += mockito,
-    libraryDependencies += spark
+    libraryDependencies += spark,
+    libraryDependencies += snakeYaml,
+    libraryDependencies += jacksonCore,
+    libraryDependencies += jacksonDataformat,
+    libraryDependencies += jacksonAnnotation
   )
diff --git a/project/plugins.sbt b/project/plugins.sbt
@@ -0,0 +1,2 @@
+addSbtPlugin("org.xerial.sbt" % "sbt-sonatype" % "2.3")
+addSbtPlugin("com.jsuereth" % "sbt-pgp" % "1.1.1")
diff --git a/src/main/resources/body_part1.txt b/src/main/resources/body_part1.txt
@@ -0,0 +1,3 @@
+<div id="Stage Name/ Model Name Gray" style="width: 1000px; margin: 0 auto;" class="shadow p-1 mb-1 bg-body">
+        <div class=" p-3 mb-2 bg-secondary text-white" >
+          <p style="color: #FFFFFF; height: 10px ; "class="text-center"><b> <!--Block1 END-->
diff --git a/src/main/resources/body_part2.txt b/src/main/resources/body_part2.txt
@@ -0,0 +1,11 @@
+<!-- Put Model Name--> <!--Block2 BEGIN--></b>
+    <table  class ="table table-striped " style=" width: 1000px; margin: 0 auto;">
+      <thead class="thead-blue" >
+        <tr style="height: 18px;">
+          <th scope="col"> </th>
+            <th scope="col" class="text-center" style="font-size: 25px;"> Metric</th>
+            <th scope="col" class="text-center" style="font-size: 25px;"> Value </th>
+        </tr>
+      </thead>
+      <tbody>
+        <!--Block2 END-->
diff --git a/src/main/resources/body_part3.txt b/src/main/resources/body_part3.txt
@@ -0,0 +1,5 @@
+<!--BEGIN 2ND LOOP-->
+        <!--Block1 BEGIN-->
+        <tr style="height: 18px;">
+           <th scope="row"></th>
+              <td> <!--Block1 END -->
diff --git a/src/main/resources/caraML_logo_200x100.png b/src/main/resources/caraML_logo_200x100.png
diff --git a/src/main/resources/header.txt b/src/main/resources/header.txt
@@ -0,0 +1,35 @@
+<!doctype html>
+<html lang="fr">
+
+  <head>
+
+    <title>CaraML Report</title>
+
+    <meta charset="utf-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
+
+    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
+
+  </head>
+
+  <body>
+
+    <figure class="figure">
+      <p>     </p>
+    </figure>
+
+    <div style="font-family: Lucida Sans Unicode, Lucida Grande, sans-serif; color: #5e9ca0;">
+      <div class="shadow-sm p-1 mb-5 bg-white rounded">
+        <h1 >CaraML : Pipeline Metrics Report</h1>
+      </div>
+    </div>
+
+    <p class="text-center shadow-sm p-1 mb-1 bg-white rounded" >This is the Fit report of your PipelineModel&nbsp; &nbsp;&nbsp;</p>
+
+    <h2 class="text-center shadow-sm p-1 mb-1 bg-white rounded" style="color: #2e6c80;">Summary</h2>
+
+      <div id="Models and Metrics BLUE" class=" p-3 mb-2 bg-info text-white"  >
+        <p style="color: #FFFFFF; font-size: 25px;"class="text-center">Models and Metrics </p>
+      </div>
+      <!-- BEGIN 1ST LOOP FROM HERE -->
+      <!--Block1 BEGIN-->