Skip to content

jossefaz/pypliner-data-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



Pypliner-data-processor

The aim of this project is to allow a great modularity in the development of data processes based on the execution sequence of scripts.

The original idea was to provide a central script that would receive as input a configuration file, from which it would be able to import and execute the different scripts.

The power of this tool lies in the fact that it allows to inject the result of a given script to a target script as a parameter. All this based only on the simple configuration file (JSON).

Install

First clone this repository

git clone https://github.com/jossefaz/pypliner-data-processor.git

Create a new configuration file (.json) following the example below.

        [  
            {  
            "run_example" : {  
                  "Tool" : "tool_example",  
                  "Args" : {  
                     "param1" : "This is a test",  
                     "param2" : "This is another test"  
                  }  
            },  
           "run_example2" : {  
                  "Tool" : "tool_example",  
                  "Args" : {  
                     "param1" : "This is a test 2",  
                     "param2" : "This is another test 2"  
            }  
          },  
           "order" : ["run_example", "run_example2"]  
          }
     ]

Now let’s explain the different parameters :

The configuration file is basically a list of Objects (dictionnary), where each of those represents an execution pipeline.

So it begins like a list ([]).
Inside of it we define a global object for each pipelines.
The keys of this object represents the name of the process (feel free to give a name that is very explicit on what this process aims to achieve).

So far we have

        [
           {
	           "run_my_first_script" : {
	           }
           }
       ]

Now we need some mandatory keys to indicates which script this process will execute and what are its arguments.
Let’s add the Tool and Args key :

 "run_my_first_script" : {
     "Tool" : "tool_example"
      "Args" : {  
          "param1" : "This is a test",  
          "param2" : "This is another test"  
          }  
  }

The Tool key must be the name of an existing script (that you will build) under the directory Tools -> executables -> <ENVIRONMENT>
The ENVIRONMENT is a folder based on the defined runtime environment (dev, prod and test) which is defined by the --env runtime variable (see Run section bellow)

There is a tool_example script that you see here is a script that you can find in the Tools -> executables -> dev directory
And this script is very basic :


def main(param1, param2):  
  print("param1 is : ", param1)  
  print("param2 is : ", param2)  
  return param1 + " from main"

As you can see here : the names of the parameters must be the sames as those defined in the configuration file (If it is not the case, an ArgumentMissingException will be raised. This is a custom exception (see Exceptions section bellow)

Each Script must have a main function. But this function could be parameter-less (but if you add parameters to your main function, those parameter’s names must match those in the configuration file)

The order list defines the order in which the pypliner will execute those scripts. So the processes configuration order does not matter, only the order list will define the execution order of the scripts.

Result injection

In the world of data processing, it is often necessary to link the results of different processes together: for example, the result of process 1 will often be used as a basis for process 2 to run.

The pypliner allows you to injects the resulte from one script to another by just writing its name in the configuration :

        [  
            {  
            "run_example1" : {  
                  "Tool" : "tool_example",  
                  "Args" : {  
                     "param1" : "This is a test",  
                     "param2" : "This is another test"  
                  }  
            },  
           "run_example2" : {  
                  "Tool" : "tool_example",   
                  "Args" : {  
                     "param1" : "run_example1", <<==== Here we inject the result of run_example1 process that we define before, as a parameter for this second script 
                     "param2" : "This is another test 2"  
            }  
          },  
           "order" : ["run_example", "run_example2"] <<=== To be able to inject between processes, you must keep the order logic too. You cannot inject the result of a process that you did not execute yet 
          }
     ]

Run

Runtime variables

  • --logpath, -lp : path for writting logs (default value is defined
    to ‘./logs’ i.e the running directory)

  • --logconfig, -lc : path for configuring the logs file (this projects come with a default logger configuration file that you can use as a base for your own logger configuration definition. So this variable has a default value of "“Config/prod/logger.json”)

  • --config, -cfg : path of the configuration file

  • --env, -e : define the environment of the runtime (‘DEV’, ‘PROD’, ‘TEST’ are the possible values and it has a default value of ‘DEV’)

About

This project allows running python scripts based on a configuration json file that describe functions and their arguments and the order to run them

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages