Skip to content

okp4/data-join-tabular

Data-join-tabular

version lint build test codecov conventional commits semantic-release contributor covenant License

Purpose & Philosophy

This repository contains data tabular join service.

Description

2 sets of input data, giving 1 output with associated data based on a common column.

Specification

  • Read different file format (csv, geojson, shp)
  • Optional argument depending on the type of input file (ex: separator for a csv)
  • The name of the new columns created (suffix, prefix...)
  • Type of join ('left', 'right', 'outer', 'inner', 'cross)
  • Validate the output data

Technologies

pandas.merge

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

  • the use of conventional commits, semantic versioning and semantic releasing which automates the whole package release workflow including: determining the next version number, generating the release notes, and publishing the artifacts (project tarball, docker images, etc.)
  • a uniform way for managing the project lifecycle (dependencies management, building, testing)
  • KISS principles: simple for developers
  • a consistent coding style

Usage

The usage is given as follows:

Usage: data-join-tabular join [OPTIONS]

  Represents cli 'join' command

Options:
  -i1, --input1 FILE              path to first file to join  [required]
  -i2, --input2 FILE              path to second file to join  [required]
  -s1, --sep1 TEXT                separtor for reading the first file
  -s2, --sep2 TEXT                separtor for reading the second file
  -sr, --sufrigh TEXT             the suffix to add to overlapping column
                                  names in right
  -sl, --sufleft TEXT             the suffix to add to overlapping column
                                  names in left
  -onm, --outname TEXT            output file name, if not provioded, output
                                  name will be the same as file1
  -o, --on TEXT                   Column or index level names to join on.
                                  These must be found in both DataFrames.
                                  If on is None and not merging on indexes
                                  then this defaults to the intersection of
                                  the columns in both DataFrames
  -v, --validate [one_to_one|one_to_many|many_to_one|many_to_many]
                                  If specified, checks if merge is of
                                  specified type.        “one_to_one” or
                                  “1:1”: check if merge keys are unique in
                                  both left and right datasets.
                                  “one_to_many” or “1:m”: check if merge keys
                                  are unique in left dataset.
                                  “many_to_one” or “m:1”: check if merge keys
                                  are unique in right dataset.
                                  “many_to_many” or “m:m”: allowed, but does
                                  not result in checks.
                                  It will raise a MergeError if the validation fails
  -how, --how [left|right|outer|inner|cross]
                                  Type of merge to be performed.    left: use
                                  only keys from left frame, similar to a SQL
                                  left outer join; preserve key order.
                                  right: use only keys from right frame,
                                  similar to a SQL right outer join; preserve
                                  key order.    outer: use union of keys from
                                  both frames, similar to a SQL full outer
                                  join; sort keys lexicographically.    inner:
                                  use intersection of keys from both frames,
                                  similar to a SQL inner join; preserve the
                                  order of the left keys.    cross: creates
                                  the cartesian product from both frames,
                                  preserves the order of the left keys.
  -so, --sort TEXT                Sort the join keys lexicographically in the
                                  result DataFrame. If False,        the order
                                  of the join keys depends on the join type
                                  (how keyword).
  -or, --onrigh TEXT              Column name to join in the right DataFrame
  -ol, --onleft TEXT              Column name to join in the left DataFrame, 
                                  it must be sorted to match the on_right columns
  -out, --output DIRECTORY        output directory where output file will be
                                  written  [default: .]
  -f, --force                     overwrite existing file
  -ft, --fix-types                fix types issues
  --dry-run                       passthrough, will not write anything
  --help                          Show this message and exit.
poetry run data-join-tabular  join -i1 ./tests/data/inputs1/input_test1.csv -i2 ./tests/data/inputs2/input_test1.csv -o categorie -o statut -o effectif -o genre -s1 ';' -s2 ';' -out ./tests/data -f

System requirements

Python

The repository targets python 3.9 and higher.

Poetry

The repository uses Poetry as python packaging and dependency management. Be sure to have it properly installed before.

  curl -sSL https://install.python-poetry.org | python3 

Docker

You can follow the link below on how to install and configure Docker on your local machine:

What's included

This template provides the following:

  • poetry for dependency management.
  • flake8 for linting python code.
  • mypy for static type checks.
  • pytest for unit testing.
  • click to easily setup your project commands

The project is also configured to enforce code quality by declaring some CI workflows:

  • conventional commits
  • lint
  • unit test
  • semantic release

Everyday activity

Build

Project is built by poetry.

poetry install

Lint

⚠️ Be sure to write code compliant with linters or else you'll be rejected by the CI.

Code linting is performed by flake8.

poetry run flake8 --count --show-source --statistics

Static type check is performed by mypy.

poetry run mypy .

To improve code quality, we use other linters in our workflows, if you don't want to be rejected by the CI, please check these additional linters.

Markdown linting is performed by markdownlint-cli.

markdownlint "**/*.md"  

Docker linting is performed by dockerfilelint and hadolint.

dockerfilelint Dockerfile
hadolint Dockerfile

Unit Test

⚠️ Be sure to write tests that succeed or else you'll be rejected by the CI.

Unit tests are performed by the pytest testing framework.

poetry run pytest -v

Build & run docker image (locally)

Build a local docker image using the following command line:

docker build -t data-join-tabular .

Once built, you can run the container locally with the following command line:

docker run -ti --rm data-join-tabular

You want to get involved? 😍

Please check out OKP4 health files :