Skip to content

2. Source Specification

Caio Tomazelli edited this page Feb 8, 2024 · 5 revisions

A source in Tightlock represents the location of the data that is going to be processed by a destination. The accompanying destination is the one responsible for defining the data schema that needs to be present in the corresponding source. For more information about data schemas necessary for each connection available in Tightlock, see the destination specification article.

See that all sources require a unique_id field that is used by Tightlock to guarantee batch consistency. When creating a source, you can specify unique_id to be the column name of any guaranteed distinct field that is present in the data.

unique_id defaults to the column name "id" for all available data sources, so make sure to add an "id" column and populate it with unique identifiers for your data in case you don't use a custom field.

You can find below the necessary fields for each type of source available:

Big Query

Config

Name Type Description
dataset str The name of your BigQuery dataset.
table str The name of your BigQuery table.
credentials str The full credentials service-account JSON string. Not needed if your backend is located in the same GCP project as the BigQuery table.
unique_id str Unique id column name to be used by BigQuery. Defaults to 'id' when nothing is provided.

Google Cloud Storage

NOTE: GCS source supports typical Hadoop-supported file types (CSV, JSON, Avro, Parquet etc). If using the csv format, use the .csvh extension so that headers can be taken into account (see sample_data directory for reference).

Config

Name Type Description
bucket_name str The name of the GCS bucket that contains your data.
location* str The name of the GCS bucket folder that contains your data.
unique_id str Unique id column name to be used by GCS source engine. Defaults to 'id' when nothing is provided.

*: All files in the location folder will compose the "table". That means that you can partition table between multiple files inside of a folder, as long as they all have the same structure.

AWS S3

NOTE: S3 source supports typical Hadoop-supported file types (CSV, JSON, Avro, Parquet etc). If using the csv format, use the .csvh extension so that headers can be taken into account (see sample_data directory for reference).

Config

Name Type Description
bucket_name str The name of the S3 bucket that contains your data.
location* str The name of the S3 bucket folder that contains your data.
secret_key str Optional AWS secret key (only needed when running Tightlock on a separate cloud environment).
access_key str Optional AWS access key (only needed when running Tightlock on a separate cloud environment).
unique_id str Unique id column name to be used by S3 source engine. Defaults to 'id' when nothing is provided.

*: All files in the location folder will compose the "table". That means that you can partition table between multiple files inside of a folder, as long as they all have the same structure.

Local File

NOTE: Local File source type is usually used for development or testing purposes. This source type only has access to files located at the sample_data directory, so only files deployed alongside the code can be referenced in configuration creation time.

Config

Name Type Description
location str The path to your local file, relative to the container 'data' folder (which is mapped to the host 'sample_data' folder)
unique_id str Unique id column name to be used by local file engine. Defaults to 'id' when nothing is provided.