-
Notifications
You must be signed in to change notification settings - Fork 10
2. Source Specification
A source in Tightlock represents the location of the data that is going to be processed by a destination. The accompanying destination is the one responsible for defining the data schema that needs to be present in the corresponding source. For more information about data schemas necessary for each connection available in Tightlock, see the destination specification article.
See that all sources require a unique_id field that is used by Tightlock to guarantee batch consistency. When creating a source, you can specify unique_id to be the column name of any guaranteed distinct field that is present in the data.
unique_id defaults to the column name "id" for all available data sources, so make sure to add an "id" column and populate it with unique identifiers for your data in case you don't use a custom field.
You can find below the necessary fields for each type of source available:
Name | Type | Description |
---|---|---|
dataset | str | The name of your BigQuery dataset. |
table | str | The name of your BigQuery table. |
credentials | str | The full credentials service-account JSON string. Not needed if your backend is located in the same GCP project as the BigQuery table. |
unique_id | str | Unique id column name to be used by BigQuery. Defaults to 'id' when nothing is provided. |
NOTE: GCS source supports typical Hadoop-supported file types (CSV, JSON, Avro, Parquet etc). If using the csv format, use the .csvh extension so that headers can be taken into account (see sample_data directory for reference).
Name | Type | Description |
---|---|---|
bucket_name | str | The name of the GCS bucket that contains your data. |
location* | str | The name of the GCS bucket folder that contains your data. |
unique_id | str | Unique id column name to be used by GCS source engine. Defaults to 'id' when nothing is provided. |
*: All files in the location
folder will compose the "table". That means that you can partition table between multiple files inside of a folder, as long as they all have the same structure.
NOTE: S3 source supports typical Hadoop-supported file types (CSV, JSON, Avro, Parquet etc). If using the csv format, use the .csvh extension so that headers can be taken into account (see sample_data directory for reference).
Name | Type | Description |
---|---|---|
bucket_name | str | The name of the S3 bucket that contains your data. |
location* | str | The name of the S3 bucket folder that contains your data. |
secret_key | str | Optional AWS secret key (only needed when running Tightlock on a separate cloud environment). |
access_key | str | Optional AWS access key (only needed when running Tightlock on a separate cloud environment). |
unique_id | str | Unique id column name to be used by S3 source engine. Defaults to 'id' when nothing is provided. |
*: All files in the location
folder will compose the "table". That means that you can partition table between multiple files inside of a folder, as long as they all have the same structure.
NOTE: Local File source type is usually used for development or testing purposes. This source type only has access to files located at the sample_data directory, so only files deployed alongside the code can be referenced in configuration creation time.
Name | Type | Description |
---|---|---|
location | str | The path to your local file, relative to the container 'data' folder (which is mapped to the host 'sample_data' folder) |
unique_id | str | Unique id column name to be used by local file engine. Defaults to 'id' when nothing is provided. |