Welcome to the Kaggle Data Pipeline Template! This repository serves as a blueprint for creating automated data pipelines using the Kaggle API. It includes the necessary scripts and file structure to get your own data pipeline up and running quickly.
You will need:
- A Kaggle account.
- API credentials from Kaggle.
To use this repository as a template, follow these steps:
- Click on the
Use this template
button to create a new repository. - Clone the new repository to your local machine.
- Install the Kaggle API by running
pip install kaggle
.
To use the Kaggle API, you will need to add your API credentials as secrets to your repository. To do this, follow these steps:
- Download your Kaggle API credentials file (
kaggle.json
) from the Account section of your Kaggle profile. - Navigate to the 'Settings' tab of your new repository.
- Click on 'Secrets'.
- Click on 'New repository secret'.
- Name the new secret
KAGGLE_USERNAME
and set its value to your Kaggle username (found in thekaggle.json
file). - Create another secret named
KAGGLE_KEY
and set its value to the corresponding key from thekaggle.json
file.
By doing this, you ensure that your Kaggle credentials are kept secure and are not shared publicly.
Once you have set up your Kaggle API credentials, you can start using the data pipeline.
data
folder: Whatever datasets you generate should be placed within the data folder and should be referenced directly by the dataset-metadata.json file. For more info on structuring the metadata file, see the Metadata Wiki. The data_card.md can be updated to be included in your metadata filedescription
key.utils
folder: Convenient place to put helper functions for your pipeline. There is an example for updating the data card in the metadata file.tests
folder: Contains basic tests from Evidently to monitor data integrity before pushing to Kaggle. Docs available hererequirements.txt
: Add names of all python packages used. Lock versions if applicable.generator.py
: Build your data generator. Basic code is included to update a kaggle data version.actions
folder: Contains an example of a github action that will run thegenerator.py
script on the first day of every month. This template will not do anything by itself, it must be developed within a github action by navigating to the 'Actions' tab and creating a new workflow.
#Create new dataset version
api = KaggleApi()
api.authenticate()
api.dataset_create_version(
"./data/",
version_notes=f"Updated on {datetime.datetime.now().strftime('%Y-%m-%d')}",
)
# Create new dataset
api = KaggleApi()
api.authenticate()
api.dataset_create_new("./data/",
public=False,
quiet=False,
convert_to_csv=True,
dir_mode='skip')
For any questions or concerns, please open an issue on this repository.
© 2023 jon-bown