Skip to content

jaimevalero/push-kaggle-dataset

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Push Kaggle Dataset action

This action push data from a github repository to a dataset at kaggle.

Use this action to keep synchronized your datasets at kaggle with your github repositories.

Please bear in mind that this action do NOT work with kernels nor notebooks, so it is of not use on competitions.

Inputs

id

Dataset identifier in format {username}/{dataset}). Default is "{KAGGLE_USERNAME}/{GITHUB_REPO_NAME}" Where KAGGLE_USERNAME is a secret - see sections secret You cannot upload data to other kaggle user.

files

Files to upload. Default is "*.csv"

title

Title of the dataset.

Only if it is a new dataset. Otherwise it is not used. Default is the dataset id.

Eg: if the dataset is mlg-ulb/creditcardfraud, the default title would be 'creditcardfraud'

subtitle

Subtitle of the dataset. We highly recommend entering a subtitle for your Dataset. Only if it is a new dataset. Otherwise it is not used. Must be between 20 and 80 characters. If the subtitle has fewer than 20 characters, trailing white spaces are added.

description

Description of the dataset, only if it has to be created. Only if it is a new dataset. Otherwise it is not used.

is_public

Visibility of the the new dataset. Boolean (True or False) Default is "*.False"- private datasets are created by default. Only if it is a new dataset. Otherwise it is not used.

Secrets

You have to configure your secrets at: Settings >> Secrets

  • ${{ secrets.KAGGLE_USERNAME }} - Required The dataset owner.
  • ${{ secrets.KAGGLE_KEY }} - Required The API key for your user. You can create your api key here.

Examples usage

Create a main.yml file like this in the path your repo, in the path .github/workflows/main.yml

Change the fields in that yaml : id, files, title, subtitle and description. Add the secrets.

Please consider that if you are NOT creating a new dataset, only updating it, the following fields are not used, so you can skip it :

  • title
  • subtitle
  • description
  • is_public

Example

Example1 : Create a new dataset, uploading a file, "titanic.csv"

name: upload

# Controls when the action will run. Triggers the workflow on push or pull request
# events but only for the master branch
on:
  push:
    branches: [ master ]
  pull_request:
    branches: [ master ]

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "build"
  upload:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest

    # Steps represent a sequence of tasks that will be executed as part of the job
    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@v2
      # Runs a single command using the runners shell
      - name: Upload datasets
        uses: jaimevalero/push-kaggle-dataset@v3.1 # This is the action
        env:
          # Do not leak your credentials.
          KAGGLE_USERNAME: ${{ secrets.KAGGLE_USERNAME }}
          KAGGLE_KEY: ${{ secrets.KAGGLE_KEY }}

        with:
          id:  "jaimevalero/my-new-dataset"
          title: "Testing github actions for upload datasets"
          subtitle: "Titanic data2"
          description: "## Example of dataset syncronized by github actions <br/>Source https://github.com/jaimevalero/test-actions and https://github.com/jaimevalero/push-kaggle-dataset <br/> "
          files:  titanic.csv
          is_public: true

Example2 : Upload more than one file

You can use:

  • wildcards (eg: *.xlsx )
  • directory names (eg: images )

If you use directory names, you will upload every file from that directory. Please bear in mind that files in subdirectories are packaged in tar file, due to API behaviour.

In case you use more than one line, you should use the "|" operator.

          files: |
            titanic.csv
            *.xlsx
            images

Hint: Skip workflow based on commit message

You can configure your workflow itself. En this example, the workflow only triggers when a string is found in the commit message

  upload:
   if: "contains(github.event.commits[0].message, 'madrid_results.csv')"

Complete yaml for these examples could be find at examples directory