This is a CLI tool (written in Python) that allows you to load your data from Mysql to Google BigQuery. Although, I tried to optimize it for speed, there are some TODOs left. The script is based on Ryan's import script.
- libmysqlclient-dev (on ubuntu just run:
sudo apt install libmysqlclient-dev
) - Google Cloud SDK
- Python
The following steps are based on the official Google Dokumentation:
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
echo "deb https://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
sudo apt-get install google-cloud-sdk-app-engine-python
To configure it, run:
gcloud init
To authenticate your app create a service account key, name it google_key.json
and place it in this folder.
I would recommend to use a virtualenv. Run the following command to install all dependencies using pip:
pip install -r requirements.txt
I successfully used it with pypy.
sudo apt install pypy pypy-dev
virtualenv -p pypy ve
cpython | pypy | |
---|---|---|
real | 0m6.964s | 0m9.186s |
user | 0m0.888s | 0m2.500s |
sys | 0m0.152s | 0m0.244s |
3000 | 5000 | 10000 | |
---|---|---|---|
real | 5m8.826s | 5m28.480s | 5m27.202s |
user | 0m28.468s | 0m32.820s | 0m36.176s |
sys | 0m2.728s | 0m3.724s | 0m4.548s |
Usage: run.py [OPTIONS]
Options:
-h, --host TEXT MySQL hostname
-d, --database TEXT MySQL database [required]
-u, --user TEXT MySQL user
-p, --password TEXT MySQL password
-t, --table TEXT MySQL table [required]
-i, --projectid TEXT Google BigQuery Project ID [required]
-n, --dataset TEXT Google BigQuery Dataset name [required]
-l, --limit INTEGER max num of rows to load
-s, --batch_size INTEGER max num of rows to load
--help Show this message and exit.
- use async load?
- adjust bucket-size
- add support for other data types