What is Cdiscount starter?
This is ready to use, end-to-end sample solution for the currently running Kaggle Cdiscount challenge.
It involves data loading and augmentation, model training (many different architectures), ensembling and submit generator.
Check collection of public projects
In this open source solution you will find references to the neptune.ml. It is free platform for community Users, which we use daily to keep track of our experiments. Please note that using neptune.ml is not necessary to proceed with this solution. You may run it as plain Python script
How to run Cdiscount starter?
Install the requirements
pip install -r requirements.txt
Install neptune by simply
pip install neptune-cli
Finish neptune installation by running
Finally, open neptune and create project
cdiscount. Check the project key because you will use it later (most likely it is: CDIS).
Now, you are ready to run the code and train some models...
remark about the competition data: We have uploaded the data to the neptune platform. It is available in the
/public/cdiscount directory. Moreover, we created the
meta_data file for large .bson files in the
/public/Cdiscount/meta directory. It makes the process way faster.
You can run this end-to-end solution in two ways:
- If you wish to work on your own machine you can run
neptune run run_manager.py -- run_pipeline
- Deploying on cloud via neptune is super easy
more advanced option is to run
neptune send run_manager.py \ --config experiment_config.yaml \ --pip-requirements-file requirements.txt \ --project-key CDIS \ --environment keras-2.0-gpu-py3 \ --worker gcp-gpu-medium \ -- run_pipeline
Collect results and upload to Kaggle
/output/project_data/submissions, get your submission file, upload it to Kaggle and check your rank in the competition!
custom data directories
If you do not wish to use default data directories, you can specify custom paths in the
raw_data_dir: /public/Cdiscount meta_data_dir: /public/Cdiscount/meta meta_data_processed_dir: /output/project_data/meta_processed models_dir: /output/project_data/models predictions_dir: /output/project_data/predictions submissions_dir: /output/project_data/submissions
meta data creation
If you want to create meta data locally you should run
python run_manager create_metadata
and your metadata will be stored in the
Since the dataset is very large we suggest that you sample training dataset to a manageable size. Something like 1000 most common categories and 1000 images per category seems reasonable to start with. Nevertheless, You can tweak it however you want in the
properties: - key: top_categories value: 100 - key: images_per_category value: 100 - key: epochs value: 10 - key: pipeline_name value: InceptionPipeline
hyperparameter space search
We give you an option to run this code without neptune. The transition is seamless, just follow these steps:
Download the competition data to some folder
specify data directories in the
run python code
python run_manager.py run_pipeline
Please feel free to modify this code in order to improve your score. Add new models, pre- and post-processing routines or ensembling methods.
Have fun competing on this Kaggle challenge!