This repository contains data, code and Jupyter notebooks for the validation of data science projects tutorial. The tutorial consists of three sections for each step in the production data science model life cycle:
- Database management (using Great Expectations)
- Training pipeline (using Pandera)
- Model serving (using Pydantic)
Each section comes with a notebook in which there are explanations, code snippets and exercises.
If you would like to see me run through these notebooks from PyData London 2022, you can navigate to this YouTube video: Data Validation for Data Science | PyData London 2022
Dataset used for the purposes of this tutorial is taken from the House prices
prediction competition on Kaggle.
Two CSV files located in the data folder: train.csv and test.csv.
To Follow the notebooks and exercises there are two options:
- Use your own Python environment with Jupyter installed. The notebooks are run using
the jupyter notebookcommand, select the notebook you want to run in thenotebooksfolder and follow the instructions. For running the different tools with all of the features available it is recommended to usePython 3.8and up.
- Use Google Colaboratory without any pre-installation needed. Click the link to go to
the repository's GitHub page. Choose one
of the notebooks in the notebooksfolder and from the interactive view, click on the link toopen in Colab.