In my perspective, to achieve a good AI solution there must be balance between model and the data quality. I too more conscious on data side. Andrew NG and his team prove the data quality is a key by show it with a experiment with real-world data.
The common practice amongst researchers is to hold the data fixed while trying to improve the code. But, when the dataset size is modest (<10,000 examples), Andrew Ng suggests ML teams will make faster progress, given the dataset is good.
Bellow table is an result of the expericent which proves why data centric aproach is better than model centric. If your model is already at its best the task to have it improved to achieve 90% accuracy sound almost impossible.
for the model centric, the improvements is based on Network Architecture search and using the state-of-the-art architectures, whereas, for the data centric, the approach taken was to identify inconsistencies and clean noisy labels. you can see that what data centric aproach does
Andrew Ng mentioned how everyone jokes about ML is 80% data preparation, but no one seems to care. A quick look at the arxiv would give an idea of the direction ML research is going. There is unprecedented competition around beating the benchmarks. If Google has BERT then OpenAI has GPT-3. But, these fancy models take up only 20% of a business problem.
In fact, MLOps is essential to connect the dots and take these steps to the next level while ensuring consistency, completeness and relevancy. The most important objective of the MLOps is to ensure a high-quality and consistent flow of data throughout all stages of a project.
There are a number of goals enterprises want to achieve through MLOps systems successfully implementing ML across the enterprise, including:
- Deployment and automation
- Reproducibility of models and predictions
- Governance and regulatory
- Scalability
- Monitoring and management
Orchestrators in TFX automates task executions and monitors TF components. One of the largest TFX Orchestrators is Apache Beam. Apache Beam is the unified batch and stream distributed API which acts as an abstraction layer to run on top of the distributed processing framework. This allows you to work on diverse backends such as Apache Spark, Local, Dataflow, etc.
In this repo we had given an wide range of idea on how to use each tfx components standalone and also as MLOps pipeline. All notebooks in this repo are depended to each other. Each notebook will expect the execution of previous one. Each notebook explained the standalone execution of component and orchestrate it using interactive context from tfx. we have used metadata store heavily to establish link between notebooks. Follow the below mentioned sequence:
There is a step to be taken for smooth learning:
step 1:
clone the repo and create environment in the path [root_dir]/Tensorflow-Extended-tutorial
pythom -m venv env
Activate the evirnonment using command bellow
if you are using windows:
env\Scripts\activate
if you are using linux based system:
source env/bin/activate
step 2:
install all required packages:
pip install -r requirements.txt
step 3:
For model training pipeline you need to download some pretrained model weights from here and extract it in the path
[root_dir]/Tensorflow-Extended-tutorial/models
(or)
you can dowload those thing on the fly by changing the value of the parameter in config.py file
FILE PATH: [root_dir]/Tensorflow-Extended-tutorial/utils/configurations
change line 15 as => UNIVERSAL_EMBEDDING_MODEL = "https://tfhub.dev/google/universal-sentence-encoder/4"
Everything done!! lets go!!
The sequence that you have to follow for better understand TFX is given bellow. The notebooks are created in the way one will be depend on previous one.
- Data Ingestion and mldatastore
- Data Validation
- Data Preprocessing
- Model Training
- Tunner + Training
Also try to explore Apache Beam and what helper function we have used in Utiles folder
note: This notebook is in development stage. we planned to cover all components in TFX within a month. we also planned to develop end-to-end pipelining the production ready model with MLOps based deployment and in organisation way.
- Tunner + Training