Tensorflow-Extended-tutorial

Model Centric

In the model centric approach Data Scientist will stive to make the data fit their model through feature enginearing. First they will start with the base model. If their existing model fails they will develop new one that adequately address the problem. This type of Data Scientist approach is always be like keeping the data fixed after standard preprocessing and iteratively imporves the model to deal with the noise in the data.

Data Centric

In the data centric approach Data Scientist will expose the data with the right analysis technique. They Highly inverst their time in ensuring the data quaity. Data Consistenct is a key here. They build complex visualizations to understand the data. This type of Data Scientist approach is always be like holding the code/algorithms fixed and interated the data quality.

In my perspective, to achieve a good AI solution there must be balance between model and the data quality. I too more conscious on data side. Andrew NG and his team prove the data quality is a key by show it with a experiment with real-world data.

The common practice amongst researchers is to hold the data fixed while trying to improve the code. But, when the dataset size is modest (<10,000 examples), Andrew Ng suggests ML teams will make faster progress, given the dataset is good.

Bellow table is an result of the expericent which proves why data centric aproach is better than model centric. If your model is already at its best the task to have it improved to achieve 90% accuracy sound almost impossible.

for the model centric, the improvements is based on Network Architecture search and using the state-of-the-art architectures, whereas, for the data centric, the approach taken was to identify inconsistencies and clean noisy labels. you can see that what data centric aproach does

Andrew Ng mentioned how everyone jokes about ML is 80% data preparation, but no one seems to care. A quick look at the arxiv would give an idea of the direction ML research is going. There is unprecedented competition around beating the benchmarks. If Google has BERT then OpenAI has GPT-3. But, these fancy models take up only 20% of a business problem.

Model Centric -> Data Centric

Data is an main fuel for all type of machine learning model and it also an high stakes in AI developments, achieving high quality data is core here. meaning full data is not only scarce and noisy but also very expensice to be obtained. To achine data centric aproach there we need to feed our model complete, relevant, consistent and enough data. In a lot of the real-word problems, not much data is available and the more data we have moe noise it is present. we can counter it with right hyperparameters and model choice to achieve generalizable results. But better the quality of the data, the higher the probabilities of several models to do well.

In fact, MLOps is essential to connect the dots and take these steps to the next level while ensuring consistency, completeness and relevancy. The most important objective of the MLOps is to ensure a high-quality and consistent flow of data throughout all stages of a project.

How MLOps helps us to attain data centric approach?

If the model in product has to give good result and get better over time, they need to be trained with high quality data and they has to be built and tuned in a continuous manner which ensures the consistent performance. MLOps will ensure the model consistency by repeated training with most relevant and recent data. It also helps to counter the training and serving skew.

There are a number of goals enterprises want to achieve through MLOps systems successfully implementing ML across the enterprise, including:

Deployment and automation
Reproducibility of models and predictions
Governance and regulatory
Scalability
Monitoring and management

Tensorflow Extended

TFX is a Tensorflow Based Platform to host end to end Machine Learning Pipelines. TFX framework will used to prepare pipeline to clean data, train and serve production ready machone learning systems. TFX provides modular, flexible, collaborative, accessible and easy to use ML Ops Platform. Each TFX component allows proper storage, configuration, and orchestration of ML Models.
Orchestrators in TFX automates task executions and monitors TF components. One of the largest TFX Orchestrators is Apache Beam. Apache Beam is the unified batch and stream distributed API which acts as an abstraction layer to run on top of the distributed processing framework. This allows you to work on diverse backends such as Apache Spark, Local, Dataflow, etc.

In this repo we had given an wide range of idea on how to use each tfx components standalone and also as MLOps pipeline. All notebooks in this repo are depended to each other. Each notebook will expect the execution of previous one. Each notebook explained the standalone execution of component and orchestrate it using interactive context from tfx. we have used metadata store heavily to establish link between notebooks. Follow the below mentioned sequence:

There is a step to be taken for smooth learning:

step 1:
      clone the repo and create environment in the path [root_dir]/Tensorflow-Extended-tutorial
            pythom -m venv env
      Activate the evirnonment using command bellow
      if you are using windows:
            env\Scripts\activate
      if you are using linux based system:
            source env/bin/activate

step 2:
install all required packages:
pip install -r requirements.txt

step 3:
      For model training pipeline you need to download some pretrained model weights from here and extract it in the path
            [root_dir]/Tensorflow-Extended-tutorial/models
                        (or)
      you can dowload those thing on the fly by changing the value of the parameter in config.py file
            FILE PATH: [root_dir]/Tensorflow-Extended-tutorial/utils/configurations
      change line 15 as => UNIVERSAL_EMBEDDING_MODEL = "https://tfhub.dev/google/universal-sentence-encoder/4"

Everything done!! lets go!!

The sequence that you have to follow for better understand TFX is given bellow. The notebooks are created in the way one will be depend on previous one.

Data Ingestion and mldatastore
Data Validation
- Data Validation Run In GCP
- Writing Custom data connector
Data Preprocessing
- Advance Data Preprocessing
Model Training
- Tunner + Training
  
  Also try to explore Apache Beam and what helper function we have used in Utiles folder
  
  note: This notebook is in development stage. we planned to cover all components in TFX within a month. we also planned to develop end-to-end pipelining the production ready model with MLOps based deployment and in organisation way.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
basics/apache beam		basics/apache beam
images		images
models		models
notebooks		notebooks
script		script
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tensorflow-Extended-tutorial

Model Centric

Data Centric

Model Centric -> Data Centric

How MLOps helps us to attain data centric approach?

Tensorflow Extended

About

Releases

Packages

Languages

License

jagan-mathematics/Tensorflow-Extended-tutorial

Folders and files

Latest commit

History

Repository files navigation

Tensorflow-Extended-tutorial

Model Centric

Data Centric

Model Centric -> Data Centric

How MLOps helps us to attain data centric approach?

Tensorflow Extended

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages