Can small mobile models understand text?

In this project we tried to observe if mobile models have the capacity of understanding text or not. Thereby we trained we a MobileNet accompanied with Lite-Transformers following the CLIP and CLIP-Lite methods on $10%$ of the MS-COCO Captions dataset. And the result of the training is as follows:

Zero-shot Capabilities

The model shows the capability of possessing Zero-Shot classification even though having only $44M$ parameters compared to the CLIP model.

Visual-Text grounding

When the model was prompted with recognizing specific object in an image, the model was able to highlight that specific object, as shown in the above figure, indicating that the model possess the capability of visual-text grounding.

Run this project

Clone the repo by pasting the following text into your terminal: git clone https://github.com/lucifermorningstar1305/qriousAI.git
Setup the training environment using the following command: conda env create -f qrious_env.yml

To train the models run the following command:

python train_coco.py\ 
--train_data_path <path where you have stored the mscoco captions training dataset as a csv>\ 
--val_data_path <path where you have stored the mscoco captions validation dataset as a csv>\ 
--config_path ./configs/config.yaml\ 
--checkpoint_filename <filename for your checkpoint>\ 
--max_epochs 500\ --early_stopping_patience 5\ 
--data_size .1\ 
--accumulate_grad_batches 10```

To evaluate the models run the following command:

python evaluate_models.py --root_dir <path to store the evaluation dataset>\ 
--dataset <name of pytorch dataset to download>\ 
--model_checkpoint <checkpoint of model to evaluate>\
--config_path ./configs/config.yaml\
--prompt "A photo of a"

Results on Standard Benchmark

The above chart indicates the top-1 and top-5 accuracy of the model on standard Computer vision benchmarks. The reason for such low scores compared to the original CLIP is the less amount of data being used for training. With more data these accuracies can be enhanced.

Inaccurate Results

There are cases where the model does fail to perform zero-shot or locate specific objects in an image. The following figure highlights those cases:

The first figure showcases the failing of zero-shot classification, where given an image of a modern concept car, the model classifies the image as a jet.

The second figure showcases the failing of visual-text grounding of the model, given the prompt "The person in the image", the model highlights the husky in the image.

This inaccuracies indicate the model needs to trained with more data in order for it to understand objects with better accuracies.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
configs		configs
losses		losses
media		media
models		models
utility		utility
README.md		README.md
calc_grad_cam.py		calc_grad_cam.py
evaluate_clip.py		evaluate_clip.py
evaluate_models.py		evaluate_models.py
gradio_app.py		gradio_app.py
qrious_env.yml		qrious_env.yml
train_coco.py		train_coco.py
train_coco_clip.py		train_coco_clip.py
train_script.py		train_script.py
trainer.py		trainer.py
trainer_clip.py		trainer_clip.py

lucifermorningstar1305/qriousAI

Folders and files

Latest commit

History

Repository files navigation

Can small mobile models understand text?

Zero-shot Capabilities

Visual-Text grounding

Run this project

Results on Standard Benchmark

Inaccurate Results

About

Resources

Stars

Watchers

Forks

Languages