Skip to content

gmh5225/simpleVLM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 SimpleVLM: A Vision-Language Model from Scratch

SimpleVLM

Welcome to the SimpleVLM repository! This project focuses on building a simple Vision-Language Model (VLM) by implementing the LlaMA-SmolLM2 and the SigLip2 Vision Model from scratch. We have also integrated KV-Caching, designed to enhance performance and efficiency.

🌟 Table of Contents

📖 Overview

In the field of artificial intelligence, combining vision and language is a growing area of research. SimpleVLM aims to provide a straightforward implementation of this concept, making it easier for developers and researchers to experiment with and understand Vision-Language Models.

Why SimpleVLM?

  • Educational Resource: This project serves as a learning tool for those interested in deep learning and multimodal models.
  • Modular Design: Each component is designed to be modular, allowing users to customize and extend functionality.
  • Performance: With the integration of KV-Caching, we enhance the efficiency of the model, making it suitable for various applications.

⚙️ Features

  • LlaMA-SmolLM2 Implementation: A lightweight version of the LlaMA model, designed for ease of use and adaptability.
  • SigLip2 Vision Model: A robust vision model that works seamlessly with language inputs.
  • KV-Caching Support: Implemented from scratch, this feature improves model performance during inference.
  • Multimodal Capabilities: SimpleVLM can handle both text and image inputs, making it versatile for various tasks.
  • Fine-tuning Options: Users can fine-tune the models for specific tasks, enhancing their effectiveness.

🛠️ Installation

To get started with SimpleVLM, follow these steps:

  1. Clone the Repository:

    git clone https://github.com/SuyogKamble/simpleVLM.git
    cd simpleVLM
  2. Install Requirements: Ensure you have Python 3.7 or later installed. Then, run:

    pip install -r requirements.txt
  3. Set Up Environment: It’s advisable to create a virtual environment for this project. You can do this using venv or conda.

📚 Usage

Once you have installed SimpleVLM, you can start using it for your projects. Here’s a simple example to get you started:

from simpleVLM import SimpleVLM

model = SimpleVLM()
output = model.predict(image_path="path/to/image.jpg", text="Describe this image.")
print(output)

Model Inference

To run inference with the model, you need to provide both an image and a text prompt. The model will then generate a response based on the inputs.

🧠 Models

LlaMA-SmolLM2

The LlaMA-SmolLM2 model is designed for text generation tasks. It can generate coherent and contextually relevant text based on input prompts.

SigLip2 Vision Model

The SigLip2 Vision Model processes images and extracts meaningful features. This model works in tandem with the language model to provide a comprehensive understanding of multimodal inputs.

🏋️ Training

If you wish to train the models on your own datasets, follow these steps:

  1. Prepare Your Dataset: Ensure your dataset is formatted correctly. It should contain paired image and text data.

  2. Training Script: Use the provided training script to start training your model.

    python train.py --data_path path/to/dataset
  3. Monitoring: You can monitor the training process using TensorBoard.

🤝 Contributing

We welcome contributions to SimpleVLM! If you have ideas for features, improvements, or bug fixes, please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Make your changes and commit them.
  4. Push your branch and create a pull request.

Please ensure your code follows the existing style and includes appropriate tests.

📄 License

This project is licensed under the MIT License. Feel free to use and modify the code as per the license terms.

🚀 Releases

For the latest releases and updates, visit our Releases section. Download the latest version and start building your own Vision-Language Model.

If you encounter any issues or have questions, feel free to check the "Releases" section for more information.

🌐 Topics

This repository covers various topics including:

  • Computer Vision
  • Deep Learning
  • Fine-tuning LLMs
  • Fine-tuning Vision Models
  • Hugging Face
  • LLM
  • Multimodal
  • NLP
  • PyTorch
  • Transformers
  • Vision-Language Model
  • VLM

🎉 Conclusion

Thank you for checking out SimpleVLM! We hope this project serves as a valuable resource for your work in the field of Vision-Language Models. Whether you are a researcher, developer, or enthusiast, we encourage you to explore the capabilities of this model and contribute to its growth.

For further details and updates, please visit our Releases section.

About

building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2 Vision Model. KV-Caching is supported and implemented from scratch as well

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 99.3%
  • Python 0.7%