# **Final Project of AI & ML (Generative AI)**

**Image_Captioning-and-Text_Recognition**

The idea of this project is to build a gradio interface for two tasks. First task, image caption extraction, Second task, text recognition with the text is written by hand or digital. The result of these tasks will appear in english and arabic anguage.

* Install and import the needed library for loading pretrained model, library for gradio interface and other for dealing and processing the image.



In [1]:
# Install needed library
!pip install gradio
!pip install transformers
!pip install torch


Collecting gradio
  Downloading gradio-4.44.1-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0 (from gradio)
  Downloading fastapi-0.115.0-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from g

In [2]:
# Import needed library
from PIL import Image
import gradio as gr
import torch
import requests
import re
from transformers import pipeline, BlipProcessor, BlipForConditionalGeneration, TrOCRProcessor, VisionEncoderDecoderModel



* Example of images for gradio interface



In [3]:
# load image examples
img_urls_1 = ['https://i.pinimg.com/564x/f7/f5/bd/f7f5bd929e05a852ff423e6e02deea54.jpg', 'https://i.pinimg.com/564x/b4/29/69/b4296962cb76a72354a718109835caa3.jpg',
        'https://i.pinimg.com/564x/f2/68/8e/f2688eccd6dd60fdad89ef78950b9ead.jpg']
for idx1, url1 in enumerate(img_urls_1):
  image = Image.open(requests.get(url1, stream=True).raw)
  image.save(f"image_{idx1}.png")

In [4]:
# load image examples
img_urls_2 = ['https://i.pinimg.com/564x/14/b0/07/14b0075ccd5ea35f7deffc9e5bd6de30.jpg', 'https://newsimg.bbc.co.uk/media/images/45510000/jpg/_45510184_the_writings_466_180.jpg',
        'https://cdn.shopify.com/s/files/1/0047/1524/9737/files/Cetaphil_Face_Wash_Ingredients_Optimized.png?v=1680923920', 'https://github.com/kawther12h/Image_Captioning-and-Text_Recognition/blob/main/handText22.jpg?raw=true','https://github.com/kawther12h/Image_Captioning-and-Text_Recognition/blob/main/handText11.jpg?raw=true']
for idx2, url2 in enumerate(img_urls_2):
  image = Image.open(requests.get(url2, stream=True).raw)
  image.save(f"tx_image_{idx2}.png")

**Image Captioning**

In [5]:
# Load Blip model and processor for captioning
processor_blip = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model_blip = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

# Load marefa model for translation (English to Arabic)
translate = pipeline("translation",model="marefa-nlp/marefa-mt-en-ar")

def caption_and_translate(img, min_len, max_len):
    # Generate English caption
    raw_image = Image.open(img).convert('RGB')     # It takes image and convert it to the RGB color

    inputs_blip = processor_blip(raw_image, return_tensors="pt")     #prepares the image data for input to the Blip model

    out_blip = model_blip.generate(**inputs_blip, min_length=min_len, max_length=max_len)     #generates an English caption for the image

    english_caption = processor_blip.decode(out_blip[0], skip_special_tokens=True)


    # Translate caption from English to Arabic
    arabic_caption = translate(english_caption)
    arabic_caption = arabic_caption[0]['translation_text']

    # The Arabic caption is formatted with right-to-left directionality.
    translated_caption = f'<div dir="rtl">{arabic_caption}</div>'


    # Return both caption and translated caption
    return english_caption, translated_caption


# Gradio interface with multiple outputs
img_cap_en_ar = gr.Interface(
    fn=caption_and_translate, # The function that processes the image
    #Users can upload an image and adjust the minimum and maximum caption lengths
    inputs=[gr.Image(type='filepath'),
            gr.Slider(label='Minimum Length', minimum=1, maximum=500, value=30),
            gr.Slider(label='Maximum Length', minimum=1, maximum=500, value=100)],

    outputs=[gr.Textbox(label='English Caption'),
             gr.HTML(label='Arabic Caption')],

    title='Image Captioning | وصف الصورة',
    description="Upload an image to generate an English & Arabic caption | قم برفع صورة وأرسلها ليظهر لك وصف للصورة",
    examples =[["image_0.png"], ["image_1.png"], ["image_2.png"]]
)




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/801k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/917k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]



**Text Recognition**

In [6]:
# Load the OCR-Donut model
text_rec = pipeline("image-to-text", model="jinhybr/OCR-Donut-CORD")

# Load marefa model for translation (English to Arabic)
translate = pipeline("translation",model="marefa-nlp/marefa-mt-en-ar")

# Function to process the image and extract text
def extract_text(image):
    # Pass the image to the pipeline to  extract text
    result = text_rec(image)

    # Extract the plain text and remove tags
    text = result[0]['generated_text']
    text = re.sub(r'<[^>]*>', '', text)  # Remove all HTML tags

    # Translate extracted text from English to Arabic
    arabic_text3 = translate(text)
    arabic_text3 = arabic_text3[0]['translation_text']

    #Formats the translated text in right-to-left direction
    htranslated_text = f'<div dir="rtl">{arabic_text3}</div>'

    # Return the extracted text
    return text, htranslated_text

# Define the Gradio interface
text_recognition = gr.Interface(
    fn=extract_text,                    # The function that processes the image
    inputs=gr.Image(type="pil"),        # Input is an image (PIL format)

    outputs=[gr.Textbox(label='Extracted text'), gr.HTML(label= 'Translateted of Extracted text ')],   # Output is text

    title="Text Extraction and Translation | إستخراج النص وترجمتة",
    description="Upload an image then Submet to extract text and translate it to Arabic| قم برفع الصورة وأرسلها ليظهر لك النص من الصورة",
    examples =[["tx_image_0.png"], ["tx_image_2.png"]],
)


config.json:   0%|          | 0.00/5.03k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/809M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/489 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/1.30M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.02M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/355 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/362 [00:00<?, ?B/s]

**Handwritten text Recognition**

In [7]:
# Load trocr model for handwritten text extraction
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten')

# Load marefa model for translation (English to Arabic)
translate = pipeline("translation",model="marefa-nlp/marefa-mt-en-ar")

# Function to process the image and extract text

def recognize_handwritten_text(image2):
  pixel_values = processor(images=image2, return_tensors="pt").pixel_values # The image is processed, convert it into pixel values

  generated_ids = model.generate(pixel_values) # Generates IDs for the extracted text.

  generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] # Decodes these IDs into actual text

  # Translate extracted text from English to Arabic
  arabic_text2 = translate(generated_text)
  arabic_text2 = arabic_text2[0]['translation_text']

  #Formats the translated text in right-to-left direction
  htranslated_text = f'<div dir="rtl">{arabic_text2}</div>'

  # Return the extracted text and translated text
  return generated_text, htranslated_text

# Gradio interface with image upload input and text output
handwritten_rec = gr.Interface(
    fn=recognize_handwritten_text,
    inputs=gr.Image(type="pil"),
    outputs=[gr.Textbox(label='English Text'),
             gr.HTML(label='Arabic Text')],
    title="Handwritten Text Extraction | | إستخراج النص المكتوب بخط اليد وترجمتة",
    description="Upload an image then Submet to extract text and translate it to Arabic| قم برفع الصورة وأرسلها ليظهر لك النص من الصورة",
    examples =[["tx_image_1.png"], ["tx_image_3.png"]]
)

preprocessor_config.json:   0%|          | 0.00/224 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-base-handwritten and are newly initialized: ['encoder.pooler.dense.bias', 'encoder.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**Launch the gradio interface**

In [None]:
# Combine all interfaces into a tabbed interface
demo = gr.TabbedInterface([img_cap_en_ar, text_recognition, handwritten_rec], ["Extract_Caption", " Extract_Digital_text", " Extract_HandWritten_text"])
demo.launch(debug=True, share=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://169cf06a81d40dfeb0.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Token indices sequence length is longer than the specified maximum sequence length for this model (756 > 512). Running this sequence through the model will result in indexing errors
Your input_length: 756 is bigger than 0.9 * max_length: 512. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1935, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1520, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/usr/local/lib/python3.10/di