## Part4: Assignment: From Image to Text — Understanding Visual Content with Language Models

### Objective

The goal of this assignment is to explore how visual information can be represented in textual form and how large language models (LLMs) can reason about such representations.  
You will convert a personal image into a text-based (ASCII-style) image and investigate whether a language model can understand and describe the visual content from the textual representation alone.

---

### Task Description

#### Step 1: Image Acquisition

- Capture or generate an image of yourself.
  - You may use a photograph, a self-portrait, or an AI-generated image that represents you.
- The image should clearly show a human face or upper body.

---

#### Step 2: Image-to-Text Conversion

- Convert the image into a **text-based image representation** (e.g., ASCII art).
- The conversion should:
  - Use a fixed-width (monospace) font.
  - Preserve basic visual structure such as contours, shading, or facial features.
- You may use Python with any library (e.g., Python with PIL, OpenCV).

Below is an example:

![Input Image](imgs/image.png)

---

#### Step 3: Language Model Interpretation

- Provide the text-based image as input to a language model.
- Ask the model to:
  - Describe what it sees in the text representation.
  - Infer high-level attributes (e.g., whether it looks like a face, a person, or an object).
- Record the model’s response.

In [None]:
import numpy as np
import cv2
from pathlib import Path
import os
import requests
from PIL import Image
from io import BytesIO

CHARS = '@W#$OEXC[(/?=^~_.` '

urls = [
    "https://github.com/hacker-is-undefeatable/ISDN3150_Week2_Assignment/blob/main/imgs/image_1.jpg?raw=true",
    "https://github.com/hacker-is-undefeatable/ISDN3150_Week2_Assignment/blob/main/imgs/image_2.jpg?raw=true",
    "https://github.com/hacker-is-undefeatable/ISDN3150_Week2_Assignment/blob/main/imgs/image_3.jpg?raw=true"
]

def mono(input_img, target_h=100, target_w=100):
    """
    input_img: an image of shape [h, w], which should be an numpy array
    target_h: target height of the generated str-based image.
    target_w: target width of the generated str-based image.
    
    Note:
        1) that you may not want to generate an image of an extremely large size.
        2) so implement this, you may need to resize the image using the resize function from cv2 library
        3) you may need to keep the aspect ratio when giving target_h and target_w

    return:
        a text representing an image
    """
    # Todo 1: implement your code here
    # Get original dimensions
    h, w = input_img.shape
    
    # Calculate aspect ratio and resize while maintaining aspect ratio
    aspect_ratio = w / h
    if aspect_ratio > target_w / target_h:
        # Width is the constraint
        new_w = target_w
        new_h = int(target_w / aspect_ratio)
    else:
        # Height is the constraint
        new_h = target_h
        new_w = int(target_h * aspect_ratio)
    
    # Resize the image
    resized_img = cv2.resize(input_img, (new_w, new_h))

    # Equalize histogram to improve contrast
    resized_img = cv2.equalizeHist(resized_img)
    
    # Normalize pixel values to 0-255 range
    resized_img = (resized_img - resized_img.min()) / (resized_img.max() - resized_img.min() + 1e-5) * 255
    resized_img = resized_img.astype(np.uint8)
    
    # Map pixel values to characters
    ascii_art = ""
    for row in resized_img:
        for pixel in row:
            # Map pixel brightness (0-255) to character index (0 to len(CHARS)-1)
            char_idx = int((pixel / 255) * (len(CHARS) - 1))
            ascii_art += CHARS[char_idx]
        ascii_art += "\n"
    
    return ascii_art


# Todo 2: read images from multiple URLs
ascii_arts = []

for i, url in enumerate(urls):
    print(f"\nProcessing image {i+1}...")
    
    # Download and read the image from URL
    response = requests.get(url)
    im = Image.open(BytesIO(response.content))
    
    # Convert PIL image to numpy array
    im = np.array(im)
    
    # Convert to grayscale if it is an RGB/RGBA image
    if len(im.shape) == 3:
        im = cv2.cvtColor(im, cv2.COLOR_RGB2GRAY)
    
    # Convert this to string art
    ascii_art = mono(im, target_h=50, target_w=100)
    ascii_arts.append(ascii_art)
    print(f"Image {i+1} ASCII art generated")
    print(ascii_art)


Processing image 1...
CEXEX(/.^~?^^^^^^~~~___..._~=/[?((([[(XXXEXCC[CEX?
CXXX[_[_(_C=^^^^^#@_C?([XO@`~.=?/=?/C[XXEEXXXCEE[[
X[[/((C[[[?^[^^~.W#E[[/ECXXW`````_~=/[/~(CEE/.EO/X
C/CC/[CC(/[(^^~~_WX@[~OC([[O`````.~~=_(~((CXECEE^=
OX([?/=?^^^^~~~.@@($#E^^?/X.$``````.^?/?[``CEEOO=[
XCX^C/C(/^~~~~..(@WX/O(E/(E^E`````````_~`.~=[EEEOX
EXC[/._~=_.~_.[X.$O[(CXC(XX/E```....`````~CCXCEEOO
X(=.^[[_`.`..WE_@OOE(OW/[=~(X```...__.```?CXXX[?EO
[~[._[```_...W@##^EO[/^_~X@[````.______`==([X[C=^X
[[~(~`.__....W@@.=OX$=__~/EO````______~_``=_?~(^^=
/C[~^......__#@..~$(X~___^O````.____~~~~~.```~~^^=
/_~_.....__.W#@.`^/OXCO_OOOC_``____~~~~~~~_``_~^==
^(~.........XW@.C?OO[WO@OE[([___~__~~~~~~~~~~~^^/(
~^.......`.`@@@@[EO[C@@@#@O/X((____~~^^^~^~~~_~=([
.......`....##@/[$_$?_$$($W[[XO/(=?^~^^^^^^^~~~?([
...``....``.@$@(@#_~_[[OO$#$$O([_.(_/^^^^^^^~~=~(C
_.``````````WE@@#@W[[[[##W$@@@EX(_[~^^//====^?(([[
^^.`````````WWWWW@@OCWOX@@@@@C(XXXXX/[(?//===~^[[[
=^~_....`````#W@@@@OE@X@@@@@@WXCE[^^^^[(=?#==^?([[
=^^_____

In [27]:
from openai import AzureOpenAI
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-10-21",
    azure_endpoint="https://hkust.azure-api.net"
)

# Todo 3: Analyze each text-based image with the LLM
for i, ascii_art in enumerate(ascii_arts):
    print(f"\n{'='*60}")
    print(f"Analyzing ASCII art for image {i+1}...")
    print(f"{'='*60}\n")
    
    prompt_content = f"""Please analyze the following text-based image representation and describe what you see:

{ascii_art}

Based on this ASCII art representation, can you:
1. Describe what you observe in the text-based image
2. Infer what type of object or subject this represents (e.g., face, person, landscape)
3. Identify any visible features or characteristics"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that can analyze text-based image representations and understand visual content from ASCII art."},
            {
                "role": "user",
                "content": prompt_content
            }
        ]
    )
    print(response.choices[0].message.content)


Analyzing ASCII art for image 1...

1. **Description of the Text-Based Image:**
   The ASCII art representation consists mainly of letters, symbols, and various characters arranged in a structured pattern. This arrangement creates a visual texture rather than a clear image. The background contains a mixture of characters that include 'W', 'C', 'O', 'X', and various punctuation marks. There are regions that appear clearer and denser, while others seem more chaotic in arrangement.

2. **Type of Object or Subject:**
   While the ASCII art is quite abstract and does not depict a recognizable object or subject clearly, the patterns and forms generated by the text suggest a structure, possibly a landscape or abstract representation. The repetition of certain characters like 'W', 'C', and 'O' could imply an environment, perhaps something that represents a complex pattern like water or foliage, although it is not definitive.

3. **Visible Features or Characteristics:**
   - The art showcases 