# ISDN3150-Week2-Section3: Introduction to LLM & Image Generation

***>>> What you will learn in this workshop:***
- How to call api of LLM to automatically process scalable tasks
- How to use HKUST GenAI api
- How to use HKUST GenAI api to call DALL-E-3 for image generation

## Part1: Using LLM by Calling API

Most of you have tried and possibly got very familiar with using LLM through web app. However, chatting with LLM on web app is manual, which is not scalable and automatic. In lots of cases, we have a bunch of materials and questions to process, so basically there are two ways:

1. Deploy an open-sourced LLM locally -> this requires some computation and storage resources

2. Call the api of every LLM you can find -> this costs you some money (https://bailian.console.aliyun.com/cn-beijing/?spm=a2c4g.11186623.0.0.1c185e8esFkknc&tab=app#/api-key)

In [1]:
!uv pip install dashscope
!uv pip install openai

[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m19 packages[0m [2min 321ms[0m[0m
[2K[2mPrepared [1m1 package[0m [2min 66ms[0m[0m
[2K[2mInstalled [1m1 package[0m [2min 9ms[0m[0m
 [32m+[39m [1mdashscope[0m[2m==1.25.12[0m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 133ms[0m[0m


In [None]:
import os
from openai import OpenAI

client = OpenAI(
    api_key="!!!put your api keys here",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen2-7b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is self attention?"}
    ]
)
print(completion.choices[0].message.content)

Self-attention is a mechanism used in artificial intelligence, particularly in natural language processing (NLP) tasks, where a model can weigh the importance of different parts of the input data when making predictions or decisions. It allows each element in a sequence to attend to all other elements in the same sequence, enabling the model to capture complex dependencies and interactions between elements.

In the context of neural networks, self-attention typically involves an operation called "multi-head attention," which combines multiple attention mechanisms, each focusing on different aspects of the input. This approach helps to improve the model's ability to understand the nuances of the input data and perform more sophisticated reasoning.

Self-attention was popularized by the Transformer architecture, introduced by Vaswani et al. in 2017, which has since become a cornerstone in NLP due to its ability to handle long-range dependencies in sequences and its efficiency in parallel

Suppose that we have tens of sentences describing a person is doing something, and we want to extract the action label of each setence.

In [14]:
data = [
    "A man in a green tank top and white pants stands on a purple yoga mat and brings a wine glass from a high-reaching position down towards his face to drink from it.",
    "A man in a green tank top and white sweatpants stands on a yoga mat, holding a wine glass on his forearm while performing a gentle side bend exercise.",
    "A man in a blue t-shirt and black shorts demonstrates a badminton stance, moving from a ready position to a defensive squat while holding a racket.",
    "A man in a blue shirt, black shorts, and glasses stands in a ready badminton stance, demonstrating a backhand drop shot swing without a shuttlecock.",
    "A man wearing a cap, a dark shirt, and light-colored pants stands on a set of steps while holding a skateboard and talking on his phone.",
    "A young man in a grey t-shirt and red shorts performs overhead arm circles behind his back while holding a purple towel.",
    "A man in a white tank top and dark shorts catches a basketball with both hands.",
    "A man in a white tank top and dark shorts stands in an athletic stance on a basketball court, holding a basketball and slightly turning his torso.",
    "A man in a white tank top and dark shorts practices a basketball jab step and shot fake on a basketball court.",
    "A man in a white tank top and black compression pants holds a basketball in a triple threat stance on a basketball court, then pivots on his right foot while turning his body to his right.",
    "A man wearing a black short-sleeved shirt and khaki shorts stands still, facing away from the camera, while looking at a painting in an art gallery.",
    "A man in a black shirt and khaki shorts stands still with his hands clasped behind his back while looking at a painting in an art gallery.",
    "A man with grey hair, wearing a blue shirt and black pants, stands in front of a screen and gestures with both hands while talking.",
    "A man in a blue shirt and black pants stands in place while talking and gesturing with his hands.",
    "A woman in a black tank top and leggings performs plie squats while holding onto the back of a red chair for support.",
    "A woman in black workout clothes performs a sumo squat while holding onto the back of a red chair for support.",
    "A man in a green jacket and jeans stands still while gesturing with his hands as he talks.",
    "A bearded man wearing a dark puffer jacket, black t-shirt, and dark pants stands on a hoverboard, talking and gesturing with his hands while spinning in place.",
    "A man wearing a black hoodie and blue jeans walks forward on the grass.",
    "A man in a white cap and grey shirt walks around an indoor golf practice area, holding a golf club and preparing for a shot.",
    "A man in a grey pullover, blue pants, and a white cap is standing and demonstrating a golf swing posture, bringing the club down to address the ball.",
    "A man in a grey long-sleeved shirt, blue pants, and a white cap stands in a golf address position, looks up at the camera, and then returns his gaze to the golf ball on the floor.",
    "A man in a grey jacket, blue jeans, and a white cap stands on a green mat, holding a golf club and practicing his swing.",
    "A man wearing a grey jacket, blue jeans, and a white cap walks in a small circle on a green mat while holding a golf club and adjusting his glove.",
    "A man in a grey jacket, blue jeans, and a white cap demonstrates a golf swing with a golf club in an indoor setting.",
    "A man in a grey sweatshirt, blue jeans, and a white cap practices his golf swing.",
    "A man wearing a white cap, grey jacket, and blue jeans holds a golf club and walks in a small circle in a golf simulation room.",
    "A man in a grey pullover and a white cap walks from a golf bag, picks up a golf club, and walks forward while looking at it.",
    "A man in a white cap, grey long-sleeved shirt, and blue jeans walks from the right side of a room toward the left, before turning to face the wall and picking up a golf club.",
    "A man in a white cap, grey jacket, and blue jeans walks towards the camera in a golf studio, holding a golf club and transitioning into a setup position.",
    "A man wearing a grey pullover, blue jeans, and a white cap practices his golf swing indoors.",
    "A man in a grey shirt, jeans, and a white cap stands in a golf practice room, holding a golf club and looking around."
]

for i in range(len(data)):
    completion = client.chat.completions.create(
        model="qwen2-7b-instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"The given sentence describes a person is doing something, only response a single action label word: {data[i]}"}
        ]
    )
    print(f"[given sentence]: {data[i]}")
    print(f"[extracted action label]: {completion.choices[0].message.content}")
    print()

[given sentence]: A man in a green tank top and white pants stands on a purple yoga mat and brings a wine glass from a high-reaching position down towards his face to drink from it.
[extracted action label]: Drinking

[given sentence]: A man in a green tank top and white sweatpants stands on a yoga mat, holding a wine glass on his forearm while performing a gentle side bend exercise.
[extracted action label]: Performing

[given sentence]: A man in a blue t-shirt and black shorts demonstrates a badminton stance, moving from a ready position to a defensive squat while holding a racket.
[extracted action label]: demonstrates

[given sentence]: A man in a blue shirt, black shorts, and glasses stands in a ready badminton stance, demonstrating a backhand drop shot swing without a shuttlecock.
[extracted action label]: "staging"

[given sentence]: A man wearing a cap, a dark shirt, and light-colored pants stands on a set of steps while holding a skateboard and talking on his phone.
[extracted

## Part2: Use HKUST GenAI API

Follow the instructions in https://itso.hkust.edu.hk/services/it-infrastructure/azure-openai-api-service to setup your account and api key.

In [17]:
!uv pip install openai

import openai
print("openai library version: ", openai.__version__)

[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 88ms[0m[0m
openai library version:  2.17.0


In [None]:
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="!!!put your api keys here",
    api_version="2025-02-01-preview",
    azure_endpoint="https://hkust.azure-api.net"
)

response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the difference between self attention and cross attention?"}
    ]
)
print(response.choices[0].message.content)

Short answer
- Self-attention: queries (Q), keys (K) and values (V) all come from the same source (the same sequence). It lets each position in that sequence attend to other positions in the same sequence.
- Cross-attention: queries come from one source and keys/values come from a different source. It lets one sequence (or modality) look up information in another.

Core equation (same for both)
attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V

What changes
- Self-attention: Q = K = V = X (one tensor).
- Cross-attention: Q = Y (e.g., decoder states), K = V = X (e.g., encoder outputs or image features).

Where you see them
- Encoder layers usually use self-attention to model relationships inside the input sequence.
- Decoder stacks in encoder–decoder transformers use masked self-attention (to prevent peeking at future tokens) plus a cross-attention layer that lets the decoder queries attend to encoder outputs.
- Multimodal models use cross-attention to let text attend to image features or 

# Image Generation

***>>> What you will learn in this workshop:***
- How to use HKUST GenAI api to call DALL-E-3 for image generation

## Part3: Image Generation

We can also use HKUST GenAI to call image generation models such as DALL-E-3.

In [None]:
!uv pip install openai
from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="!!!put your api keys here",
    api_version="2025-02-01-preview",
    azure_endpoint="https://hkust.azure-api.net"
)

response = client.images.generate(
    model="dall-e-3",
    prompt="A ballet dancer with challenging pose and sophisticated dressing",
    size="1024x1024",
    quality="standard",
    n=1
)
image_url = response.data[0].url
print(f"Generated image URL: {image_url}")

[2mUsing Python 3.12.12 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 89ms[0m[0m
Generated image URL: https://dalleproduse.blob.core.windows.net/private/images/8826ed9f-9fa1-4a33-bdeb-e0e790ee29e5/generated_00.png?se=2026-02-12T08%3A46%3A19Z&sig=luGQ4D3NVCbfJW6DAVmwt%2BM5geBvMxNV79AiNeAdCaI%3D&ske=2026-02-14T04%3A03%3A28Z&skoid=09ba021e-c417-441c-b203-c81e5dcd7b7f&sks=b&skt=2026-02-07T04%3A03%3A28Z&sktid=33e01921-4d64-4f8c-a055-5bdaffd5e33d&skv=2020-10-02&sp=r&spr=https&sr=b&sv=2020-10-02
