# 04 - Multimodal and Structured Data

So far we have only looked at models which exclusively support text input and output, however many models provide a wider range of input and output.  Alternatives to text include:

* Image
* Video
* Audio
* JSON

We'll now briefly look at ways in which local LLMs can support multi-modal inputs/outputs and structured data.

## 4.1 Image analysis

Ollama supports the [LLava class of models](https://llava-vl.github.io/) which allow for image analysis and understanding.

In [1]:
import ollama

In [2]:
messages=[
    dict(role='user',
         content='Describe this image',
         images=['../img/HF-logo-horizontal.png'])
]

In [3]:
%%time
result = ollama.chat(
	model="llava-phi3:3.8b",
	messages=messages
)

CPU times: user 4.74 ms, sys: 10.2 ms, total: 14.9 ms
Wall time: 9.36 s


In [4]:
print(result['message']['content'])

The image presents a playful and friendly scene. Dominating the center of the black background is a yellow smiley face, its arms stretched out in an open hug. The eyes of the smiley face are filled with joy, adding to the overall cheerful vibe of the image.

Just below this heartwarming figure, there's a line of text that reads "Huggling Face". The words are written in a blue color and appear to be slightly tilted downwards, creating a dynamic contrast with the upward-facing smiley face above them. The combination of these elements creates a sense of comfort and positivity.


### EXERCISE: Use the Llava LLM to analyze the Context Window trend graph

*(5 minutes)*

Using the same technique, analyze the file in the tutorial repo at this path:

`img/meibel-ai-context-window-size-history.png`

1. Get a general description
2. Ask specific questions about the image
3. Ask for the main observations based on the data in the graph

## 5.2 Structured Data

There are two general mechanisms which help coerce model output into a structured format:

* Model trained to produce structured output in a particular format (e.g. [Osmosis 600M](https://huggingface.co/osmosis-ai/Osmosis-Structure-0.6B))
* Model wrapped with layer that enforces structured input and/or output

Ollama supports [`pydantic`](https://docs.pydantic.dev/latest/) output type definitions, which provides a straight forward way to define your expected outputs.  Some other options include:

* [`instructor`](https://useinstructor.com/)
* [`outlines`](https://github.com/dottxt-ai/outlines)
* [BAML](https://boundaryml.com/)

In [5]:
from ollama import chat
from pydantic import BaseModel

In [6]:
class ChatSession:
    def __init__(self,
                 model:str,
                 format:BaseModel = None,
                 system:str = 'You are a helpful chatbot'):
        self.model    = model
        self.format   = format
        self.system   = system
        self.messages = []

        self.messages.append(dict(role='system', content=system))

    def prompt(self, msg) -> str:
        self.messages.append(dict(role='user', content=msg))
        if self.format:
            json_format = self.format.model_json_schema()
        else:
            json_format = None
        raw_response = chat(model=self.model,
                        messages=self.messages,
                        format=json_format).message.content
                            
        if self.format:
            response = self.format.model_validate_json(raw_response)
        else:
            response = raw_response
            
        self.messages.append(dict(role='assistant', content=raw_response))
        
        return response

In [8]:
class Address(BaseModel):
    street: str
    city:   str
    state:  str
    zip:    str

In [9]:
cs = ChatSession(model='gemma2:2b', 
                 format=Address, 
                 system='Please extract addresses from each prompt')

In [None]:
%%time
me = cs.prompt('I live at 123 Washington St, in the city of Syracuse, NY. My zip code is 13444')
me

In [12]:
me.street

'123 Washington St'

In [11]:
%%time
parents = cs.prompt('''
My parents live in Saint John, New Brunswick.
Their street address is 765 Loyalist Drive.
Their postal code is E2K 5G5.
''')
parents

CPU times: user 3.58 ms, sys: 4.6 ms, total: 8.18 ms
Wall time: 1.37 s


Address(street='765 Loyalist Drive', city='Saint John', state='New Brunswick', zip='E2K 5G5')

In [13]:
parents.zip

'E2K 5G5'

In [14]:
type(parents)

__main__.Address

In [None]:
cs.messages

### EXERCISE: Extract movie review into JSON

*(5 minutes)*

Using [these short movie reviews](https://letterboxd.com/paragraphfilms/reviews/) use the text content, not including the score, as the prompt input.  Create a Pydantic `MovieReview` model to capture the key elements of the review data, including movie name, year of release, sentiment, summary.