In [None]:
!wget --quiet -O items.zip https://github.com/jsoma/nicar24-beyond-chatgpt/raw/main/items.zip
!unzip -o -q items.zip

## Image analysis

In [None]:
!pip install --quiet timm transformers

### Object detection

Object detection does just what it says it does: identifies objects in a picture!

[See a zero-shot object detector here](https://huggingface.co/spaces/wendys-llc/object-detection)

Let's look at this image:

<img src="coffee.png" style="max-width: 400px; display: block; margin: auto;">

We'll use the most common model, [facebook/detr-resnet-50](https://huggingface.co/facebook/detr-resnet-50)

In [None]:
from transformers import pipeline
pipe = pipeline("object-detection", model="facebook/detr-resnet-50")

pipe("coffee.png")

### Image segmentation

Image segmentation can break apart each individual pixel and assign it a label. Really useful for questions like "using satellite photography, what percentage of our city is greenery?"

[See an example here](https://huggingface.co/spaces/thiagohersan/maskformer-coco-vegetation-gradio)

### Panoptic segmentation

[Another example](https://huggingface.co/spaces/wendys-llc/mask2former-demo)

### Image description

I'm going to come out strong and say **image descriptions are not good for alt text**. The best alt text is written by someone who knows why the article is written, what is being written about, and why the image is relevant. By understanding that context, the writer of the alt text is able to convey more than just a flat description that might miss important details.

...but despite that, [here is a good tweet thread](https://twitter.com/arvindsatya1/status/1674876209543389184) about a dataset about ML and writing captions for data visualization.

In [None]:
!pip install --quiet clip_interrogator pillow

In [None]:
from clip_interrogator import Config, Interrogator
ci = Interrogator(Config(clip_model_name="ViT-L-14/openai"))

In [None]:
from PIL import Image
image = Image.open("coffee.png").convert('RGB')

ci.interrogate(image)

One of the big reasons for using image descriptions is *trying to make similar images with AI*. OpenAI offers a model called CLIP that you can use to (somewhat) reverse-engineer the prompt that was used to create an image.tr

[Check out the CLIP Interrogator](https://huggingface.co/spaces/pharmapsychotic/CLIP-Interrogator)

## Table and form analysis

As journalists, *we love to complain about PDFs*. While we might be used to using tools like Tabula to wrangle CSVs out of them, the world has moved on to newer, nicer tools. Most of the time it has to do with invoices: auto-detecting tables, OCR, and figuring out what different parts of the page are (the total, the items, the billing address) are now reasonable possible with open-source models.

> Speaking of table extraction: if you haven't seen [pdfplumber](https://github.com/jsvine/pdfplumber), it's a *delight*. I made [a Hugging Face demo](https://huggingface.co/spaces/wendys-llc/pdfplumber-demo) of it even though it isn't AI-powered.

### Image-based Q&A

Asking questions about an image is related to all of the above image-describing bits: it can only answer questions that involve things that can be identified/described by CLIP/object detection/etc.

[Here's a great example comparing different models](https://huggingface.co/spaces/wendys-llc/comparing-VQA-models)

Another part of that is **document-based Q&A**, which is mostly "let's look at invoices." [You can see the most popular model here](https://huggingface.co/impira/layoutlm-document-qa), and the one that's [fine-tuned for invoices here](https://huggingface.co/impira/layoutlm-invoices).

Let's take a look at this invoice:

<img src="invoice-template-us-neat-750px.png" style="width: 300px; margin: auto;">

Along with the model, we do need to install `pytesseract` to analyze text on the page. You'll need to install tesseract as `apt install tesseract-ocr` on Colab or `brew install tesseract`.

In [None]:
!pip install --quiet pytesseract

In [None]:
from transformers import pipeline

nlp = pipeline("document-question-answering", model="impira/layoutlm-document-qa")

In [None]:
nlp(
    "invoice-template-us-neat-750px.png",
    "What is the invoice number?"
)

You can also use full URLs, the image is from [here](https://templates.invoicehome.com/invoice-template-us-neat-750px.png).

**But when you're working with PDFs**, you probably want to use the [docquery](https://github.com/impira/docquery) library, which is basically the same, but a little easier to use. Sadly it needs an old version of pydantic and transformers and will probably cause all sorts of trouble. Hopefully you use virtual environments! It also isn't actively maintained anymore. *But* it works with PDFs!

> You'll need to run `brew install poppler` on macs, `apt install poppler-utils` is necessary on Colab

In [None]:
!pip install --quiet "docquery" pydantic==1.10 transformers==4.23

Let's look at the same document above, but as a PDF with a text layer.

In [None]:
from docquery import document, pipeline

p = pipeline('document-question-answering')

In [None]:
doc = document.load_document("invoice.pdf")

In [None]:
p(question="What is the invoice number?", **doc.context)

In [None]:
p(question="What is the sales tax rate?", **doc.context)

In [None]:
p(question="What are the items on the invoice?", **doc.context)

## Image classification (doing it yourself)

[Head over to here](https://huggingface.co/autotrain)! Autotrain is in the middle ground between "the most convenient tool" and "sometimes it breaks and you're confused about how to fix it." You can [see how to use it with image classification here](https://huggingface.co/docs/autotrain/image_classification). **You need to changed mixed_precision from `fp12` to `false` in order for it to work!**

In the example below, we're classifying amber mines like [this piece from Texty](https://texty.org.ua/d/2018/amber_eng/). It used to be the case of [a whole set of education materials about machine learing](https://newsinitiative.withgoogle.com/resources/trainings/hands-on-machine-learning/investigating-stories-with-machine-learning/) from Google News Initiative, but now you can reproduce it all *in all of five minutes*.

After you use the autotrainer it build you a new model, [like this one](https://huggingface.co/wendys-llc/autotrain-stfol-259iu). It'll be named something awful, it won't give you an alert or anything, it will just magically appear and your space will shut down. You'll need to make the model public, then you can do something similar to the below.

Images to test: [this](51.52066320000001_26.477740000000196_09_Mar_2015_GMT_super_9.png) and [this](51.51732900000001_26.461450000000195_09_Mar_2015_GMT_super_6.png) and [this](51.61886970000046_27.210790000000262_05_Oct_2015_GMT_super_8.png) and [this](51.46398180000005_26.31484000000018_09_Mar_2015_GMT_super_10.png) and [this](51.601393_26.555139_09_Mar_2015_GMT_super_24.png) and [this](51.46731600000005_26.320270000000182_09_Mar_2015_GMT_super_20.png). It works magically and we only have 25 examples of each!

> We're going to reinstall transformers so that we make sure we have the most recent one.

In [None]:
!pip install -U --quiet transformers timm

In [None]:
from transformers import pipeline

classifier = pipeline("image-classification", model="wendys-llc/autotrain-stfol-259iu")

In [None]:
# 51.52066320000001_26.477740000000196_09_Mar_2015_GMT_super_9.png
# 51.51732900000001_26.461450000000195_09_Mar_2015_GMT_super_6.png
# 51.61886970000046_27.210790000000262_05_Oct_2015_GMT_super_8.png
# 51.46398180000005_26.31484000000018_09_Mar_2015_GMT_super_10.png
# 51.601393_26.555139_09_Mar_2015_GMT_super_24.png
# 51.46731600000005_26.320270000000182_09_Mar_2015_GMT_super_20.png

from PIL import Image
image = Image.open("51.601393_26.555139_09_Mar_2015_GMT_super_24.png")
classifier(image)

## Format shifting

A valuable concept for making use of all these new tools is *format shifting*, the idea that we can convert willy-nilly between different types of media.

### Audio to text

Speech to text – transcription – is something we've covered at length already. Using OpenAI's [Whisper](https://github.com/openai/whisper) we're easily able to convert from audio to text.

In [None]:
pip install --quiet openai-whisper

In [None]:
import whisper

model = whisper.load_model("tiny.en")
result = model.transcribe("sample-4.mp3")
result["text"]

### Text to speech

[Elevenlabs](https://elevenlabs.io/) is my favorite tool for doing voice work. I (unfortunately) haven't found an open-source model that does nearly as well. Point it to five minutes of video and *voilà*, you have the ability to create oddly lifelike speech from a piece of text.

### Speech to ...speech?

One of the problems with text-to-speech is it *probably* won't capture the intonation, the feeling, the stress we're looking for. If we're especially picky about how good our communication is, we can use a speech-to-speech model! A speech-to-speech model can basically mask our voice with someone else's, maintaining all of those emotional characteristics that are lost in the text-to-speech process.

You can also use [Elevenlabs](https://elevenlabs.io/) for this.

### Video to images

The easiest way to process video is *not* looking for a video-specific model. Instead, you just cut up your video into a collection of segments and run image analysis on each and every one of them!

> `!brew install ffmpeg` for macs

In [None]:
!ffmpeg -i sora-dog.mp4 -vf fps=1 output%d.png

Once we make it, we could use an **object detection model** to see what's inside.

In [None]:
!pip install -U --quiet transformers timm

In [None]:
from PIL import Image
image = Image.open("output1.png")

In [None]:
from transformers import pipeline
pipe = pipeline("object-detection", model="facebook/detr-resnet-50")

pipe("output1.png")

While rare, there *are* a few video-specific models. They might do **scene segmentation**, splitting videos into chunks, or **speaker identification**, which uses cues like face detection and whose mouth is moving to identify who on the audio track is speaking.

## Object tracking in videos

https://github.com/tryolabs/norfair

### Text to video

OpenAI's [Sora](https://openai.com/sora) is the new entrant to generative video, and it looks *great*. Previously folks were excited about [Runway ML](https://runwayml.com/) buuuuut Sora blows it out of the water.

## Image to 3D

3D is going at a breakneck speed, but much of it is based around tools as opposed to models. You can [see a Hugging Face demo here](https://huggingface.co/spaces/zxhezexin/OpenLRM).

# Reflection

Image- and video-based models aren't nearly as easy to use as the text-based models! They're sometimes great but at other points fall flat on their face. They're also kind of tough to install and wrangle from a software point of view.

Fun, though!