# 2: Natural Language Processing (NLP)

### Build the `chatbot` pipeline using 🤗 Transformers Library

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

In [None]:
from transformers import pipeline

- Define the conversation pipeline

In [None]:
chatbot = pipeline(task="conversational",
                   model="facebook/blenderbot-400M-distill")

Info about ['blenderbot-400M-distill'](https://huggingface.co/facebook/blenderbot-400M-distill)

In [None]:
user_message = """
What are some fun activities I can do in the winter?
"""

In [None]:
from transformers import Conversation

In [None]:
conversation = Conversation(user_message)

In [None]:
print(conversation)

In [None]:
conversation = chatbot(conversation)

In [None]:
print(conversation)

- You can continue the conversation with the chatbot with:
```
print(chatbot(Conversation("What else do you recommend?")))
```
- However, the chatbot may provide an unrelated response because it does not have memory of any prior conversations.

- To include prior conversations in the LLM's context, you can add a 'message' to include the previous chat history.

In [None]:
conversation.add_message(
    {"role": "user",
     "content": """
What else do you recommend?
"""
    })

In [None]:
print(conversation)

In [None]:
conversation = chatbot(conversation)

print(conversation)

- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
- [LMSYS Chatbot Arena Leaderboard](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)

### Try it yourself! 
- Try chatting with the model!

# 3: Translation and Summarization

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

### Build the `translation` pipeline using 🤗 Transformers Library

In [None]:
from transformers import pipeline 
import torch

In [None]:
translator = pipeline(task="translation",
                      model="facebook/nllb-200-distilled-600M",
                      torch_dtype=torch.bfloat16) 

NLLB: No Language Left Behind: ['nllb-200-distilled-600M'](https://huggingface.co/facebook/nllb-200-distilled-600M).



In [None]:
text = """\
My puppy is adorable, \
Your kitten is cute.
Her panda is friendly.
His llama is thoughtful. \
We all have nice pets!"""

In [None]:
text_translated = translator(text,
                             src_lang="eng_Latn",
                             tgt_lang="fra_Latn")

To choose other languages, you can find the other language codes on the page: [Languages in FLORES-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200)

For example:
- Afrikaans: afr_Latn
- Chinese: zho_Hans
- Egyptian Arabic: arz_Arab
- French: fra_Latn
- German: deu_Latn
- Greek: ell_Grek
- Hindi: hin_Deva
- Indonesian: ind_Latn
- Italian: ita_Latn
- Japanese: jpn_Jpan
- Korean: kor_Hang
- Persian: pes_Arab
- Portuguese: por_Latn
- Russian: rus_Cyrl
- Spanish: spa_Latn
- Swahili: swh_Latn
- Thai: tha_Thai
- Turkish: tur_Latn
- Vietnamese: vie_Latn
- Zulu: zul_Latn

In [None]:
text_translated

## Free up some memory before continuing
- In order to have enough free memory to run the rest of the code, please run the following to free up memory on the machine.

In [None]:
import gc

In [None]:
del translator

In [None]:
gc.collect()

### Build the `summarization` pipeline using 🤗 Transformers Library

In [None]:
summarizer = pipeline(task="summarization",
                      model="facebook/bart-large-cnn",
                      torch_dtype=torch.bfloat16)

Model info: ['bart-large-cnn'](https://huggingface.co/facebook/bart-large-cnn)

In [None]:
text = """Paris is the capital and most populous city of France, with
          an estimated population of 2,175,601 residents as of 2018,
          in an area of more than 105 square kilometres (41 square
          miles). The City of Paris is the centre and seat of
          government of the region and province of Île-de-France, or
          Paris Region, which has an estimated population of
          12,174,880, or about 18 percent of the population of France
          as of 2017."""

In [None]:
summary = summarizer(text,
                     min_length=10,
                     max_length=100)

In [None]:
summary

### Try it yourself! 
- Try this model with your own texts!

# 4: Sentence Embeddings

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

### Build the `sentence embedding` pipeline using 🤗 Transformers Library

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

More info on [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [None]:
sentences1 = ['The cat sits outside',
              'A man is playing guitar',
              'The movies are awesome']

In [None]:
embeddings1 = model.encode(sentences1, convert_to_tensor=True)

In [None]:
embeddings1

In [None]:
sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

In [None]:
embeddings2 = model.encode(sentences2, 
                           convert_to_tensor=True)

In [None]:
print(embeddings2)

* Calculate the cosine similarity between two sentences as a measure of how similar they are to each other.

In [None]:
from sentence_transformers import util

In [None]:
cosine_scores = util.cos_sim(embeddings1,embeddings2)

In [None]:
print(cosine_scores)

In [None]:
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

### Try it yourself! 
- Try this model with your own sentences!

# 5: Zero-Shot Audio Classification

The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. 
- This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

### Prepare the dataset of audio recordings

In [None]:
from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds
dataset = load_dataset("ashraq/esc50",
                      split="train[0:10]")
# dataset = load_from_disk("ashraq/esc50/train")

In [None]:
audio_sample = dataset[0]

In [None]:
audio_sample

In [None]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

### Build the `audio classification` pipeline using 🤗 Transformers Library

In [None]:
from transformers import pipeline

In [None]:
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="laion/clap-htsat-unfused")

More info on [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused).

### Sampling Rate for Transformer Models
- How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)? 

In [None]:
(1 * 192000) / 16000

- The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio.

- How about 5 seconds of audio?

In [None]:
(5 * 192000) / 16000

- 5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

In [None]:
zero_shot_classifier.feature_extractor.sampling_rate

In [None]:
audio_sample["audio"]["sampling_rate"]

* Set the correct sampling rate for the input and the model.

In [None]:
from datasets import Audio

In [None]:
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))

In [None]:
audio_sample = dataset[0]

In [None]:
audio_sample

In [None]:
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

In [None]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

In [None]:
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

In [None]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

### Try it yourself! 
- Try this model with some other labels and audio files!

# 6: Automatic Speech Recognition

The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. 
- This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

### Data preparation

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("librispeech_asr",
                       split="train.clean.100",
                       streaming=True,
                       trust_remote_code=True)

In [None]:
example = next(iter(dataset))

In [None]:
dataset_head = dataset.take(5)
list(dataset_head)

In [None]:
list(dataset_head)[2]

In [None]:
example

In [None]:
from IPython.display import Audio as IPythonAudio

IPythonAudio(example["audio"]["array"],
             rate=example["audio"]["sampling_rate"])

### Build the pipeline

In [None]:
from transformers import pipeline

In [None]:
asr = pipeline(task="automatic-speech-recognition",
               model="distil-whisper/distil-small.en")

Info about [distil-whisper/distil-small.en](https://huggingface.co/distil-whisper)

In [None]:
asr.feature_extractor.sampling_rate

In [None]:
example['audio']['sampling_rate']

In [None]:
asr(example["audio"]["array"])

In [None]:
example["text"]

### Build a shareable app with Gradio

### Troubleshooting Tip
- Note, in the classroom, you may see the code for creating the Gradio app run indefinitely.
  - This is specific to this classroom environment when it's serving many learners at once, and you won't wouldn't experience this issue if you run this code on your own machine.
- To fix this, please restart the kernel (Menu Kernel->Restart Kernel) and re-run the code in the lab from the beginning of the lesson.

In [None]:
import os
import gradio as gr

In [None]:
demo = gr.Blocks()

In [None]:
def transcribe_speech(filepath):
    if filepath is None:
        gr.Warning("No audio found, please retry.")
        return ""
    output = asr(filepath)
    return output["text"]

In [None]:
mic_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="microphone",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never")

To learn more about building apps with Gradio, you can check out the short course: [Building Generative AI Applications with Gradio](https://www.deeplearning.ai/short-courses/building-generative-ai-applications-with-gradio/), also taught by Hugging Face.

In [None]:
file_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="upload",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never",
)

In [None]:
with demo:
    gr.TabbedInterface(
        [mic_transcribe,
         file_transcribe],
        ["Transcribe Microphone",
         "Transcribe Audio File"],
    )

demo.launch(inbrowser=True, share=True)

In [None]:
demo.close()

## Note: Please stop the demo before continuing with the rest of the lab.
- The app will continue running unless you run
  ```Python
  demo.close()
  ```
- If you run another gradio app (later in this lesson) without first closing this appp, you'll see an error message:
  ```Python
  OSError: Cannot find empty port in range
  ```

* Testing with a longer audio file

In [None]:
import soundfile as sf
import io

In [None]:
audio, sampling_rate = sf.read('narration_example.wav')

In [None]:
sampling_rate

In [None]:
asr.feature_extractor.sampling_rate

In [None]:
asr(audio)

_Note:_ Running the cell above will return:
```
ValueError: We expect a single channel audio input for AutomaticSpeechRecognitionPipeline
```


* Convert the audio from stereo to mono (Using librosa)

In [None]:
audio.shape

In [None]:
import numpy as np

audio_transposed = np.transpose(audio)

In [None]:
audio_transposed.shape

In [None]:
import librosa

In [None]:
audio_mono = librosa.to_mono(audio_transposed)

In [None]:
IPythonAudio(audio_mono,
             rate=sampling_rate)

In [None]:
asr(audio_mono)

_Warning:_ The cell above might throw a warning because the sample rate of the audio sample is not the same of the sample rate of the model.

Let's check and fix this!

In [None]:
sampling_rate

In [None]:
asr.feature_extractor.sampling_rate

In [None]:
audio_16KHz = librosa.resample(audio_mono,
                               orig_sr=sampling_rate,
                               target_sr=16000)

In [None]:
asr(
    audio_16KHz,
    chunk_length_s=30, # 30 seconds
    batch_size=4,
    return_timestamps=True,
)["chunks"]

* Build the Gradio interface.

In [None]:
import gradio as gr
demo = gr.Blocks()

In [None]:
def transcribe_long_form(filepath):
    if filepath is None:
        gr.Warning("No audio found, please retry.")
        return ""
    output = asr(
      filepath,
      max_new_tokens=256,
      chunk_length_s=30,
      batch_size=8,
    )
    return output["text"]

In [None]:
mic_transcribe = gr.Interface(
    fn=transcribe_long_form,
    inputs=gr.Audio(sources="microphone",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never")

file_transcribe = gr.Interface(
    fn=transcribe_long_form,
    inputs=gr.Audio(sources="upload",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never",
)

In [None]:
with demo:
    gr.TabbedInterface(
        [mic_transcribe,
         file_transcribe],
        ["Transcribe Microphone",
         "Transcribe Audio File"],
    )
demo.launch(share=True)

In [None]:
# demo.close()

## Note: Please stop the demo before continuing with the rest of the lab.
- The app will continue running unless you run
  ```Python
  demo.close()
  ```
- If you run another gradio app (later in this lesson) without first closing this appp, you'll see an error message:
  ```Python
  OSError: Cannot find empty port in range
  ```

### Try it yourself!
- Try this model with your own audio files!

In [None]:
import soundfile as sf
import io

audio, sampling_rate = sf.read('narration_example.wav')

In [None]:
sampling_rate

In [None]:
asr.feature_extractor.sampling_rate

# 7: Text to Speech

**Note:**  `py-espeak-ng` is only available Linux operating systems.

To run locally in a Linux machine, follow these commands:
```
    sudo apt-get update
    sudo apt-get install espeak-ng
    pip install py-espeak-ng
```

### Build the `text-to-speech` pipeline using the 🤗 Transformers Library

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging

logging.set_verbosity_error()

In [None]:
from transformers import pipeline

narrator = pipeline("text-to-speech",
                    model="kakao-enterprise/vits-ljs")

Info about [kakao-enterprise/vits-ljs](https://huggingface.co/kakao-enterprise/vits-ljs)

In [None]:
text = """
Researchers at the Allen Institute for AI, \
HuggingFace, Microsoft, the University of Washington, \
Carnegie Mellon University, and the Hebrew University of \
Jerusalem developed a tool that measures atmospheric \
carbon emitted by cloud servers while training machine \
learning models. After a model’s size, the biggest variables \
were the server’s location and time of day it was active.
"""
text = """
Maybe I said so Javiera, I really dont remember. I could have said it. Because you said it \
"""

In [None]:
narrated_text = narrator(text)

In [None]:
from IPython.display import Audio as IPythonAudio

IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

### Try it yourself! 
- Try this model with your own text to speech examples!

# 8: Object Detection

**Note:**  `py-espeak-ng` is only available Linux operating systems.

To run locally in a Linux machine, follow these commands:
```
    sudo apt-get update
    sudo apt-get install espeak-ng
    pip install py-espeak-ng
```

### Build the `object-detection` pipeline using 🤗 Transformers Library

- This model was release with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) from Carion et al. (2020)

In [None]:
from helper import load_image_from_url, render_results_in_image

In [None]:
from transformers import pipeline

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

from helper import ignore_warnings
ignore_warnings()

In [None]:
od_pipe = pipeline("object-detection", "facebook/detr-resnet-50")

Info about [facebook/detr-resnet-50](https://huggingface.co/facebook/detr-resnet-50)

Explore more of the [Hugging Face Hub for more object detection models](https://huggingface.co/models?pipeline_tag=object-detection&sort=trending)

### Use the Pipeline

In [None]:
from PIL import Image

In [None]:
raw_image = Image.open('huggingface_friends.jpg')
raw_image.resize((569, 491))

In [None]:
pipeline_output = od_pipe(raw_image)

- Return the results from the pipeline using the helper function `render_results_in_image`.

In [None]:
processed_image = render_results_in_image(
    raw_image, 
    pipeline_output)

In [None]:
processed_image

### Using `Gradio` as a Simple Interface

- Use [Gradio](https://www.gradio.app) to create a demo for the object detection app.
- The demo makes it look friendly and easy to use.
- You can share the demo with your friends and colleagues as well.

In [None]:
import os
import gradio as gr

In [None]:
def get_pipeline_prediction(pil_image):
    
    pipeline_output = od_pipe(pil_image)
    
    processed_image = render_results_in_image(pil_image,
                                            pipeline_output)
    return processed_image

In [None]:
demo = gr.Interface(
  fn=get_pipeline_prediction,
  inputs=gr.Image(label="Input image", 
                  type="pil"),
  outputs=gr.Image(label="Output image with predicted instances",
                   type="pil")
)

- `share=True` will provide an online link to access to the demo

In [None]:
demo.launch(share=False, )

In [None]:
demo.close()

### Close the app
- Remember to call `.close()` on the Gradio app when you're done using it.

### Make an AI Powered Audio Assistant

- Combine the object detector with a text-to-speech model that will help dictate what is inside the image.

- Inspect the output of the object detection pipeline.

In [None]:
pipeline_output

In [None]:
od_pipe

In [None]:
raw_image = Image.open('huggingface_friends.jpg')
raw_image.resize((284, 245))

In [None]:
from helper import summarize_predictions_natural_language

In [None]:
text = summarize_predictions_natural_language(pipeline_output)

In [None]:
text

### Generate Audio Narration of an Image

In [None]:
tts_pipe = pipeline("text-to-speech",
                    model="kakao-enterprise/vits-ljs")

More info about [kakao-enterprise/vits-ljs](https://huggingface.co/kakao-enterprise/vits-ljs).

In [None]:
narrated_text = tts_pipe(text)

### Play the Generated Audio

In [None]:
from IPython.display import Audio as IPythonAudio

In [None]:
IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

### Try it yourself! 
- Try these models with other images!

# 9: Segmentation

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

### Mask Generation with SAM

The [Segment Anything Model (SAM)](https://segment-anything.com) model was released by Meta AI.

In [None]:
from transformers import pipeline

In [None]:
sam_pipe = pipeline("mask-generation",
    "Zigeng/SlimSAM-uniform-77")

Info about [Zigeng/SlimSAM-uniform-77](https://huggingface.co/Zigeng/SlimSAM-uniform-77)

In [None]:
from PIL import Image

In [None]:
raw_image = Image.open('meta_llamas.jpg')
raw_image.resize((720, 375))

- Running this will take some time
- The higher the value of 'points_per_batch', the more efficient pipeline inference will be

In [None]:
output = sam_pipe(raw_image, points_per_batch=32)

In [None]:
from helper import show_pipe_masks_on_image

In [None]:
show_pipe_masks_on_image(raw_image, output)

_Note:_ The colors of segmentation, that you will get when running this code, might be different than the ones you see in the video.

### Faster Inference: Infer an Image and a Single Point

In [None]:
from transformers import SamModel, SamProcessor

In [None]:
model = SamModel.from_pretrained(
    "Zigeng/SlimSAM-uniform-77")

processor = SamProcessor.from_pretrained(
    "Zigeng/SlimSAM-uniform-77")

In [None]:
raw_image.resize((720, 375))

- Segment the blue shirt Andrew is wearing.
- Give any single 2D point that would be in that region (blue shirt).

In [None]:
input_points = [[[1600, 700]]]

- Create the input using the image and the single point.
- `return_tensors="pt"` means to return PyTorch Tensors.

In [None]:
inputs = processor(raw_image,
                 input_points=input_points,
                 return_tensors="pt")

- Given the inputs, get the output from the model.

In [None]:
import torch

In [None]:
with torch.no_grad():
    outputs = model(**inputs)

In [None]:
predicted_masks = processor.image_processor.post_process_masks(
    outputs.pred_masks,
    inputs["original_sizes"],
    inputs["reshaped_input_sizes"]
)

 Length of `predicted_masks` corresponds to the number of images that are used in the input.

In [None]:
len(predicted_masks)

- Inspect the size of the first ([0]) predicted mask

In [None]:
predicted_mask = predicted_masks[0]
predicted_mask.shape

In [None]:
outputs.iou_scores

In [None]:
from helper import show_mask_on_image

In [None]:
for i in range(3):
    show_mask_on_image(raw_image, predicted_mask[:, i])

## Depth Estimation with DPT

- This model was introduced in the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by Ranftl et al. (2021) and first released in [isl-org/DPT](https://github.com/isl-org/DPT).

In [None]:
depth_estimator = pipeline(task="depth-estimation",
                        model="Intel/dpt-hybrid-midas")

Info about ['Intel/dpt-hybrid-midas'](https://huggingface.co/Intel/dpt-hybrid-midas)

In [None]:
raw_image = Image.open('gradio_tamagochi_vienna.png')
raw_image.resize((806, 621))

- If you'd like to generate this image or something like it, check out the short course on [Gradio](https://www.deeplearning.ai/short-courses/building-generative-ai-applications-with-gradio/) and go to the lesson "Image Generation App".

In [None]:
output = depth_estimator(raw_image)

In [None]:
output

- Post-process the output image to resize it to the size of the original image.

In [None]:
output["predicted_depth"].shape

In [None]:
output["predicted_depth"].unsqueeze(1).shape

In [None]:
prediction = torch.nn.functional.interpolate(
    output["predicted_depth"].unsqueeze(1),
    size=raw_image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

In [None]:
prediction.shape

In [None]:
raw_image.size[::-1],

In [None]:
prediction

- Normalize the predicted tensors (between 0 and 255) so that they can be displayed.

In [None]:
import numpy as np 

In [None]:
output = prediction.squeeze().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)

In [None]:
depth

### Demo using Gradio

### Troubleshooting Tip
- Note, in the classroom, you may see the code for creating the Gradio app run indefinitely.
  - This is specific to this classroom environment when it's serving many learners at once, and you won't wouldn't experience this issue if you run this code on your own machine.
- To fix this, please restart the kernel (Menu Kernel->Restart Kernel) and re-run the code in the lab from the beginning of the lesson.

In [None]:
import os
import gradio as gr
from transformers import pipeline

In [None]:
def launch(input_image):
    out = depth_estimator(input_image)

    # resize the prediction
    prediction = torch.nn.functional.interpolate(
        out["predicted_depth"].unsqueeze(1),
        size=input_image.size[::-1],
        mode="bicubic",
        align_corners=False,
    )

    # normalize the prediction
    output = prediction.squeeze().numpy()
    formatted = (output * 255 / np.max(output)).astype("uint8")
    depth = Image.fromarray(formatted)
    return depth

In [None]:
iface = gr.Interface(launch, 
                     inputs=gr.Image(type='pil'), 
                     outputs=gr.Image(type='pil'))

In [None]:
iface.launch(share=False, )

In [None]:
iface.close()

### Close the app
- Remember to call `.close()` on the Gradio app when you're done using it.

### Try it yourself! 
- Try this model with your own images!

# 10: Image Retrieval

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

- Load the model and the processor

In [None]:
from transformers import BlipForImageTextRetrieval

In [None]:
model = BlipForImageTextRetrieval.from_pretrained(
    "Salesforce/blip-itm-base-coco")

More info about [Salesforce/blip-itm-base-coco](https://huggingface.co/Salesforce/blip-itm-base-coco).

In [None]:
from transformers import AutoProcessor

In [None]:
processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-itm-base-coco")

In [None]:
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'

In [None]:
from PIL import Image
import requests

In [None]:
raw_image =  Image.open(
    requests.get(img_url, stream=True).raw).convert('RGB')

In [None]:
raw_image

### Test, if the image matches the text

In [None]:
text = "an image of a woman and a dog on the beach"

In [None]:
inputs = processor(images=raw_image,
                   text=text,
                   return_tensors="pt")

In [None]:
inputs

In [None]:
itm_scores = model(**inputs)[0]

In [None]:
itm_scores

In [None]:
import torch

- Use a softmax layer to get the probabilities

In [None]:
itm_score = torch.nn.functional.softmax(
    itm_scores,dim=1)

In [None]:
itm_score

In [None]:
print(f"""\
The image and text are matched \
with a probability of {itm_score[0][1]:.4f}""")

### Try it yourself! 
- Try this model with your own images and texts!

# 11: Image Captioning

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

import warnings
warnings.filterwarnings("ignore", message="Using the model-agnostic default `max_length`")

- Load the Model and the Processor.

In [None]:
from transformers import BlipForConditionalGeneration

In [None]:
model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base")

Info about [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)

In [None]:
from transformers import AutoProcessor

In [None]:
processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-image-captioning-base")

- Load the image.

In [None]:
from PIL import Image
import requests

In [None]:
# image = Image.open("./beach.jpeg")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
image =  Image.open(
    requests.get(img_url, stream=True).raw).convert('RGB')

In [None]:
image

### Conditional Image Captioning

In [None]:
text = "a photograph of"
inputs = processor(image, text, return_tensors="pt")

In [None]:
inputs

In [None]:
out = model.generate(**inputs)

In [None]:
out

In [None]:
print(processor.decode(out[0], skip_special_tokens=True))

### Unconditional Image Captioning

In [None]:
inputs = processor(image,return_tensors="pt")

In [None]:
out = model.generate(**inputs)

In [None]:
print(processor.decode(out[0], skip_special_tokens=True))

### Try it yourself! 
- Try this model with your own images and texts!

# 12: Visual Question & Answering

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

import warnings
warnings.filterwarnings("ignore", message="Using the model-agnostic default `max_length`")

* Load the Model and the Processor.

In [None]:
from transformers import BlipForQuestionAnswering

In [None]:
model = BlipForQuestionAnswering.from_pretrained(
    "Salesforce/blip-vqa-base")

Info about [Salesforce/blip-vqa-base](https://huggingface.co/Salesforce/blip-vqa-base)

In [None]:
from transformers import AutoProcessor

In [None]:
processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-vqa-base")

- Load the image.

In [None]:
from PIL import Image
import requests

In [None]:
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
image =  Image.open(
    requests.get(img_url, stream=True).raw).convert('RGB')

In [None]:
image

- Write the `question` you want to ask to the model about the image.

In [None]:
question = "how many dogs are in the picture?"

In [None]:
inputs = processor(image, question, return_tensors="pt")

In [None]:
out = model.generate(**inputs)

In [None]:
print(processor.decode(out[0], skip_special_tokens=True))

### Try it yourself! 
- Try this model with your own images and questions!

# 13: Zero-Shot Image Classification

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

- Load the model and the processor.

In [None]:
from transformers import CLIPModel

In [None]:
model = CLIPModel.from_pretrained(
    "openai/clip-vit-large-patch14")

In [None]:
from transformers import AutoProcessor

In [None]:
processor = AutoProcessor.from_pretrained(
    "openai/clip-vit-large-patch14")

More info about [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14).

- Load the image.

In [None]:
from PIL import Image
import requests

In [None]:
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
image =  Image.open(
    requests.get(img_url, stream=True).raw).convert('RGB')

In [None]:
image

- Set the list of labels from which you want the model to classify the image (above).

In [None]:
labels = ["a photo of a cat", "a photo of a dog"]

In [None]:
inputs = processor(text=labels,
                   images=image,
                   return_tensors="pt",
                   padding=True)

In [None]:
outputs = model(**inputs)

In [None]:
outputs

In [None]:
outputs.logits_per_image

In [None]:
probs = outputs.logits_per_image.softmax(dim=1)[0]

In [None]:
probs

In [None]:
probs = list(probs)
for i in range(len(labels)):
  print(f"label: {labels[i]} - probability of {probs[i].item():.4f}")

### Try it yourself! 
- Try this model with your own images and labels!

# 14: Deploy ML Models on 🤗 Hub using Gradio

- Welcome to the last lesson - ML deployment using 🤗 Hub and Gradio libraries.
- This lesson is optional.  You can watch the video first to see a walkthrough of how to deploy to Hugging Face Spaces.
- If you would like to follow along or deploy to Hugging Face Spaces later, you can do so by creating a free account on https://huggingface.co/
- You are not required to create an account to complete this lesson, as this lesson contains screenshots and instructions for how to deploy, but does not have any code that requires you to have a Hugging Face account.

- In the classroom, the libraries are already installed for you.
- If you would like to run this code on your own machine, you can install the following:

```
    !pip install transformers
    !pip install gradio
    !pip install gradio_client
```

- Note that if you run into issues when making an API call to your own space, you can try to upgrade your version of gradio_client:

```
pip install -U gradio_client
```

- Here is some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

import warnings
warnings.filterwarnings("ignore", 
                        message="Using the model-agnostic default `max_length`")

## 🤗 Spaces

- You can create an account on hugging face from [here](https://huggingface.co), to follow the instructions provided in the video.

### Deploying to Hugging Face Spaces

- Go to [https://huggingface.co/spaces](https://huggingface.co/spaces)![]
- Click the button "create new space".

- Give the space a name, such as "blip-image-captioning".
- Choose a license, such as Apache 2.0
- For "Select the Space SDK", click "Gradio".
- For Hardware, choose the default free option: "CPU Basic"

- Leave it as "public"
- Click "create space".

- You will see a new page with instructions for how to get started by cloning and updating a GitHub repo.
- You can also add the required files directly in the web browser if you'd like to get a small app running quickly.  Click on "Files" at the top.

- Click on "+ Add file"->"Create new File".

### Add requirements.txt


- Add a file called requirements.txt.
- Paste in the following:

```
transformers
torch
gradio
```

- Leave "Commit Directly to the main branch" selected.
- Click "commit new file to main".

### Add app.py

- In the textbox "Name Your File", type "app.py"
- In the textbox for your code, paste in the code that you ran above, or copy this block below:



```Python
import gradio as gr
from transformers import pipeline

pipe = pipeline("image-to-text",
                model="Salesforce/blip-image-captioning-base")


def launch(input):
    out = pipe(input)
    return out[0]['generated_text']

iface = gr.Interface(launch,
                     inputs=gr.Image(type='pil'),
                     outputs="text")

iface.launch()
```
- Notice that `iface.launch()` does not have `share=True`

- Leave "Commit Directly to the main branch" selected.
- Click "Commit new file to main".

### View the app

- You will see that the app is still "Building" for a few minutes.
- You can click on the "App" menu to the left of the "Files" menu to see the console as the space is being built.

- When the build is done, you'll see your app!
- At the bottom, you can click "Use via API" to see sample code that you can use to use your model with an API call.

- You can run the pip install if you haven't already done so.
- In the classroom, gradio_client should already be installed for you.
- Copy the sample code, which will look something like this:

```Python
from gradio_client import Client

client = Client("eddyS/blip-image-captioning-2")
result = client.predict(
		"https://raw.githubusercontent.com/gradio-app/gradio/main/test/test_files/bus.png",	# filepath  in 'input' Image component
		api_name="/predict"
)
print(result)
```

- Note, you can replace the string within `client.predict()` with a string that points to a local file.
- In the classroom, there are two image files that you can use.
  - "kittens.jpg"
  - "huggingface_friends.jpg"
  - Feel free to upload your own to the file directory.
 
So your code may look like this:
```Python
from gradio_client import Client

client = Client("eddyS/blip-image-captioning-2")
result = client.predict(
		"kittens.jpg",
		api_name="/predict"
)
print(result)
```

- Inspect the information in the API.

```Python
client.view_api()
```
- The output may look like this:


```
Client.predict() Usage Info
---------------------------
Named API endpoints: 1

 - predict(input, api_name="/predict") -> output
    Parameters:
     - [Image] input: filepath 
    Returns:
     - [Textbox] output: str 

```

In [None]:
from gradio_client import Client, file

client = Client(src="hjerpe/blip-image-captioning")
result = client.predict(
    file("https://cms.eichertrucksandbuses.com/uploads/truck/sub-category/a933e5958e4a354cfb8d22665bd244fd.png"),	# filepath  in 'parameter_1' Image component
    api_name="/predict"

)
print(result)

In [None]:
client.view_api

You can modify the API call to include your access token.

```Python
from gradio_client import Client

client = Client("eddyS/blip-image-captioning-2",
                hf_token=hf_access_token
               )
result = client.predict(
		"kittens.jpg",
		api_name="/predict"
)
print(result)
# client = Client("abidlabs/whisper-large-v2", 
)
```

### Saving your access token securely
- It's recommended that you not hard code the access token.

```Python
HF_TOKEN="abc1234" # not recommended
```

- You can save your access token to a file ".env"

```
HF_ACCESS_TOKEN="abc123"
```

Then access that environment variable with the `dotenv` library

```Python
# !pip install python-dotenv # install library
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())
hf_access_token = os.getenv("HF_ACCESS_TOKEN")
```

### GPU Zero Space
- [ZeroGPU Explorers](https://huggingface.co/zero-gpu-explorers): A place to spin free GPUs on demand for your spaces.