<a href="https://colab.research.google.com/github/pratim808/smol-course/blob/main/7_inference/notebooks%20/inference_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:

from fastapi import FastAPI, HTTPException
import uvicorn
from transformers import pipeline, AutoTokenizer
# Step 1: Set Up a Basic Pipeline
print("Step 1: Basic pipeline setup")
generator = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
# Get the tokenizer from the pipeline
tokenizer = generator.tokenizer

# Example prompt
prompt = "Write a haiku about a cat:"

# Manually tokenize the input
tokenized_input = tokenizer(prompt, return_tensors="pt")

# Print tokenized input
print("Tokenized Input:")
print(tokenized_input)
response = generator(
    "Write a short story about a lost puppy:",
    max_new_tokens=100
)
print(response[0]['generated_text'])




Step 1: Basic pipeline setup


config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Tokenized Input:
{'input_ids': tensor([[128000,   8144,    264,   6520,  39342,    922,    264,   8415,     25]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Write a short story about a lost puppy: Max
Max was a small, fluffy white puppy with big brown eyes and a wagging tail that never seemed to stop. He had been separated from his mother and siblings in the chaos of a big storm, and now he was all alone in a strange new place.

As the sun began to set, Max curled up under a bush, shivering with fear and uncertainty. He had never spent a night away from his cozy little nest with his mother and siblings. The wind howled and the trees


In [8]:
# Step 2: Configure Generation Parameters
print("\nStep 2: Pipeline with generation parameters")
response = generator(
    "Translate this to Bengali: The cat is on the table",
    max_new_tokens=150,  # Increase max_new_tokens
    do_sample=True,  # Enable sampling
    temperature=0.8,  # Adjust temperature
    top_k=50,
    top_p=0.95,
    num_return_sequences=2
)

for r in response:
  print(r['generated_text'])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Step 2: Pipeline with generation parameters
Translate this to Bengali: The cat is on the table, while the man is beside the chair. The cat is sleeping, and the man is sitting in the chair. ২০২৩ সালের চীনে চারু নজর সূত্র আলোচনা করছেন মার্ক অপটিক্স প্রাইভেট। তিনি বলেছেন যে নজর সূত্রটি প্রথম �
Translate this to Bengali: The cat is on the table, and he is also on the wall. He is very happy when he sees the beautiful painting on the wall.


In [3]:
import nest_asyncio
from transformers import pipeline
import gradio as gr
from typing import Dict

generator = None  # Initialize generator variable as None

def load_model():
    try:
        # Initialize pipeline globally with a smaller model for testing
        print("Loading Model...")
        model = pipeline(
            "text-generation",
            model="gpt2",  # Using a smaller model for testing
            device_map="auto"
        )
        print("Model loaded successfully!")  # Print if the model loads without errors
        return model
    except Exception as e:
        print(f"Error loading model: {e}")
        return None

generator = load_model() # Load the model once.


# Gradio Interface
def generate_text_gradio(prompt: str) -> str:
   try:
    if not prompt:
      return "No prompt provided"
    if generator is None:
       return "Model failed to load"
    response = generator(
      prompt,
      max_new_tokens=100,
      do_sample=True,
      temperature=0.7,
    )

    generated_text = response[0]['generated_text'] if response and response[0] and 'generated_text' in response[0] else ""
    return generated_text

   except Exception as e:
       print(f"An error occurred in generating text: {e}")
       return "An error occurred: " + str(e)

if __name__ == "__main__":
    nest_asyncio.apply()

    # Gradio interface
    iface = gr.Interface(
        fn=generate_text_gradio,
        inputs=gr.Textbox(lines=5, placeholder="Enter your prompt here"),
        outputs="text",
        title="Text Generation Demo"
    )
    iface.launch(debug=True) # Launch the interface in the notebook.

Loading Model...


Device set to use cuda:0


Model loaded successfully!
Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://897d73ea9f13578eed.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://897d73ea9f13578eed.gradio.live


In [8]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.13.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.6.0 (from gradio)
  Downloading gradio_client-1.6.0-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.9.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6 (from gradio)
  Downloading safehttpx-0.1.6-py3-none-any.whl.me

In [20]:
!pip install -U transformers fastapi uvicorn torch

Collecting transformers
  Downloading transformers-4.48.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.48.1-py3-none-any.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m76.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.47.1
    Uninstalling transformers-4.47.1:
      Successfully uninstalled transformers-4.47.1
Successfully installed transformers-4.48.1


In [10]:
!pip install nest_asyncio



In [4]:
!pip install fastapi uvicorn transformers

Collecting uvicorn
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Downloading uvicorn-0.34.0-py3-none-any.whl (62 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uvicorn
Successfully installed uvicorn-0.34.0


In [5]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [8]:
# KeyDataset is a util that will just output the item we're interested in.
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset


dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:2]")
dataset


Dataset({
    features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
    num_rows: 2
})

In [10]:
pipe = pipeline(model="hf-internal-testing/tiny-random-wav2vec2", device=0)
for out in pipe(KeyDataset(dataset, "audio")):
    print(out)

config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/829k [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at hf-internal-testing/tiny-random-wav2vec2 and are newly initialized: ['wav2vec2.feature_extractor.conv_layers.1.layer_norm.bias', 'wav2vec2.feature_extractor.conv_layers.1.layer_norm.weight', 'wav2vec2.feature_extractor.conv_layers.2.layer_norm.bias', 'wav2vec2.feature_extractor.conv_layers.2.layer_norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/554 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/812k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

Device set to use cuda:0


{'text': "EYB  ZB COE C BEZCYCZ HO MOWWB EM BWOB ZMEG  B COEB BE BEC B U OB BE BCB BEWUBB BXYWBESWYCB SBBB SSEZ C Z WH UB F IGVB SB Z<unk> XOES CZ BBXOXFBB  OBY W B VM OFOWUONFWB ZCX B M WZ Q S C Q BC CQBF FOMB BOT ZWYBZ WB  B CM B C B WZCWWW BHU EOYTO YWB BZ SHZBGEM Q OO T B BM XZ QW C OFBZMSEHB BE ZZBX M Q XB<unk> CEVWZ FOHSB W B O Z ZW S ZB O VM <s> D EUCKH XNC D Q BG B O BW U  U  MBE CBYE  WB HFQUBQBUWZ B MW BMPY F ZBU  EB B WBOF S XFOBB ZB X B MOT W B CEO WBM   BBXBBEOBECB B UM C BP FMBWB BZ WFCED Z B B FXB Z OZ OBBZ NVD UBZC W B WYCWY X CE CW B WB MWU BWN B DECF GEF'C WZS CS BYWB<s>FZ'Z<s>ZGBU ECFEY BF ZOZ O UWBSSZBBBBW   O O DBB BZWFUW ZWOZYCGOYCOT WC O CZ BD BBBBBBX X W T B BC BZC FWYBFO FBCE X Z PEZ CE B WEDBMBO BN B BY Y  W B BMCB XOXQ  BSZES Z M CF S FB BBXBB B C CSZ EF SEQF S BEC BNO BN  SU EH  WRFBS WB  W B OEZ WS X B F B X ZBBE BBEHB B BU BECBSXHB BSQWFW BSZXH BWSEG W VQETZMCZ UCXW Z DBE<s> O SXZX MB W RX YYOBSUBWOCFYEF O B O B C Z UBEZBE BTB C   CBFCB V W B BF W ZBBESBBE