In [None]:
'''
text = (
    "Hello TechEmporium, I recently purchased a set of noise-canceling headphones "
    "from your online store in Canada. Upon receiving the item, I noticed a defect: "
    "the left earcup doesn’t produce sound. As someone who relies on quality audio "
    "for my work, this is a major inconvenience. I would appreciate a replacement or "
    "a refund. Attached are the order details and proof of purchase. Please let me "
    "know how we can resolve this issue promptly. Best regards, Alex."
)
'''

In [None]:
import os
import torch
from transformers import pipeline, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoModelForQuestionAnswering, AutoModelForSeq2SeqLM, AutoTokenizer
from introdl.utils import get_device, wrap_print_text

# overload print to wrap text
print = wrap_print_text(print)

# get GPU if available
device = get_device()


# Sample Text
text = """I ordered the Samsung Galaxy S24 Ultra from Tech Haven, expecting next-day delivery, 
but after three days, I hadn’t even received a shipping update. After waiting 45 minutes on hold, 
customer service told me there was a stock issue—yet no one had informed me! 

When the package finally arrived a week late, it contained a Google Pixel 8 Pro instead. 
The support rep was apologetic but said an exchange would take another two weeks.  

I paid $1,200 for the wrong phone, dealt with delays and poor communication, and now have to wait even longer. 
Tech Haven, you need to do better! Sincerely, Jamie."""

def print_model_info(pipe):
    model = pipe.model
    model_size = sum(p.numel() for p in model.parameters())
    print(f"Model: {model.name_or_path}, Size: {model_size:,} parameters")

# Sentiment Analysis
print("\n**Sentiment Analysis**")
sentiment_pipeline = pipeline("sentiment-analysis", device=device)
print_model_info(sentiment_pipeline)
sentiment_result = sentiment_pipeline(text)
print(sentiment_result)

# Named Entity Recognition (NER)
print("\n**Named Entity Recognition**")
ner_pipeline = pipeline("ner", grouped_entities=True, device=device)
print_model_info(ner_pipeline)
ner_result = ner_pipeline(text)
print(ner_result)

# Question Answering
print("\n**Question Answering**")
qa_pipeline = pipeline("question-answering", device=device)
print_model_info(qa_pipeline)
question = "What is the defect?"
qa_result = qa_pipeline(question=question, context=text)
print(qa_result)

# Translation (English to Spanish)
print("\n**Translation**")
translation_pipeline = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es", device=device)
print_model_info(translation_pipeline)
translation_result = translation_pipeline(text, max_length=200)
print(translation_result)

# Summarization
print("\n**Summarization**")
summarization_pipeline = pipeline("summarization", device=device)
print_model_info(summarization_pipeline)
summarization_result = summarization_pipeline(text, max_length=50, min_length=25, do_sample=False)
print(summarization_result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.



**Sentiment Analysis**
Model: distilbert/distilbert-base-uncased-finetuned-sst-2-english, Size:
66,955,010 parameters


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9983344674110413}]

**Named Entity Recognition**


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  attn_output = torch.nn.functional.scaled_dot_product_attention(
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision 

Model: dbmdz/bert-large-cased-finetuned-conll03-english, Size: 332,538,889
parameters
[{'entity_group': 'MISC', 'score': 0.9910073, 'word': 'Samsung Galaxy S24
Ultra', 'start': 14, 'end': 38}, {'entity_group': 'ORG', 'score': 0.99012053,
'word': 'Tech Haven', 'start': 44, 'end': 54}, {'entity_group': 'MISC', 'score':
0.992503, 'word': 'Google Pixel 8 Pro', 'start': 325, 'end': 343},
{'entity_group': 'ORG', 'score': 0.96618426, 'word': 'Tech Haven', 'start': 551,
'end': 561}, {'entity_group': 'PER', 'score': 0.98039377, 'word': 'Jamie',
'start': 597, 'end': 602}]

**Question Answering**
Model: distilbert/distilbert-base-cased-distilled-squad, Size: 65,192,450
parameters
{'score': 0.11689740419387817, 'start': 222, 'end': 233, 'answer': 'stock
issue'}

**Translation**
Model: Helsinki-NLP/opus-mt-en-es, Size: 77,943,296 parameters


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'translation_text': 'Pedí el Samsung Galaxy S24 Ultra de Tech Haven, esperando
la entrega del día siguiente, pero después de tres días, ni siquiera había
recibido una actualización de envío. Después de esperar 45 minutos en espera, el
servicio al cliente me dijo que había un problema de existencias — sin embargo,
nadie me había informado! Cuando el paquete finalmente llegó una semana tarde,
que contenía un Google Pixel 8 Pro en su lugar. El representante de apoyo era
apologético, pero dijo que un intercambio tomaría otras dos semanas. Pagué
$1,200 por el teléfono equivocado, trató con retrasos y mala comunicación, y
ahora tienen que esperar aún más. Tech Haven, usted necesita hacer mejor!
Sinceramente, Jamie.'}]

**Summarization**
Model: sshleifer/distilbart-cnn-12-6, Size: 305,510,400 parameters
[{'summary_text': ' Tech Haven sent a Samsung Galaxy S24 Ultra to Tech Haven,
expecting next-day delivery . The package arrived a week late, it contained a
Google Pixel 8 Pro instead . An ex

: 

In [None]:
import os
import openai
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Cache for local models
MODEL_CACHE = {}

def llm_prompt(prompt, model_str, max_length=256):
    """
    Generate a response from an LLM based on the given model string.
    
    Parameters:
        - prompt (str): The input text prompt.
        - model_str (str): The model identifier (e.g., 'gpt-4o' or 'meta-llama/Llama-3.2-3B-Instruct').
        - max_length (int, optional): Maximum response length. Default is 256.
    
    Returns:
        - str: The model-generated response.
    """
    
    # Case 1: OpenAI API Model
    if model_str in ["gpt-4o", "gpt-4", "gpt-3.5-turbo"]:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OpenAI API key not found. Set OPENAI_API_KEY in the environment.")
        
        openai.api_key = api_key
        try:
            response = openai.ChatCompletion.create(
                model=model_str,
                messages=[{"role": "system", "content": "You are an AI assistant."},
                          {"role": "user", "content": prompt}],
                max_tokens=max_length,
                temperature=0.7
            )
            return response['choices'][0]['message']['content'].strip()
        except Exception as e:
            return f"OpenAI API error: {str(e)}"

    # Case 2: Local Hugging Face Model
    else:
        if model_str not in MODEL_CACHE:
            try:
                print(f"Loading model: {model_str} (this may take a while)...")
                MODEL_CACHE[model_str] = {
                    "model": AutoModelForCausalLM.from_pretrained(model_str, torch_dtype=torch.float16, device_map="auto"),
                    "tokenizer": AutoTokenizer.from_pretrained(model_str)
                }
            except Exception as e:
                return f"Error loading model {model_str}: {str(e)}"
        
        model = MODEL_CACHE[model_str]["model"]
        tokenizer = MODEL_CACHE[model_str]["tokenizer"]

        # Tokenize input
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # Generate response
        with torch.no_grad():
            output = model.generate(**inputs, max_length=max_length, temperature=0.7)

        return tokenizer.decode(output[0], skip_special_tokens=True)



In [None]:
# Using OpenAI GPT-4o with a lower temperature for more deterministic responses
response = llm_prompt("Explain quantum entanglement.", "gpt-4o", temperature=0.3)
print(response)

# Using LLaMA with a higher temperature for more creative output
response = llm_prompt("Tell me a sci-fi story.", "meta-llama/Llama-3.2-3B-Instruct", temperature=1.2)
print(response)


I've updated your `llm_prompt` function to include **search strategies** for local Hugging Face models, giving you control over how responses are generated. You can now specify **different decoding strategies** using a new parameter:  
- **`search_strategy="greedy"`** (default) – simple, deterministic response  
- **`search_strategy="beam"`** – beam search for better fluency  
- **`search_strategy="top_k"`** – random sampling from the top-k most likely tokens  
- **`search_strategy="top_p"`** – nucleus sampling for diverse responses  
- **`search_strategy="contrastive"`** – contrastive search for high-quality output  

---

### **Updated `llm_prompt` Function**
```python
import os
import openai
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Cache for local models to avoid reloading
MODEL_CACHE = {}

def llm_prompt(prompt, model_str, max_length=256, temperature=0.7, search_strategy="greedy", top_k=50, top_p=0.9, num_beams=5):
    """
    Generate a response from an LLM based on the given model string.
    
    Parameters:
        - prompt (str): The input text prompt.
        - model_str (str): The model identifier (e.g., 'gpt-4o' or 'meta-llama/Llama-3.2-3B-Instruct').
        - max_length (int, optional): Maximum response length. Default is 256.
        - temperature (float, optional): Controls randomness (0.0 = deterministic, 1.0+ = creative). Default is 0.7.
        - search_strategy (str, optional): Decoding strategy. Options: 'greedy', 'beam', 'top_k', 'top_p', 'contrastive'. Default is 'greedy'.
        - top_k (int, optional): Used if `search_strategy="top_k"`. Default is 50.
        - top_p (float, optional): Used if `search_strategy="top_p"`. Default is 0.9.
        - num_beams (int, optional): Used if `search_strategy="beam"`. Default is 5.

    Returns:
        - str: The model-generated response.
    """

    # Case 1: OpenAI API Model
    if model_str in ["gpt-4o", "gpt-4o-mini", "o1", "o1-mini"]:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("OpenAI API key not found. Set OPENAI_API_KEY in the environment.")

        openai.api_key = api_key
        try:
            response = openai.ChatCompletion.create(
                model=model_str,
                messages=[{"role": "system", "content": "You are an AI assistant."},
                          {"role": "user", "content": prompt}],
                max_tokens=max_length,
                temperature=temperature  # Adjustable temperature
            )
            return response['choices'][0]['message']['content'].strip()
        except Exception as e:
            return f"OpenAI API error: {str(e)}"

    # Case 2: Local Hugging Face Model (e.g., LLaMA)
    else:
        if model_str not in MODEL_CACHE:
            try:
                print(f"Loading model: {model_str} (this may take a while)...")
                MODEL_CACHE[model_str] = {
                    "model": AutoModelForCausalLM.from_pretrained(model_str, torch_dtype=torch.float16, device_map="auto"),
                    "tokenizer": AutoTokenizer.from_pretrained(model_str)
                }
            except Exception as e:
                return f"Error loading model {model_str}: {str(e)}"
        
        model = MODEL_CACHE[model_str]["model"]
        tokenizer = MODEL_CACHE[model_str]["tokenizer"]

        # Tokenize input
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        # Apply search strategy
        gen_kwargs = {"max_length": max_length, "temperature": temperature}

        if search_strategy == "greedy":
            gen_kwargs["do_sample"] = False  # Greedy decoding
        elif search_strategy == "beam":
            gen_kwargs["num_beams"] = num_beams  # Beam search
        elif search_strategy == "top_k":
            gen_kwargs.update({"do_sample": True, "top_k": top_k})  # Top-k sampling
        elif search_strategy == "top_p":
            gen_kwargs.update({"do_sample": True, "top_p": top_p})  # Nucleus sampling
        elif search_strategy == "contrastive":
            gen_kwargs.update({"penalty_alpha": 0.6, "top_k": 4})  # Contrastive search

        # Generate response
        with torch.no_grad():
            output = model.generate(**inputs, **gen_kwargs)

        return tokenizer.decode(output[0], skip_special_tokens=True)
```

---

### **New Features & Search Strategies**
| Search Strategy  | How It Works | Best Used For |
|------------------|-------------|--------------|
| **greedy** (default) | Picks the most likely word at each step. | Fast, deterministic output. |
| **beam** | Considers multiple possible sequences and picks the best. | Producing high-quality, fluent text. |
| **top_k** | Samples from the top-k most likely words. | Adding variety while staying relevant. |
| **top_p** | Samples dynamically from the most probable subset of words. | More diverse, natural text. |
| **contrastive** | Uses **contrastive search** to balance diversity and quality. | High-quality open-ended generation. |

---

### **Example Usage**
#### **1️⃣ OpenAI GPT-4o (API)**
```python
response = llm_prompt("Explain quantum mechanics.", "gpt-4o", temperature=0.3)
print(response)
```

#### **2️⃣ LLaMA Using Greedy Decoding**
```python
response = llm_prompt("Tell me a story.", "meta-llama/Llama-3.2-3B-Instruct", search_strategy="greedy")
print(response)
```

#### **3️⃣ LLaMA Using Beam Search for Higher Fluency**
```python
response = llm_prompt("Summarize this article.", "meta-llama/Llama-3.2-3B-Instruct", search_strategy="beam", num_beams=5)
print(response)
```

#### **4️⃣ LLaMA Using Top-k Sampling for More Variety**
```python
response = llm_prompt("Write a poem about AI.", "meta-llama/Llama-3.2-3B-Instruct", search_strategy="top_k", top_k=40)
print(response)
```

#### **5️⃣ LLaMA Using Contrastive Search for High-Quality Responses**
```python
response = llm_prompt("Describe the future of AI.", "meta-llama/Llama-3.2-3B-Instruct", search_strategy="contrastive")
print(response)
```

---

### **Why This Update is Useful**
✅ **More Control** – Choose between fluency (beam), diversity (top-p), or determinism (greedy).  
✅ **Better Text Quality** – Contrastive search improves coherence.  
✅ **Preserves API Flexibility** – GPT models still work the same way.  
✅ **Fast & Efficient** – **Caches** the model for reuse instead of reloading it each time.  

Would you like any additional customization, such as **temperature-dependent sampling adjustments** or **log probabilities** for debugging? 🚀