In [None]:
Creating an end-to-end use case for text generation using Hugging Face's NLP models involves steps similar to the text classification example but tailored for generating text rather than classifying it. Here’s a detailed guide for this use case:

1. Environment Setup
Install Necessary Libraries: First, install the required libraries.
bash
Copy code
pip install transformers datasets torch
2. Data Collection & Preprocessing (Optional)
Text generation models like GPT-2 or GPT-3 typically don’t require custom datasets unless you're fine-tuning the model. If you’re using a pre-trained model directly, you can skip data collection.
Load Dataset (Optional): If you want to fine-tune the model on specific text, you can load a dataset.
python
Copy code
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
Preprocess Data (Optional): Tokenize the dataset.
python
Copy code
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length")

tokenized_datasets = dataset.map(tokenize_function, batched=True)
3. Model Selection
Choose a Pre-Trained Model: Select a pre-trained text generation model. GPT-2 is commonly used for text generation tasks.
python
Copy code
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
4. Training the Model (Optional)
Fine-Tune the Model: If you want to fine-tune the model on your custom dataset.
python
Copy code
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()
Fine-tuning might require significant computational resources, so you can skip this step if you're using a pre-trained model for text generation.
5. Text Generation
Generate Text: Use the model to generate text based on a prompt.
python
Copy code
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")

prompt = "Once upon a time"
generated_text = generator(prompt, max_length=50, num_return_sequences=1)
print(generated_text)
Custom Generation Parameters: You can customize the generation process with parameters like max_length, temperature, num_return_sequences, etc.
python
Copy code
generated_text = generator(prompt, max_length=100, num_return_sequences=3, temperature=0.7, top_k=50, top_p=0.95)
for i, text in enumerate(generated_text):
    print(f"Generated Text {i+1}:\n{text['generated_text']}\n")
6. Model Evaluation (Optional)
Evaluate Text Quality: If you’ve fine-tuned the model, evaluate the quality of the generated text using metrics like perplexity or human evaluation.
Perplexity: Calculate perplexity to assess how well the model predicts the next word in the sequence.
python
Copy code
import math
from datasets import load_metric

metric = load_metric("perplexity")
results = trainer.evaluate()
print(f"Perplexity: {math.exp(results['eval_loss'])}")
7. Model Deployment
Save the Model: Save the trained or fine-tuned model for deployment.
python
Copy code
model.save_pretrained("./gpt2-model")
tokenizer.save_pretrained("./gpt2-model")
Deploy with Hugging Face Inference API: Deploy the model using the Hugging Face Inference API or on your custom server.
python
Copy code
from transformers import pipeline

generator = pipeline("text-generation", model="./gpt2-model")
prompt = "The future of AI"
generated_text = generator(prompt, max_length=50)
print(generated_text)
8. Model Monitoring & Maintenance
Monitor Performance: Once deployed, continuously monitor the model’s performance in production.
Update the Model: Periodically fine-tune or retrain the model with new data to maintain or improve performance.
9. Documentation and Sharing
Document the Workflow: Provide thorough documentation of the steps involved for reproducibility.
Share the Model: Optionally, share your model on the Hugging Face Model Hub for others to use.
This code outline provides a comprehensive guide for using Hugging Face's NLP models to create an end-to-end text generation application. Depending on your specific requirements, you can adjust the steps, especially if you decide to use a pre-trained model directly without fine-tuning.

In [1]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-raw-v1")


  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|█████████████████████████████████████████████████████████| 10.5k/10.5k [00:00<00:00, 22.3kB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 733k/733k [00:00<00:00, 1.15MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████| 6.36M/6.36M [00:02<00:00, 2.47MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 657k/657k [00:00<00:00, 1.00MB/s]
Generating test split: 100%|████████████████████████████████████████████| 4358/4358 [00:00<00:00, 174794.42 examples/s]
Generating train split: 100%|████████████████████████████████████████| 36718/36718 [00:00<00:00, 1577062.43 examples/s]
Generating validation split: 100%|██████████████████████████████████████| 3760/3760 [00:00<00:00, 939620.06 examples/s]


In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length")

tokenized_datasets = dataset.map(tokenize_function, batched=True)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Map:   0%|                                                                             | 0/4358 [00:00<?, ? examples/s]


ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")


In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()


In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")

prompt = "Once upon a time"
generated_text = generator(prompt, max_length=50, num_return_sequences=1)
print(generated_text)


In [None]:
generated_text = generator(prompt, max_length=100, num_return_sequences=3, temperature=0.7, top_k=50, top_p=0.95)
for i, text in enumerate(generated_text):
    print(f"Generated Text {i+1}:\n{text['generated_text']}\n")


In [None]:
import math
from datasets import load_metric

metric = load_metric("perplexity")
results = trainer.evaluate()
print(f"Perplexity: {math.exp(results['eval_loss'])}")


In [None]:
model.save_pretrained("./gpt2-model")
tokenizer.save_pretrained("./gpt2-model")


In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="./gpt2-model")
prompt = "The future of AI"
generated_text = generator(prompt, max_length=50)
print(generated_text)
