<a href="https://colab.research.google.com/github/hsan666666/AutoTrader/blob/main/Another_copy_of_01_working_with_raw_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Foundation Language Models

Such models are not trained to complete chat turns nor are they explicitly trained to respond to specific instructions such as "Summarize the following text ...". Nonetheless, when prompted properly these models can achieve an impressive level of performance in doing such tasks.

These models are trained on vast amounts of text data usually crawled from the Internet, including books, code, articles, news, etc. They are trained to complete the input text with the most probable words. Keep that in mind when trying to prompt such models.

One important learning from this exercise: do not try to make these models do what they have not seen in the pre-training data, and your life will be easier. This maxima also can be applied to the chat LMs models.


For this exercise, we will be using open-source models that we can get from huggingface. If you have time, do try to use other models as well.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

_new_max_tokens = 32 if not torch.cuda.is_available() else 128

# If you are using Apple M1, M2, M3 on your local machine
# if not torch.cuda.is_available() and torch.backends.mps.is_available():
#   if torch.backends.mps.is_built():
#       device = "mpu"

Our model is Qwen1.5. The model contains 0.5B parameters, you might be able to use this model both on the CPU (slow) and GPU (much faster). Do switch your runtime to T4 GPU to accelerate completion (Runtime -> Change Runtime Type). Note that Google Colab has limits on GPU usage, the resources are not guaranteed.

In [None]:
model_name = "Qwen/Qwen1.5-0.5B"
# if you want a higher degree of challenge, use the following models:
# model_name = "gpt2"
# model_name = "gpt2-medium"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto").to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def complete(primer, max_new_tokens=_new_max_tokens):
  inputs = tokenizer(primer, return_tensors="pt").to(device)

  outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)

  generated_ids = outputs[0][len(inputs.input_ids[0]):]

  return primer, tokenizer.decode(generated_ids, skip_special_tokens=True)

Completion is the standard task for LMs, we can use it to write a story from a primer.

In [None]:
prompt = "Once upon a time, in a quiet village, a young girl discovered a hidden world in her backyard. She was exploring the dense bushes when she stumbled upon a"

primer, completion = complete(prompt)

print("START >>\n\n", primer, sep="")
print()
print("COMPLETION >>\n\n", completion, sep="")


Although we face the completion task all the time (GMAIL would suggest how to finish the sentence, a keyboard app on your phone would recommend the next word, etc.), in the business settings we are interested in solving other problems, like summarization.

In the following example, try to make the LM to summarize this passage for you.

In [None]:
text_to_summarize = \
"""Artificial intelligence (AI) is a rapidly growing field with applications in various industries.
From healthcare to finance, AI is transforming how we approach complex problems. In healthcare, AI
is being used to develop advanced diagnostic tools and personalized treatment plans. In finance,
AI is enhancing fraud detection and algorithmic trading. Despite these advancements,
there are ethical concerns surrounding AI, including privacy and bias issues."""

prompt = \
"""{text}

In summary,
"""

primer, completion = complete(prompt.format(text=text_to_summarize))

print("START >>\n\n", primer, sep="")
print()
print("COMPLETION >>\n\n", completion, sep="")

One of the most prominent use cases for LMs is code-writing. Could you make the LM write a code for you that would check:

1. If the number is prime: https://en.wikipedia.org/wiki/Prime_number
2. That finds the greatest common divisor of two integers: https://en.wikipedia.org/wiki/Greatest_common_divisor

In [None]:
prompt = \
"""
<put your prompt here>
"""

primer, completion = complete(prompt)

print("START >>\n\n", primer, sep="")
print()
print("COMPLETION >>\n\n", completion, sep="")

Showcase the poetic skills of the LM. Write a haiku about nature. (https://en.wikipedia.org/wiki/Haiku)

In [None]:
prompt = \
"""
<put your prompt here>
"""

primer, completion = complete(prompt)

print("START >>\n\n", primer, sep="")
print()
print("COMPLETION >>\n\n", completion, sep="")

Make the LM answer a question based on the passage.

In [None]:
reference_text = \
"""The Amazon rainforest is the largest tropical rainforest in the world, covering over 5.5 million square kilometers.
It is home to an estimated 390 billion individual trees and is often referred to as the 'lungs of the Earth'
due to its role in producing oxygen and absorbing carbon dioxide. The rainforest is also home to millions of
species of plants and animals, many of which are not found anywhere else on Earth."""

question = "Why is the Amazon rainforest referred to as the 'lungs of the Earth'?"

prompt = \
"""{text}"""

primer, completion = complete(prompt.format(text=reference_text))

print("START >>\n\n", primer, sep="")
print()
print("COMPLETION >>\n\n", completion, sep="")

Make the LM come up with a product name!

In [None]:
product_description = \
"""The headphones feature cutting-edge technology to deliver premium sound quality and the best noise-cancelling
performance on the market. Real-time audio processors and high-performance mics power the specially designed
driver unit, for wide frequency reproduction, deep bass and clear vocals. They are designed to immerse you in
a sound so good, it's feels like you’re in the studio with your favourite artists. The
headphones raise the bar for distraction-free listening and clarity."""

prompt = \
"""{text}"""

primer, completion = complete(prompt.format(text=product_description))

print("START >>\n\n", primer, sep="")
print()
print("COMPLETION >>\n\n", completion, sep="")