<a href="https://colab.research.google.com/github/angelaaaateng/AIR_AI_Engineering_Course_2024/blob/main/Day1/2_LMMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to Large Language Models

## "Next Word Prediction" in Python through LLMs like GPTNeo

GPT-Neo is an open-source language model created by EleutherAI, designed as a free alternative to models like OpenAI's GPT-3. It's based on the same underlying architecture (Transformer) and can perform tasks such as text generation, summarization, and language understanding. GPT-Neo is available in different sizes, depending on the number of parameters (1.3B, 2.7B, etc.), and can be fine-tuned or used out of the box for various natural language processing tasks

**Import Libraries:**

The code imports torch for handling tensors and the transformers library to use pre-trained language models like GPT-Neo.

In [None]:
import torch
from transformers import AutoTokenizer, GPTNeoForCausalLM

**Loading the Tokenizer and Model:**

The tokenizer and GPT-Neo model are loaded from the transformers library using the `from_pretrained` method. The model used here is the 1.3 billion parameter version of GPT-Neo.

In [None]:
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")

model

**Tokenizing Input Text:**

The input text "Language models are" is tokenized into a format that the model understands and converted into tensors.

In [None]:
# Tokenize the input text
input_ids = tokenizer("Language models are", return_tensors="pt")
input_ids

`input_ids:` The original input text "Language models are" is converted into token IDs, which are unique numerical representations for each word or subword. In this case, "Language models are" is tokenized as tensor([[32065, 4981, 389]]). Each number corresponds to a specific word in the model's vocabulary.

`attention_mask:` This tensor [1, 1, 1] indicates that all tokens are valid and should be attended to during processing (no padding tokens).

**Generate Next Token:**

The model generates the next token based on the input sequence, returning scores for each possible next word.

In [None]:
# Generate the next token with scores
gen_tokens = model.generate(**input_ids, max_new_tokens=1,
                            output_scores=True, return_dict_in_generate=True)

gen_tokens

**Extract and Sort Scores:**

The output scores are extracted, and the top 20 token probabilities are sorted in descending order.

In [None]:
# Extract the output scores
output_scores = gen_tokens["scores"]
scores_tensor = output_scores[0]

print(output_scores)
scores_tensor

In [None]:
# Sort the tokens by their scores in descending order
sorted_indices = torch.argsort(scores_tensor[0], descending=True)[:20]
sorted_indices

**Displaying Results:**

A loop iterates through the top 20 tokens, decodes them back into words, and displays each word with its corresponding score, showing the model’s confidence for each next-word prediction.

In [None]:
# Loop through the top 20 token indices and display token name and score
for index in sorted_indices:
    token_id = index
    token_name = tokenizer.decode([token_id.item()])
    token_score = scores_tensor[0][index].item()
    print(f"Token: {token_name}, Score: {token_score}")

## Open-Source LLMs

In [None]:
# Install the necessary libraries
# !pip install transformers
# !pip install sentencepiece

import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='test/test_model.model')

sp.encode('This is a test')

In [None]:


# Import the T5 model and tokenizer
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the T5 model and tokenizer (T5-small for faster execution)
model_name = "t5-small"  # You can also use "t5-base" for a bigger model
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Set the translation prompt (T5 expects prompts in the format 'translate English to French: ...')
prompt = "translate English to Romanian: Hello, how are you?"

# Tokenize the input prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate the translation
output = model.generate(input_ids, max_length=50)

# Decode the output to text
response = tokenizer.decode(output[0], skip_special_tokens=True)

# Print the translated text
print(response)


In [None]:
response

In [None]:
# Install the latest OpenAI package
!pip install openai

# Import OpenAI library
from openai import OpenAI

# Set your API key here
api_key = "XXXX API KEY"


from openai import OpenAI
client = OpenAI(api_key=api_key)

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Translate the following from english to french: 'hello, how are you?' "
        }
    ]
)

print(completion.choices[0].message)
response