<a href="https://colab.research.google.com/github/napsugark/LLM_Course/blob/main/LLM_Learning_Path_1_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to LLMs**



# Note on running Colab Notebooks:



Colab notebooks, short for Google Colaboratory, are interactive, cloud-based Jupyter notebooks that
- allow you to write and execute Python code in your browser
- include formatted text using Markdown
- provide free access to GPUs and TPUs (which makes them especially useful for machine learning, data science, and AI)
- allows you to execute code blocks (cell) independently by using Shift + Enter or clicking the Run button and get instant feedback (outputs, plot).

Code cells share the same global state (variables, functions, imports, etc.).
Running a cell out of order may result in inconsistent variable states or errors. Jupyter notebooks do not enforce top-to-bottom execution, but following it avoids issues.

# Note on costs

To use OpenAI models via Azure, you will need access to an Azure Open AI Resource. If you haven't been giving an endpoint and API key, ask your mentors to provide you with one.

On that resource there are two deployments, "gpt-4o-mini" and "text-embedding-ada-002".

Please keep in mind that Azure OpenAI is priced as pay-as-you-go on token basis. In other words, any call you make to Azure OpenAI costs and the more text you sent to an LLM and the more text it generates, the costlier it gets. While the per token cost is relatively low (e.g. Chat GPT 40-mini costs 0.63 EUR for 1 million output tokens), for large use cases it can add up easily. Please use the resources carefully, avoid unecessary repetitions of calls and check with your mentors first, if you want to process data that is much larger than any of the examples or assignments. Thank you!   

# Azure AI Client Setup

Store AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY as secrets (look for the key icon in the sidebar on the left) in your google colab and provide access to them from your notebook (via the toggle next to the variable). These secrets are linked to your google account, somebody else looking at the same notebook won't see them, yet they are available for you in any notebook you open. You will however have to provide notebook access in other notebooks.

In [None]:
from openai import AzureOpenAI
from google.colab import userdata

client = AzureOpenAI(
    azure_endpoint = userdata.get('AZURE_OPENAI_ENDPOINT'),
    api_key=userdata.get('AZURE_OPENAI_API_KEY'),
    api_version="2024-02-01" )

In [None]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello there!"},
    ]
)
print(response.choices[0].message.content)

# Basic concepts to know

At the end of this module you should have an understanding of the following concepts:
- Token
- Pretraining
- RLHF
- Reasoning Models
- Embedding
- Similarity Metrics


# Materials:


You should study the following mandatory materials (and any additional materials you find helpful) with the aim of gaining a basic understanding of the concepts listed above.

### Mandatory

- Introduction to LLMs with Andrej Karpathy (1 hour): https://www.youtube.com/watch?v=zjkBMFhNj_g

- Embeddings: https://www.datacamp.com/blog/vector-embedding

- Similarity Metrics (you can skip the embedding part): https://www.newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python#toc-0

### Optional

Introduction to LLMs:
- Microsoft Generative AI for Beginners, Lesson 01 and 02: https://microsoft.github.io/generative-ai-for-beginners/#/


Embeddings:
- Embedding projector: https://projector.tensorflow.org/
- Embeddings technical (section on embeddings only): https://developers.google.com/machine-learning/crash-course/embeddings

How do LLMs work?:
- Deep Dive into LLMs by Andrej Karpathy (3.5 hrs) : https://www.youtube.com/watch?v=7xTGNNLPyMI
- Large Language Models: https://developers.google.com/machine-learning/crash-course/llm
- Intro article: https://medium.com/data-science-at-microsoft/how-large-language-models-work-91c362f5b78f


NLP:
- https://www.deeplearning.ai/resources/natural-language-processing/

AI Engineering:
- https://newsletter.pragmaticengineer.com/p/ai-engineering-with-chip-huyen?publication_id=458709&amp%3Bpost_id=156423504&amp%3BisFreemail=true&amp%3Br=5i6tv&amp%3BtriedRedirect=true

# Coding

## Example: Create a vector ambedding

In [None]:
example_sentence = "This is a sentence about cats"


In [None]:
cat_embedding = client.embeddings.create(
    input=example_sentence,
    model="text-embedding-ada-002").data[0].embedding

Observe that the embedding is a numerical representation of the sentence, with 1536 numbers between -1 and 1 describing the position of the sentence in the embedding space and capturing the semantic meaning of it.

In [None]:
cat_embedding[:20]

In [None]:
len(cat_embedding)

## Example: Calculating the cosine similarity between two embeddings

In [None]:
dog_embedding = client.embeddings.create(
    input="This is a fun fact about dogs",
    model="text-embedding-ada-002").data[0].embedding

In [None]:
import numpy as np
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v.
    """

    distance = 0.0

    # Compute the dot product between u and v
    dot = np.dot(u,v)
    # Compute the L2 norm of u
    norm_u =  np.linalg.norm(u)

    # Compute the L2 norm of v
    norm_v =  np.linalg.norm(v)
    # Compute the cosine similarity
    cosine_similarity = dot/(norm_u*norm_v)


    return cosine_similarity


In [None]:
cosine_similarity(dog_embedding, cat_embedding)

## Assignment: Cosine similarities

Calculate the cosine similarities of embeddings between the following pairs and compare:
- mother and father
- car and dog
- sun and rain