# <u>Large Language Models(LLMs)</u>

## ***Quick summary of AI vs ML***

<img src="images/aivsml.PNG" alt="Drawing" style="height:450; width: 750px;"/>

AI can be thought of a displine like Physics where as ML can be a specific area within Physics

## A typical classical ML model

<img src="images/classical-ml.PNG" alt="Drawing" style="height:450; width: 750px;"/>

## What are Large Language Models (LLMs)

LLMs are machine learning models that are ***really good at understanding and generating human language***. LLMs are trained on massive datasets of text, and they can be used for a variety of tasks.

Here is an analogy that might help you to understand LLMs. Imagine that you have a big book that contains all of the information in the world. If you want to learn something, you can just look it up in the book. This is what an LLM does. It has a big book of text data, and it can use this data to answer any question you ask it.

some of the tasks LLM can do are :

- Natural language understanding (NLU): LLMs can be used to understand the meaning of text, even if it is ambiguous or grammatically incorrect.
- Natural language generation (NLG): LLMs can be used to generate text, such as news articles, blog posts, and creative writing.
- Question answering (QA): LLMs can be used to answer questions about a variety of topics.
- Machine translation (MT): LLMs can be used to translate text from one language to another.


Here are the few examplse of LLMs
- GPT-3 (Generative Pre-trained Transformer 3): **175 billion parameters**
- PaLM (Pathway Language Model): **540 billion parameters**
- BERT (Bidirectional Encoder Representations from Transformers): **340 million parameters**

## What does 175 billion parameters means?

In a large language model (LLM), parameters are the variables that are learned during training. They represent the relationships between different parts of language, such as words, phrases, and sentences. The more parameters an LLM has, the more complex the relationships it can learn.

When we say that an LLM has been trained on 7 billion parameters, it means that the model has been trained on a dataset of text that contains 7 billion words. This dataset is used to learn the relationships between different words and phrases. The more words and phrases an LLM is trained on, the better it can understand and generate language.

## LLMs are general purpose language models that can be pre-trained and fine tuned for specific purposes

<table><tr>
<td> <img src="images/pre-trained.PNG" alt="Drawing" style="height:450; width: 500px;"/> </td>
    <td></td>
<td> <img src="images/fine-tuned.PNG" alt="Drawing" style="height:450; width: 500px;"/> </td>
</tr></table>

## A typical gen AI model (LLMs)

<img src="images/genAI.PNG" alt="Drawing" style="height:450; width: 750px;"/>

# <u>What are Transformers </u>

Transformers are neural network architectures (proposed by google @ https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html) that learn the context by tracking the relationships in sequential data, like the words in a sentence. Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder)

<img src="images/transformer-arch.PNG" alt="Drawing" style="height:450; width: 750px;"/>



** As the Transformer architecture mentions encoder and decoder as single components, it consists of many layers.**



<table><tr>
<td> <img src="images/encoders.PNG" alt="Drawing" style="height:450; width: 500px;"/> </td>
    <td></td>
<td> <img src="images/decoders.PNG" alt="Drawing" style="height:450; width: 500px;"/> </td>
</tr></table>

## Self-attention in Transformers

<img src="images/self-attention.PNG" alt="Drawing" style="height:450; width: 800px;"/>

## How language model look at text?

In a Language modelling task, a model is trained to predict a missing word in a sequence of words. In general there are two type of language  models (**Auto-regressive, Auto-encoding**)


<img src="images/tasks.PNG" alt="Drawing" style="height:450; width: 750px;"/>

##### A great example of auto regressive model is auto correct in phones 

<img src="images/nlg.PNG" alt="Drawing" style="height:450; width: 750px;" />

# <u>Auto-encoding Models - BERT</u>

<img src="images/bert.PNG" alt="Drawing" style="height:450; width: 750px;" />

**At a distance when the sentance "Istanbul is a great city" is fed to BERT, it creates representations for each tokens.**

*Note: [SEP] and [CLS] are reserved tokens in BERT*

<img src="images/bert-token.PNG" alt="Drawing" style="height:450; width: 750px;" />

In [64]:
from transformers import BertModel, BertTokenizer
from bertviz import head_view
import torch
import pandas as pd

In [65]:
#loading the bert base model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [66]:
"puru" in tokenizer.vocab

False

In [73]:
text = "My friend told me about this class and I love it so far. She was right."

tokens = tokenizer.encode(text)
#unsqueeze changes the tensor from (20) for (20,1)
inputs = torch.tensor(tokens).unsqueeze(0)
inputs

tensor([[ 101, 2026, 2767, 2409, 2033, 2055, 2023, 2465, 1998, 1045, 2293, 2009,
         2061, 2521, 1012, 2016, 2001, 2157, 1012,  102]])

In [74]:
#get attention scores from the bert model
attention = model(inputs, output_attentions = True)[2]

In [75]:
# get the final encoder (output of the default 12 encoders)
final_attention = attention[-1].mean(1)[0]

In [76]:
#put the output in a df

attention_df = pd.DataFrame(final_attention.detach()).applymap(float).round(3)
attention_df.columns = tokenizer.convert_ids_to_tokens(tokens)
attention_df.index = tokenizer.convert_ids_to_tokens(tokens)

attention_df

Unnamed: 0,[CLS],my,friend,told,me,about,this,class,and,i,love,it,so,far,.,she,was,right,..1,[SEP]
[CLS],0.112,0.03,0.019,0.009,0.009,0.018,0.049,0.082,0.034,0.022,0.027,0.033,0.009,0.036,0.114,0.056,0.079,0.131,0.048,0.083
my,0.029,0.023,0.018,0.008,0.009,0.014,0.026,0.016,0.018,0.013,0.011,0.019,0.008,0.01,0.437,0.028,0.031,0.024,0.04,0.218
friend,0.022,0.01,0.132,0.011,0.005,0.01,0.009,0.014,0.012,0.005,0.011,0.006,0.006,0.006,0.434,0.026,0.011,0.007,0.029,0.235
told,0.013,0.004,0.013,0.093,0.004,0.011,0.005,0.005,0.007,0.003,0.007,0.005,0.006,0.003,0.544,0.008,0.004,0.004,0.019,0.241
me,0.03,0.013,0.013,0.012,0.014,0.015,0.017,0.012,0.021,0.012,0.012,0.014,0.008,0.01,0.481,0.011,0.008,0.009,0.038,0.249
about,0.021,0.01,0.009,0.022,0.008,0.085,0.018,0.013,0.013,0.006,0.013,0.017,0.011,0.008,0.459,0.005,0.003,0.005,0.031,0.243
this,0.03,0.014,0.005,0.005,0.009,0.016,0.067,0.023,0.013,0.01,0.012,0.016,0.007,0.008,0.459,0.006,0.004,0.005,0.037,0.254
class,0.029,0.011,0.008,0.006,0.005,0.013,0.029,0.099,0.011,0.01,0.014,0.016,0.007,0.009,0.427,0.009,0.005,0.005,0.037,0.25
and,0.033,0.017,0.008,0.008,0.012,0.009,0.013,0.01,0.092,0.013,0.012,0.01,0.01,0.011,0.44,0.012,0.012,0.012,0.035,0.23
i,0.027,0.018,0.011,0.006,0.01,0.01,0.018,0.014,0.03,0.029,0.017,0.012,0.007,0.016,0.452,0.015,0.013,0.012,0.038,0.242


In [77]:
tokens_as_list = tokenizer.convert_ids_to_tokens(inputs[0])
head_view(attention,tokens_as_list)

<IPython.core.display.Javascript object>

<img src="images/benefits.PNG" alt="Drawing" style="height:450; width: 750px;" />

#### Below is an example of how a LLM is used for Q&A task

In [54]:
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import numpy as np
from datasets import load_dataset

from sentence_transformers import SentenceTransformer,util
from transformers import pipeline

from random import sample,seed,shuffle
from sentence_transformers import InputExample, losses, evaluation
from torch.utils.data import DataLoader

In [55]:
query = "tiffin girls"
headers = {"User-Agent": "Mozilla/5.0"}
cookies = {"CONSENT": "YES+cb.20210720-07-p0.en+FX+410"}

google_html = BeautifulSoup(requests.get(f'https://www.google.com/search?q={query}',headers=headers, 
                                         cookies=cookies).text).get_text()[:5000]



In [57]:
nlp = pipeline('question-answering', model="deepset/roberta-base-squad2",
              tokenizer="deepset/roberta-base-squad2",max_length=10)

In [58]:
google_html

"tiffin girls - Google SearchGoogle×Please click here if you are not redirected within a few seconds.    AllImagesNewsMaps Videos Shopping Books Search tools    Any timeAny timePast hourPast 24 hoursPast weekPast monthPast yearAll resultsAll resultsVerbatimThe Tiffin Girls' Schoolwww.tiffingirls.orgTiffin Girls' has a wonderful culture and ethos, which is evident when you first step into the building. Students live up to the motto 'sapere aude – dare\xa0...Year 7 AdmissionsSixth Form AdmissionsFind UsAbout UsThe Tiffin Girls' School 4.3  (18)  \nSchool in Kingston, EnglandDirectionsWebsiteTiffin Girls' School is a girls' selective school in Kingston upon Thames, Southwest London, England; it moved from voluntary aided status to become an academy in 2011.Address: Richmond Rd, Kingston upon Thames KT2 5PLHours: Closed ⋅ Opens 8\u202famPhone: 020 8546 0773Reviews aren't verified by Google, but Google checks for and removes fake content when it's identified. People also askHow hard is it t

In [60]:
nlp("is tiffin girls a selective school?",google_html)

{'score': 0.33677807450294495,
 'start': 2894,
 'end': 2941,
 'answer': "The Tiffin Girls' School is a selective academy"}

### Summary of auto-encoding models

- Natural language understaing models
- Uses only encoding layer of the transformers
- Represents contextual Representation (vectors) of sentences 

# <u>Auto-regressive Models - GPT</u>

<img src="images/gpt.PNG" alt="Drawing" style="height:450; width: 750px;" />

**Generative** - The Generative indicates that it creates tokens based on one side of the context, the past context only

GPT refers to a family of models

- GPT-1 released in 2018 - .117B parameters
- GPT2 released in 2019 - 1.5 B parameters
- GPT3 released in 2020 - 175B parameters

In [78]:
from transformers import pipeline, set_seed, GPT2Tokenizer, GPT2LMHeadModel
from torch import tensor, numel
from bertviz import model_view

# to ensure we get the same results
set_seed(42)

In [79]:
generator = pipeline('text-generation', model='gpt2')
generator("Hello, I am a computer engineer and I", max_length=30, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I am a computer engineer and I don't want to be a computer scientist. As for all those people who want to study programming and software"},
 {'generated_text': 'Hello, I am a computer engineer and I am starting a company. It will eventually take four to five weeks. It was an opportunity to learn.'},
 {'generated_text': "Hello, I am a computer engineer and I love it! If you are interested in this, please see the post about the 'Designer's manual"}]

In [81]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dro

**ModuleList shows the layers of Decoders (12 Decoder layers) , lm_head call vector represention out in_features=768 and number of tokens in out_features=50257**

### Masked Self Attention 

**Unlike BERT, in GPT the final out each token is masked so that the model does not cheat and has to predict the token**

<img src="images/masked-attention.PNG" alt="Drawing" style="height:450; width: 750px;" />

In [88]:
phrase = "my friend was right about this class. It is so fun!"

encoded_phrase = tokenizer(phrase, return_tensors='pt')

response = model(**encoded_phrase, output_attentions=True, output_hidden_states=True)

tokens = tokenizer.convert_ids_to_tokens(encoded_phrase['input_ids'][0])

#grab from 9th layer and 0 is the head

arr = response.attentions[9][0][0]
n_digits = 3

attention_df = pd.DataFrame((torch.round(arr*10**n_digits)/(10**n_digits)).detach()).applymap(float)
attention_df.columns = tokens
attention_df.index = tokens

attention_df

Unnamed: 0,[CLS],my,friend,was,right,about,this,class,.,it,is,so,fun,!,[SEP]
[CLS],1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
my,0.837,0.163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
friend,0.726,0.14,0.134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
was,0.722,0.082,0.104,0.092,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
right,0.652,0.08,0.112,0.088,0.069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
about,0.739,0.042,0.09,0.048,0.05,0.031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
this,0.731,0.04,0.054,0.048,0.038,0.018,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
class,0.694,0.04,0.072,0.035,0.057,0.027,0.062,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.0
.,0.778,0.023,0.038,0.027,0.03,0.017,0.038,0.012,0.038,0.0,0.0,0.0,0.0,0.0,0.0
it,0.799,0.021,0.03,0.019,0.025,0.013,0.034,0.009,0.028,0.023,0.0,0.0,0.0,0.0,0.0


## Challanges of using pre-trained auto-regressive models As-IS (Bias, mis-information)

In [97]:
generator("The man works as a", max_length=8, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The man works as a police officer,'},
 {'generated_text': 'The man works as a journalist for The'},
 {'generated_text': 'The man works as a security guard at'}]

In [98]:
generator("The women works as a", max_length=8, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The women works as a schoolteacher'},
 {'generated_text': 'The women works as a full-time'},
 {'generated_text': 'The women works as a janitor and'}]

In [99]:
generator("The earth is", max_length=12, num_return_sequences=10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The earth is flat, but our atmosphere is thick. Our'},
 {'generated_text': 'The earth is very cold, I have been on a train'},
 {'generated_text': 'The earth is a circle which is the center of the Sun'},
 {'generated_text': 'The earth is so far removed from reality that this is also'},
 {'generated_text': 'The earth is also surrounded by many oceans that can be either'},
 {'generated_text': 'The earth is being pushed as hard as it is moving.'},
 {'generated_text': 'The earth is thin and fragile. We have nothing to worry'},
 {'generated_text': 'The earth is a sphere because it is a circular body and'},
 {'generated_text': 'The earth is flat on September 20th, 2016.\n'},
 {'generated_text': "The earth is going down and, so he says, '"}]

# Few shot, Zero short learning

In [100]:
generator(""" Sentiment Analysis
Text:  hate it when my phone battery dies.
Sentiment: Negative
###
Text: My day has been really great!
Sentiment: Positive
###
Text: This new music video was so good
Sentiment:""" , top_k=2, temperature=0.1, max_length=55)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


' Sentiment Analysis\nText:  hate it when my phone battery dies.\nSentiment: Negative\n###\nText: My day has been really great!\nSentiment: Positive\n###\nText: This new music video was so good\nSentiment: Positive\n'

## Summary of auto regressive models

- Works on the concepts 

## Few examples of genAI now 

- Image generator like DALL-E (https://labs.openai.com/) - Example below
- Music generator (https://google-research.github.io/seanet/musiclm/examples/)
- Text Genertor chatGPT

<img src="images/news.PNG" alt="Drawing" style="height:450; width: 750px;" />

*Reference : all images are from sources from Google cloud lab vidoes and book from Sinan Ozdemir and last one from DALL-E*