# A peek into ChatGPT
* Evolution of ChatGPT : From neural network to large language model
* Leveraging ChatGPT : From fine-tuning to zero-shot learning

ChatGPT is a specialized language model for chat conversation. There is GPT in ChatGPT which stands for Generative Pretrained Transformer. We will try to demystify ChatGPT and build intuition toward fundamental concepts underpinning this technology.


## ChatGPT sample
 
 ![ChatGPT-DogPrompt](https://drive.google.com/uc?export=view&id=1sJbH_2M9JDkfALNJJwCaFqTxqDV1qb-m)
<br/>

Visit [LearnGPT.com](https://www.learngpt.com/) to review other samples and learn more about ChatGPT.

<br/>

## Function metaphor

Both GPT and ChatGPT can be thought of as a function that transforms input string into output string. 

<br/>

 ![Function](https://drive.google.com/uc?export=view&id=1JMsfaMS1cO-5Gw0Wlm2o5w16dt4Nd82U)




## Writing Functions with AI

Function is a very general concept and as developers we have been writing functions for a long while. Functions map input(x) to certain output(y). Instead of writing functions in traditional way, deep learning allows us to build arbitrary functions with the help of AI. We now have a general machinery that given enough data(x) and supervision(y) can produce a the mapping function(aka model, f). This model f can then transform unseen input(x) to output(y). 

### Function Anatomy
Our function consists of layers of neurons that receive input and transform it for next layer. Transformations depend on the strength of interconnections(aka weights) between neurons. 
* Training : We progressively tweak the weights such that quality of transformation improves.
* Inference : We keep the weights frozen. We pass new input to the pre-trained model.

<br/>

![NeuralNet](https://drive.google.com/uc?export=view&id=1Tiwy1JJdP3Mu6URLVn1xjL0r8JQ0hbcv)



## Reusing Functions with AI ( No training)

### Reusing functions as-is

We have a model hub that hosts pre-trained models(aka functions) for various tasks. Given a task, the library downloads the right function and use it to transform input to output. This works well if someone has already built functions that we need and they are available in the hub. 

Let us explore few tasks using HuggingFace libary.
* Sentiment Analysis
* Text Summarization
* Text Generation



In [None]:
#Install hugging-face libraries
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#Load libraries
from transformers import pipeline

In [None]:
#Sentiment Analysis
classifier = pipeline("sentiment-analysis")

input_text = "Its a sure-footed performance from an actor that never fails to hold the attention of its audience"

output_text = classifier(input_text)

output_text

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998669624328613}]

In [None]:
#Text Summarization
summarizer = pipeline("summarization")

input_text = """The James Webb Space Telescope has for the first time peered inside a planet-forming disk of dust surrounding a nearby star, 
                a development promising to supercharge the search for exoplanets. 
                Using the James Webb Space Telescope's near-infrared camera (NIRCam), a team of astronomers led by Kellen Lawson, 
                a postdoctoral program fellow at NASA Goddard Space Flight Center, observed the surroundings of a red dwarf star known as AU Microscopii or AU Mic. 
                Red dwarfs are rather unassuming stars that make up the largest population of stars in our galaxy, the Milky Way. 
                Most of the time, red dwarf stars are too faint to see in the visible light, which is why observing them in the heat-carrying infrared wavelengths, 
                those that Webb specializes in, is useful. Astronomers have known that AU Mic is surrounded by a planet-forming disk of gas and dust and have previously 
                discovered two exoplanets orbiting the star thanks to the star's periodic dimming caused by the planets' crossing in front of the star, 
                which was detected by NASA's exoplanet hunter TESS.
               """

output_text = summarizer(input_text)

output_text

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' The James Webb Space Telescope has for the first time peered inside a planet-forming disk of dust surrounding a nearby star . Red dwarf stars are too faint to see in the visible light, which is why observing them in the heat-carrying infrared wavelengths is useful . Astronomers have previously discovered two exoplanets orbiting the star .'}]

In [None]:
#Text Generation
generator = pipeline("text-generation")

input_text = "The greenhouse effect is essential to life on Earth, but human-made emissions are"

output_text = generator(input_text)

output_text

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The greenhouse effect is essential to life on Earth, but human-made emissions are harming human life as much as possible.\n\nEven more harmful is the use of fossil fuels, which, in turn, are harming human life as well as nature,'}]

### Pretrained Models
We can view these sample tasks as mapping from input(x) to output(y). In these cases, we relied on pre-existing models that were downloaded from the model hub.Since we did not do any training ourselves, all of these are examples of pre-trained models.

|Task|Pre-trained model(f)|x|y|
|---|---|---|--- |
| Sentiment Analysis| distilbert-base-uncased-finetuned-sst-2-english  | text  |  sentiment(+ve, -ve) |
| Summarizer| sshleifer/distilbart-cnn-12-6  | text  |  summary |
| Generator| gpt2  | text  |  continuation text |

<br/>

Typically we do not find models that we need in the hub. The models available may have been trained on different task and fare poorly for our application. One way would be to build a new model from scratch. This process is expensive and data-hungry. Do we have relatively cheaper path to achieve good performance? Can we tweak original model for a different but related task?




## Reusing functions with AI ( Partial training)

All of us are familiar with tweaking the code base to support new scenario. We call it _refactoring_. Software engineers build layered architecture that enables code reuse through modular desgin. In a typical refactoring, most of the changes in the code happen at the top layer where earlier layers remain unchanged. We can appeal to this intuition when we do function reuse via **Transfer Learning**.

### Transfer Learning - Concept
A model is stacked layer of neurons. The earlier layers are grouped together as body. The latter layers are called head. We need to take the body of pre-trained model & attach a new head for our specific task. There are various approaches for transfer learning.
* Feature extraction : Pretrained body is frozen, only head specific task is trained. This is less costly in terms of compute.
* Fine tuning : Both the body and head is retrained though we retrain head parameters more. This is costlier but may achieve better task performance.

![Transfer Learning](https://drive.google.com/uc?export=view&id=1XTfY6_piQdzEm0nmhec6uKZMMTitIVOf)




### Transfer Learning - Motivation 

A natural question arises on why transfer learning works. Why does earlier weights of the model do not need to be altered for a new inference task? Let us look at a vision task to build some intuition. 

Zeiler and Fergus looked at various layers and found a way to visualize weights and activations to understand what the model is learning.

We now look at a trained neural network on image classification task. Instead of looking at weights as a collection of numbers, we visualize these by looking at feature activation and corresponding images.


* Layer 1-2
![Layer1-2](https://drive.google.com/uc?export=view&id=15PQuA7c0uqcPb_CmpcWpT38gT8QxRbNl)


* Layer 3
![Layer3](https://drive.google.com/uc?export=view&id=1wDWyvPs5aHOdcxJDUw-dzsPEQhJSFQGU)


* Layer 4-5
![Layer4-5](https://drive.google.com/uc?export=view&id=1PWDZoUEBaHCkvq7vTCxw1Z06dqzqFxBq)


<br>
We realize that hidden layers progressively capture complex features as we move from input to output. This also means that earlier layers can be useful in many different tasks. For e.g. a model trained on image classification can be re-used for digit(1-9) recoginition. 

## GPT

GPT is a large language model. 

* Large : GPT is large as it has large number of parameters. GPT-3 has 175 billion parameters. Large models have shown to be adept at doing tasks other than they were trained on.

* Language model : GPT is a language model which is trained on large amount of text from the internet. The training task involves predicting next token given starting context. Once training is complete, GPT is able to spit out meaningful continuations from the starting text.



### Transformer family

GPT belongs to a family of language models that were inspired by Transformer architecture. Even before the advent of Transformer, Encoder-Decoder models were common for translation task. The job of encoder was to encode input sequence such that it can be later used by decoder to generate sequence in another language. In the original paper [Attention is all you need](https://arxiv.org/abs/1706.03762), an encoder-decoder model was used for a translation task. Encoders and Decoders can also be seen as functions. If we consider english to french translation, we can break it down as follows.
*  state = encoder(english-sentence)
*  french-sentence = decoder(state)

<br/>

Overtime it was realized that we can build Tranformer models by just using one of the blocks(Encoder or Decoder). Now we have 3 different branches of Transformer based language models.

* Encoder only : BERT
* Decoder only : GPT
* Encoder-Decoder : T5, BART


<br/>

GPT is trained on next word prediciton task. Once trained, the model is able to complete a sentence given starting prompt. In essence, the model learns to mimic the internet.

<br/>

![Transformer Family](https://drive.google.com/uc?export=view&id=1AwJqpLcxAwBETTV8yZYzIR2FLBTnTKfR)








### Attention : Motivation
Transformer was a very succesful architecture. One of the key ingredients was its use of attention. Attention is a general concept where decoder can pay attention to certain specific words of the encoder at the time of decoding. There is a related concept of self-attention where meaning of a word is derived from its neighboring context. Our language is replete with cases where context alters the meaning of a sentence and understanding this nuance is important to succesfully complete NLP tasks. 

![Self-Attention](https://drive.google.com/uc?export=view&id=1c6Bq6L762_uPkgFf94HtufyHfQMrB4GH)


In the example, we can observe that the token 'flies' pays attention to different tokens in first and second sentence.

### Reusing function via Few-shot Learning
Large language model exhibit few shot learning capabilities. Essentially we get good model performance without retraining the initial model on the desired task. The inference task is reformulated to bring it closer to training task. GPT can therefore be used to myriad new tasks without expensive retraining.

![Few-Shot](https://drive.google.com/uc?export=view&id=1c4EnbgGit9hUzN7GWmObjjUMCudH0GxJ)



## Ways to build new functions

We need new capabilities for our application. We have the following possibilities to explore.

|Function metaphor|AI task| Considerations |
|---|---| ---|
| Write _new_ function| Training from scratch | Costly in compute and data  | 
| _Refactor_ existing function| Fine-tuning existing model| Moderate requirements on new data and compute  | 
| _Pass custom function_ to pre-existing function| Few shot learning, inference time conditioning | Cheapest, require large models, no training  |

<br/>

GPT and ChatGPT are large models that are expensive to train. Hence there is  lot of focus on exploiting few-shot learning on frozen language models. This opens the door for **Prompt Engineering**, a way to best use the large language models by building prompts optimized for the task at hand. 



## Chat GPT
Chat GPT is language model specialized for chat conversation. It listens to our instructions and provides meaningful responses. Our initial GPT language model is fine-tuned to achieve this. Fine tuning is achieved in following ways.

* Supervised fine tuning
* RL based fine tuning : RLHF or Reinforcement Learning From Human Feedback

### RL setting

![Reinforcement Learning](https://drive.google.com/uc?export=view&id=1eOctLUCRatLxnYCZUXPH8MpZWmOHUKFY)


In reinforcement learning, we have an agent interacting with the environment.

* Action : Agent takes an action against the environment.
* Reward : Environment occasionally gives reward to agent.
* State  : Agent receives a state from environment. Agent decides an action based on the state.
* Policy : Policy maps a state to one of the actions. 
* Learning : Agent learns an optimal policy to maximize reward. An optimal policy gives us best action for a given state. 

For e.g. in a chess game, we have following notions.
* Action : Chess move
* Reward : Win or Loss
* State  : State of the chess board
* Policy : Given a state of chess board, take appropriate action.
* Learning : Learn best policy in each state so as to win the chess game.


We are given an initial learning model. We would like to have a reward signal that that grounds us in human values and guides the model to produce good quality response that are honest, helpfule and harmless.

Once the model is fine-tuned, we hope to get a good quality chat engine that acts as an AI assistant.


### Supervised Fine Tuning

In supervised fine tuning, we take combination of prompts and responses. We also add high quality human responses for a given prompt to the mix. This training data can then be used to fine-tune the initial language model using Transfer Learning.

![Supervised Fine Tuning](https://drive.google.com/uc?export=view&id=1ySIZBn4BSxTOcQhoGAZ9jU3OFuDhgMo2)



### RLHF : Reinforcement Learning From Human Feedback

The goal of RLHF is to provide human grounding to the model results beyond the loss function used to build the inital language model. We will be using RL for this step. The initial model finds itself in some state based on initial prompt. We would like the model to learn a policy(i.e. pick best action which means a meaningful continuation of conversation) such that we have responses that are highly valued by humans. 

How do we go about shaping the reward function with human inputs? 

We will need a _reward model_ to score model response. Once we have a good reward model, we can use it to fine tune the _initial language model_.



#### Reward Model
We sample multiple responses to a given prompt from the language model. These responses are ranked by humans. A high ranked response corresponds to more reward. This data is used to build a **Reward Model** that transforms given prompt-response into a scalar reward.

![Reward Model](https://drive.google.com/uc?export=view&id=1_4S4JwixPCyYKM_HfPSZml5fmsIibB_c)



#### RL Fine tuning
We begin with 2 identical language models with the goal of tuning one of these using Reinforcement Learning. The fine tuning reward has 2 components.

* Human Reward : The earlier reward model is used to get the reward for the response from fine-tuned model. 

* Prediction Shift Penalty : Same prompt is passed to both the models and their divergence is measured. This constraints the fine-tuned model from diverging too much from the initial language model. This is similar to how we freeze initial stacks of neural network while doing Transfer Learning in supervised setting.

Combined reward is used to guide the update to the language model. Once training is complete, we get our **ChatGPT**.


![RLHF](https://drive.google.com/uc?export=view&id=1_kN-xD5C_IO_OjKS8W3BU7SJmH2kaZG2)

## Applications

Large language models(LLMs) like ChatGPT gives us language generation abilities coupled with a vast trove of knowledge base. They are also capable of in-context learning where they can learn a new task by following few examples. This gives us a wide design space to imagine and build new applications. LLM would be a great asset in your dev toolbox. So, go forth and **prompt** !!

Here is a sample from new Bing. Bing is prompted to solve a riddle that appears to be gibberish. Bing is able to do a remarkable job by leveraging its LLM.

<br/>

![Bing Riddle](https://drive.google.com/uc?export=view&id=1MxN05lXUkCo6p2KoPBQWw-tirpmcWl_6)






## References
* [Attention is All You Need](https://arxiv.org/abs/1706.03762)
* [Chat GPT: Optimizing Language Model for dialogue](https://openai.com/blog/chatgpt/)
* [GPT](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
* [InstructGPT : Training language models to follow instructions
with human feedback](https://arxiv.org/pdf/2203.02155.pdf)
* [Proximal Policy Optimization](https://arxiv.org/pdf/1707.06347.pdf)
* [Reinforcement Learning](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf)
* [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146)


## Image Credits
* [Bing riddle](https://twitter.com/thomasrice_au/status/1627128539886780417?lang=en)
* [Illustrating RLHF](https://huggingface.co/blog/rlhf)
* [Language models are few shot learners](https://arxiv.org/pdf/2005.14165v4.pdf)
* [Transformers Book](https://transformersbook.com/)
* [Visualizing and Understanding CNN](https://arxiv.org/abs/1311.2901)



## Questions