# Investigating Few Shot Learning and mimicking GPT-3
In this notebook, we began by doing an overarching investigation into zero-shot, one-shot, and few-shot learning. This includes a conceptual explaination of the material after watching and following a few videos and tutorial provided by OpenAI and other large ML organizations. As you may expect, we quickly stumbled accross GPT-3 (also mentioned as a cool product in class) and we were of course fascinated by its capabilities. Unfortunately, we couldn't find any online interface to try out the OpenAI product, so we began the actual coding in this notebook mimicking some of GPT-3's capabilities with other, similar yet smaller models that will run on our computers. This included using GPTneo and flair TARSC. As far as code, this notebook isn't extensively long, but it contains material and relevant worldly products that have to do with the topics we learned in this biweekly interval in class, and contains our origional markdown explainations and commenting. 

## Few Shot Learning Explainations
Below is a conglomeration of our knowledge that we have aquired about few-shot learning after watching videos and tutorials. Training what you read below as the "prior worldly knowledge" takes a lot of computational time and data, and mostly is not available in tutorials. Instead there are pleanty of model packages people typically just import for this, so we felt the need to have some thurough explainations before showing code that relies heavily on just importing someone else's pretrained knowledge data. 
### Intro
"Few-shot learning" is an umbrella term for zero-shot, one-shot, or N-shot learning models. The overall function of these models is to train a dataset even when there is limited available datapoints to train on. After watching some of the tutorial videos online, a good analogy we thought of is few-shot learning is extremely close to how the average human being's brain can operate. We can see only one picture of a person and then recognize if a "testing" picture is the same person or not. This is because we have seen so many other people in our lifetime, that we are able to use our prior knowledge from the world to improve our learning abilities. With previous methods we have worked on in this course, we would need to train our model on many pictures of the same person for it to start successfully recognizing their face. 

For the application of natural language processing, the same concepts follow. A typicall RNN or other models require extensive training datapoints in order to interpret words and language context of a sequence. With few-shot learning, we can simply import "prior worldly knowledge" to our system, and allow it to figure out what is going on in our "test" data dequence without much training data. N-shot, one-shot, and zero-shot represent the exact amount of datapoints we are allowing our few-shot learning model to have before using testing data. In application few-shot learning, models can be trained in rare, under-mearsured topics. It also reduces the amount of data processing required to build a model

### Approaches
The three main approaches are similarity, learning, and data.
#### Similarity
Few-shot learning models that are based on similarity learn pattterns in their "prior worldly knowledge" that can then be used to separate categories of data even if the test data is completely unseen before. To learn these patterns, the model must be able to compare between multiple classification networks in order to have the distinguishing be high level at "similar" or "not similar" instead of specific like --for example-- "dog or cat". For a very simplified nlp example, by comparing a network that distinguishes a sequence between "loving" and "not-loving" and between "funny" and "not funny", it will be able to use what it has learned about similarities between sequences to then distinguish between "positive" or "negative" review comments after only a few training datapoints on a positive or negative review (we just made this example up). In another notebook we have turned in this week, we perform a siamese network implementation which is training this "prior worldy knowledge" on two networks at once.

#### Learning
In performing few shot learning based on learning, the "prior worldly knowledge" consists of information of constraints such as hyperparameters and rules. If the dataset has trained already to hyperparameters of previous similar models, it doesn't take many training data points for it to identify how these hyperparameters apply to the input sequence and follow the necessary operations. LSTMs and MAML are both examles of using this type of approach. 

#### Data
Models based on prior knowledge of the dataset at hand can also be categorized as few show learning as well. Often times, this includes pretrained knowledge that are organized in "families" of data classes on the internet. Then importing the knowledge from families of datasets that are the same or extremely similar to the ones you are training on, will reduce the necessary training data for you to have available. Common methods of model based few-shot learning include pen-strike and analogies by Facebook AI. 

Some key sources to learn all of this included: 
- https://analyticsindiamag.com/an-introductory-guide-to-few-shot-learning-for-beginners/
- https://www.youtube.com/watch?v=VqPmrYFvKf8
- https://www.analyticsvidhya.com/blog/2021/05/an-introduction-to-few-shot-learning/
- https://jaketae.github.io/study/gpt/

## Trying out GPTneo 
Out of curiousity of GPT-3, below we have implemented GPTneo, which is a smaller version of GPT-3 available for download from PyTorch. It is mostly fun to mess around with but also inspires some cool aspirations and project extensions as it truely does demonstrate the power of few-shot learning in language models. This also goes along with what we previously mentioned that implementing these few shot learning applications can require very few lines of code because it is designed to rely on someone elses priorly trained work that is available on the internet. After this we make an implementation that is a little bit more involved.

A good source for this implentation was this youtube video:
- https://www.youtube.com/watch?v=6MI0f6YjJIk


In [1]:
#Following tutorial on nessesary tools to implement GPTneo here
!pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
!pip install transformers

[0mLooking in links: https://download.pytorch.org/whl/torch_stable.html

[33m×[0m The package index page being used does not have a proper HTML doctype declaration.
[33m╰─>[0m Problematic URL: [4;94mhttps://download.pytorch.org/whl/torch_stable.html[0m

[1;35mnote[0m: This is an issue with the page at the URL mentioned above.
[1;36mhint[0m: You might need to reach out to the owner of that package index, to get this fixed. See [4;94mhttps://github.com/pypa/pip/issues/10825[0m for context.
[31mERROR: Could not find a version that satisfies the requirement torch==1.8.1+cu111 (from versions: 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2)[0m[31m
[0m[31mERROR: No matching distribution found for torch==1.8.1+cu111[0m[31m
[0m

In [2]:
from transformers import pipeline 
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B') #import "prior knowledge" from GPTneo


In [3]:
prompt = "The 12th president of the US was" # Prompt a random phrase that we thought of
res = generator(prompt, max_length=50, do_sample=True, temperature=0.9) # generates additional text to our prompt up to 50 characters
print(res[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The 12th president of the US was born in Hawaii, but moved to Washington DC, where his parents ran a restaurant. However, Trump has always had an extremely high opinion of himself.

The real estate mogul has been married since 1981 and


We tried many fun prompts in this model. If you enter "import pandas as pd", the model outputs python code that could follow. We also put in a problem for our numerics homework and the model outputted latex code of mathematics that was relevant and more or less followed logically. The GPT-3 has plenty of more knowledge that it has trained on than GPTneo, so the results would be much better if we had the resources to use it.

## Performing zero-shot learning using "flair" models for sequence classification 
Since we were very interested, but weren't extremely satisfied with just using the GPTneo implementation above, we found "flair" which is a pytorch based package that performs zero-shot learning in a slightly more involved way. It has families of pretrained "TARS" models for different languages (below we use English so we can enjoy our results). After loading in the "previous worldy knowledge" we define classifications to fit a prompted sequence into. Then when comparing the prompted sequence to our TARS worldly knowledge, we can classify the sequence. All without expending our own resources to train a model with large amounts of data!

A good tutorial we found on this is here:
- https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_10_TRAINING_ZERO_SHOT_MODEL.md

In [4]:
!pip install flair

[0m

In [5]:
from flair.models import TARSClassifier
from flair.data import Sentence

tars = TARSClassifier.load('tars-base') #Loading prior worldy knowledge

classes = ["happy", "sad"] #Define classifications

sentence = Sentence("I am so glad you liked it!") #input sentence sequence

tars.predict_zero_shot(sentence, classes) #run zsl on the TARS model for classification
print(sentence)

2022-02-23 03:32:17,420 loading file /Users/simon/.flair/models/tars-base-v8.pt
Sentence: "I am so glad you liked it !"   [− Tokens: 8  − Sentence-Labels: {'happy-sad': [happy (0.8667)]}]


Next, instead of loading the "base" family from TARS for English, we uploaded the Named Entity Recognition (ner) family. This prior knowledge allows our model to identify named identities in our inputted sequence. 

In [6]:
from flair.models import TARSTagger
from flair.data import Sentence

tars = TARSTagger.load('tars-ner') #Loading prior worldy knowledge

labels = ["Soccer Team", "University", "Vehicle", "River", "City", "Country", "Person", "Movie", "TV Show"] #Define classifications

sentences = [ 
    Sentence("The Humboldt University of Berlin is situated near the Spree in Berlin, Germany"),
    Sentence("Bayern Munich played against Real Madrid"),
    Sentence("I flew with an Airbus A380 to Peru to pick up my Porsche Cayenne"),
    Sentence("Game of Thrones is my favorite series"),
] #input sentence sequences

tars.add_and_switch_to_new_task('task 1', labels, label_type='ner') #run zsl on the TARS model for classification

for sentence in sentences:
    tars.predict(sentence)
    print(sentence.to_tagged_string("ner"))

2022-02-23 03:32:23,343 loading file /Users/simon/.flair/models/tars-ner.pt
The Humboldt <B-University> University <I-University> of <I-University> Berlin <E-University> is situated near the Spree <S-River> in Berlin <S-City> , Germany <S-Country>
Bayern <B-Soccer Team> Munich <E-Soccer Team> played against Real <B-Soccer Team> Madrid <E-Soccer Team>
I flew with an Airbus <B-Vehicle> A380 <E-Vehicle> to Peru <S-City> to pick up my Porsche <B-Vehicle> Cayenne <E-Vehicle>
Game <B-TV Show> of <I-TV Show> Thrones <E-TV Show> is my favorite series


As we can see this is working very well! This reminds us of when you "have heard of a sports team" but know nothing about it and don't remember where you heard it from in the first place. When someone says "Real Madrid" in the context of the sentence, you somehow know --from zero other context except for prior worldly knowledge-- that they are talking about soccer. 

In this next snippets we get a little bit more involved with implementing a more difficult classification that will require training from our model additionally. Of course, we still will only need to train with a few data points! In this case, we will take in the "prior worldly knowledge" from TARS, while also defining an unrelated dictionary of emotions from corpus for classification. Thus, our model has a very good idea of how English and natural language works, but needs to be very briefly trained on what "emotions" are before it can perform. 

In [None]:
from flair.datasets import GO_EMOTIONS
from flair.models import TARSClassifier
from flair.trainers import ModelTrainer

tars = TARSClassifier.load('resources/taggers/trec/best-model.pt') #Loading prior worldy knowledge

# 2. load a new flair corpus e.g., GO_EMOTIONS
new_corpus = GO_EMOTIONS()
label_type = "emotion"
label_dict = new_corpus.make_label_dictionary(label_type=label_type)

tars.add_and_switch_to_new_task("GO_EMOTIONS",
                                label_dictionary=label_dict,
                                label_type=label_type) #compare to TARS "worldly knowledge" model prior to training to "emotions"

trainer = ModelTrainer(tars, new_corpus) #train
trainer.train(base_path='resources/taggers/go_emotions', # path to store the model artifacts
              learning_rate=0.02, 
              mini_batch_size=16,
              mini_batch_chunk_size=4, 
              max_epochs=10, 
              )

2022-02-23 03:32:47,783 loading file resources/taggers/trec/best-model.pt
2022-02-23 03:32:53,158 Reading data from /Users/simon/.flair/datasets/go_emotions
2022-02-23 03:32:53,159 Train: /Users/simon/.flair/datasets/go_emotions/train.txt
2022-02-23 03:32:53,160 Dev: /Users/simon/.flair/datasets/go_emotions/dev.txt
2022-02-23 03:32:53,160 Test: /Users/simon/.flair/datasets/go_emotions/test.txt
2022-02-23 03:32:54,084 Initialized corpus /Users/simon/.flair/datasets/go_emotions (label type name is 'emotion')
2022-02-23 03:32:54,085 Computing label dictionary. Progress:


100%|██████████| 43410/43410 [00:12<00:00, 3470.37it/s]

2022-02-23 03:33:40,240 Corpus contains the labels: emotion (#43410)
2022-02-23 03:33:40,241 Created (for label 'emotion') Dictionary with 29 tags: <unk>, NEUTRAL, ANGER, FEAR, ANNOYANCE, SURPRISE, GRATITUDE, DESIRE, OPTIMISM, ADMIRATION, CONFUSION, AMUSEMENT, APPROVAL, CARING, EMBARRASSMENT, REALIZATION, DISAPPOINTMENT, GRIEF, SADNESS, CURIOSITY, JOY, LOVE, EXCITEMENT, DISAPPROVAL, REMORSE, DISGUST, RELIEF, PRIDE, NERVOUSNESS
2022-02-23 03:33:40,241 Task `GO_EMOTIONS` already exists in TARS model. Switching to it.
2022-02-23 03:33:40,244 ----------------------------------------------------------------------------------------------------
2022-02-23 03:33:40,246 Model: "TARSClassifier(
  (tars_model): TextClassifier(
    (loss_function): CrossEntropyLoss()
    (document_embeddings): TransformerDocumentEmbeddings(
      (model): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512

2022-02-23 03:33:40,247 ----------------------------------------------------------------------------------------------------
2022-02-23 03:33:40,247 Corpus: "Corpus: 43410 train + 5426 dev + 5427 test sentences"
2022-02-23 03:33:40,248 ----------------------------------------------------------------------------------------------------
2022-02-23 03:33:40,249 Parameters:
2022-02-23 03:33:40,249  - learning_rate: "0.02"
2022-02-23 03:33:40,250  - mini_batch_size: "16"
2022-02-23 03:33:40,250  - patience: "3"
2022-02-23 03:33:40,251  - anneal_factor: "0.5"
2022-02-23 03:33:40,251  - max_epochs: "10"
2022-02-23 03:33:40,251  - shuffle: "True"
2022-02-23 03:33:40,252  - train_with_dev: "False"
2022-02-23 03:33:40,252  - batch_growth_annealing: "False"
2022-02-23 03:33:40,253 ----------------------------------------------------------------------------------------------------
2022-02-23 03:33:40,254 Model training base path: "resources/taggers/go_emotions"
2022-02-23 03:33:40,254 ------------




2022-02-23 04:23:32,851 epoch 1 - iter 271/2714 - loss 0.01426208 - samples/sec: 1.46 - lr: 0.020000
2022-02-23 05:13:47,877 epoch 1 - iter 542/2714 - loss 0.01390668 - samples/sec: 1.44 - lr: 0.020000
2022-02-23 06:04:33,292 epoch 1 - iter 813/2714 - loss 0.01415939 - samples/sec: 1.42 - lr: 0.020000
2022-02-23 06:57:58,626 epoch 1 - iter 1084/2714 - loss 0.01404646 - samples/sec: 1.35 - lr: 0.020000
2022-02-23 07:51:22,612 epoch 1 - iter 1355/2714 - loss 0.01407711 - samples/sec: 1.35 - lr: 0.020000
2022-02-23 08:44:43,219 epoch 1 - iter 1626/2714 - loss 0.01407313 - samples/sec: 1.35 - lr: 0.020000
2022-02-23 09:39:03,155 epoch 1 - iter 1897/2714 - loss 0.01411025 - samples/sec: 1.33 - lr: 0.020000
2022-02-23 10:32:33,857 epoch 1 - iter 2168/2714 - loss 0.01406061 - samples/sec: 1.35 - lr: 0.020000
2022-02-23 11:24:14,160 epoch 1 - iter 2439/2714 - loss 0.01400876 - samples/sec: 1.40 - lr: 0.020000


In [4]:

tars = TARSClassifier.load('tars-base') #Loading prior worldy knowledge

existing_tasks = tars.list_existing_tasks()
print(f"Existing tasks are: {existing_tasks}") #Check out what datasets it was trained on


tars.switch_to_task("GO_EMOTIONS") #Apply our emotions training

sentence = Sentence("I absolutely love this!")
tars.predict(sentence)
print(sentence)

Existing tasks are: {'AGNews', 'DBPedia', 'IMDB', 'SST', 'TREC_6', 'NEWS_CATEGORY', 'Amazon', 'Yelp', 'GO_EMOTIONS'}
Sentence: "I absolutely love this !"   [− Tokens: 5  − Sentence-Labels: {"label": [LOVE (0.9708)]}]


I reran this notebook (illadvised), so the training above didn't finish, but luckily we are still able to see the output from the previous run above. It looks extremely accurate! 