<a href="https://colab.research.google.com/github/jonkrohn/NLP-with-LLMs/blob/main/code/GPyT-code-completion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPyT for Python code completion

In this notebook (based on Sinan Ozdemir's [here](https://github.com/sinanuozdemir/oreilly-gpt-hands-on-nlg/blob/main/notebooks/Third_Party_Models.ipynb)), we experience how easy it is to use the `transformers` library to work with **third-party models**. (Sinan's notebook contains several more examples.)

The [GpyT model](https://huggingface.co/Sentdex/GPyT) is a GPT-2 architecture trained from scratch on Python code from GitHub. It excels at efficiently generating Python code completions.

### Load dependencies

In [None]:
%%capture
!pip install transformers==4.28.0

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

### Load model

In [None]:
tokenizer = AutoTokenizer.from_pretrained("Sentdex/GPyT")
model = AutoModelForCausalLM.from_pretrained("Sentdex/GPyT")

Downloading (…)okenizer_config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/787k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/432k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/516M [00:00<?, ?B/s]

### Perform inference

In [None]:
input_code = """import pandas as pd
import numpy as np

df = pd"""  # I'd expect a read_csv here

converted = input_code.replace("\n", "<N>")

tokenized = tokenizer.encode(converted, return_tensors='pt')
resp = model.generate(tokenized, 
                      max_length=tokenized.shape[1] + 8, 
                      pad_token_id=tokenizer.eos_token_id) # suppresses warning

decoded = tokenizer.decode(resp[0])
reformatted = decoded.replace("<N>","\n")

print(reformatted)

import pandas as pd
import numpy as np

df = pd.read_csv('data/data
