## Familiarization with LLM packages such as `LangChain`, `OpenAI` & `tiktoken`

In [8]:
import credentials

### 1. Tiktoken

https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb


Encodings specify how text is converted into tokens. Different models use different encodings

`tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`  |
| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003`|
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |

In [9]:
import tiktoken

In [10]:
encoding = tiktoken.get_encoding("cl100k_base")
#encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [None]:
text = 'Tomorrow it will rain in Budapest.'
encoding.encode(text)

[91273, 433, 690, 11422, 304, 70695, 13]

In [None]:
text = 'TomorrowitwillraininBudapest.'
encoding.encode(text)

[91273, 275, 14724, 30193, 258, 33, 664, 28724, 13]

In [None]:
[encoding.decode_single_token_bytes(token) for token in [91273, 275, 14724, 30193, 258, 33, 664, 28724, 13]]

[b'Tomorrow', b'it', b'will', b'rain', b'in', b'B', b'ud', b'apest', b'.']

In [None]:
[bytes.decode(encoding.decode_single_token_bytes(token)) for token in [91273, 275, 14724, 30193, 258, 33, 664, 28724, 13]]

['Tomorrow', 'it', 'will', 'rain', 'in', 'B', 'ud', 'apest', '.']

In [None]:
encoding.decode([433])

' it'

In [None]:
encoding.decode([275])

'it'

### 2. `LangChain`, `Pinecone` & other vector databases / vector representers for Document Representation

Set up environment

In [None]:
import pinecone
from langchain.vectorstores import Pinecone

pinecone.init(
    api_key = credentials.pinecone_api,  
    environment = credentials.pinecone_loc  
)

Load PDF

### 3. `OpenAI`

In [None]:
from langchain.llms import OpenAI

In [None]:
llm = OpenAI(model_name='text-davinci-003', openai_api_key=credentials.openai_api)

In [None]:
#need billing set-up for API to be live
#llm('Tell me a joke')

In [14]:
llm.get_num_tokens('Tell me a joke.')

5

In [13]:
llm.get_num_tokens('Tellmeajoke.')

5