# Loading Text, Chunking, Embedding and Upserting into Pinecone Index

Got most of these from James Briggs' notebook: https://www.pinecone.io/learn/langchain-retrieval-augmentation/

### 1. Load Text

In [17]:
doc_path = (r"feedback_data.txt")

# Open the file
with open(doc_path, 'r') as f:
    # Read the file
    contents = f.read()

In [18]:
# set up tokenizer
import tiktoken
tokenizer = tiktoken.get_encoding('p50k_base')


# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

# sample
tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

28

### 2. Create chunking function

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_text(contents)
chunks[0]

"FEEDBACK DATA:\nName\tTitle\tRating\tDate\tReview\tPros\tCons\nJohn Doe\tSoftware Engineer\t5\t8/1/23\tLove the functionality and ease of use!\tIntuitive UI, great customer support\tMinor bugs on macOS\nJane Smith\tProduct Manager\t4\t8/2/23\tVery helpful for our team's workflow.\tIntegration with other tools, flexibility\tLacks a few features we wanted\nRobert Brown\tIT Specialist\t3\t8/3/23\tIt's decent but could be improved.\tReliable performance, easy installation\tExpensive, occasional crashes\nLucy White\tData Analyst\t4\t8/4/23\tGreat software but needs some tweaks.\tEfficient data processing, good documentation\tHard to understand error messages\nMichael Green\tTeam Lead\t5\t8/5/23\tHighly recommended for all teams!\tSmooth collaboration features, robust\tNone so far\nElla Black\tSoftware Tester\t2\t8/6/23\tToo many bugs for my liking.\tDecent UI\tCrashes often, hard to troubleshoot\nAnna Johnson\tDevOps Engineer\t4\t8/7/23\tWorks seamlessly with our deployment pipeline.\tGood

### 3. Create Embeddings

In [20]:
# initialize embedding function
from langchain.embeddings.openai import OpenAIEmbeddings
import os

OPENAI_API_KEY = "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" # there is a free tier. still trying to figure out how to use the azure deployment instead

model_name = 'text-embedding-ada-002'

# set embeddings function
embed = OpenAIEmbeddings(
    model = model_name,
    openai_api_key=OPENAI_API_KEY
)

In [21]:
# create data format from chunked text for upserting into Pinecone index. Format: id, embeddings, metadata
from uuid import uuid4

vectors = [(str(uuid4()), embed.embed_documents([text])[0], {"text": text}) for text in chunks]


#### How the 'vectors' or embeddings look when printed. 
There are 1536 elements to the vector representing each chunk of data.

![vectors](assets/vectors.png)

### 4. Prep Pinecone Index

In [22]:
import pinecone

index_name = 'fd'
dimension=1536

pinecone.init(
        api_key="5fc9a612-0250-4bfa-a0b5-4dfe60e95542",  # get yours from pinecone.io. there is a free tier.
        environment="gcp-starter"  
)

# delete index if it exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

# create index
pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=dimension       
)

### 5. Upsert vectors to index

In [10]:
# connect to index
index = pinecone.Index(index_name)

# upsert vectors to pinecone
index.upsert(
    vectors=vectors,
    #namespace=index_name, 
    values=True, 
    include_metadata=True
    )

index.describe_index_stats()

ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'Content-Length': '104', 'date': 'Thu, 24 Aug 2023 15:58:18 GMT', 'x-envoy-upstream-service-time': '4', 'server': 'envoy', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"code":3,"message":"Vector dimension 1536 does not match the dimension of the index 1540","details":[]}
