<a href="https://colab.research.google.com/github/kanalive/notebooks/blob/main/chatwithcode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Your API Keys

In [4]:
import os
import json
# Load environment object from JSON
file_path = '/content/drive/MyDrive/keys/keys.json'  # Replace with the actual file path
with open(file_path, 'r') as file:
    loaded_object = json.load(file)

os.environ['OPENAI_API_KEY'] = loaded_object['OPEN_AI_API_KEY']
os.environ['ACTIVELOOP_TOKEN'] = loaded_object['ACTIVELOOP_TOKEN']

In [3]:
!python3 -m pip install --upgrade langchain deeplake openai tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Embedding

In [5]:
import os

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

embeddings = OpenAIEmbeddings(disallowed_special=())


# Index the code base

In [6]:
!git clone https://github.com/twitter/the-algorithm # replace any repository of your choice 


Cloning into 'the-algorithm'...
remote: Enumerating objects: 8683, done.[K
remote: Counting objects: 100% (1979/1979), done.[K
remote: Compressing objects: 100% (1416/1416), done.[K
remote: Total 8683 (delta 394), reused 1948 (delta 390), pack-reused 6704[K
Receiving objects: 100% (8683/8683), 7.41 MiB | 13.33 MiB/s, done.
Resolving deltas: 100% (2607/2607), done.


#Process data with langchain

In [7]:
import os
from langchain.document_loaders import TextLoader

root_dir = './the-algorithm'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass

In [8]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)



In [9]:
username = "kanalive" # replace with your username from app.activeloop.ai
db = DeepLake(dataset_path=f"hub://{username}/twitter-algorithm", embedding_function=embeddings, public=True) #dataset would be publicly available
db.add_documents(texts)


Your Deep Lake dataset has been successfully created!


/

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/kanalive/twitter-algorithm


 

hub://kanalive/twitter-algorithm loaded successfully.


Evaluating ingest: 100%|██████████| 31/31 [02:53<00:00


Dataset(path='hub://kanalive/twitter-algorithm', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype       shape       dtype  compression
  -------   -------     -------     -------  ------- 
 embedding  generic  (30996, 1536)  float32   None   
    ids      text     (30996, 1)      str     None   
 metadata    json     (30996, 1)      str     None   
   text      text     (30996, 1)      str     None   


['2f45cef4-034d-11ee-bc9c-0242ac1c000c',
 '2f45d188-034d-11ee-bc9c-0242ac1c000c',
 '2f45d21e-034d-11ee-bc9c-0242ac1c000c',
 '2f45d2a0-034d-11ee-bc9c-0242ac1c000c',
 '2f45d30e-034d-11ee-bc9c-0242ac1c000c',
 '2f45d372-034d-11ee-bc9c-0242ac1c000c',
 '2f45d3cc-034d-11ee-bc9c-0242ac1c000c',
 '2f45d430-034d-11ee-bc9c-0242ac1c000c',
 '2f45d48a-034d-11ee-bc9c-0242ac1c000c',
 '2f45d4ee-034d-11ee-bc9c-0242ac1c000c',
 '2f45d548-034d-11ee-bc9c-0242ac1c000c',
 '2f45d5ac-034d-11ee-bc9c-0242ac1c000c',
 '2f45d610-034d-11ee-bc9c-0242ac1c000c',
 '2f45d66a-034d-11ee-bc9c-0242ac1c000c',
 '2f45d6c4-034d-11ee-bc9c-0242ac1c000c',
 '2f45d71e-034d-11ee-bc9c-0242ac1c000c',
 '2f45d778-034d-11ee-bc9c-0242ac1c000c',
 '2f45d7dc-034d-11ee-bc9c-0242ac1c000c',
 '2f45d840-034d-11ee-bc9c-0242ac1c000c',
 '2f45d89a-034d-11ee-bc9c-0242ac1c000c',
 '2f45d91c-034d-11ee-bc9c-0242ac1c000c',
 '2f45d980-034d-11ee-bc9c-0242ac1c000c',
 '2f45d9da-034d-11ee-bc9c-0242ac1c000c',
 '2f45da34-034d-11ee-bc9c-0242ac1c000c',
 '2f45da8e-034d-

In [12]:
db = DeepLake(dataset_path=f"hub://{username}/twitter-algorithm", read_only=True, embedding_function=embeddings)

\

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/kanalive/twitter-algorithm



\

hub://kanalive/twitter-algorithm loaded successfully.

Deep Lake Dataset in hub://kanalive/twitter-algorithm already exists, loading from the storage
Dataset(path='hub://kanalive/twitter-algorithm', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype       shape       dtype  compression
  -------   -------     -------     -------  ------- 
 embedding  generic  (30996, 1536)  float32   None   
    ids      text     (30996, 1)      str     None   
 metadata    json     (30996, 1)      str     None   
   text      text     (30996, 1)      str     None   


  

In [13]:
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10

In [14]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name='gpt-3.5-turbo') # switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)

#start asking questions!

In [15]:
questions = [
    "What does favCountParams do?",
] 
chat_history = []

for question in questions:  
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result['answer']))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What does favCountParams do? 

**Answer**: Unfortunately, I cannot determine the purpose of `favCountParams` from the given context. It seems to be one of the optional parameters for `ThriftLinearFeatureRankingParams`, but I do not have any additional information about what it does or how it is used in the code. 

