# **Training Llama 2 7B on a cancer inhibitor dataset which focuses on small molecules' activity targeting protein kinase**

## **Install replicate to run the Llama 2 LLM using an API**

In [None]:
! pip install replicate

Collecting replicate
  Downloading replicate-0.26.1-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx<1,>=0.21.0 (from replicate)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.21.0->replicate)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.21.0->replicate)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, replicate
Successfully installed h11-0.14.0 ht

## **Set Replicate API token**
https://replicate.com/account/api-tokens

In [None]:
import os

os.environ["REPLICATE_API_TOKEN"] = ""

# **Testing chatbot text generation**

In [None]:
import replicate

# Prompts
pre_prompt = "You are a helpful assistant. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'." #Tells LLM what it should be
prompt_input = "What is a llama?" #LLM generates output based on this prompt


input = {
    "top_p": 0.1,
    "prompt": f"{pre_prompt} {prompt_input} Assistant: ",
    "temperature": 0.1,
    "max_new_tokens": 500,
    "min_new_tokens": -1
}


for event in replicate.stream(
    "meta/llama-2-7b-chat",
    input=input
):
    print(event, end="")


 Hello! I'm here to help you with any questions you may have. A llama is a domesticated South American mammal that is closely related to the alpaca. Llamas are known for their distinctive long necks, ears, and soft, woolly coats. They are often used as pack animals in the Andes region, where they originated, and are also kept as pets in many parts of the world. Is there anything else I can help you with?

# **Using Kaggle Cancer Inhibitor dataset for training**

Dataset: https://www.kaggle.com/datasets/xiaotawkaggle/inhibitors

In [None]:
! pip install kaggle



In [None]:
! mkdir ~/.kaggle

kaggle.json uploaded to this colab notebook to access Kaggle api
https://www.kaggle.com/settings  > API > Create new token

In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download xiaotawkaggle/inhibitors

Dataset URL: https://www.kaggle.com/datasets/xiaotawkaggle/inhibitors
License(s): CC-BY-NC-SA-4.0
Downloading inhibitors.zip to /content
 79% 82.0M/104M [00:01<00:00, 79.4MB/s]
100% 104M/104M [00:01<00:00, 94.7MB/s] 


In [None]:
! unzip inhibitors.zip

Archive:  inhibitors.zip
  inflating: abbr.csv                
  inflating: cdk2.h5                 
  inflating: cdk2_test.csv           
  inflating: cdk2_train.csv          
  inflating: egfr_erbB1.h5           
  inflating: egfr_erbB1_test.csv     
  inflating: egfr_erbB1_train.csv    
  inflating: gsk3b.h5                
  inflating: gsk3b_test.csv          
  inflating: gsk3b_train.csv         
  inflating: hgfr.h5                 
  inflating: hgfr_test.csv           
  inflating: hgfr_train.csv          
  inflating: map_k_p38a.h5           
  inflating: map_k_p38a_test.csv     
  inflating: map_k_p38a_train.csv    
  inflating: pubchem_neg_sample.h5   
  inflating: pubchem_neg_sample_part_1.csv  
  inflating: pubchem_neg_sample_part_2.csv  
  inflating: pubchem_neg_sample_part_3.csv  
  inflating: tpk_lck.h5              
  inflating: tpk_lck_test.csv        
  inflating: tpk_lck_train.csv       
  inflating: tpk_src.h5              
  inflating: tpk_src_test.csv        
  in

Use data for Cyclin-dependent kinase 2 (cdk2) which is cdk2_train.csv to train Llama 2

# **Using Langchain and Pinecone to process dataset so that Llama can be trained on the dataset**

In [None]:
! pip install langchain langchain-community langchain-core pinecone transformers sentence-transformers



In [None]:
import os
import pinecone
import sys
from langchain.llms import Replicate
from langchain.vectorstores import Pinecone
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders.csv_loader import CSVLoader

Initialize Pinecone with API Key and Environment

https://app.pinecone.io/organizations/-O0DMW8sX5mY9OFiqyhh/projects/650d4fd9-8428-4461-b248-bdbf8e44edb7/keys

In [None]:
# pc = Pinecone(api_key="")
# index = pc.Index("llama2")

Load csv

In [None]:
loader = CSVLoader(file_path="./cdk2_train.csv")
documents = loader.load()

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)  #split into smaller chunks
texts = text_splitter.split_documents(documents)

In [None]:
embeddings = HuggingFaceEmbeddings()

  warn_deprecated(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Pinecone Index for Storing Vectors

In [None]:
pip install -U pinecone-client langchain

Collecting pinecone-client
  Downloading pinecone_client-4.1.1-py3-none-any.whl (216 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.2/216.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-client
Successfully installed pinecone-client-4.1.1 pinecone-plugin-interface-0.0.7


In [None]:
! pip install --upgrade langchain-pinecone



Set Pinecone API Key

https://app.pinecone.io/organizations/-O0DMW8sX5mY9OFiqyhh/projects/650d4fd9-8428-4461-b248-bdbf8e44edb7/keys

In [None]:
from langchain_pinecone import PineconeVectorStore
os.environ['PINECONE_API_KEY'] = ''
vectordb = PineconeVectorStore.from_documents(texts, embeddings, index_name="llama2")

# **Training Llama 2**

In [None]:
llm = Replicate(
    model="a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5",
    input={"temperature": 0.1, "max_length": 1000}
)



In [None]:
qa_chain = ConversationalRetrievalChain.from_llm(
    llm,
    vectordb.as_retriever(search_kwargs={'k': 2}),
    return_source_documents=True
)

# **Llama 2 has now been trained on the dataset, and we can ask it questions about the data**

In [None]:
chat_history = []
query = "Prompt: What can you tell me about this dataset?"
result = qa_chain({'question': query, 'chat_history': chat_history})
print('Answer: ' + result['answer'] + '\n')
chat_history.append((query, result['answer']))

Answer:  Hello! I'd be happy to help you with your question about the dataset.

Based on the information provided, the dataset appears to be a set of numbers arranged in a table with the following dimensions:

* Row 1: 0
* Row 2: 0
* Row 3: 8
* Row 4: 3
* Row 5: 2
* Row 6: 15
* Row 7: 14

It looks like there are no missing values in the dataset, and all the values are non-negative integers. The dataset appears to be balanced, with no obvious patterns or trends in the data.

Without more information about the context and purpose of the dataset, it's difficult to provide any further insights or interpretations. However, I'm happy to help answer any specific questions you may have or provide guidance on how to analyze the data further.

Please let me know if there's anything else you'd like to know or if you have any specific questions about the dataset!

