# **Training Llama 2 7B on NIPS 2015 research papers**

## **Install replicate to run the Llama 2 LLM using an API**

In [None]:
! pip install replicate

Collecting replicate
  Downloading replicate-0.26.1-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx<1,>=0.21.0 (from replicate)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.21.0->replicate)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.21.0->replicate)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, replicate
Successfully installed h11-0.14.0 ht

## **Set Replicate API token**
https://replicate.com/account/api-tokens

In [None]:
import os

os.environ["REPLICATE_API_TOKEN"] = ""

# **Testing chatbot text generation**

In [None]:
import replicate

# Prompts
pre_prompt = "You are a helpful assistant. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'." #Tells LLM what it should be
prompt_input = "What is a llama?" #LLM generates output based on this prompt


input = {
    "top_p": 0.1,
    "prompt": f"{pre_prompt} {prompt_input} Assistant: ",
    "temperature": 0.1,
    "max_new_tokens": 500,
    "min_new_tokens": -1
}


for event in replicate.stream(
    "meta/llama-2-7b-chat",
    input=input
):
    print(event, end="")


 Hello! I'm here to help you with any questions you may have. A llama is a domesticated South American mammal that is closely related to the alpaca. Llamas are known for their distinctive long necks, ears, and soft, woolly coats. They are often used as pack animals in the Andes region, where they originated, and are also kept as pets in many parts of the world. Is there anything else I can help you with?

# **Using Kaggle NIPS 2015 Papers dataset for training**

Dataset: https://www.kaggle.com/datasets/benhamner/nips-2015-papers

In [None]:
! pip install kaggle



In [None]:
! mkdir ~/.kaggle

kaggle.json uploaded to this colab notebook to access Kaggle api
https://www.kaggle.com/settings

In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download benhamner/nips-2015-papers

Dataset URL: https://www.kaggle.com/datasets/benhamner/nips-2015-papers
License(s): ODbL-1.0
Downloading nips-2015-papers.zip to /content
 72% 7.00M/9.67M [00:01<00:00, 10.1MB/s]
100% 9.67M/9.67M [00:01<00:00, 7.94MB/s]


In [None]:
! unzip nips-2015-papers.zip

Archive:  nips-2015-papers.zip
  inflating: Authors.csv             
  inflating: PaperAuthors.csv        
  inflating: Papers.csv              
  inflating: accepted_papers.html    
  inflating: database.sqlite         
  inflating: hashes.txt              


Papers.csv will be used to train Llama 2

# **Using Langchain and Pinecone to process dataset so that Llama can be trained on the dataset**

In [None]:
! pip install langchain langchain-community langchain-core pinecone transformers sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

In [None]:
import os
import pinecone
import sys
from langchain.llms import Replicate
from langchain.vectorstores import Pinecone
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders.csv_loader import CSVLoader

Initialize Pinecone with API Key and Environment

https://app.pinecone.io/organizations/-O0DMW8sX5mY9OFiqyhh/projects/650d4fd9-8428-4461-b248-bdbf8e44edb7/keys

In [None]:
# pc = Pinecone(api_key="")
# index = pc.Index("llama2")

Load csv

In [None]:
import pandas as pd

df = pd.read_csv('Papers.csv')

# editing data frame
df = df.head(5)

# after editing file yo can save it to as a csv file
df.to_csv('Papers.csv')

In [None]:
loader = CSVLoader(file_path="./Papers.csv")
documents = loader.load()

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)  #split into smaller chunks
texts = text_splitter.split_documents(documents)



In [None]:
embeddings = HuggingFaceEmbeddings()



Pinecone Index for Storing Vectors

In [None]:
pip install -U pinecone-client langchain

Collecting pinecone-client
  Downloading pinecone_client-4.1.1-py3-none-any.whl (216 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.2/216.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-client
Successfully installed pinecone-client-4.1.1 pinecone-plugin-interface-0.0.7


In [None]:
! pip install --upgrade langchain-pinecone

Collecting langchain-pinecone
  Downloading langchain_pinecone-0.1.1-py3-none-any.whl (8.4 kB)
Collecting pinecone-client<4.0.0,>=3.2.2 (from langchain-pinecone)
  Downloading pinecone_client-3.2.2-py3-none-any.whl (215 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.9/215.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pinecone-client, langchain-pinecone
  Attempting uninstall: pinecone-client
    Found existing installation: pinecone-client 4.1.1
    Uninstalling pinecone-client-4.1.1:
      Successfully uninstalled pinecone-client-4.1.1
Successfully installed langchain-pinecone-0.1.1 pinecone-client-3.2.2


Set Pinecone API Key

https://app.pinecone.io/organizations/-O0DMW8sX5mY9OFiqyhh/projects/650d4fd9-8428-4461-b248-bdbf8e44edb7/keys

In [None]:
from langchain_pinecone import PineconeVectorStore
os.environ['PINECONE_API_KEY'] = ''
vectordb = PineconeVectorStore.from_documents(texts, embeddings, index_name="llama2")

# **Training Llama 2**

In [None]:
llm = Replicate(
    model="a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5",
    input={"temperature": 0.1, "max_length": 1000}
)



In [None]:
qa_chain = ConversationalRetrievalChain.from_llm(
    llm,
    vectordb.as_retriever(search_kwargs={'k': 2}),
    return_source_documents=True,
)

# **Llama 2 has now been trained on the dataset, and we can ask it questions about the data**

In [None]:
chat_history = []
query = "Tell me about this dataset"
result = qa_chain({'question': query, 'chat_history': chat_history})
print(result['answer'])
chat_history.append((query, result['answer']))

 Hello! I'd be happy to help you with information about the dataset you provided.

Based on the information in the table and figure, this appears to be a synthetic dataset with six columns of data. The first column is labeled "Noise Rate," and the remaining five columns are labeled "01 Error." The table shows the mean and standard deviation of the 01 error over 125 trials on LS10, with grayed cells indicating the best performer at each noise rate.

From the figure, it seems that the dataset is related to the LS10 dataset, which is a benchmark dataset for image denoising tasks. The figure shows a plot of the 01 error against the noise rate, with the best performer at each noise rate indicated by the grayed cells in the table.

Without more information about the context and purpose of the dataset, it's difficult to provide more specific information or insights about the data. However, I'm happy to help answer any follow-up questions you may have based on the information provided.
