### Chat with your unstructured CSVs with Llama3 and Ollama

Some code inspired by Sascha Retter (https://blog.retter.jetzt/)

##### Chat with local Llama3 Model via Ollama in KNIME Analytics Platform — Also extract Logs into structured JSON Files
https://medium.com/p/aca61e4a690a

##### Ask Questions from your CSV with an Open Source LLM, LangChain & a Vector DB
https://www.tetranyde.com/blog/langchain-vectordb

##### Document Loaders in LangChain
https://medium.com/@varsha.rainer/document-loaders-in-langchain-7c2db9851123

##### Unleashing Conversational Power: A Guide to Building Dynamic Chat Applications with LangChain, Qdrant, and Ollama (or OpenAI’s GPT-3.5 Turbo)
https://medium.com/@ingridwickstevens/langchain-chat-with-your-data-qdrant-ollama-openai-913020ec504b


In [13]:
import os
import pandas as pd

# Document Loaders in LangChain
# https://medium.com/@varsha.rainer/document-loaders-in-langchain-7c2db9851123
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain_community.document_loaders.csv_loader import CSVLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter

# from langchain.vectorstores import Chroma
from langchain_community.vectorstores import Chroma

# from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import OllamaEmbeddings

# from langchain.llms import Ollama
from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

from langchain.chains import RetrievalQA


model = "llama3:instruct" # model needs already be available, already pulled with for example 'ollama run llama3:instruct'

In [28]:
question = f"How does a computer network work?"
v_num_docs = 10 # how many documents should be loaded

In [26]:
# Proxy configuration
proxy = "http://proxy.my-company.com:8080"  # Replace with your proxy server and port
# proxy = "http://sia-lb.telekom.de:8080"  # Replace with your proxy server and port
proxy = ""
os.environ['http_proxy'] = proxy
os.environ['https_proxy'] = proxy

In [4]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

In [6]:
type(newsgroups_train)

sklearn.utils._bunch.Bunch

In [7]:
texts = newsgroups_train.data
labels = newsgroups_train.target

In [8]:
df = pd.DataFrame({'text': texts, 'label': labels})

In [9]:
df['label_name'] = df['label'].map(dict(enumerate(newsgroups_train.target_names)))

In [10]:
df.head()

Unnamed: 0,text,label,label_name
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


In [11]:
# Write the DataFrame to a Parquet file
df.to_parquet('../documents/csv/newsgroups_data.parquet')

# Write the DataFrame to a CSV file
df.to_csv('../documents/csv/newsgroups_data.csv', index=False)


In [14]:
# You can load the CSV information from a column
# https://medium.com/@varsha.rainer/document-loaders-in-langchain-7c2db9851123
loader = CSVLoader("../documents/csv/newsgroups_data.csv", source_column="text")
documents = loader.load()

In [15]:
type(documents)

list

In [None]:
# Create embeddings
embedding_model = OllamaEmbeddings(base_url="http://localhost:11434", model=model)

In [None]:
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2" # the standard embedding model for
embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_name)

In [17]:
chroma_db = Chroma.from_documents(
        documents,  embedding=embedding_model, persist_directory="../data/vectorstore/newsgroups"
    )

In [20]:
type(chroma_db)

langchain_community.vectorstores.chroma.Chroma

In [21]:
# load from disk to demonstrate it will also work the next time
chroma_db = Chroma(persist_directory="../data/vectorstore/newsgroups", embedding_function=embedding_model)

In [22]:
type(chroma_db)

langchain_community.vectorstores.chroma.Chroma

In [23]:
# LLM
llm = Ollama(model=model,
            verbose=True,
            callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))

print(f"Loaded LLM model {llm.model}")

Loaded LLM model llama3:instruct


In [29]:
# Initialize the RetrievalQA chain with the vector store retriever
retriever = chroma_db.as_retriever(search_kwargs={"k": v_num_docs})  # Use the number of documents to retrieve

In [30]:
# Initialize the RetrievalQA chain with the vector store retriever
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=retriever,
)

# Use the 'invoke' method to handle the query instead of '__call__'
result = qa_chain.invoke({"query": question})

Based on the provided context, it appears that there are several online and offline sources of images, data, etc. for computer networks. Some examples include:

* Online sources:
	+ The Internet Relay Chat (IRC) system
	+ The Usenet news network
	+ Electronic mail (e-mail) systems like the ones used by the authors of the posts
* Offline sources:
	+ Books and documents on computer networking and related topics
	+ Technical manuals and guides for specific hardware and software configurations

It's worth noting that these sources are likely to be scattered across different platforms, networks, and devices, and may require some effort to access and utilize them effectively.

In terms of the question "How does a computer network work?", it seems that there is already an answer provided within the context:

In [None]:
# Print the result
print(result)