## Semantic search

In [20]:
import chromadb

# Initialize Chroma DB client
client = chromadb.PersistentClient(path="./db")
collection = client.get_collection(name="my_collection")
from langchain.embeddings import HuggingFaceEmbeddings

# Initialize embeddings
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

# Get user input
query = input("Enter your query: ")

# Convert query to vector representation
query_vector = embeddings.embed_query(query)

# Query Chroma DB with the vector representation
results = collection.query(query_embeddings=query_vector, n_results=1 , include=["documents"])

# Print results
for result in results["documents"]:
    for i in result:
        print(i)

Spark is a complex piece of software and most guides out there are over-complicating the
installation proces. We’ll take a much simpler approach by installing the bare minimum to start,
and building from there. Our goals are as follow:
Install Java (Spark is written in Scala, which runs on the Java Virtual Machine, or JVM).
Install Spark
Install Python 3 and IPython
Launch a PySpark shell using IPython
(Optional): Install Jupyter and use it with PySpark.Appendix B: Installing PySpark locally
209
©Manning Publications Co.  To comment go to liveBook 
https://livebook.manning.com/#!/book/pyspark-in-action/discussionWhen working on Windows, you either have the option to install Spark directly on Windows, or
to use WSL (Windows Subsystem for Linux). If you want to use WSL, follow the instructions at 
 and then follow the instructions for GNU/Linux. If you want to install on plain aka.ms/wslinstall
Windows, follow the rest of this section.


## Passing it through gemini-pro

In [64]:
import os
import google.generativeai as genai
from dotenv import load_dotenv
load_dotenv()  # take environment variables from .env.

True

In [65]:
os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

In [71]:
model = genai.GenerativeModel('gemini-pro')

In [92]:
# Initialize Chroma DB client
client = chromadb.PersistentClient(path="./db")
collection = client.get_collection(name="my_collection")
from langchain.embeddings import HuggingFaceEmbeddings

# Initialize embeddings
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

# Get user input
query = "Why use pyspark?"

# Convert query to vector representation
query_vector = embeddings.embed_query(query)

# Query Chroma DB with the vector representation
results = collection.query(query_embeddings=query_vector, n_results=3 , include=["documents"])

In [93]:
results

{'ids': [['Jonathan Rioux - Data Analysis with Python and PySpark-Manning (2022).pdf_10',
   'Jonathan Rioux - Data Analysis with Python and PySpark-Manning (2022).pdf_11',
   'Jonathan Rioux - Data Analysis with Python and PySpark-Manning (2022).pdf_30']],
 'distances': None,
 'metadatas': None,
 'embeddings': None,
 'documents': [['providing a coherent interface between language while preserving the idiosyncrasies of the\nlanguage where appropriate. Your PySpark program will therefore by quite easy to read by a\nScala/Spark programmer, but also to a fellow Python programmer who hasn’t jumped into the\ndeep end (yet).\nPython is a dynamic, general purpose language, available on many platforms and for a variety of\ntasks. Its versatility and expressiveness makes it an especially good fit for PySpark. The\nlanguage is one of the most popular for a variety of domains, and currently is a major force in\ndata analysis and science. The syntax is easy to learn and read, and the amount of lib

In [94]:
response = model.generate_content(results["documents"][0])

In [95]:
print(response.text)

PySpark is a powerful tool for data analysis and science. It is fast, expressive, and versatile.

**PySpark is fast:**
- Spark is built on top of Hadoop, which is a distributed computing platform that can process large amounts of data in parallel.
- Spark uses an aggressive query optimizer that can generate efficient execution plans for complex queries.
- Spark also uses a lazy evaluation engine, which means that it only evaluates data when it is needed. This can save a lot of time and resources.

**PySpark is expressive:**
- PySpark provides a comprehensive set of APIs for data manipulation, graph analysis, streaming data, and machine learning.
- This makes it easy to develop complex data analysis pipelines.

**PySpark is versatile:**
- PySpark can be used to process data in a variety of formats, including CSV, JSON, Parquet, and Avro.
- It can also be used to connect to a variety of data sources, such as HDFS, Hive, and Cassandra.

PySpark is a great choice for data analysis and scie

## Additional information: Interactive chatbot

In [42]:
chat = model.start_chat(history=[])

In [None]:
while True:
    user_input = input("You: ")
    response = chat.send_message(user_input, stream=True)
    for chunk in response:
        print(chunk.text)