In [1]:
# Importing relevant packages for this project
import chromadb as db  #This helps us work with the vectors database
from chromadb.utils import embedding_functions # This helps us fetch our embedding model

# Loading our embedding model to memory (all-MiniLM-L6-v2)
""" Note we have other embedding models available to us in the chromadb 
ecosystem. We could even decide to use the OpenAIEmbedding model. Note that 
using this OpenAIEmdedding model will require an API key issues to you. In this 
Tutorial, we have decide to use a completely free embedding model.
"""
embedding_model = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

In [2]:
# Creating our ChromaDB client
chroma_client = db.PersistentClient(path="/Users/greatware/Desktop/NLP Application")

In [3]:
"""
The create_collection method helps us create a new collection that we can store data. When creating this collection, there are some attributes we need
to pass tho this method to customize its behavious
1. We need to specify and embedding function that embeds any text entering the model
2. We can also a metadata that states how this database should compute similarities between object in the vectors space computed.
"""

chroma_client.create_collection(
    name='Collection1',
    metadata={"hnsw:space": "cosine"}, # l2 is the default
    embedding_function=embedding_model
)

Collection(name=Collection1)

In [4]:
collection_one = chroma_client.get_collection(name='Collection1')

In [5]:
datas = [
    'Data analysis is the process of inspecting and exploring data generated by a particular population to find the information needed to make decisions and draw conclusions. With the use of data in decision making, most businesses today need data analysts. So, if you want to know about the best books to learn data analysis, this article is for you. In this article, I will introduce you to some of the best books to learn data analysis.',
    'The performance of a machine learning algorithm on a particular dataset often depends on whether the features of the dataset satisfies the assumptions of that machine learning algorithm. Not all machine learning algorithms have assumptions that differentiate them from each other. So, in this article, I will take you through the assumptions of machine learning algorithms.',
    'The K-Means Clustering is a clustering algorithm capable of clustering an unlabeled dataset quickly and efficiently in just a very few iterations. In this article, I will take you through the K-Means clustering in machine learning using Python.',
    'Many machine learning algorithms can be used to solve complex problems that require a large amount of data with a large number of features, but deep learning can outperform all algorithms. So to understand where we can use deep learning techniques, in this article, I will introduce you to the applications of deep learning.',
    'A scatter plot is one of the most useful ways to analyze the relationship between two features. You must have used a scatter plot before if you are learning data science but have you ever tried to create an animated scatter plot using Python? If you want to learn how to visualize an animated scatter plot, this article is for you. In this article, I will take you through a tutorial on visualizing animated scatter plot using Python.'
    ]

metadatas = [{'title': 'Best Books to Learn Data Analysis'},
             {'title': 'Assumptions of Machine Learning Algorithms'},
             {'title': 'K-Means Clustering in Machine Learning'},
             {'title': 'Applications of Deep Learning'},
             {'title': 'Animated Scatter Plot using Python'}]

ids = ['1',
       '2',
       '3',
       '4',
       '5']


In [6]:
collection_one.add(
    documents=datas,
    metadatas=metadatas,
    ids=ids
)

In [7]:
collection_one.query(
    query_texts='For data visualization, what is the role of matplotlib',
    n_results=2,
)

{'ids': [['5', '1']],
 'distances': [[0.48327692152397195, 0.7423485375197497]],
 'metadatas': [[{'title': 'Animated Scatter Plot using Python'},
   {'title': 'Best Books to Learn Data Analysis'}]],
 'embeddings': None,
 'documents': [['A scatter plot is one of the most useful ways to analyze the relationship between two features. You must have used a scatter plot before if you are learning data science but have you ever tried to create an animated scatter plot using Python? If you want to learn how to visualize an animated scatter plot, this article is for you. In this article, I will take you through a tutorial on visualizing animated scatter plot using Python.',
   'Data analysis is the process of inspecting and exploring data generated by a particular population to find the information needed to make decisions and draw conclusions. With the use of data in decision making, most businesses today need data analysts. So, if you want to know about the best books to learn data analysis, 