**Using Atlas to Visualize a Dataset of Text**

See [docs.nomic.ai](https://docs.nomic.ai) for documentation.

In [46]:
!pip install langchain nomic sentence-transformers transformers torch > /dev/null

In [None]:
import nomic
import time
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import AtlasDB
from langchain.document_loaders import TextLoader
from nomic import atlas
nomic.login('Mug83c2mM5lD-I-XEtNFAFrTtxIqNznl8SS0Obz9tApfe') #api key to a limited demo account. Make your own account at atlas.nomic.ai

In [None]:
embedd = HuggingFaceEmbeddings()

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=50,
                                          chunk_overlap=20,
                                          length_function=len)

In [None]:
AD = """How Atlas Works
Atlas is a platform for visually and programmatically interacting with massive unstructured datasets of text documents, images and embeddings.
Data model

Atlas lets you store and manipulate data like a standard noSQL document engine. On upload, your data is stored in an abstraction called a Project. You can add, update, read and delete (CRUD) data in a project via API calls from the Atlas Python client.
What kind of data can I store in Atlas?

Atlas can natively store:
    Embedding vectors
    Text Documents
Our roadmap includes first class support for data modalities such as images, audio and video. You can still store images, audio and video in Atlas now but you must generate embeddings for it yourself.
Data stored in an Atlas Project is semantically indexed by Atlas. This indexing allows you to interact, view and search through your dataset via meaning instead of matching on words.
How does Atlas semantically index data?
Atlas semantically indexes unstructured data by:
    Converting data points into embedding vectors (if they aren't embeddings already)
    Organizing the embedding vectors for fast semantic search and human interpretability
If you have embedding vectors of your data from an embedding API such as OpenAI or Cohere, you can attach them during upload.
If you don't already have embedding vectors for your data points, Atlas will create them by running your data through neural networks that semantically encode your data points. For example, if you upload text documents Atlas will run them through neural networks that semantically encode text. It is often cheaper and faster to use Atlas' internal embedding models as opposed to an external model APIs.
How is Atlas different from a noSQL database?

Unlike existing data stores, Atlas is built with embedding vectors as first class citizens. Embedding vectors are representations of data that computers can semantically manipulate. Most operations you do in Atlas, under the hood, are performed on embeddings.
Atlas makes embeddings human interpretable

Despite their utility, embeddings cannot be easily interpreted because they reside in high dimensions.

During indexing, Atlas builds a contextual two-dimensional data map of embeddings. This map preserves high-dimensional relationships present between embeddings in a two-dimensional, human interpretable view.

Reading an Atlas Map

Atlas Maps lay out your dataset contextually. We will use the above map of news articles generated by Atlas to describe how to read Maps.

An Atlas Map has the following properties:

    Points close to each other on the map are semantically similar/related. For example, all news articles about sports are at the bottom of the map. Inside the sports region, the map breaks down by type of sport because news articles about a fixed sport (e.g. baseball) have more similarity to each other than with news articles about other types of sports (e.g. tennis).
    Relative distances between points correlate with semantic relatedness but the numerical distance between 2D point positions does not have meaning. For example, the observation that the Tennis and Golf news article clusters are adjacent signify a relationships between Tennis and Golf in the embedding space. You should not, however, make claims or draw conclusions using the Euclidean distance between points in the two clusters. Distance information is only meaningful in the ambient embedding space and can be retrieved with vector_search.
    Floating labels correspond to distinct topics in your data. For example, the Golf cluster has the label 'Ryder Cup'. Labels are automatically determined from the textual contents of your data and are crucial for navigating the Map.
    Topics have a hierarchy. As you zoom around the Map, more granular versions of topics will emerge.
    Maps update as your data updates. When new data enters your project, Atlas can reindex the map to reflect how the new data relates to existing data.
All information and operations that are visually presented on an Atlas map have a programmatic analog. For example, you can access topic information and vector search through the Python client.
Technical Details
Atlas visualizes your embeddings in two-dimensions using a non-linear dimensionality reduction algorithm. Atlas' dimensionality reduction algorithm is custom-built for scale, speed and dynamic updates. Nomic cannot share the technical details of the algorithm at this time.
Data Formats and Integrity
Atlas stores and transfers data using a subset of the Apache Arrow standard.
pyarrow is used to convert python, pandas, and numpy data types to Arrow types; you can also pass any Arrow table (created by polars, duckdb, pyarrow, etc.) directly to Atlas and the types will be automatically converted.
Before being uploaded, all data is converted with the following rules:
    Strings are converted to Arrow strings and stored as UTF-8.
    Integers are converted to 32-bit integers. (In the case that you have larger integers, they are probably either IDs, in which case you should convert them to strings; or they are a field that you want perform analysis on, in which case you should convert them to floats.)
    Floats are converted to 32-bit (single-precision) floats.
    Embeddings, regardless of precision, are uploaded as 16-bit (half-precision) floats, and stored in Arrow as FixedSizeList.
    All dates and datetimes are converted to Arrow timestamps with millisecond precision and no time zone. (If you have a use case that requires timezone information or micro/nanosecond precision, please let us know.)
    Categorical types (called 'dictionary' in Arrow) are supported, but values stored as categorical must be strings.
Other data types (including booleans, binary, lists, and structs) are not supported. Values stored as a dictionary must be strings.
All fields besides embeddings and the user-specified ID field are nullable.
Permissions and Privacy
To create a Project in Atlas, you must first sign up for an account and obtain an API key.
Projects you create in Atlas have configurable permissions and privacy levels.
When you create a project, it's ownership is assigned to your Atlas team. You can add people to this team to collaborate on projects together. For example, if you want to invite somone to help you tag points on an Atlas Map, you would add them to your team and give them the appropriate editing permissions on your project.
"""

In [None]:
text_split = splitter.split_text(AD)

In [47]:
text_split[0]

'How Atlas Works'

In [53]:
text_dataset = []

for indi,text in enumerate(text_split):
  text_dataset.append({"id":indi,
                       "text":text})

In [92]:
embeddings = embedd.embed_documents(text_split)

In [93]:
import pandas

#load a demo dataset of 25k news articles
news_articles = pandas.read_csv('https://raw.githubusercontent.com/nomic-ai/maps/main/data/ag_news_25k.csv').to_dict('records')

In [94]:
news_articles[0]

{'id': 0,
 'text': 'Nasdaq planning \\$100m-share sale The owner of the Nasdaq index, an icon of the internet boom, is planning to sell \\$100m of shares to the public and list itself on the market it operates.',
 'label': 2}

In [91]:
from nomic import atlas

#By specifying modality='embedding' you are saying you will upload your own embeddings.
project = AtlasProject(name='atlas documentation',
                       unique_id_field='id',
                       modality='embedding')


[32m2023-07-01 13:20:53.312[0m | [1mINFO    [0m | [36mnomic.project[0m:[36m_create_project[0m:[36m749[0m - [1mCreating project `atlas documentation` in organization `kamaljp`[0m


In [95]:
from nomic import atlas, AtlasProject
import numpy as np


#add your OpenAI embeddings and metadata to the Atlas DB project
project.add_embeddings(
    embeddings=np.array(embeddings),
    data=text_dataset
)

1it [00:01,  1.84s/it]
[32m2023-07-01 13:24:07.175[0m | [1mINFO    [0m | [36mnomic.project[0m:[36m_add_data[0m:[36m1371[0m - [1mUpload succeeded.[0m


In [96]:
project.create_index(name=project.name,
                     build_topic_model=True,
                     topic_label_field='text')
print(project.maps[0])

[32m2023-07-01 13:24:17.217[0m | [1mINFO    [0m | [36mnomic.project[0m:[36mcreate_index[0m:[36m1081[0m - [1mCreated map `atlas documentation` in project `atlas documentation`: https://atlas.nomic.ai/map/2b6d324a-a9bc-4155-ac28-9e1eb5d92ce2/5c3ac7db-6ca0-47d8-97f5-9007a3863085[0m


atlas documentation: https://atlas.nomic.ai/map/2b6d324a-a9bc-4155-ac28-9e1eb5d92ce2/5c3ac7db-6ca0-47d8-97f5-9007a3863085


In [98]:
map = project.maps[0]

In [100]:
print(project.get_data(ids=[0,10]))

[{'id': '0', 'id_': 'AA', 'text': 'How Atlas Works'}, {'id': '10', 'id_': 'Cg', 'text': 'your data is stored in an abstraction called a'}]


In [103]:
map.topics.df

Unnamed: 0,id,topic_depth_1,topic_depth_2,topic_depth_3
0,0,Embeddings,"Atlas - Create, team, project,",Atlas
1,1,Embeddings,"Atlas - Create, team, project,",Atlas
2,2,Embeddings,Embeddings (2),OpenAI - Programmatically Interacting
3,3,Embeddings,Embeddings (2),OpenAI - Programmatically Interacting
4,4,Embeddings,Embeddings (2),OpenAI - Programmatically Interacting
...,...,...,...,...
205,205,Embeddings,"Atlas - Create, team, project,",Atlas
206,206,Embeddings,"Atlas - Create, team, project,",Atlas
207,207,Sports,Sports (2),Sports (3)
208,208,Computer Science,ü§∑‚Äç‚ôÇÔ∏è5ü§∑‚Äç‚ôÄÔ∏è,Account and permissions management


In [104]:
map.topics.hierarchy

{'Computer Science': ['ü§∑\u200d‚ôÇÔ∏è5ü§∑\u200d‚ôÄÔ∏è'],
 'Sports': ['Sports (2)', 'Distance'],
 'Embeddings': ['Search and Access of Topics and Documents',
  'Neural networks for semantically running encode',
  'Embeddings (2)',
  'Atlas - Create, team, project,'],
 'Convert types to Arrow': ['ü§∑\u200d‚ôÇÔ∏è9ü§∑\u200d‚ôÄÔ∏è'],
 'Search and Access of Topics and Documents': ['Arrow',
  'Map of topics and their labels',
  'Search and Access of Topics and Vectors'],
 'Embeddings (2)': ['Audio, Video, Image, and Data',
  'Embeddings (3)',
  'OpenAI - Programmatically Interacting'],
 'ü§∑\u200d‚ôÇÔ∏è9ü§∑\u200d‚ôÄÔ∏è': ['Convert Pandas DataFrame to Pyarrow',
  'Bit integers',
  'Arrow Types'],
 'Neural networks for semantically running encode': ['ü§∑\u200d‚ôÇÔ∏è17ü§∑\u200d‚ôÄÔ∏è'],
 'Distance': ['ü§∑\u200d‚ôÇÔ∏è18ü§∑\u200d‚ôÄÔ∏è'],
 'Atlas - Create, team, project,': ['Atlas'],
 'ü§∑\u200d‚ôÇÔ∏è5ü§∑\u200d‚ôÄÔ∏è': ['Account and permissions management', 'Stored data'],
 'Sports (2

In [105]:
map.topics.metadata

Unnamed: 0,depth,topic_id,topic_depth_1,topic_description,topic_short_description,topic_depth_2,topic_depth_3
0,1,1,Sports,sports/distance/topics/news/map/sport/clusters...,Sports,,
1,1,2,Computer Science,permissions/audio/computers/account/collaborat...,Computer Science,,
2,1,3,Embeddings,Atlas/embeddings/embedding/vectors/search/spac...,Embeddings,,
3,1,4,Convert types to Arrow,converted/bit/precision/32/integers/floats/upl...,Convert types to Arrow,,
4,2,5,Computer Science,ü§∑‚Äç‚ôÇÔ∏è5ü§∑‚Äç‚ôÄÔ∏è,ü§∑‚Äç‚ôÇÔ∏è5ü§∑‚Äç‚ôÄÔ∏è,ü§∑‚Äç‚ôÇÔ∏è5ü§∑‚Äç‚ôÄÔ∏è,
5,2,6,Sports,sports/articles/sport/Tennis/news/region/type/...,Sports (2),Sports (2),
6,2,7,Embeddings,search/topics/Arrow/matching/zoom/granular/acc...,Search and Access of Topics and Documents,Search and Access of Topics and Documents,
7,2,8,Sports,distance/2D/Euclidean/reside/close/numerical/d...,Distance,Distance,
8,2,9,Convert types to Arrow,ü§∑‚Äç‚ôÇÔ∏è9ü§∑‚Äç‚ôÄÔ∏è,ü§∑‚Äç‚ôÇÔ∏è9ü§∑‚Äç‚ôÄÔ∏è,ü§∑‚Äç‚ôÇÔ∏è9ü§∑‚Äç‚ôÄÔ∏è,
9,2,10,Embeddings,networks/neural/semantically/running/encode,Neural networks for semantically running encode,Neural networks for semantically running encode,


In [110]:
query = "how atlas works"

query_embed = np.array([embedd.embed_query(query)])

In [111]:
map.embeddings.vector_search(queries=query_embed,
                             k = 2)

([['0', '147']], [[4.755779414722383e-08, 0.49030232429504395]])