# Data Exploration, Cleaning and Labeling in Atlas

This tutorial describes how to use Atlas to quickly label or tag a large corpus of text.

Atlas provides insights into a text corpus by organizing its documents onto a map.
Documents of text that are semantically similar cluster together on a map allowing for the following
data labeling workflow:

1. Make a map of your data.
2. Use the lasso tool in Atlas to tag regions based on your domain expertise.
3. Access your annotated tags with an [AtlasProjection](atlas_api.md)'s `get_tags` method.

Tags can then be funneled into a downstream machine learning model, used to clean your dataset by deleting points from your project and
leveraged to build new maps on subsets of your data.

# Exploring and Labeling a News Dataset
In this example, we will map and label a news dataset from the Huggingface hub.
To start, load the dataset [ag_news](https://huggingface.co/datasets/ag_news), randomly sample 10,000 points and map it.

The dataset is composed of news articles scraped by an [academic news scraping engine](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) after 2004.


In [3]:
from nomic import atlas
import nomic
import numpy as np
from datasets import load_dataset
nomic.login('7xDPkYXSYDc1_ErdTPIcoAR9RNd8YDlkS3nVNXcVoIMZ6')

np.random.seed(0)  # so your map has the same points sampled.

dataset = load_dataset('ag_news')['train']

max_documents = 10000
subset_idxs = np.random.randint(len(dataset), size=max_documents).tolist()
documents = [dataset[i] for i in subset_idxs]
for i in range(len(documents)):
    documents[i]['id'] = i

project = atlas.map_text(data=documents,
                         id_field='id',
                         indexed_field='text',
                         name='News 10k Labeling Example',
                         description='10k News Articles for Labeling'
                         )

Using custom data configuration default
Reusing dataset ag_news (/home/andriy/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548)


  0%|          | 0/2 [00:00<?, ?it/s]

2023-03-15 16:42:40.435 | INFO     | nomic.project:_create_project:946 - Creating project `News 10k Labeling Example` in organization `Atlas Demo`
2023-03-15 16:42:42.108 | INFO     | nomic.atlas:map_text:210 - Uploading text to Atlas.
100%|██████████| 10/10 [00:02<00:00,  3.39it/s]
2023-03-15 16:42:45.061 | INFO     | nomic.atlas:map_text:227 - Text upload succeeded.
2023-03-15 16:42:47.252 | INFO     | nomic.project:create_index:1259 - Created map `News 10k Labeling Example` in project `News 10k Labeling Example`: https://atlas.nomic.ai/map/038e2811-419b-457e-bc2a-bfab95898739/9ec69185-5f11-4840-b605-628e6a364cb0
2023-03-15 16:42:47.253 | INFO     | nomic.atlas:map_text:241 - News 10k Labeling Example: https://atlas.nomic.ai/map/038e2811-419b-457e-bc2a-bfab95898739/9ec69185-5f11-4840-b605-628e6a364cb0


[News 10k Labeling Example: https://atlas.nomic.ai/map/038e2811-419b-457e-bc2a-bfab95898739/9ec69185-5f11-4840-b605-628e6a364cb0]


In [4]:
project.maps[0]