## Duplicate Detection

Atlas can automatically infer the presence of duplicates in your dataset based on textual
similarity. 

In [9]:
%load_ext autoreload
%autoreload 2

In [10]:
import random
def story():
  # Procedurally generate a short story.
  characters = ['John', 'Mary', 'Alice', 'John', 'Micah', 'Harrison']
  places = ['the kitchen', 'the living room', 'the garden', 'the bathroom', 'the kitchen', 'the kitchen']
  verbs = ['ate', 'prayed for', 'loved', 'ate', 'ate', 'ate']
  objects = ['apples', 'bananas', 'spaghetti', 'the cake', 'apples', 'bananas', '']
  relationships = ['spouse', 'spouse', 'child', 'child']
  conclusions = [
    'Before they finished, a bear burst in and ate them.',
    'Everyone lived happily ever after.',
    'They all lived happily ever after.',
  ]
  char1 = random.choice(characters)
  char2 = random.choice(characters)
  while char1 == char2:
    char2 = random.choice(characters)
  place = random.choice(places)
  verb1 = random.choice(verbs)
  verb2 = random.choice(verbs)
  relation = random.choice(relationships)

  story = f"""
  Once upon a time, {char1} lived with their {relation} {char2}.
  {char1} {verb1} {random.choice(objects)} in {place}.
  {char2} sat in {place} and {verb2} {random.choice(objects)}.
  {random.choice(conclusions)}
  """
  return story

stories = [
  {"id": str(i), "story": story()} for i in range(200)
]    

In [23]:
from nomic import atlas
proj = atlas.map_text(stories, indexed_field = 'story', id_field='id', duplicate_detection=True)

2023-06-20 01:18:58.566 | INFO     | nomic.project:_create_project:1116 - Creating project `productive-gong` in organization `bmschmidt`
2023-06-20 01:19:01.164 | INFO     | nomic.atlas:map_text:214 - Uploading text to Atlas.
1it [00:00,  1.09it/s]
2023-06-20 01:19:02.090 | INFO     | nomic.project:_add_data:1737 - Upload succeeded.
2023-06-20 01:19:02.091 | INFO     | nomic.atlas:map_text:230 - Text upload succeeded.
2023-06-20 01:19:03.320 | INFO     | nomic.project:create_index:1443 - Created map `productive-gong` in project `productive-gong`: https://atlas.nomic.ai/map/4309d7c3-e4a4-4e70-868a-d6fbd8d344db/05733867-c62d-46b1-aeb0-bcd1766c95b4
2023-06-20 01:19:03.321 | INFO     | nomic.atlas:map_text:246 - productive-gong: https://atlas.nomic.ai/map/4309d7c3-e4a4-4e70-868a-d6fbd8d344db/05733867-c62d-46b1-aeb0-bcd1766c95b4


In [24]:
with proj.wait_for_project_lock():
    proj.id

2023-06-20 01:19:03.681 | INFO     | nomic.project:wait_for_project_lock:1237 - productive-gong: Waiting for Project Lock Release.


In [27]:
proj.maps[0].duplicates

===Atlas Duplicates for (productive-gong: https://atlas.nomic.ai/map/4309d7c3-e4a4-4e70-868a-d6fbd8d344db/05733867-c62d-46b1-aeb0-bcd1766c95b4)===
123 deletion candidates in 77 clusters
      id     _duplicate_class  _cluster_id
0      0  retention candidate            3
1      1  retention candidate           27
2     10  retention candidate           16
3    100   deletion candidate            0
4    101   deletion candidate            1
..   ...                  ...          ...
195   95   deletion candidate            2
196   96   deletion candidate            1
197   97   deletion candidate            0
198   98   deletion candidate            7
199   99            singleton           61

[200 rows x 3 columns]

You can access the underlying data as a pandas dataframe using the `df` attribute.

In [28]:
proj.maps[0].duplicates.df


Unnamed: 0,id,_duplicate_class,_cluster_id
0,0,retention candidate,3
1,1,retention candidate,27
2,10,retention candidate,16
3,100,deletion candidate,0
4,101,deletion candidate,1
...,...,...,...
195,95,deletion candidate,2
196,96,deletion candidate,1
197,97,deletion candidate,0
198,98,deletion candidate,7


## A list of potential deletion candidates is available at the `deletion_candidates` attribute.

In [29]:
proj.maps[0].duplicates.deletion_candidates()

['100',
 '101',
 '103',
 '106',
 '107',
 '108',
 '111',
 '113',
 '114',
 '116',
 '120',
 '122',
 '123',
 '126',
 '127',
 '128',
 '129',
 '130',
 '131',
 '135',
 '136',
 '139',
 '141',
 '142',
 '143',
 '144',
 '145',
 '146',
 '147',
 '149',
 '150',
 '151',
 '153',
 '154',
 '156',
 '158',
 '159',
 '16',
 '161',
 '163',
 '165',
 '169',
 '170',
 '171',
 '172',
 '173',
 '174',
 '177',
 '18',
 '181',
 '182',
 '185',
 '186',
 '189',
 '190',
 '191',
 '192',
 '194',
 '195',
 '197',
 '198',
 '199',
 '20',
 '22',
 '23',
 '24',
 '25',
 '27',
 '28',
 '29',
 '3',
 '30',
 '32',
 '33',
 '35',
 '39',
 '40',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49',
 '50',
 '51',
 '52',
 '53',
 '54',
 '55',
 '56',
 '57',
 '58',
 '59',
 '6',
 '60',
 '62',
 '64',
 '65',
 '67',
 '69',
 '7',
 '70',
 '71',
 '72',
 '73',
 '74',
 '75',
 '77',
 '78',
 '8',
 '80',
 '82',
 '83',
 '87',
 '88',
 '9',
 '94',
 '95',
 '96',
 '97',
 '98']

For more complicated operations, use pandas or pyarrow methods to iterate over the data.
Here, for instance, are the largest clusters. 

In [45]:
for id, group in proj.maps[0].duplicates.df.groupby('_cluster_id'):
    if int(id) % 9 != 8:
        continue
    print("CLUSTER ID:", id)
    for i, row in group.iterrows():
        story = stories[int(row['id'])]['story'].replace("\n", " ")
        print(f'{story}')


CLUSTER ID: 8
   Once upon a time, John lived with their spouse Harrison.   John prayed for apples in the kitchen.   Harrison sat in the kitchen and ate .   Before they finished, a bear burst in and ate them.   
   Once upon a time, John lived with their spouse Harrison.   John prayed for bananas in the kitchen.   Harrison sat in the kitchen and ate apples.   Before they finished, a bear burst in and ate them.   
   Once upon a time, John lived with their spouse Harrison.   John ate the cake in the bathroom.   Harrison sat in the bathroom and loved bananas.   Before they finished, a bear burst in and ate them.   
   Once upon a time, John lived with their spouse Harrison.   John prayed for bananas in the kitchen.   Harrison sat in the kitchen and ate the cake.   Before they finished, a bear burst in and ate them.   
CLUSTER ID: 17
   Once upon a time, Harrison lived with their spouse Mary.   Harrison prayed for apples in the garden.   Mary sat in the garden and ate apples.   Before the

In [40]:
id

76