## Remove Duplicate in your dataset

In [1]:
import json
from datasets import Dataset
import pandas as pd

### 1. Convert dataset into huggingface type dataset

- column to dedupe must be text type

In [31]:
with open("/home/ubuntu/aisyah/dedupe_test2.jsonl") as fopen:
    data = [json.loads(line) for line in fopen]

In [6]:
data[0]

{'id': 0,
 'text': 'SELEPAS seminggu politik negara kecoh pada Februari 2020, Muhyiddin Yassin mengangkat sumpah sebagai perdana menteri pada 1 Mac susulan peletakan jawatan Dr Mahathir Mohamed dan keruntuhan kerajaan Pakatan Harapan (PH). Sepanjang memegang jawatan itu selama 16 bulan, kerajaannya yang terbentuk bersama bekas musuhnya yang kemudian menjadi sekutu, berdepan jatuh bangun, turun dan naik, terutama dalam mengendalikan pandemik Covid-19. Muhyiddin meletak jawatan perdana menteri hari ini selepas hilang majoriti di Dewan Rakyat apabila 15 ahli Parlimen Umno menarik balik sokongan. Presiden Bersatu itu pada awalnya enggan meletak jawatan, tetapi cubaan terakhirnya menghulurkan tawaran kerjasama kepada pembangkang pada Jumaat gagal, apabila semua parti pembangkang menolaknya. Muhyiddin atau mesra dengan jolokan panggilan “Abah” oleh netizen, akan menjadi perdana menteri sementara bermula hari ini hingga pelantikan baharu dibuat, menurut kenyataan Istana Negara. – 16 Ogos, 202

In [7]:
len(data)

473487

In [8]:
data_dict = {
    key: [entry[key] for entry in data]
    for key in data[0]  # Assuming all dictionaries have the same keys
}

In [10]:
dataset = Dataset.from_dict(data_dict)

In [11]:
dataset

Dataset({
    features: ['id', 'text'],
    num_rows: 473487
})

In [12]:
dataset.save_to_disk("text_dedup/test-1")

Saving the dataset (0/2 shards):   0%|          | 0/473487 [00:00<?, ? examples/s]

### 2. Run below command

make sure directory for the script is correct before running the command

- path -  your dataset to dedupe path
- name - relevant only when u use dataset directly from huggingface
- output - output directory of the dedupe dataset
- column - column of the text we are going to remove duplicate
- threshold - jaccard similarity threshold
- local - (local dataset or directly from huggingface) remove if dataset is not local

In [32]:
!python3 -m text_dedup.minhash \
  --path "text_dedup/test-1" \
  --split "train" \
  --cache_dir "./cache" \
  --output "MinHash_Output/Dedup_Dataset_Trial-0" \
  --column "text" \
  --batch_size 10000 \
  --threshold 0.95 \
  --local

#ngram default = 5, threshold jaccard similarity = 0.7

Iterating MinHashes...: 100%|███████████████████| 48/48 [00:05<00:00,  8.47it/s]
Clustering...: 100%|██████████████████████████████| 5/5 [00:00<00:00, 20.13it/s]
Duplicate pairs before filtering:                                               
Cluster 126 has duplicate pairs: [126, 470431]
Cluster 195 has duplicate pairs: [195, 450632]
Cluster 484 has duplicate pairs: [484, 448413]
Cluster 746 has duplicate pairs: [746, 367517]
Cluster 1014 has duplicate pairs: [1014, 344772]
Cluster 1294 has duplicate pairs: [1294, 468601]
Cluster 1296 has duplicate pairs: [1296, 315917]
Cluster 1298 has duplicate pairs: [1298, 462415]
Cluster 1337 has duplicate pairs: [1337, 433726]
Cluster 1486 has duplicate pairs: [1486, 396912]
Cluster 1787 has duplicate pairs: [1787, 457257]
Cluster 1794 has duplicate pairs: [1794, 358727]
Cluster 1822 has duplicate pairs: [1822, 409252]
Cluster 1998 has duplicate pairs: [1998, 322529]
Cluster 2105 has duplicate pairs: [2105, 417868]
Cluster 2143 has duplicate pai

Cluster 249993 has duplicate pairs: [249993, 249996]
Cluster 255965 has duplicate pairs: [255965, 255966]
Cluster 262698 has duplicate pairs: [262698, 415049]
Cluster 264489 has duplicate pairs: [264489, 264495]
Cluster 270451 has duplicate pairs: [270451, 270452]
Cluster 273377 has duplicate pairs: [273377, 273404]
Cluster 274251 has duplicate pairs: [274251, 297491]
Cluster 274343 has duplicate pairs: [274343, 390424]
Cluster 274764 has duplicate pairs: [274764, 431844]
Cluster 274901 has duplicate pairs: [274901, 399068]
Cluster 274906 has duplicate pairs: [274906, 430670]
Cluster 275099 has duplicate pairs: [275099, 383395]
Cluster 276750 has duplicate pairs: [276750, 302274]
Cluster 277086 has duplicate pairs: [277086, 277102]
Cluster 277268 has duplicate pairs: [277268, 277278]
Cluster 277971 has duplicate pairs: [277971, 288427]
Cluster 278608 has duplicate pairs: [278608, 404675]
Cluster 278883 has duplicate pairs: [278883, 469705]
Cluster 278933 has duplicate

Cluster 347524 has duplicate pairs: [347524, 456198]
Cluster 347967 has duplicate pairs: [347967, 401862]
Cluster 349201 has duplicate pairs: [349201, 434231]
Cluster 349208 has duplicate pairs: [349208, 432371]
Cluster 349304 has duplicate pairs: [349304, 409656]
Cluster 349384 has duplicate pairs: [349384, 396708]
Cluster 349728 has duplicate pairs: [349728, 405572]
Cluster 350574 has duplicate pairs: [350574, 380926]
Cluster 350749 has duplicate pairs: [350749, 393500]
Cluster 351149 has duplicate pairs: [351149, 448844]
Cluster 351849 has duplicate pairs: [351849, 363580]
Cluster 352287 has duplicate pairs: [352287, 375808]
Cluster 352867 has duplicate pairs: [352867, 452615]
Cluster 352873 has duplicate pairs: [352873, 394600]
Cluster 352962 has duplicate pairs: [352962, 435424]
Cluster 353429 has duplicate pairs: [353429, 354882]
Cluster 355364 has duplicate pairs: [355364, 379488, 405125]
Cluster 355979 has duplicate pairs: [355979, 443342]
Cluster 356543 has duplicate pairs: [3

The script produced the following

- duplicates.csv : containing duplicate pairs indices (document index)
- duplicate_cluster.csv: the text of duplicate pairs for view
- the deduped dataset in huggingface dataset type (from output argument)

## View Result

In [20]:
import pandas as pd

In [23]:
df = pd.read_csv('output/duplicates.csv')

In [25]:
df.head(50)

Unnamed: 0,Cluster,Duplicate Pair
0,126,"[126, 470431]"
1,195,"[195, 450632]"
2,484,"[484, 448413]"
3,746,"[746, 367517]"
4,1014,"[1014, 344772]"
5,1294,"[1294, 468601]"
6,1296,"[1296, 315917]"
7,1298,"[1298, 462415]"
8,1337,"[1337, 433726]"
9,1486,"[1486, 396912]"


In [13]:
from datasets import load_from_disk

# Load the dataset from the saved directory
base_ds = load_from_disk("text_dedup/test-1")

In [34]:
base_ds[126]

{'id': 126,
 'text': 'PERIKATAN Nasional (PN) Johor hari ini mengesahkan tiada perbincangan mengenai pertukaran mana-mana calon termasuk calon kerusi Parlimen Pulai pada Pilihan Raya Umum ke-15 (PRU15), kata Pengarah Jabatan Pilihan Raya PN Negeri Johor Rasman Ithnain. Beliau berkata setakat ini perbincangan antara parti komponen dalam PN mengenai pemilihan calon berjalan lancar tanpa ada selisih faham atau kecil hati. “Setakat hari ini, PN tak dengar pun ada pertukaran. Kita pun baru dengar (berita ini). Memang tiada perbincangan. Itu hanya khabar angin, (saya rasa) cuma provokasi (pihak lain). “Setakat ini, nama yang ada di tangan saya, (masih) nama Deepak (Jaikishan Rewachand) (untuk kerusi Pulai),” katanya kepada pemberita di Bilik Gerakan Utama PN Negeri Johor di Kulai hari ini. Rasman berkata demikian ketika mengulas laporan sebuah akhbar tempatan berbahasa Inggeris semalam, yang memetik Pengerusi Gerakan Johor Teo Kok Chee sebagai berkata Presiden Gerakan Dominic Lau menamakan c

In [33]:
base_ds[470431]

{'id': 470431,
 'text': 'Perikatan Nasional (PN) Johor hari ini mengesahkan tiada perbincangan mengenai pertukaran mana-mana calon termasuk calon kerusi parlimen Pulai pada Pilihan Raya Umum ke-15 (PRU15), kata Pengarah Jabatan Pilihan Raya PN Negeri Johor Rasman Ithnain. Beliau berkata setakat ini perbincangan antara parti komponen dalam PN mengenai pemilihan calon berjalan lancar tanpa ada selisih faham atau kecil hati. “Setakat hari ini, PN tak dengar pun ada pertukaran. Kita pun baru dengar (berita ini). Memang tiada perbincangan. Itu hanya khabar angin, (saya rasa) cuma provokasi (pihak lain). “Setakat ini, nama yang ada di tangan saya, (masih) nama Deepak (Jaikishan Rewachand) (untuk kerusi Pulai),” katanya kepada pemberita di Bilik Gerakan Utama PN Negeri Johor hari ini. Rasman berkata demikian ketika mengulas laporan sebuah akhbar tempatan berbahasa Inggeris semalam, yang memetik Pengerusi Gerakan Johor Teo Kok Chee sebagai berkata Presiden Gerakan Dominic Lau menamakan calon l

In [14]:
""" Dedup Dataset """
dedup_ds = load_from_disk("MinHash_Output/Dedup_Dataset_Trial-0")

In [15]:
# Extract IDs from the datasets
base_df_id = set(base_ds['id'])
dedup_df_id = set(dedup_ds['id'])

# Find IDs that do not exist in dataset2
missing_ids = base_df_id - dedup_df_id

In [16]:
base_df = base_ds.to_pandas()
dedup_df = dedup_ds.to_pandas()

In [35]:
dedup_df

Unnamed: 0,id,text
0,0,SELEPAS seminggu politik negara kecoh pada Feb...
1,1,PARLIMEN perlu meluluskan undang-undang bagi m...
2,2,MISTERI kehilangan seorang warga emas berusia ...
3,3,LIMPAHAN air Sungai Muar yang berlaku sejak se...
4,4,POLIS Kelantan telah mengenal pasti 28 lokasi ...
...,...,...
472811,473482,\nKhabar angin bertiup kencang di kalangan pem...
472812,473483,\nKira-kira dua juta daripada tujuh juta pener...
472813,473484,\nPerairan Sabah didapati bebas daripada kenai...
472814,473485,\nPerbicaraan ke atas bekas pengurus besar Lem...


In [36]:
len(dedup_df)

472816