# Fine-Tuning Preparation
Based on the analysis, the confidence score is correlated with the number of labels predicted. This means that by increasing the number of predicted labels, the confidence score will increase as wel. However, this would only appply after the prediction.
The good news is that from the analysis, there are a few label types that have shown to have poor high confidence score ratio against low confidence score. This is while some low frequency labels have good ratio. So, to increase the confidence score with less label frequency. The training data needs to be augmented. Here, there are two things that can be done for this augmentation. These are:
- get contextual texts that corresponds to poor ratio labels, which are the labels that has more low scores than high scores.
- synthesize training data for rare texts, which are low score labels with less than 100 per 20,000 predictions.

Poor ratio labels:
`ORDINAL`, `CARDINAL`, `QUANTITY`, `NORP`, `WORK_OF_ART`, and `EVENT`

Rare labels:
`ORDINAL`, `NORP`

----------------
-------------
## Search Preparation
Firstly, we need to get the texts that contribute to these low confidence score across these labels.

### Import Libraries

In [260]:
import pandas as pd

### Import Data 

In [261]:
df = pd.read_csv(r'..\prediction\results_first_line.csv')
df

Unnamed: 0,start,end,text,label,score
0,0,5,polis,ORG,0.904411
1,23,31,siasatan,EVENT,0.733970
2,32,39,program,ORG,0.677047
3,40,45,ehati,ORG,0.682098
4,46,52,wanita,PERSON,0.935187
...,...,...,...,...,...
24290,1484116,1484123,sarawak,LOC,0.679316
24291,1484139,1484157,mustafa kamal gani,PERSON,0.872470
24292,1484194,1484201,peniaga,PERSON,0.644195
24293,1484206,1484216,suri rumah,PERSON,0.686967


### Data Filtering

In [262]:
# low confidence scores
low_conf = df[df['score'] < 0.7]
low_conf

Unnamed: 0,start,end,text,label,score
2,32,39,program,ORG,0.677047
3,40,45,ehati,ORG,0.682098
6,79,85,rumput,ORG,0.588429
7,126,133,manusia,PERSON,0.561913
14,285,290,rumah,LOC,0.539215
...,...,...,...,...,...
24277,1482917,1482920,sic,ORG,0.566282
24290,1484116,1484123,sarawak,LOC,0.679316
24292,1484194,1484201,peniaga,PERSON,0.644195
24293,1484206,1484216,suri rumah,PERSON,0.686967


In [263]:
# feature selection
low_conf = low_conf.drop(['start', 'end', 'score'],axis=1)
low_conf

Unnamed: 0,text,label
2,program,ORG
3,ehati,ORG
6,rumput,ORG
7,manusia,PERSON
14,rumah,LOC
...,...,...
24277,sic,ORG
24290,sarawak,LOC
24292,peniaga,PERSON
24293,suri rumah,PERSON


In [264]:
# drop duplicates
low_conf = low_conf.drop_duplicates()
low_conf

Unnamed: 0,text,label
2,program,ORG
3,ehati,ORG
6,rumput,ORG
7,manusia,PERSON
14,rumah,LOC
...,...,...
24250,rangkaian kedai kopi mewah,PRODUCT
24261,peralihan endemik,EVENT
24265,penularan wabak,EVENT
24274,sic,ORG


In [265]:
# filter low score labels
low_conf_labels =low_conf[low_conf['label'].isin(['ORDINAL', 'CARDINAL', 'QUANTITY', 'NORP', 'WORK_OF_ART','EVENT'])]
low_conf_labels

Unnamed: 0,text,label
18,konflik,EVENT
37,pentas dunia,EVENT
56,4,QUANTITY
89,pecah rumah,EVENT
102,11,QUANTITY
...,...,...
23976,20 kilogram,QUANTITY
24023,prk,EVENT
24107,myanmar,NORP
24261,peralihan endemik,EVENT


In [266]:
# filter rare labels
rare_labels = low_conf[low_conf['label'].isin(['ORDINAL', 'NORP'])]
rare_labels

Unnamed: 0,text,label
505,27 kali,ORDINAL
2799,uighur,NORP
2803,bahasa melayu,NORP
3565,orang asli,NORP
3874,rohingya,NORP
4060,tigress,NORP
7205,warga asing,NORP
8920,indonesia,NORP
10360,melayu,NORP
10679,cina,NORP


### Get Texts for Each Label

In [267]:
# ordinal
ordinal =low_conf[low_conf['label'].isin(['ORDINAL'])]
print(ordinal)

        text    label
505  27 kali  ORDINAL


In [268]:
# cardinal
cardinal =low_conf[low_conf['label'].isin(['CARDINAL'])]
print(cardinal)

              text     label
318          1,005  CARDINAL
413            224  CARDINAL
419              6  CARDINAL
708    nombor satu  CARDINAL
1634             1  CARDINAL
...            ...       ...
23466       40,786  CARDINAL
23502          dua  CARDINAL
23542  26,298 undi  CARDINAL
23545  19,620 undi  CARDINAL
23882          160  CARDINAL

[81 rows x 2 columns]


In [269]:
# quantity
quantity =low_conf[low_conf['label'].isin(['QUANTITY'])]
print(quantity)

              text     label
56               4  QUANTITY
102             11  QUANTITY
115              3  QUANTITY
131          1,005  QUANTITY
342            tan  QUANTITY
...            ...       ...
23700        6,000  QUANTITY
23710          111  QUANTITY
23792     80 orang  QUANTITY
23793           25  QUANTITY
23976  20 kilogram  QUANTITY

[193 rows x 2 columns]


In [270]:
# norp
norp =low_conf[low_conf['label'].isin([ 'NORP'])]
print(norp)

                 text label
2799           uighur  NORP
2803    bahasa melayu  NORP
3565       orang asli  NORP
3874         rohingya  NORP
4060          tigress  NORP
7205      warga asing  NORP
8920        indonesia  NORP
10360          melayu  NORP
10679            cina  NORP
10739           india  NORP
12765          rakyat  NORP
12793      bangladesh  NORP
13815        palestin  NORP
15485        scotland  NORP
15736          semang  NORP
15738    melayu proto  NORP
16919     afghanistan  NORP
17500           tamil  NORP
18434    melayu islam  NORP
18754        malaysia  NORP
19447          jerman  NORP
20171        thailand  NORP
20172         kemboja  NORP
20387  etnik rohingya  NORP
22213        inggeris  NORP
24107         myanmar  NORP


In [271]:
# work of art
work_of_art =low_conf[low_conf['label'].isin(['WORK_OF_ART'])]
print(work_of_art)

                           text        label
273                     cebisan  WORK_OF_ART
394                       indah  WORK_OF_ART
396                       video  WORK_OF_ART
416                          f1  WORK_OF_ART
582                       mayat  WORK_OF_ART
...                         ...          ...
20841                 munafik 2  WORK_OF_ART
20915                  broadway  WORK_OF_ART
21761                     skrip  WORK_OF_ART
21762             filem trilogi  WORK_OF_ART
22243  bendera malaysia gergasi  WORK_OF_ART

[87 rows x 2 columns]


In [272]:
# event
event =low_conf[low_conf['label'].isin(['EVENT'])]
print(event)

                    text  label
18               konflik  EVENT
37          pentas dunia  EVENT
89           pecah rumah  EVENT
133             rampasan  EVENT
206             anugerah  EVENT
...                  ...    ...
23867        perhimpunan  EVENT
23907           op ihsan  EVENT
24023                prk  EVENT
24261  peralihan endemik  EVENT
24265    penularan wabak  EVENT

[474 rows x 2 columns]


------------
----------
## Data Synthesis  for `ORDINAL` and `NORP` labels


---------
### Synthesize Data for `ORDINAL` label

#### Import Libraries

In [273]:
import random
import json

#### Set Fake Texts

In [274]:
ordinal_words = [
    "pertama", "kedua", "ketiga", "keempat", "kelima",
    "keenam", "ketujuh", "kelapan", "kesembilan", "kesepuluh",
    "kesebelas", "keduabelas", "ketigabelas", "keempatbelas", "kelimabelas",
    "keenambelas", "ketujuhbelas", "kelapanbelas", "kesembilanbelas", "keduapuluh",
    "terakhir", "penutup", "pembukaan",
    "ke-1", "ke-2", "ke-3", "ke-4", "ke-5", "ke-6", "ke-7", "ke-8", "ke-9", "ke-10",
    "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9.", "10."
]

patterns = [
    "sekali", "dua kali", "tiga kali", "empat kali", "lima kali",
    "enam kali", "tujuh kali", "lapan kali", "sembilan kali", "sepuluh kali",
    "sebelas kali", "dua belas kali", "tiga belas kali", "empat belas kali", "lima belas kali",
    "1 kali", "2 kali", "3 kali", "4 kali", "5 kali", "6 kali", "7 kali", "8 kali", "9 kali", "10 kali"
]

cities = [
    "Kuala Lumpur", "George Town", "Ipoh", "Shah Alam", "Petaling Jaya",
    "Johor Bahru", "Malacca City", "Kota Kinabalu", "Kuching", "Seremban",
    "Kuantan", "Alor Setar", "Miri", "Kota Bharu", "Kuala Terengganu"
]

events = [
    "Sukan SEA", "Piala Malaysia", "Festival Filem Malaysia", 
    "Pertandingan Memasak Tradisional", "Karnival Buku Kebangsaan",
    "Pameran Teknologi Digital", "Kejohanan Catur Kebangsaan", 
    "Perarakan Hari Kebangsaan", "Konvensyen Pendidikan Nasional",
    "Ekspo Pelancongan Malaysia", "Pertunjukan Fesyen Muslimah",
    "Konsert Amal Merdeka", "Pesta Makanan Jalanan", "Kejohanan E-Sukan ASEAN"
]

templates = [
    "Pertandingan tahunan {event} {ordinal1} telah berlangsung di {city}. Ini adalah kali {kali1} bandar ini menjadi tuan rumah.",
    "{event} {ordinal1} yang diadakan di {city} telah tamat. Acara ini diadakan {kali1} setiap tahun.",
    "Dalam {event} di {city}, peserta {ordinal1} mencatat prestasi cemerlang untuk kali {kali1} berturut-turut.",
    "{event} {ordinal1} berlangsung di {city} dengan penyertaan ramai. Edisi {ordinal2} akan diadakan tahun depan.",
    "Acara {event} {ordinal1} diadakan di {city} pada hari ini. Mereka telah memenangi {kali1} dalam sejarah.",
    "Sambutan {event} {ordinal1} di {city} menarik perhatian ramai. Ini adalah sambutan {ordinal2} mereka.",
    "{event} di {city} mempamerkan acara {ordinal1}. Mereka telah mengadakan acara ini {kali1} sejak 2010.",
    "{event} tahun ini di {city} menampilkan persembahan {ordinal1}. Persembahan {ordinal2} adalah yang paling popular.",
    "{event} yang dianjurkan di {city} telah melakar sejarah {ordinal1}. Kejohanan {ordinal2} akan lebih meriah.",
    "{event} di {city} mencatat kehadiran rekod {ordinal1}. Mereka menjangkakan {kali1} lebih ramai tahun depan.",
    "Edisi {ordinal1} {event} di {city} berakhir minggu lalu. Penganjur merancang {ordinal2} untuk tahun depan.",
    "Peserta dari {city} memenangi tempat {ordinal1} dalam {event}. Ini kemenangan {kali1} mereka sejak 2015.",
    "Perasmian {ordinal1} {event} di {city} berlangsung semalam. Acara akan berterusan hingga {ordinal2}."
]

#### Generate Fake Data

In [275]:
synthetic_data = []

for _ in range(1000):  # Generate 200 samples
    # component selection
    event = random.choice(events)
    city = random.choice(cities)
    template = random.choice(templates)
    
    # ordinal words and patterns
    ordinals = {
        "ordinal1": random.choice(ordinal_words),
        "ordinal2": random.choice(ordinal_words),
        "kali1": random.choice(patterns)
    }
    
    # create text
    text = template.format(event=event, city=city, **ordinals)
    
    # record entities
    entities = []
    
    # get ordinal positions
    for key, value in ordinals.items():
        if key.startswith("ordinal"):
            start_idx = text.find(value)
            if start_idx != -1:  # Only add if found
                entities.append({
                    "start": start_idx,
                    "end": start_idx + len(value),
                    "label": "ordinal"
                })
    
    # get pattern positions
    for key, value in ordinals.items():
        if key.startswith("kali"):
            start_idx = text.find(value)
            if start_idx != -1:  # Only add if found
                entities.append({
                    "start": start_idx,
                    "end": start_idx + len(value),
                    "label": "ordinal"
                })
    
    # add to dataset
    if entities:
        synthetic_data.append({
            "text": text,
            "entities": entities
        })

#### Save as JSON file

In [276]:
with open('malay_ordinal.json', 'w', encoding='utf-8') as f:
    json.dump(synthetic_data, f, ensure_ascii=False, indent=2)

print("Saved as JSON file.")

Saved as JSON file.


------------
### Synthesize Data for `NORP` label

#### Import Libraries

In [277]:
import random
import json

#### Set Fake Texts

In [278]:
cities = [
    "Kuala Lumpur", "George Town", "Ipoh", "Shah Alam", "Petaling Jaya",
    "Johor Bahru", "Malacca City", "Kota Kinabalu", "Kuching", "Seremban",
    "Kuantan", "Alor Setar", "Miri", "Kota Bharu", "Kuala Terengganu"
]

nationalities = [
    "Malaysia", "Indonesia", "Singapura", "Thailand", "Filipina",
    "Vietnam", "India", "China", "Bangladesh", "Pakistan",
    "Nepal", "Myanmar", "Korea Selatan", "Jepun", "Arab Saudi"
]

ethnics = [
    "Melayu", "Cina", "India", "Kadazan", "Dusun",
    "Iban", "Bidayuh", "Orang Asli", "Bajau", "Murut",
    "Bugis", "Jawa", "Banjar", "Minangkabau", "Siam"
]

religions = [
    "Islam", "Kristian", "Buddha", "Hindu", "Sikh",
    "Taoisme", "Konfusianisme", "Animisme", "Baha'i", "Ateis"
]

politics = [
    "UMNO", "DAP", "PAS", "PKR", "BERSATU",
    "PEJUANG", "MUDA", "WARISAN", "GERAKAN", "MCA",
    "MIC", "GPS", "PSB", "PBM", "PUTRA"
]

# combine into norp entities
norp_entities = nationalities + ethnics + religions + politics

templates = [
    "Komuniti {norp} di {city} meraikan perayaan tradisi mereka.",
    "Pertubuhan {norp} mengadakan majlis amal di {city}.",
    "Perdana Menteri bertemu dengan wakil komuniti {norp} di {city}.",
    "Festival kebudayaan {norp} berlangsung di {city} minggu ini.",
    "Penganut {norp} berkumpul di {city} untuk upacara keagamaan.",
    "Delegasi dari {norp} tiba di {city} untuk rundingan diplomatik.",
    "Majlis perkahwinan campur antara {norp} dan {norp2} diadakan di {city}.",
    "Pertubuhan {norp} menyumbang kepada pembangunan masyarakat di {city}.",
    "Kumpulan {norp} mengadakan protes aman di pusat bandar {city}.",
    "Persatuan {norp} melancarkan inisiatif pendidikan di {city}.",
    "Makanan tradisional {norp} semakin popular di {city}.",
    "Kumpulan seni {norp} membuat persembahan di {city}.",
    "Majlis peringatan {norp} diadakan di {city} semalam.",
    "Komuniti {norp} di {city} menghadapi cabaran ekonomi.",
    "Pertandingan sukan antara komuniti {norp} dan {norp2} berlangsung di {city}.",
    "Penganut {norp} merayakan perayaan besar di {city}.",
    "Wakil {norp} menyuarakan kebimbangan di parlimen {city}.",
    "Kumpulan belia {norp} mengadakan kempen kesedaran di {city}.",
    "Masakan {norp} menjadi tumpuan di festival makanan {city}.",
    "Delegasi {norp} melawat {city} untuk mempromosikan pelancongan."
]

#### Generate Fake Data

In [279]:
synthetic_data = []

for _ in range(1000):   # 1000 sample
    # component selection
    city = random.choice(cities)
    template = random.choice(templates)
    
    # select 1-2 norp entities
    norp1 = random.choice(norp_entities)
    norp2 = random.choice(norp_entities)
    
    # create text
    text = template.format(
        norp=norp1, 
        norp2=norp2, 
        city=city
    )
    
    # record NORP entities
    entities = []
    
    # get first NORP
    start_idx = text.find(norp1)
    if start_idx != -1:
        entities.append({
            "start": start_idx,
            "end": start_idx + len(norp1),
            "label": "NORP"
        })
    
    # get second NORP (if it exists and is different)
    if norp2 != norp1:
        start_idx = text.find(norp2)
        if start_idx != -1:
            entities.append({
                "start": start_idx,
                "end": start_idx + len(norp2),
                "label": "NORP"
            })
    
    # add to dataset
    if entities:
        synthetic_data.append({
            "text": text,
            "entities": entities
        })

#### Save as JSON file

In [280]:
# save as json file
with open('malay_norp.json', 'w', encoding='utf-8') as f:
    json.dump(synthetic_data, f, ensure_ascii=False, indent=2)

print("Saved as JSON file")

Saved as JSON file


-----------
---------
## Get News Texts for `ORDINAL`, `CARDINAL`, `QUANTITY`, `NORP`, `WORK_OF_ART`, and `EVENT` labels

-----
-------
## Merge All JSON Files into a Single File

In [281]:
# ordinal
with open('malay_ordinal.json', 'r', encoding='utf-8') as f:
    ordinal_data = json.load(f)
    
# norp
with open('malay_norp.json', 'r', encoding='utf-8') as f:
    norp_data = json.load(f)

# merge & shuffle 
combined_data = ordinal_data + norp_data
random.shuffle(combined_data)

# save as json file
with open('training_data.json', 'w', encoding='utf-8') as f:
    json.dump(combined_data, f, ensure_ascii=False, indent=2)

print("JSON merged successfully!")
print(f"Ordinal : {len(ordinal_data)}")
print(f"NORP    : {len(norp_data)}")
print(f"Total   : {len(combined_data)}")
print("Saved as JSON file.")

JSON merged successfully!
Ordinal : 1000
NORP    : 1000
Total   : 2000
Saved as JSON file.
