# Fine-Tuning Preparation
Based on the analysis, the confidence score is correlated with the number of labels predicted. This means that by increasing the number of predicted labels, the confidence score will increase as wel. However, this would only appply after the prediction.
The good news is that from the analysis, there are a few label types that have shown to have poor high confidence score ratio against low confidence score. This is while some low frequency labels have good ratio. So, to increase the confidence score with less label frequency. The training data needs to be augmented. Here, there are two things that can be done for this augmentation. These are:
- get contextual texts that corresponds to poor ratio labels.
- synthesize training data for rare texts, which are low score labels with less than 100 per 20,000 predictions.

Poor ratio labels:
`ORDINAL`, `CARDINAL`, `QUANTITY`, `NORP`, `WORK_OF_ART`, and `EVENT`

Rare labels:
`ORDINAL`, `NORP`

----------------
-------------
## Search Preparation
Firstly, we need to get the texts that contribute to these low confidence score across these labels.

### Import Libraries

In [66]:
import pandas as pd

### Import Data 

In [67]:
df = pd.read_csv(r'..\data\results_first_line.csv')
df

Unnamed: 0,start,end,text,label,score
0,0,5,polis,ORG,0.904411
1,23,31,siasatan,EVENT,0.733970
2,32,39,program,ORG,0.677047
3,40,45,ehati,ORG,0.682098
4,46,52,wanita,PERSON,0.935187
...,...,...,...,...,...
24290,1484116,1484123,sarawak,LOC,0.679316
24291,1484139,1484157,mustafa kamal gani,PERSON,0.872470
24292,1484194,1484201,peniaga,PERSON,0.644195
24293,1484206,1484216,suri rumah,PERSON,0.686967


### Data Filtering

In [68]:
# low confidence scores
low_conf = df[df['score'] < 0.7]
low_conf

Unnamed: 0,start,end,text,label,score
2,32,39,program,ORG,0.677047
3,40,45,ehati,ORG,0.682098
6,79,85,rumput,ORG,0.588429
7,126,133,manusia,PERSON,0.561913
14,285,290,rumah,LOC,0.539215
...,...,...,...,...,...
24277,1482917,1482920,sic,ORG,0.566282
24290,1484116,1484123,sarawak,LOC,0.679316
24292,1484194,1484201,peniaga,PERSON,0.644195
24293,1484206,1484216,suri rumah,PERSON,0.686967


In [69]:
# feature selection
low_conf = low_conf.drop(['start', 'end', 'score'],axis=1)
low_conf

Unnamed: 0,text,label
2,program,ORG
3,ehati,ORG
6,rumput,ORG
7,manusia,PERSON
14,rumah,LOC
...,...,...
24277,sic,ORG
24290,sarawak,LOC
24292,peniaga,PERSON
24293,suri rumah,PERSON


In [70]:
# drop duplicates
low_conf = low_conf.drop_duplicates()
low_conf

Unnamed: 0,text,label
2,program,ORG
3,ehati,ORG
6,rumput,ORG
7,manusia,PERSON
14,rumah,LOC
...,...,...
24250,rangkaian kedai kopi mewah,PRODUCT
24261,peralihan endemik,EVENT
24265,penularan wabak,EVENT
24274,sic,ORG


In [71]:
# filter low score labels
low_conf_labels =low_conf[low_conf['label'].isin(['ORDINAL', 'CARDINAL', 'QUANTITY', 'NORP', 'WORK_OF_ART','EVENT'])]
low_conf_labels

Unnamed: 0,text,label
18,konflik,EVENT
37,pentas dunia,EVENT
56,4,QUANTITY
89,pecah rumah,EVENT
102,11,QUANTITY
...,...,...
23976,20 kilogram,QUANTITY
24023,prk,EVENT
24107,myanmar,NORP
24261,peralihan endemik,EVENT


In [72]:
# filter rare labels
rare_labels = low_conf[low_conf['label'].isin(['ORDINAL', 'NORP'])]
rare_labels

Unnamed: 0,text,label
505,27 kali,ORDINAL
2799,uighur,NORP
2803,bahasa melayu,NORP
3565,orang asli,NORP
3874,rohingya,NORP
4060,tigress,NORP
7205,warga asing,NORP
8920,indonesia,NORP
10360,melayu,NORP
10679,cina,NORP


### Get Texts for Each Label

In [None]:
# ordinal
ordinal =low_conf[low_conf['label'].isin(['ORDINAL'])]
print(ordinal)

        text    label
505  27 kali  ORDINAL


In [None]:
# cardinal
cardinal =low_conf[low_conf['label'].isin(['CARDINAL'])]
print(cardinal)

              text     label
318          1,005  CARDINAL
413            224  CARDINAL
419              6  CARDINAL
708    nombor satu  CARDINAL
1634             1  CARDINAL
...            ...       ...
23466       40,786  CARDINAL
23502          dua  CARDINAL
23542  26,298 undi  CARDINAL
23545  19,620 undi  CARDINAL
23882          160  CARDINAL

[81 rows x 2 columns]


In [None]:
# quantity
quantity =low_conf[low_conf['label'].isin(['QUANTITY'])]
print(quantity)

              text     label
56               4  QUANTITY
102             11  QUANTITY
115              3  QUANTITY
131          1,005  QUANTITY
342            tan  QUANTITY
...            ...       ...
23700        6,000  QUANTITY
23710          111  QUANTITY
23792     80 orang  QUANTITY
23793           25  QUANTITY
23976  20 kilogram  QUANTITY

[193 rows x 2 columns]


In [None]:
# norp
norp =low_conf[low_conf['label'].isin([ 'NORP'])]
print(norp)

                 text label
2799           uighur  NORP
2803    bahasa melayu  NORP
3565       orang asli  NORP
3874         rohingya  NORP
4060          tigress  NORP
7205      warga asing  NORP
8920        indonesia  NORP
10360          melayu  NORP
10679            cina  NORP
10739           india  NORP
12765          rakyat  NORP
12793      bangladesh  NORP
13815        palestin  NORP
15485        scotland  NORP
15736          semang  NORP
15738    melayu proto  NORP
16919     afghanistan  NORP
17500           tamil  NORP
18434    melayu islam  NORP
18754        malaysia  NORP
19447          jerman  NORP
20171        thailand  NORP
20172         kemboja  NORP
20387  etnik rohingya  NORP
22213        inggeris  NORP
24107         myanmar  NORP


In [None]:
# work of art
work_of_art =low_conf[low_conf['label'].isin(['WORK_OF_ART'])]
print(work_of_art)

                           text        label
273                     cebisan  WORK_OF_ART
394                       indah  WORK_OF_ART
396                       video  WORK_OF_ART
416                          f1  WORK_OF_ART
582                       mayat  WORK_OF_ART
...                         ...          ...
20841                 munafik 2  WORK_OF_ART
20915                  broadway  WORK_OF_ART
21761                     skrip  WORK_OF_ART
21762             filem trilogi  WORK_OF_ART
22243  bendera malaysia gergasi  WORK_OF_ART

[87 rows x 2 columns]


In [None]:
# event
event =low_conf[low_conf['label'].isin(['EVENT'])]
print(event)

                    text  label
18               konflik  EVENT
37          pentas dunia  EVENT
89           pecah rumah  EVENT
133             rampasan  EVENT
206             anugerah  EVENT
...                  ...    ...
23867        perhimpunan  EVENT
23907           op ihsan  EVENT
24023                prk  EVENT
24261  peralihan endemik  EVENT
24265    penularan wabak  EVENT

[474 rows x 2 columns]
