# Poor Confidence Analysis
This analysis aims to improve the NER model by studying the predictions done by the model. This is done firstly by identifying the weakness or limitations discovered from the prediction results. So, by looking at the outputs, the confidence score can be the area of focus for this analysis.

The idea of this analysis is to find out if the low confidence scores are justifiable. For example, is the confidence score related to the volume of the dataset, frequency of the labels, or word rarity?

This can be determined by answering these questions:
1. Is there a correlation between label frequency and confidence scores?
2. Are low-confidence predictions associated with rare vocabulary?
3. Do certain entity types show systematic confidence patterns?
4. Is confidence score correlated with word length or complexity?
5. Does context window size affect confidence?

Based on these questions, 

* If low scores correlate with rare labels: Augment training data for underrepresented classes
* If low scores correlate with rare words: Add domain-specific vocabulary for training 
* If low scores are consistent with certain label group: Review annotation
* If low scores are consistent with longer words/phrase: Review span detection
* If low scores cluster in specific positions: Review context window

If all are unclear: Review model architecture or hyperparameter tuning

------
## Data Preparation

### Import Libraries

In [87]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Load Dataset

In [None]:
# load csv
df = pd.read_csv('results_main.csv')
df.head()

Unnamed: 0,start,end,text,label,score
0,0,13,ahli parlimen,PERSON,0.646612
1,14,18,umno,ORG,0.919545
2,44,59,utusan malaysia,ORG,0.676499
3,101,114,korea selatan,LOC,0.768719
4,115,126,kementerian,ORG,0.92251


## Data Analysis

### Q1: Is there a correlation between label frequency and confidence scores?
Compare confidence scores of high-frequency labels and low-frequency labels.
1. get label distribution
2. set high-frequency labels
3. set low-frequency labels
4. visualize on confidence score 
5. compare

#### Get Label Distribution

In [None]:
# drop start, end, and score columns
q1 = df.drop(['start', 'end', 'score'], axis = 1)
q1.head()

In [None]:
# normalize text
q1['text'] = q1['text'].str.lower()
q1.head()

In [None]:
# duplicates
q1.drop_duplicates().head()

In [None]:
# label distribution
q1['label'].value_counts()

In [None]:
# visualize
label_freq = q1['label'].value_counts().index

plt.figure(figsize=(18,5))
sns.countplot(x=q1['label'],data=q1, order=label_freq)
plt.xticks(rotation=45)
plt.show()

#### Get High and Low Frequency Labels

In [None]:
# overview
df.head()

In [None]:
# drop columns
q2 = df.drop(['start','end','text'], axis=1)
q2.head()

In [None]:
# list labels
q2['label'].unique()

In [None]:
# split into higher and lower frequency groups
high_label = ['PERSON','ORG','LOC']
low_label = ['PRODUCT', 'QUANTITY', 'EVENT', 'GPE',
       'CARDINAL', 'TIME', 'LAW', 'MONEY', 'PERCENT', 'WORK_OF_ART',
       'NORP', 'FAC', 'ORDINAL']

q2_high = q2[q2['label'].isin(high_label)]
q2_low = q2[q2['label'].isin(low_label)]

#### Visualize on Confidence Score

In [None]:
# get mean score
label_score_mean_high = q2_high.groupby('label')['score'].mean().reset_index()
label_score_mean_low = q2_low.groupby('label')['score'].mean().reset_index()

In [None]:
# sort mean score
label_score_mean_high = label_score_mean_high.sort_values('score', ascending=False)
label_score_mean_low = label_score_mean_low.sort_values('score', ascending=False)

In [None]:
# visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# high frequency labels
bars1 = ax1.bar(label_score_mean_high['label'], 
               label_score_mean_high['score'],
               color='green')

# labeling
for bar in bars1:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2, height + 0.01,
            f'{height:.3f}', ha='center', va='bottom', fontsize=10)

# setting
ax1.set_title('High Frequency Entities')
ax1.set_xlabel('Entity Label')
ax1.set_ylabel('Average Confidence Score')
ax1.set_ylim(0.5, 0.9)
ax1.tick_params(axis='x', rotation=45)

# low frequency labels
bars2 = ax2.bar(label_score_mean_low['label'], 
               label_score_mean_low['score'],
               color='teal')

# labeling
for bar in bars2:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2, height + 0.01,
            f'{height:.3f}', ha='center', va='bottom', fontsize=10)

# setting
ax2.set_title('Low Frequency Entities')
ax2.set_xlabel('Entity Label')
ax2.set_ylabel('Average Confidence Score')
ax2.set_ylim(0.5, 0.9)
ax2.tick_params(axis='x', rotation=45)

# display
plt.tight_layout(rect=[0, 0, 1, 0.96])  # Make room for suptitle
plt.show()

### Q2: Are low-confidence predictions associated with rare vocabulary?
Compare word frequency between low-scoring entities and high-scoring ones.
1. get word frequency
2. get confidence score
3. visualize and compare

### Q3: Do certain entity types show systematic confidence patterns?
Analyze score distributions per label type (e.g., TIME vs. LOC) using box plots or violin plots.

### Q4: Is confidence score correlated with word length or complexity?
Compute correlation between entity string length/number of tokens and confidence scores.

### Q5: Does context window size affect confidence?
Examine scores relative to entity position in sentence (beginning, middle, end).