# Evaluate manually labeled data

Author: Steeve Huang (黃功詳)

This notebook demonstrate how good is my algorithm for automatic speech recognition generation.
The 996 rows of data are manually labeled by myself.

In [1]:
import pandas as pd
import numpy as np
import os
import re
from sklearn.metrics import recall_score, precision_score, f1_score

## Load Data
__gt_df__ is manually labeled by me.

__2018-7-17-10-54__ is the predicted subtitle by my algorithm.

In [2]:
# read ground truth
gt_df = pd.read_csv('manual_label_result/csvs/labelled_subtitle.csv')
gt_df.head()

Unnamed: 0,id,subtitle
0,DTfHpK5W4M4-0001929-2.0,但不也是 挺好的嗎
1,QKYCPuoxj6U-0001075-2.0,白虎
2,9cIr5pINQaY-0001757-2.0,
3,aczYWgtYNYw-0002651-2.0,
4,J7Cu0sIACKo-0000566-2.0,


In [3]:
# read prediction
prediction_df = pd.read_csv('manual_label_result/csvs/2018-7-17-10-54.csv')
prediction_df.head()

Unnamed: 0,id,prediction,confidence
0,7-eVc37Q_w8-0000004-2.0,,-1.0
1,swKJqvSc3ek-0000007-2.0,你找不到人你打電話給我幹什麼,0.99
2,Dj8o7JJSFf0-0000013-2.0,我不是在怪她吧,0.98
3,NCnvTl5jPlQ-0000013-2.0,,-1.0
4,vDpOeD7URIg-0000018-2.0,真希望 得最後決後,0.975


In [4]:
# merge ground truth with prediction
merge_df = gt_df.merge(prediction_df, on='id')
merge_df.head()

Unnamed: 0,id,subtitle,prediction,confidence
0,DTfHpK5W4M4-0001929-2.0,但不也是 挺好的嗎,但不也是 挺好的嗎,0.975
1,QKYCPuoxj6U-0001075-2.0,白虎,白虎,0.98
2,9cIr5pINQaY-0001757-2.0,,,0.0
3,aczYWgtYNYw-0002651-2.0,,,-1.0
4,J7Cu0sIACKo-0000566-2.0,,,0.0


In [5]:
merge_df['subtitle'] = merge_df.subtitle.str.replace(' ', '')
merge_df['prediction'] = merge_df.prediction.str.replace(' ', '')
merge_df.head()

Unnamed: 0,id,subtitle,prediction,confidence
0,DTfHpK5W4M4-0001929-2.0,但不也是挺好的嗎,但不也是挺好的嗎,0.975
1,QKYCPuoxj6U-0001075-2.0,白虎,白虎,0.98
2,9cIr5pINQaY-0001757-2.0,,,0.0
3,aczYWgtYNYw-0002651-2.0,,,-1.0
4,J7Cu0sIACKo-0000566-2.0,,,0.0


## No threshold

Let's see what happen if we don't set any threshold.

In [6]:
same_predictoin = merge_df.subtitle==merge_df.prediction
both_nan = merge_df.subtitle.isna() & merge_df.prediction.isna()
merge_df.loc[merge_df.subtitle.isna(),]
print(f"Total images: {merge_df.shape[0]}")
print(f"Total ground truth images containing subtitles: {merge_df.loc[~merge_df.subtitle.isna()].shape[0]}")
print(f"Total correctly predicted images that contain subtitles: {merge_df.loc[same_predictoin].shape[0]}")
print(f"Percentage of matched images containing subtitles {merge_df.loc[same_predictoin].shape[0] / merge_df.loc[~merge_df.subtitle.isna()].shape[0]}")
print(f"Total ground truth images not containing subtitles: {merge_df.loc[merge_df.subtitle.isna()].shape[0]}")
print(f"Total predictions matched with groud truth that doesn't contain subtitle: {merge_df.loc[both_nan].shape[0]}")
print(f"Total correctly predicted images: {merge_df.loc[both_nan | same_predictoin].shape[0]}")

Total images: 996
Total ground truth images containing subtitles: 510
Total correctly predicted images that contain subtitles: 330
Percentage of matched images containing subtitles 0.6470588235294118
Total ground truth images not containing subtitles: 486
Total predictions matched with groud truth that doesn't contain subtitle: 452
Total correctly predicted images: 782


In [7]:
merge_df_filtered = merge_df.loc[~merge_df.prediction.isna(),:]
merge_df_filtered.loc[:,'label'] = 0
merge_df_filtered.loc[merge_df_filtered.subtitle == merge_df_filtered.prediction, 'label'] = 1
precisions = []
recalls = []
f1s = []
thresholds= []
record_df = pd.DataFrame(columns=['threshold', 'precision', 'recall', 'f1'])
for threshold in [0.5,0.6,0.7,0.8,0.9, 0.95]:
# threshold=0.7
    matched_prediction = merge_df_filtered.subtitle == merge_df_filtered.prediction
    higher_threshold = merge_df_filtered.confidence >= threshold
    merge_df_filtered.loc[:,'predicted_label'] = 0
    merge_df_filtered.loc[higher_threshold, 'predicted_label'] = 1
    recall = recall_score(merge_df_filtered.label, merge_df_filtered.predicted_label)
    precision = precision_score(merge_df_filtered.label.values, merge_df_filtered.predicted_label.values)
    f1 = f1_score(merge_df_filtered.label.values, merge_df_filtered.predicted_label.values)
    recalls.append(recall)
    precisions.append(precision)
    f1s.append(f1)
    thresholds.append(threshold)
record_df.loc[:,'threshold']= thresholds
record_df.loc[:,'precision'] = precisions
record_df.loc[:,'recall'] = recalls
record_df.loc[:,'f1'] = f1s
record_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,threshold,precision,recall,f1
0,0.5,0.674847,1.0,0.805861
1,0.6,0.691824,1.0,0.817844
2,0.7,0.709052,0.99697,0.828715
3,0.8,0.727477,0.978788,0.834625
4,0.9,0.786802,0.939394,0.856354
5,0.95,0.827988,0.860606,0.843982


## Discussion

From the above table, it seems that setting confidence threshold = 0.9 is optimal since it derives the best f-measure.

In [8]:
def get_insertion(row):

    # if it is nan
    if pd.isna(row['subtitle']):
        return len(row['prediction'])
    else:
        return len(set(row['prediction']) - set(row['subtitle']))    
merge_df_filtered['insertion'] = merge_df_filtered.apply(get_insertion, axis=1)
merge_df_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,id,subtitle,prediction,confidence,label,predicted_label,insertion
0,DTfHpK5W4M4-0001929-2.0,但不也是挺好的嗎,但不也是挺好的嗎,0.975,1,1,0
1,QKYCPuoxj6U-0001075-2.0,白虎,白虎,0.98,1,1,0
12,vDpOeD7URIg-0000517-2.0,妳竟然說我們的實力沒什麼了不起,妳竟然說我們的力沒什麼了不起,0.81,0,0,0
15,YLiiOvjBqsk-0002153-2.0,信賴友情愛情感情,信賴友情愛情感情UhhipAudi,0.665,0,0,7
17,T3icvVg_SZw-0004794-2.0,後來他家裡沒錢了,後來他家裡沒錢了,0.96,1,1,0


In [9]:
def get_deletion(row):

    # if it is nan
    if pd.isna(row['subtitle']):
        return 0
    else:
        return len(set(row['subtitle']) - set(row['prediction']))    
merge_df_filtered['deletion'] = merge_df_filtered.apply(get_deletion, axis=1)
merge_df_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,id,subtitle,prediction,confidence,label,predicted_label,insertion,deletion
0,DTfHpK5W4M4-0001929-2.0,但不也是挺好的嗎,但不也是挺好的嗎,0.975,1,1,0,0
1,QKYCPuoxj6U-0001075-2.0,白虎,白虎,0.98,1,1,0,0
12,vDpOeD7URIg-0000517-2.0,妳竟然說我們的實力沒什麼了不起,妳竟然說我們的力沒什麼了不起,0.81,0,0,0,1
15,YLiiOvjBqsk-0002153-2.0,信賴友情愛情感情,信賴友情愛情感情UhhipAudi,0.665,0,0,7,0
17,T3icvVg_SZw-0004794-2.0,後來他家裡沒錢了,後來他家裡沒錢了,0.96,1,1,0,0


In [10]:
higher_threshold = merge_df_filtered.confidence>0.9
matched_predictions = merge_df_filtered.prediction == merge_df_filtered.subtitle
filtered_samples = merge_df_filtered.loc[higher_threshold].shape[0]
correct_samples = merge_df_filtered.loc[higher_threshold &matched_predictions].shape[0]
total_insertions = merge_df_filtered.loc[higher_threshold, 'insertion'].sum()
total_deletions = merge_df_filtered.loc[higher_threshold, 'deletion'].sum()
total_sentence_lengths = merge_df_filtered.loc[higher_threshold, 'subtitle'].str.len().sum()
print(f"Total filtered sample: {filtered_samples}")
print(f"Total correct sample: {correct_samples}")
print(f"Correct sample rate: {correct_samples/filtered_samples }")
print(f"Average insertion: {total_insertions/filtered_samples } (characters / sentence)")
print(f"Average deletion: {total_deletions/filtered_samples } (characters / sentence)")
print(f"Average sentence length: {total_sentence_lengths/filtered_samples } (characters / sentence)")

Total filtered sample: 394
Total correct sample: 310
Correct sample rate: 0.7868020304568528
Average insertion: 0.14974619289340102 (characters / sentence)
Average deletion: 0.24619289340101522 (characters / sentence)
Average sentence length: 8.071065989847716 (characters / sentence)


## Conclusion

The result is acceptable. 

In total, there are __510__ images containing subtitles. 

My algorithm generates __394__ positive predicted samples, __310__ of which are correctly predicted.

The F-measure by setting __confidence threshold = 0.9__ is around 0.85.

Average subtitle length = __8__ (characters / sentence)

Average insertion = __0.14__ (characters / sentence)

Average deletion = __0.26__ (characters / sentence)