In [1]:
MAIN_PATH = "/home/carlos/MasterDS/tfm"
JSON_DATA_PATH = '{}/data/json/'.format(MAIN_PATH)
CSV_DATA_PATH = '{}/data/csv/'.format(MAIN_PATH)

In [2]:
import sys
sys.path.insert(0, MAIN_PATH)

In [3]:
import pandas as pd
from rouge import Rouge

In [4]:
%load_ext autoreload
%autoreload 2
from scripts.metrics.summary_evaluation import SummaryEvaluation

%reload_ext autoreload

# Métricas

El objetivo de este notebook es reunir las métricas que se obtienen a partir de las distintas técnicas de resuménes extractivos que se han 
desarrollado.

En esta primera parte, el objetivo ha sido realizar un resumen de los partidos, utilizando directamente los eventos. Por tanto, para evaluar
la bondad de estos métodos, se comparará directamente con la noticia asociada a cada partido, (la cual podemos considerar resumen del partido)
utilizando las siguientes métricas:

- ROUGE: mide el recall. Cuántas secuencias (unigramas, bigramas, lcs...) del resumen generado se encuentran en el resumen real, del total de las secuencias
del resumen real.
- BLEU: mide la precision. Cuántas secuencias del resumen generado se encuentran en el resumen real, del total de las secuencias del resumen generado.

Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words from the system results appearing in the human references you will have high Bleu, and if you have many words from the human references appearing in the system results you will have high Rouge.
https://stackoverflow.com/questions/38045290/text-summarization-evaluation-bleu-vs-rouge

Es importante considerar que no se está haciendo un resumen extractivo al uso (tradicionalmente se realizaría un resumen a partir de la noticia, 
y de ahí se extraerían las métricas). El estilo de escritura y los términos usados en eventos y noticias son bastante diferentes,
por lo que las métricas tradicionales obtenidas no serán demasiado altas. Nos interesa la comparación entre técnicas propuestas.

In [31]:
summaries_path = '{}summaries/key_events_summaries_2.csv'.format(CSV_DATA_PATH)
articles_path = '{}articles_events_processed.csv'.format(CSV_DATA_PATH)

In [32]:
pd_summaries = pd.read_csv(summaries_path)
pd_matches = pd.read_csv(articles_path)

In [33]:
join_cols = ['json_file', 'url']
match_cols = join_cols + ['article', 'events']
sum_cols = join_cols + ['summary_events']

In [34]:
pd_matches_sum = pd_matches[match_cols].merge(pd_summaries[sum_cols], on=join_cols, how='inner')

### Ejemplo

In [36]:
prueba = pd_matches_sum.iloc[0]
article = prueba['article']
events = prueba['events']
summary_events = prueba['summary_events']

In [37]:
article

"A goal and an assist from Riyad Mahrez sealed a 4-1 comeback win at Brighton & Hove Albion that confirmed Manchester City as Premier League champions.\xa0\nThe Algerian, a 2015/16 champion with Leicester City, replaced Phil Foden in the starting line-up and starred as Pep Guardiola's side defended the PL title and claimed the Trophy for the fourth time.\nGlenn Murray gave Brighton a shock lead when he headed in Pascal Gross's 27th-minute corner.\nBut Sergio Aguero levelled within a minute after being played in by David Silva, before Aymeric Laporte headed City in front from Mahrez's corner on 38 minutes.\nAfter the break, Mahrez fired superbly into the top corner before Ilkay Gundogan scored a brilliant free-kick with 18 minutes remaining.\nCity finished on 98 points, one above Liverpool, who won 2-0 against Wolverhampton Wanderers on the final day."

In [38]:
events

'Attempt blocked. Ilkay Gündogan (Manchester City) right footed shot from outside the box is blocked. Assisted by Bernardo Silva. Corner,  Brighton and Hove Albion. Conceded by Aymeric Laporte. Offside, Brighton and Hove Albion. Anthony Knockaert tries a through ball, but Glenn Murray is caught offside. Foul by Anthony Knockaert (Brighton and Hove Albion). Aymeric Laporte (Manchester City) wins a free kick in the defensive half. Attempt missed. Alireza Jahanbakhsh (Brighton and Hove Albion) right footed shot from outside the box is close, but misses to the left. Assisted by Anthony Knockaert following a fast break. Attempt blocked. Sergio Agüero (Manchester City) right footed shot from outside the box is blocked. Assisted by Bernardo Silva. Foul by Yves Bissouma (Brighton and Hove Albion). Bernardo Silva (Manchester City) wins a free kick in the defensive half. Attempt missed. Sergio Agüero (Manchester City) header from the centre of the box misses to the left. Assisted by Raheem Sterl

In [39]:
summary_events

'Goal.  Brighton and Hove Albion 1, Manchester City 0. Glenn Murray (Brighton and Hove Albion) header from the left side of the six yard box to the bottom left corner. Assisted by Pascal Groß with a cross following a corner. Goal.  Brighton and Hove Albion 1, Manchester City 1. Sergio Agüero (Manchester City) left footed shot from a difficult angle on the left to the centre of the goal. Assisted by David Silva with a through ball. Goal.  Brighton and Hove Albion 1, Manchester City 2. Aymeric Laporte (Manchester City) header from very close range to the top left corner. Assisted by Riyad Mahrez with a cross following a corner. Goal.  Brighton and Hove Albion 1, Manchester City 3. Riyad Mahrez (Manchester City) right footed shot from outside the box to the top right corner. Assisted by David Silva. Goal.  Brighton and Hove Albion 1, Manchester City 4. Ilkay Gündogan (Manchester City) from a free kick with a right footed shot to the bottom right corner.'

Comparamos las métricas que obtendríamos con el resumen generado, y con los eventos originales. Observamos lo siguiente:

- Los eventos completos tienen lógicamente mayor recall, y los resúmenes mayor precision. Esto se debe a la longitud de los textos: al medir la precision,
se toma como referencia la longitud del texto candidato (eventos > resumen) y al medir el recall se toma como referencia el texto referencia (artículo, que 
será siempre el mismo). Por lo tanto, a mayor longitud de candidato, es probable que obtenga más recall y menos precision.

- Sin embargo, la diferencia entre ambos es tan grande que hace que el f1 score sea mayor en la opción resumida. Por tanto, solo podríamos tomar como límite
superior el recall de los eventos, y como límite inferior la precision de los eventos.

In [40]:
rouge = Rouge()

In [42]:
rouge.get_scores(summary_events, article)

[{'rouge-1': {'f': 0.3006535898073391,
   'p': 0.27710843373493976,
   'r': 0.32857142857142857},
  'rouge-2': {'f': 0.052631573983942644,
   'p': 0.048484848484848485,
   'r': 0.05755395683453238},
  'rouge-l': {'f': 0.27118643600114917,
   'p': 0.36363636363636365,
   'r': 0.21621621621621623}}]

In [43]:
rouge.get_scores(events, article)

[{'rouge-1': {'f': 0.07801950299630425,
   'p': 0.04358759430008382,
   'r': 0.37142857142857144},
  'rouge-2': {'f': 0.01352366454569753,
   'p': 0.007550335570469799,
   'r': 0.06474820143884892},
  'rouge-l': {'f': 0.21052631084069576,
   'p': 0.19117647058823528,
   'r': 0.23423423423423423}}]

## ROUGE

In [25]:
evaluation = SummaryEvaluation(metric='rouge')

Setting target metric to rouge


__Sin procesar__

In [154]:
path = '{}metrics/summaries/rouge/key_events_summaries_2'.format(JSON_DATA_PATH)

In [142]:
scores_dict, avg_scores_dict = evaluation.evaluate(pd_matches_sum, path, avg=True)

Path already exists


In [144]:
avg_scores_dict

{'rouge-1': {'f': 0.14093097027721083,
  'p': 0.451322714164037,
  'r': 0.10829601513666687},
 'rouge-2': {'f': 0.03568097728062855,
  'p': 0.15954426423885243,
  'r': 0.026128361172219026},
 'rouge-l': {'f': 0.13772376215091478,
  'p': 0.510722632771135,
  'r': 0.09210327264239185}}

__Con procesado (eliminación de stopwords y lematización)__

In [145]:
path = '{}metrics/summaries/rouge/key_events_summaries_2_processed'.format(JSON_DATA_PATH)

In [146]:
scores_dict, avg_scores_dict = evaluation.evaluate(pd_matches_sum, path, preprocess_text=True, avg=True)

Writing to /home/carlos/MasterDS/tfm/data/json/metrics/summaries/rouge/key_events_summaries_2_processed.pickle
Writing avg to /home/carlos/MasterDS/tfm/data/json/metrics/summaries/rouge/key_events_summaries_2_processed_avg.pickle


In [147]:
avg_scores_dict

{'rouge-1': {'f': 0.18414411021992844,
  'p': 0.5618391255639942,
  'r': 0.14011238092564743},
 'rouge-2': {'f': 0.05898801516243596,
  'p': 0.20638111609898435,
  'r': 0.044435798481286586},
 'rouge-l': {'f': 0.19663244820860556,
  'p': 0.6624929981455303,
  'r': 0.13120485256209452}}

### Con todos los resumenes

In [6]:
%%time
evaluation.evaluate_all_summaries(preprocess_text=True)

Evaluating the following summaries: ['key_events_summaries_1.csv', 'key_events_summaries_3.csv', 'key_events_summaries_4.csv', 'key_events_summaries_graph_2.csv', 'key_events_summaries_graph_3.csv', 'key_events_summaries_2.csv', 'key_events_summaries_graph_5.csv', 'key_events_summaries_graph_1.csv', 'key_events_summaries_graph_4.csv']
Evaluating key_events_summaries_1.csv
Performing evaluation for 4302 articles
Writing to /home/carlos/MasterDS/tfm/data/metrics/summaries/rouge/key_events_summaries_1_processed.pickle
Writing avg to /home/carlos/MasterDS/tfm/data/metrics/summaries/rouge/key_events_summaries_1_processed_avg.pickle
Evaluating key_events_summaries_3.csv
Performing evaluation for 4302 articles
Writing to /home/carlos/MasterDS/tfm/data/metrics/summaries/rouge/key_events_summaries_3_processed.pickle
Writing avg to /home/carlos/MasterDS/tfm/data/metrics/summaries/rouge/key_events_summaries_3_processed_avg.pickle
Evaluating key_events_summaries_4.csv
Performing evaluation for 430

In [13]:
pd_metrics = evaluation.output_avg_metrics()

In [15]:
pd_metrics

Unnamed: 0,metric,metric_type,value,experiment
0,rouge-1,f,0.134658,key_events_summaries_graph_5_processed
1,rouge-1,p,0.101599,key_events_summaries_graph_5_processed
2,rouge-1,r,0.248504,key_events_summaries_graph_5_processed
3,rouge-2,f,0.033602,key_events_summaries_graph_5_processed
4,rouge-2,p,0.024837,key_events_summaries_graph_5_processed
...,...,...,...,...
4,rouge-2,p,0.024790,key_events_summaries_graph_4_processed
5,rouge-2,r,0.072010,key_events_summaries_graph_4_processed
6,rouge-l,f,0.235995,key_events_summaries_graph_4_processed
7,rouge-l,p,0.291261,key_events_summaries_graph_4_processed


In [14]:
evaluation.bound_metrics()

Performing evaluation for 4523 articles
Writing to /home/carlos/MasterDS/tfm/data/metrics/summaries/rouge/upper_bound.pickle
Writing avg to /home/carlos/MasterDS/tfm/data/metrics/summaries/rouge/upper_bound_avg.pickle


In [26]:
pd_bound = evaluation.output_avg_bound()

In [27]:
pd_bound

Unnamed: 0,metric,metric_type,value
0,rouge-1,f,0.105305
1,rouge-1,p,0.067818
2,rouge-1,r,0.361353
3,rouge-2,f,0.021309
4,rouge-2,p,0.013826
5,rouge-2,r,0.073771
6,rouge-l,f,0.218348
7,rouge-l,p,0.22875
8,rouge-l,r,0.232812
