# Weekly Meeting 10

* Experiments for reducing the scores of negative candidates.
* To finetune BART-Large instead of BART-CNN (lead bias?)

## Negative candidates

### Discriminative approach

Experimental details:

* I consider the same ranking size than the discriminative model of the paper (24 candidates).


* The negative symbolic information is computed as $S^- = \bigcup\limits_{u\in T}\{v \ /\ (u, v) \in \textrm{E}\ \wedge \not\exists u' \in T \ : p(u', v) \}$


* I only used the category graph in the experimentation (not enough time for experimenting with the infobox graph).

<br><ins>**Experiment 1**</ins>: to use $S^-$ as negative candidates. To maintain the ranking size, negative candidates are replaced randomly by the elements $s\in S^-$ (only if $s \not\in G$ (exact match)). I considered different amounts of replaced negative candidates. 


| Replacements (%) | MAP | R@10 | MRR | Different negatives w.r.t baseline (%) |
| --- | --- | --- | --- | --- |
| Baseline | **89.80** | **99.28** | 95.69 | 0% |
| 25% | 89.66 | 99.07 | 96.02 | 12.75% |
| 50% | 88.34 | 99.08 | 94.67 | 24.95% |
| 75% | 88.59 | 97.78 | **96.10** | 34.34% |
| 100% | 88.08 | 98.88 | 94.28 | 42.85% |


Interesting case (Abdul Rahman, Hamid Karzai): 
<br>**G** $\rightarrow$ ['men', 'afghan politicians', 'afghans']
<br>**$S^-$** $\rightarrow$ ['afghan exiles', 'arabic masculine given names', 'people from kandahar province', 'turkish masculine given names', 'presidents of afghanistan', 'iranian masculine given names', 'honorary knights grand cross of the order of st michael and st george', 'afghan sunni muslims', 'mujahideen members of the soviet–afghan war', 'karzai family', 'pakistani masculine given names', '2000s in afghanistan', '2010s in afghanistan', 'afghan expatriates in pakistan']

<ins>**Experiment 2**</ins>: to use the positive symbolic information (LCA) $S^+$ as negative candidates. To maintain the ranking size, negative candidates are replaced randomly by the elements $s\in S^+$ (only if $s \not\in G$ (exact match)). All the elements in $S^+$ are considered as negative candidates.

Also, as the experiment with "positive" data augmentation of the previous meeting was with the generative system, I repeated the experimentation with the discriminative.

| Candidate type | Information | MAP | R@10 | MRR |
| --- | --- | --- | --- | --- |
| Positives | Lowest common ancestors (k=6) | 88.20 | 99.34 | 94.09 |
| Negatives | Lowest common ancestors (k=6) | 88.34 | 98.88  | 95.30 |


<ins>**Experiment 3**</ins>: to use $S^+$ (LCA) as "soft-candidates". To maintain the ranking size, negative candidates are replaced randomly by the elements $s\in S^+$ (only if $s \not\in G$ (exact match)). To avoid the imbalance, a maximum of 12 positive and soft-positive candidates are allowed.

|Information | Soft-label | MAP | R@10 | MRR |
| --- | --- | --- | --- | --- |
| Lowest common ancestors (k=6) | 0. | 88.20 | 99.34 | 94.09 |
| Lowest common ancestors (k=6) | 0.2 | 87.45 | 98.00 | 95.66 |
| Lowest common ancestors (k=6) | 0.4 | 88.49 | 98.71 | 95.17 |
| Lowest common ancestors (k=6) | 0.6 | 88.89 | 98.71 | 95.31 |
| Lowest common ancestors (k=6) | 0.8 | 88.81 | 98.99 | 95.50 |
| Lowest common ancestors (k=6) | 1. | 88.34 | 98.88  | 95.30 |

Question: In all the experiments where I have used information from the graph, the performance is very similar to the baseline w/o graph information, so, could this information already be contained in the model parameters or in the background/context of the input? [[Language Models as Knowledge Bases?]](https://arxiv.org/pdf/1909.01066.pdf). The information in the category and infobox graphs could appear in the Wikipedia articles and: 1) the pre-training data of BART contains Wikipedia, 2) the backgrounds are extracted from those articles.

### Generative approach

Regardless of whether the information in the graph is useful or not, is it still interesting for us to integrate the negative candidates in the training of the generative model?

Generative: $p(y_1^I | x_1^J) = \prod_{i=1}^{I} p(y_i | y_1^{i-1}, x_1^J)$
<br>Discriminative: $p(y | x_1^J) = f_{sm}(u_J^\intercal W + b)_y$

Idea: to integrate a discriminative loss along with the generative loss.

$\mathcal{L} = \mathcal{L_D} + \mathcal{L_G}$
<br>$\mathcal{L_G} = $
<br>$\mathcal{L_D} = $

## Lead bias checking

I assume that the pre-trained BART w/o finetuning on CNN/DM (BART-Large) does not suffer from lead bias, so, I repeated the experimentation with the discriminative model (w/o graph information), but finetuning BART-Large instead of BART-CNN.



| BART | MAP | R@10 | MRR |
| --- | --- | --- | --- |
| CNN | **89.80** | **99.28** | 95.69 |
| Large | 89.61 | 99.06 | 95.74 |


### Is BART-CNN biased to the lead sentences? 

I prepared some code for visualizing the encoder/decoder attentions and compute the accumulated probability for each sentence (**BARTBias** notebook)

## Appendix

Implementation details of the experimentation with negative candidates:

**symbolic_algo**: ("positivos") lowest_common_ancestors_graph, value_intersection_infobox. ("negativos") negatives_from_neighborhood

**symbolic_format**: si no se especifica, input (si se especifica "input", también). Si "target", la información simbólica se usa como positivos. Si "negative_targets_X%", la información simbólica se usa como negativos para reemplazar el X% de negativos.

Por tanto, se pueden hacer cosas como, usar symbolic_algo "positivos" (e.g. lowest_common_ancestors_graph) como distractores con symbolic_format=negative_targets_100, usar symbolic_algo "positivos" como gold standards con symbolic_format=target y usar symbolic_algo "negativos" como distractores con symbolic_format=negative_targets_X%.


**soft-labels** en create_modeling_task para distinguir las etiquetas de información del grafo con las del corpus ("partial_aggregation" -2 como valor clave-). En finetune.sh se puede especificar el peso que se le da a cada etiqueta: {"not_aggregation": 0, "aggregation": 1, "partial_aggregation": 0.6}, se calcula la binary crossentropy entre la p(1|x) generada por la red (posición 1 de la softmax) y la soft-label.