Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for 2.0.0 #152

Merged
merged 82 commits into from
Jul 18, 2021
Merged

Changes for 2.0.0 #152

merged 82 commits into from
Jul 18, 2021

Conversation

ozancaglayan
Copy link
Collaborator

@ozancaglayan ozancaglayan commented Mar 26, 2021

Hello,

  • Here's a detailed summary of this PR. It'll probably be quite hard to review this as the modifications to metrics will appear as large diff blocks. But in the first part, we can move on through the below summary and examples and also the code changes. I tested this extensively, but it's possible that the combination of some CLI flags may raise errors, who knows. If merge this, we could do a release candidate first to let people test it.

  • In terms of backward compatibility, I tried to be conservative. The handling of Sentence BLEU yields non-intuitive scores #141 is definitely a backward-incompatible fix but I think it's the correct behavior. Another incompatible change is how signatures are formatted on the terminal.

  • Two things are hopefully handled correctly but actually untested on Windows: (1) Colored outputs (should be disabled on Windows through platform check), (2) multi-CPU significance test (should fall back to 1 CPU if on Windows)

Questions:

  • Should we keep the single-system confidence (--confidence) functionality or is it confusing things as it does not actually provide very valuable information on itself.

  • For Having better defaults for ChrF #124 should we switch to chrF++ by default or continue computing the plain old chrF for backward compatibility?

Thanks!

General

  • Improve documentation and type annotations.
  • Add README.md and CHANGELOG.md to setup.py so that they are shown on PyPI.
  • Drop Python < 3.6 support and migrate to f-strings.
  • Relax portalocker version pinning, add regex, tabulate, numpy dependencies.
  • Drop input type manipulation through isinstance checks. If the user does not obey
    to the expected annotations, exceptions will be raised. Robustness attempts lead to
    confusions and obfuscated score errors in the past (Sentence CHRF silently accepts single reference as string #121)
  • Variable # references per segment is supported for all metrics by default. It is
    still only available through the API.
  • Use colored strings in tabular outputs (multi-system evaluation mode). The
    colored output is disabled if the platform is Windows or if the output is
    redirected into a file or --no-color is passed.

Tokenizers

  • Add caching to tokenizers which seem to speed up things a bit in some cases.
  • Add regex dependency and use it in the V14International tokenizer.
    Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (Speed up (w/ numpy) #46)

Metrics

  • General performance improvements for BLEU and CHRF (Speed up (w/ numpy) #46).
  • Scale all metrics into the [0, 100] range (Scaling chrF and TER to 0-100 #140)
  • BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (Sentence BLEU yields non-intuitive scores #141).
  • CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
  • CHRF: Added chrF+ support through word_order argument. Added test cases against chrF++.py.
    Exposed it through the CLI (--chrf-word-order). The default is still chrF and not chrF++ (Having better defaults for ChrF #124)
  • CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way,
    the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
    I kept the effective ordering as the default since this only affects sentence-level
    scoring with very short sentences. (chrF not compatible with chrF++, Moses and NLTK for sentence-level smoothing #144)
  • TER: Move tokenizer signatures to the metric itself for consistency with other metrics.
  • TER: Use string.translate for OP code mapping in one line of code.

Metric API

  • Use explicit argument names and defaults for the metrics instead of passing argparse.Namespace.
  • Various sorts of refactoring in method names and arguments.
  • A base abstract Metric class is introduced to guide further metric development
    This class defines the methods that should be implemented in the derived classes and
    offers boilerplate methods for the common functionality.
  • All metrics now receive an optional references argument at initialization time
    to process and cache the references. Further evaluations of different systems against
    the same references becomes faster this way for example when using significance
    testing.

Signatures

  • The signature formatting changed (mostly to remove '+' separator as it was interfering with chrF++).
  • The field separator is now '|' and key values are separated with ':' rather than '.'
  • Boolean true / false values are shortened to yes / no, some other shortenings applied as well.
  • Number of references is var if variable number of references is used.
  • Add effective order (yes/no) to BLEU and chrF signatures.

CLI

  • --input/-i can now ingest multiple systems. For this reason, the positional
    references should always preceed the -i flag:
# Correct
$ sacrebleu ref -i sys
# Incorrect (will fail)
$ sacrebleu -i sys ref
  • Separate metric-specific arguments for clarity when --help is printed.
  • Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
  • Prefix metric-specific arguments with --chrf and --ter. To maintain CLI and historical compatibility,
    I did not add --bleu prefixes to BLEU arguments.
  • When multiple metrics are asked, they are now aligned at = as follows:
 BLEU|case:mixed|nrefs:1|tok:13a|smooth:exp|version:2.0.0 = 23.2 55.4/28.6/17.1/10.6 (BP = 1.000 ratio = 1.059 hyp_len = 65345 ref_len = 61721)
chrF2|case:mixed|nrefs:1|nc:6|nw:0|space:no|version:2.0.0 = 52.6
  • Added --format/-f flag. If single system is evaluated, -f json will print the results
    in a parseable JSON format. Arguments such as --short, --score-only are ignored and
    full information is dumped when -f json is given:
$ sacrebleu/sacrebleu.py ref -i sys
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

$ sacrebleu/sacrebleu.py ref -i sys -f json
{
 "name": "BLEU",
 "score": 20.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}

$ sacrebleu/sacrebleu.py ref -i sys -f json | jq .score
20.8

# Multiple metrics and JSON
$ sacrebleu/sacrebleu.py ref -i sys -f json -m bleu chrf
[
{
 "name": "BLEU",
 "score": 20.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
},
{
 "name": "chrF2",
 "score": 52.0,
 "signature": "nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0",
 "nrefs": "1",
 "case": "mixed",
 "eff": "yes",
 "nc": "6",
 "nw": "0",
 "space": "no",
 "version": "2.0.0"
}
]

# Iterate over the list and fetch the score key
$ sacrebleu/sacrebleu.py ref -i sys -f json -m bleu chrf | jq '.[] | .score'
20.8
52
  • For multi-system mode, json falls back to plain text. Other options exist
    in this mode such as latex, rst, html. (More on this later)

Multi-system evaluation mode

  • sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way.
    Through the use of tabulate package, the results are nicely rendered into a table
    in plain text, LaTeX, HTML or RST (cf. --format/-f argument).

  • The systems can be either given as a list of plain text files to -i/--input or
    as a tab-separated single stream redirected into STDIN. In the former case,
    the basenames of the files will be automatically used as system names.

  • If you give the same file twice, sacreBLEU will issue an error.

Explicit filenames:

$ sacrebleu/sacrebleu.py -t wmt17 -l cs-en -i newstest2017.* -m bleu chrf
sacreBLEU: Found 4 systems.
+-----------------------------------+--------+---------+
|                            System |  BLEU  |  chrF2  |
+===================================+========+=========+
|     newstest2017.online-A.0.cs-en |  25.1  |  53.4   |
+-----------------------------------+--------+---------+
|     newstest2017.online-B.0.cs-en |  27.4  |  54.5   |
+-----------------------------------+--------+---------+
|     newstest2017.PJATK.4760.cs-en |  23.2  |  52.6   |
+-----------------------------------+--------+---------+
| newstest2017.uedin-nmt.4955.cs-en |  30.9  |  56.8   |
+-----------------------------------+--------+---------+

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0

Tab-separated STDIN:

$ paste newstest2017.* | sacrebleu/sacrebleu.py -t wmt17 -l cs-en -m bleu chrf
sacreBLEU: Found 4 systems.
+----------+--------+---------+
|   System |  BLEU  |  chrF2  |
+==========+========+=========+
| System 1 |  25.1  |  53.4   |
+----------+--------+---------+
| System 2 |  27.4  |  54.5   |
+----------+--------+---------+
| System 3 |  23.2  |  52.6   |
+----------+--------+---------+
| System 4 |  30.9  |  56.8   |
+----------+--------+---------+

LaTeX mode:

$ paste newstest2017.* | sacrebleu/sacrebleu.py -t wmt17 -l cs-en -m bleu chrf -f latex
sacreBLEU: Found 4 systems.
\begin{tabular}{rcc}
\toprule
   System &  BLEU  &  chrF2  \\
\midrule
 System 1 &  25.1  &  53.4   \\
 System 2 &  27.4  &  54.5   \\
 System 3 &  23.2  &  52.6   \\
 System 4 &  30.9  &  56.8   \\
\bottomrule
\end{tabular}

Single-system bootstrap confidence intervals (requires numpy) (#40 and #78)

  • 95% confidence intervals are provided only for the single-system evaluation mode.
    If you have multiple systems, we recommend using paired tests that will provide
    both the confidence intervals and the p-values.

  • The feature is enabled by passing --confidence to the CLI. The default number
    of bootstrap resamples is 2000. This can be changed with the --confidence-n flag.

  • The random number generator's seed is by default fixed to 12345. The seed
    can be modified by exporting SACREBLEU_SEED environment variable.
    If the exported value is [Nn]one, the seed is uninitialized, yielding
    non-deterministic results.

  • Unit tests are added to compare the results to Moses' significance Perl script.

Fixed seed:

$ sacrebleu/sacrebleu.py ref -i fbk --confidence -m bleu chrf
  BLEU|nrefs:1|bs:2000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 26.3 (μ = 26.33 ± 0.65) 59.3/32.3/19.9/12.6 (BP = 1.000 ratio = 1.003 hyp_len = 61450 ref_len = 61287)
chrF2|nrefs:1|bs:2000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 54.7 (μ = 54.72 ± 0.48)

Random seed:

$ SACREBLEU_SEED=None sacrebleu/sacrebleu.py ref -i fbk --confidence -m bleu chrf ter
        BLEU|nrefs:1|bs:2000|seed:none|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 26.3 (μ = 26.30 ± 0.65) 59.3/32.3/19.9/12.6 (BP = 1.000 ratio = 1.003 hyp_len = 61450 ref_len = 61287)
      chrF2|nrefs:1|bs:2000|seed:none|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 54.7 (μ = 54.72 ± 0.50)
TER|nrefs:1|bs:2000|seed:none|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 63.6 (μ = 63.63 ± 0.78)

Multi-system paired significance tests (#40 and #78)

  • When you have multiple systems to evaluate for a given test set and language pair,
    you can now use paired significance tests to obtain p-values.

  • The first system provided to --input/-i (or the first column hypotheses
    if pasted STDIN method is used) will be flagged as the baseline system.
    When using --input/-i, sacreBLEU will automatically discard the baseline system
    if it appears more than one time. This is useful when using shell globs.

  • Two types of paired tests are provided: Bootstrap resampling (bs) and
    approximate randomization (ar). bs replicates the behavior of Moses'
    significance Perl script whereas ar follows the Multeval
    for performing approximate randomization. The feature is enabled by passing
    one of these two methods to the --paired flag.

  • The default number of samples/trials for bs and ar are 2,000 and 10,000, respectively.
    This can be changed by using the --paired-n/-pan flag.

  • The bs test will also print 95% CI around the true mean as additional information.
    To enable same type of CI's for the AR test, pass --paired-ar-confidence-n 0
    for example, to use the default value of 2000 resamples.

  • Verbose information printed during the tests can be disabled by --quiet.

Example of evaluating 16 WMT17 submissions with 2 metrics:

# the LIUM system will not be counted as a candidate system as it was given
# as the baseline
$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.* --paired bs -m bleu chrf
(Verbose messages suppressed)

+--------------------------------------------+-----------------------+-----------------------+
|                                     System |  BLEU / μ / ± 95% CI  | chrF2 / μ / ± 95% CI  |
+============================================+=======================+=======================+
| Baseline: newstest2017.LIUM-NMT.4900.en-de |  26.6 / 26.6 / 0.65   |  55.9 / 55.9 / 0.47   |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.C-3MA.4959.en-de |  22.7 / 22.7 / 0.61   |  52.0 / 52.0 / 0.46   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|                newstest2017.FBK.4870.en-de |  26.3 / 26.3 / 0.65   |  54.7 / 54.7 / 0.48   |
|                                            |     (p = 0.0945)      |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|                newstest2017.KIT.4950.en-de |  26.1 / 26.1 / 0.66   |  55.8 / 55.8 / 0.46   |
|                                            |     (p = 0.0105)*     |     (p = 0.1089)      |
+--------------------------------------------+-----------------------+-----------------------+
|   newstest2017.LMU-nmt-reranked.4934.en-de |  27.1 / 27.1 / 0.65   |  56.4 / 56.4 / 0.46   |
|                                            |     (p = 0.0070)*     |     (p = 0.0015)*     |
+--------------------------------------------+-----------------------+-----------------------+
|     newstest2017.LMU-nmt-single.4893.en-de |  26.6 / 26.5 / 0.66   |  55.9 / 55.9 / 0.44   |
|                                            |     (p = 0.3353)      |     (p = 0.4098)      |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.online-A.0.en-de |  20.8 / 20.8 / 0.59   |  52.0 / 52.0 / 0.43   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.online-B.0.en-de |  26.7 / 26.7 / 0.67   |  56.3 / 56.3 / 0.45   |
|                                            |     (p = 0.3073)      |     (p = 0.0240)*     |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.online-F.0.en-de |  15.5 / 15.5 / 0.49   |  49.3 / 49.3 / 0.39   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|              newstest2017.online-G.0.en-de |  18.2 / 18.2 / 0.54   |  51.6 / 51.6 / 0.40   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|   newstest2017.PROMT-Rule-based.4735.en-de |  16.6 / 16.6 / 0.51   |  50.4 / 50.4 / 0.40   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|  newstest2017.RWTH-nmt-ensemble.4921.en-de |  26.0 / 26.0 / 0.66   |  55.6 / 55.6 / 0.45   |
|                                            |     (p = 0.0050)*     |     (p = 0.0120)*     |
+--------------------------------------------+-----------------------+-----------------------+
|            newstest2017.SYSTRAN.4847.en-de |  26.7 / 26.7 / 0.66   |  55.6 / 55.6 / 0.45   |
|                                            |     (p = 0.2144)      |     (p = 0.0085)*     |
+--------------------------------------------+-----------------------+-----------------------+
|           newstest2017.TALP-UPC.4834.en-de |  21.2 / 21.2 / 0.58   |  51.7 / 51.7 / 0.43   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|          newstest2017.uedin-nmt.4722.en-de |  28.3 / 28.3 / 0.70   |  57.7 / 57.7 / 0.46   |
|                                            |     (p = 0.0005)*     |     (p = 0.0005)*     |
+--------------------------------------------+-----------------------+-----------------------+
|                newstest2017.xmu.4910.en-de |  26.7 / 26.7 / 0.65   |  56.0 / 56.0 / 0.45   |
|                                            |     (p = 0.2409)      |     (p = 0.3058)      |
+--------------------------------------------+-----------------------+-----------------------+

------------------------------------------------------------
Paired bootstrap resampling test with 2000 resampling trials
------------------------------------------------------------
 - Each system is pairwise compared to Baseline: newstest2017.LIUM-NMT.4900.en-de.
   Actual system score / estimated true mean / 95% CI are provided for each metric.

 - Null hypothesis: the system and the baseline translations are essentially
   generated by the same underlying process. The p-value is roughly the probability
   of the absolute score difference (delta) between a system and the {bline} occurring due to chance.

 - Assuming a significance threshold of 0.05, the Null hypothesis can be rejected
   for p-values < 0.05 (marked with "*"). This means that the delta is unlikely to be attributed
   to chance, hence the system is significantly "different" than the baseline.
   Otherwise, the p-values are highlighted in red (if the terminal supports colors).

 - NOTE: Significance does not tell whether a system is "better" than the baseline but rather
   emphasizes the "difference" of the systems in terms of the replicability of the delta.

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|bs:2000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|bs:2000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0
  • The above command takes ~45 seconds to run which is quite acceptable. However,
    if you also enable TER, it takes ~3.5 minutes to complete. Therefore, it is also
    possible to run the tests using multiple workers (only on Linux and Mac OS X).
    Passing 0 to --paired-jobs/-paj will launch as many workers as the number
    of systems (up to the limit of the number of CPUs on the machine) whereas
    passing a value > 0 will manually set the number of workers in the pool.
    For the above example, it takes ~40 seconds to complete using 15 workers
    (for 15 candidate systems, excluding the baseline).

Approximate randomization method:

# No CI estimation, just p-value's
$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.* --paired ar -m bleu chrf

# With CI estimation using the default of 2000 bootstrap resamples
$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.* --paired ar --paired-ar-confidence-n 2000 -m bleu chrf

SacreBLEU 2.0.0 performance tests

  • The tests are performed across 4 different systems using the API. This is
    where caching kicks in. (The reason why no-cache also speeds up things
    after the 1st evaluation is the caching in the tokenizers.)
  • The 1st column between 1.5.1 and no-cache should be seen as the vanilla
    performance diff between versions.
BLEU {}
 >    [1.5.1] 0.890 0.890 0.890 0.891  || mean: 0.890 -- median: 0.890 -- stdev: 0.000
 > [no-cache] 0.612 0.458 0.452 0.435  || mean: 0.489 -- median: 0.458 -- stdev: 0.071
 >   [cached] 0.604 0.315 0.310 0.303  || mean: 0.383 -- median: 0.315 -- stdev: 0.128
BLEU {'tokenize': 'intl'}
 >    [1.5.1] 4.048 4.039 4.039 4.072  || mean: 4.108 -- median: 4.048 -- stdev: 0.119
 > [no-cache] 0.516 0.426 0.424 0.389  || mean: 0.439 -- median: 0.426 -- stdev: 0.047
 >   [cached] 0.504 0.281 0.271 0.302  || mean: 0.339 -- median: 0.302 -- stdev: 0.095
BLEU {'tokenize': 'none', 'force': True}
 >    [1.5.1] 0.516 0.518 0.523 0.514  || mean: 0.516 -- median: 0.516 -- stdev: 0.004
 > [no-cache] 0.264 0.268 0.293 0.265  || mean: 0.273 -- median: 0.268 -- stdev: 0.012
 >   [cached] 0.257 0.158 0.153 0.152  || mean: 0.180 -- median: 0.158 -- stdev: 0.045
CHRF {}
 >    [1.5.1] 1.105 1.102 1.099 1.100  || mean: 1.101 -- median: 1.100 -- stdev: 0.002
 > [no-cache] 1.111 1.086 1.078 1.092  || mean: 1.092 -- median: 1.092 -- stdev: 0.012
 >   [cached] 1.050 0.660 0.659 0.644  || mean: 0.753 -- median: 0.660 -- stdev: 0.171
CHRF {'whitespace': True}
 >    [1.5.1] 1.221 1.222 1.223 1.218  || mean: 1.225 -- median: 1.222 -- stdev: 0.009
 > [no-cache] 1.250 1.280 1.250 1.234  || mean: 1.254 -- median: 1.250 -- stdev: 0.016
 >   [cached] 1.251 0.778 0.772 0.756  || mean: 0.889 -- median: 0.778 -- stdev: 0.209

Return 0.0 BLEU if no matches occur (#141)
- Allow using epsilon smoothing (#144)
- Add multi-reference support
- Add chrF++ support through the word_order argument (#124)
- Separate out TER functionality into lib_ter.py
- Add a pre-packaged WMT17 EN-DE hyps/refs package for significance testing
- Significance testing tests: Compare results to Moses and Multeval
Add more docs for significance part
@ozancaglayan
Copy link
Collaborator Author

Okay, I'll remove it then. For the second part, yes we can make it 1000, my initial motivation was to make the estimation more robust since in terms of speed, there is not much difference between 1000 and 2000.

@mjpost
Copy link
Owner

mjpost commented Jul 3, 2021

I'm trying to do some testing now.

How much confidence do you have in the AR and BSR implementations? Has anyone code-reviewed them? Just want to make sure we have the details right, since people will likely start using this!

@mjpost
Copy link
Owner

mjpost commented Jul 3, 2021

Nitpicking now, but what do you think about remove the parens from the value in the JSON format?

  "confidence": "(μ = 42.8 ± 1.0)",

Separately, we could add confidence-mean and confidence-var fields?

(The former, with the parens, is the only thing we can't change once we release).

@ozancaglayan
Copy link
Collaborator Author

ozancaglayan commented Jul 3, 2021

The test/test_significance.py compares the results of:

  • BSR to the perl script from Moses (using a modified version of it with bs=2000 but that is not important for unit testing)
  • AR to multeval (https://github.com/jhclark/multeval)

I'm definitely not an expert but quite confident that they should be OK. If you have some people in mind, it would of course be better to let them review the code.

@mjpost
Copy link
Owner

mjpost commented Jul 3, 2021

I think we can merge this, and process any additional changes on the main branch prior to the 2.0 release.

@ozancaglayan
Copy link
Collaborator Author

One last question: The README had plenty of examples demonstrating sacreBLEU with the old text output. Since the purpose there is not the output, do you think we can keep them by adding a note saying that we assumed textual output in the following examples ?

Thank you!

@mjpost
Copy link
Owner

mjpost commented Jul 3, 2021

Sure and maybe note "-f text" when you mention that?

@ozancaglayan
Copy link
Collaborator Author

okay, I am done with the README updates as well. Care to take a final look there?

@ozancaglayan
Copy link
Collaborator Author

@mjpost Mmm I can't merge this as Travis doesn't work...

@martinpopel
Copy link
Collaborator

You can click on "command line instructions" and follow the instructions. It says "If ... an automatic merge cannot be performed, you can perform a manual merge on the command line."

@ozancaglayan
Copy link
Collaborator Author

oh okay I thought that Travis would block that too

@ozancaglayan
Copy link
Collaborator Author

It still fails, maybe we should temporarily disable the limitation from repo settings?

remote: error: GH006: Protected branch update failed for refs/heads/master.
remote: error: Required status check "Travis CI - Pull Request" is expected. At least 1 approving review is required by reviewers with write access.

@ozancaglayan
Copy link
Collaborator Author

I think this is the story: https://daniel.haxx.se/blog/2021/06/14/bye-bye-travis-ci/
I can't access the settings of the repository but we probably need to disable this and then in the future, migrate to Github Actions instead of Travis CI.

@ozancaglayan ozancaglayan merged commit 078c440 into master Jul 18, 2021
@mjpost mjpost mentioned this pull request Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment