Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no author information in key #41

Closed
ozancaglayan opened this issue Nov 19, 2018 · 24 comments
Closed

no author information in key #41

ozancaglayan opened this issue Nov 19, 2018 · 24 comments

Comments

@ozancaglayan
Copy link

Hello,

Thanks for this wonderful project that I discovered this morning. I'm not sure if this is related to sqlite3 with no support for FTS but, i have a problem with author names (both during search and also in the returned keys):

$ bibsearch search Yinfei                                                                                                                                                                                        
$ bibsearch search "Sentence Encoder"                                                                                                                                                                            
1. [unknown2018:universal] . 2018. "Universal Sentence Encoder for                                                                                                                                                                             
   English". Proceedings of the 2018 Conference on Empirical Methods                                                                                                                                                                           
   in Natural Language Processing: System Demonstrations.                                                                                                                                                                                      
   http://aclweb.org/anthology/D18-2029

Looking through the sqlite file, I see this:

D18-2029|unknown2018:universal|UNKNOWN|Universal Sentence Encoder for English|Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations|2018|@InProceedings{unknown2018:universal,         
    author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",      
    title = "Universal Sentence Encoder for English",
...
@mjpost
Copy link
Owner

mjpost commented Nov 19, 2018

We're glad you liked it! This looks like an import problem. This query works for me, and I have the following entry in my db:

D18-2029|cer2018a:universal|Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray|Universal Sentence Encoder for English|Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations|2018|@InProceedings{cer2018a:universal,

but I have FTS support.

How did you import this entry? What happens if you run

bib add https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029.bib

@davvil, any guesses what might have caused this?

@ozancaglayan
Copy link
Author

ozancaglayan commented Nov 19, 2018

Is there way to quickly see whether sqlite is compiled with FTS support or not? But in any case FTS or no-FTS should involve in later parts of the processing, no?

UPDATE: There was a new release of pybtex from yesterday thus I downgraded it to 0.21 but still this snippet from bibdb.py fails for me. I have entry.persons but not entry.fields['author']

    215         if not entry.key:
    216             return False
    218         if not entry.fields.get("author"):
    219             entry.fields["author"] = "UNKNOWN"

EDIT: Tried this and it seems that it has FTS support:

sqlite> WITH opts(n, opt) AS (
   ...>   VALUES(0, NULL)
   ...>   UNION ALL
   ...>   SELECT n + 1,
   ...>          sqlite_compileoption_get(n)
   ...>   FROM opts
   ...>   WHERE sqlite_compileoption_get(n) IS NOT NULL
   ...> )
   ...> SELECT opt
   ...> FROM opts
   ...> WHERE opt LIKE '%FTS%';
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5
$ rm -rf .bibsearch
$ bibsearch add https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029.bib
Added 1 entries, skipped 0 duplicates. Skipped 0 files
$ bibsearch find Yinfei
$ bibsearch print
@InProceedings{unknown2018:universal,
    author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",
    title = "Universal Sentence Encoder for English",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "169--174",
    location = "Brussels, Belgium",
    url = "http://aclweb.org/anthology/D18-2029",
    author = "UNKNOWN",
    original_key = "D18-2029"
}

@davvil
Copy link
Collaborator

davvil commented Nov 20, 2018 via email

@mjpost
Copy link
Owner

mjpost commented Nov 20, 2018

Yes, this is quite strange to see two author fields. I agree it looks like a parsing error. Do you have time and interest to debug this? And can you provide more details about your environment (OS, python version, etc)?

@ozancaglayan
Copy link
Author

Oh I didn't see the second author field above. Yes let me dig into it a little. Sorry for updating again and again my previous comment for which probably you did not receive separate notifications. I tried with pybtex 0.21 and 0.22 and got the same result.

@mjpost
Copy link
Owner

mjpost commented Nov 20, 2018

No need to apologize—we're happy to have someone point out a bug and go through the work of trying to fix it. I think it should be easy to track down: either pybtex parsing is broken (which would be strange, since this entry is fairly standard), or our code is broken. I'm curious what pybtex.Entry items look like here after parsing.

@ozancaglayan
Copy link
Author

ozancaglayan commented Nov 20, 2018

I think there's a very weird thing going on. I tried also on my desktop, same issue. The problem is this: pybtex.Entry never has an author field for me and that's why the code injects an UNKNOWN author. For me, all authors are inside entry.persons['author']. But then when the code asks for a pretty print of the entry, the authors= are there. This is how pybtex documents as well, see this: https://docs.pybtex.org/api/parsing.html

>>> from pybtex.database import parse_file
>>> bib_data = parse_file('../examples/tugboat/tugboat.bib')
>>> print(bib_data.entries['Knuth:TB8-1-14'].fields['title'])
Mixing right-to-left texts with left-to-right texts
>>> for author in bib_data.entries['Knuth:TB8-1-14'].persons['author']:
...     print(unicode(author))
Knuth, Donald
MacKay, Pierre

This makes me think whether we are using two completely different pybtex, i.e. maybe an old fork which provided the authors within fields and the one that gets installed (for me) through pip, which does not seem to provide this?

@mjpost
Copy link
Owner

mjpost commented Nov 20, 2018 via email

@ozancaglayan
Copy link
Author

Saved it into a local file foo.tex and then:

(base) [silver] ~ $ ipython -i `which bibsearch` -- add foo.tex 
Python 3.6.6 |Anaconda custom (64-bit)| (default, Oct  9 2018, 12:34:16) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.
  0%|                                                  | [Elapsed: 00:00 ETA: ?]> /home/caglayan/git/bibsearch/bibsearch/bibsearch.py(330)_add_file()
    329         ipdb.set_trace()
--> 330         if db.add(entry):
    331             added += 1

ipdb> 'author' in entry.fields                                                                                                                                                                                                                 
False
ipdb> entry                                                                                                                                                                                                                                    
Entry('inproceedings', fields=[('title', 'Universal Sentence Encoder for English'), ('booktitle', 'Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations'), ('year', '2018'), ('publisher', 'Association for Computational Linguistics'), ('pages', '169--174'), ('location', 'Brussels, Belgium'), ('url', 'http://aclweb.org/anthology/D18-2029')], persons=OrderedCaseInsensitiveDict([('author', [Person('Cer, Daniel'), Person('Yang, Yinfei'), Person('Kong, Sheng-yi'), Person('Hua, Nan'), Person('Limtiaco, Nicole'), Person('St. John, Rhomni'), Person('Constant, Noah'), Person('Guajardo-Cespedes, Mario'), Person('Yuan, Steve'), Person('Tar, Chris'), Person('Strope, Brian'), Person('Kurzweil, Ray')])]))

@ozancaglayan
Copy link
Author

ozancaglayan commented Nov 20, 2018

Can you try to run this file?

#!/usr/bin/env python
import pybtex.database


BIBTEX="""\
   @InProceedings{D18-2029,
      author = {Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray},
      title = {Universal Sentence Encoder for English},
      booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
      year = {2018},
      publisher = {Association for Computational Linguistics},
      pages = {169--174},
      location = {Brussels, Belgium},
      url = {http://aclweb.org/anthology/D18-2029}
    }"""

if __name__ == '__main__':
    print('Pybtex version: ', pybtex.__version__)

    library = pybtex.database.parse_string(BIBTEX, 'bibtex')
    entry = library.entries['D18-2029']
    print('author in entry.fields?', 'author' in entry.fields)
    print('author in entry.persons?', 'author' in entry.persons)

Output:

Pybtex version:  0.22.0
author in entry.fields? False
author in entry.persons? True

@mjpost
Copy link
Owner

mjpost commented Nov 20, 2018

Pybtex version:  0.21
author in entry.fields? False
author in entry.persons? True

@ozancaglayan
Copy link
Author

Then how can you escape from having UNKNOWN author field? I don't get it, or maybe I'm missing something about the way the code works.

    def add(self, entry: pybtex.Entry):
        """ Returns if the entry was added or if it was a duplicate"""

        # TODO: make this a better sanity checking and perhaps report errors
        if not entry.key:
            return False
        if not entry.fields.get("author"):
            entry.fields["author"] = "UNKNOWN"

@mjpost
Copy link
Owner

mjpost commented Nov 20, 2018

Yes, I don't understand either. I tried downgrading to pybtex 0.20.0 and even 0.19.0, but still get False on author in entry.fields, even when I change the bib file to one that was imported correctly just yesterday.

I'll have to look into this later tonight, or maybe @davvil has an idea. This is strange.

@ozancaglayan
Copy link
Author

If you remove your already generated bibdb, can you still add this entry correctly with author information?

@mjpost
Copy link
Owner

mjpost commented Nov 20, 2018 via email

@ozancaglayan
Copy link
Author

Looking through the code of pybtex, I see that the author field is never carried along as a string but it directly is parsed into Persons. Thus, I still can't see how all the code paths in bibsearch accessing to entry.fields['authors'] and parsing it with parse_names may work.

@ozancaglayan
Copy link
Author

@mjpost
Copy link
Owner

mjpost commented Nov 27, 2018

Sorry about the delay—I'll pick this up after NAACL.

@davvil
Copy link
Collaborator

davvil commented Dec 6, 2018

Again, sorry about the delay. I was able to reproduce the issue on another computer and I have comited a fix for it. Please try the current master in github which should address this issue. After the three of us do some testing, we should update the pip package ASAP.

I am also buffled as to why it worked before. Perhaps we were using a byproduct of the parsing itself.

@ozancaglayan
Copy link
Author

It seems to work on my side for the specific example above.

$ bibsearch add https://aclanthology.coli.uni-saarland.de/papers/D18-2029/d18-2029.bib
100%|██████████████████████████████████████████████| [Elapsed: 00:00 ETA: 00:00]
Added 1 entries, skipped 0 duplicates. Skipped 0 files

$ bibsearch print
@InProceedings{D18-2029,
    author = "Cer, Daniel and Yang, Yinfei and Kong, Sheng-yi and Hua, Nan and Limtiaco, Nicole and St. John, Rhomni and Constant, Noah and Guajardo-Cespedes, Mario and Yuan, Steve and Tar, Chris and Strope, Brian and Kurzweil, Ray",
    title = "Universal Sentence Encoder for English",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "169--174",
    location = "Brussels, Belgium",
    url = "http://aclweb.org/anthology/D18-2029",
    original_key = "D18-2029"
}

$ bibsearch find Yinfei
1. [D18-2029] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole
   Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes,
   Steve Yuan, Chris Tar, Brian Strope and Ray Kurzweil. 2018.
   "Universal Sentence Encoder for English". Proceedings of the 2018
   Conference on Empirical Methods in Natural Language Processing:
   System Demonstrations. http://aclweb.org/anthology/D18-2029

@davvil
Copy link
Collaborator

davvil commented Dec 7, 2018

I just fixed the key generation (it took the original key before) and I also fixed an error when importing entries with unknown macros. It seems to work quite well now. @mjpost what do you think? Can you update PyPi?

@mjpost
Copy link
Owner

mjpost commented Dec 7, 2018

Sure, can you bump the version and add to the change log? Then I'll push.

@davvil
Copy link
Collaborator

davvil commented Dec 10, 2018

Done! We are now at version π.

@mjpost
Copy link
Owner

mjpost commented Dec 10, 2018

Pushed to pypi.

@mjpost mjpost closed this as completed Dec 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants