Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Да се направи извадка на лексикалната база от данни на БАН #1

Open
miglen opened this issue Jul 18, 2018 · 2 comments

Comments

@miglen
Copy link
Owner

miglen commented Jul 18, 2018

Източник: http://ibl.bas.bg/lib/
База: http://ibl.bas.bg/leksikalna-baza-danni/
Неологизми: http://ibl.bas.bg/infolex/neologisms.php

@miglen miglen changed the title Да се направи извадка на думите от електронните издания на БАН Да се направи извадка на лексикалната база от данни на БАН Jul 18, 2018
@miglen
Copy link
Owner Author

miglen commented Jul 18, 2018

Нелогозими:

python3 - <<-EOF
import re
import requests
bg_alphabet = "абвгдежзийклмнопрстуфхцчшщъыюя"
for word in bg_alphabet:
  r = requests.post("http://ibl.bas.bg/infolex/neologisms.php", data={'search_param': 'all', 'word': word})
  neologisms = re.findall(r'<dt>[0-9]{1,5}\. (.+?) <small>', r.text)
  for neolog in neologisms:
    print(neolog)
EOF

@miglen
Copy link
Owner Author

miglen commented Jul 18, 2018

Фразеологизми: http://ibl.bas.bg/infolex/idioms.php

python3 - <<-EOF
import re
import requests
bg_alphabet = "абвгдежзийклмнопрстуфхцчшщъыюя"
for word in bg_alphabet:
  r = requests.post("http://ibl.bas.bg/infolex/idioms.php", data={'search_param': 'all', 'word': word})
  all_idioms = re.findall(r'<br\/><dt>[0-9]{1,5}\. (.+?)<\/dt><dd>', r.text)
  for idiom in all_idioms:
    sub_idioms = re.findall(r'(\w+)', idiom.lower())
    for sub_idiom in sub_idioms:
      if len(sub_idiom) >= 3:
        print(sub_idiom)
EOF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant