GitHub

Keyword extraction models for Russian

Datasets

from https://github.com/mannefedov/ru_kw_eval_datasets

habr -- HabraHabr https://habr.com/

ng -- Независимая Газета http://www.ng.ru/

rt -- Russia Today https://russian.rt.com/

cl -- Cyberleninka https://cyberleninka.ru/

Preprocessing options (nltk, pymorphy):

tokenization
lemmatization
extracting nouns and adjectives in nominative case: "ясная ночь", not "ясный ночь"
ngrams spliteration (n=2)

Further improvements: implement udpipe for tokenization and lemmatization, combine with pymorphy to extract

Approaches

Simple TFIDF method
SCAKE graph method https://arxiv.org/pdf/1811.10831v1.pdf
NN approach (in progress)

Training and evaluation mode

usage: model_trainer.py [-h] [-d [{all,rt,ng,habr,cl}]] [-o OUTPUT_PATH]
                        [-m [{tfidf,scake}]]

Keyword extractor

optional arguments:
  -h, --help            show this help message and exit
  -d [{all,rt,ng,habr,cl}], --dataset [{all,rt,ng,habr,cl}]
  -o OUTPUT_PATH, --output OUTPUT_PATH
  -m [{tfidf,scake}], --model [{tfidf,scake}]

Results on NG dataset

TFIDF

Metric	Value
Precision	0.1385
Recall	0.2649
F1	0.1733
Jaccard	0.1014

SCAKE

(much slower than tfidf)

Metric	Value
Precision	0.1717
Recall	0.2833
F1	0.2021
Jaccard	0.1199

NN

Model implemented (training is available), but it does not perform well. Further investigation needed

Implementation mode

File "main.py" contains text that is normalized and then kws are extracted using different approaches in parallel with the usage of celery

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Keyword extraction models for Russian

Datasets

Preprocessing options (nltk, pymorphy):

Approaches

Training and evaluation mode

Results on NG dataset

TFIDF

SCAKE

NN

Implementation mode

About

Releases

Packages

Languages

lilaspourpre/kw_extraction

Folders and files

Latest commit

History

Repository files navigation

Keyword extraction models for Russian

Datasets

Preprocessing options (nltk, pymorphy):

Approaches

Training and evaluation mode

Results on NG dataset

TFIDF

SCAKE

NN

Implementation mode

About

Resources

Stars

Watchers

Forks

Languages