Python impl for TextRank

A pure Python implementation of TextRank, based on the Mihalcea 2004 paper. This work leads toward integration with the Williams 2016 talk on text summarization.

Modifications to the original Mihalcea algorithm include:

fixed bug; see Java impl, 2008
use of lemmatization instead of stemming
verbs included in the graph (but not in the resulting keyphrases)
normalized keyphrase ranks used in summarization

Dependencies and Installation

This code has dependencies on several other Python projects:

To install:

conda config --add channels https://conda.binstar.org/sloria
conda install textblob
pip install -U git+https://github.com/sloria/textblob-aptagger.git@dev
sudo python -m nltk.downloader punkt
sudo python -m nltk.downloader wordnet
sudo python -m textblob.download_corpora
pip install networkx
pip install statistics
pip install datasketch -U
pip install graphviz
pip install matplotlib

Example Usage

Run a test case based on the Mihalcea paper:

./stage1.py dat/mih.json > out1.json
./stage2.py out1.json > out2.json

That test case should result as:

0.0956	types systems
0.0627	nonstrict inequations
0.0622	minimal supporting set
0.0596	mixed types
0.0571	strict inequations
0.0568	natural numbers
0.0568	minimal set
0.0545	linear diophantine equations
0.0539	linear constraints
0.0528	corresponding algorithms
0.0474	upper bounds

Run another test based on Williams, using text from a Wired article:

./stage1.py dat/ars.json > out1.json
./stage2.py out1.json > out2.json
./stage3.py out1.json out2.json > out3.json
./stage4.py out2.json out3.json > out4.md

Which produces as a summary:

excerpts: After more than four hours of tight play and a rapid-fire endgame, Google's artificially intelligent Go-playing computer system has won a second contest against grandmaster Lee Sedol, taking a two-games-to-none lead in their historic best-of-five match in downtown Seoul. The surprisingly skillful Google machine, known as AlphaGo, now needs only one more win to claim victory in the match. The Korean-born Lee Sedol will go down in defeat unless he takes each of the match's last three games. Lee Sedol is widely-regarded as the top Go player of the last decade, after winning more international titles than all but one other player. Although AlphaGo topped Lee Sedol in the match's first game on Wednesday afternoon, the outcome of Game Two was no easier to predict.

keywords: second game; all-important match; more win; seasons hotel; grandmaster lee sedol; alphago technique; wednesday afternoon; skillful google machine; downtown seoul; saturday afternoon; first time; first game; lee sedol

These results show a summarization similar to slide 30 of the talk; however, this approach is more amenable to:

bootstrapping work with new documents about a specific topic
producing results ready for use in a search engine or recommender system

TODO: Stay tuned for more...

NB: the output is encoded, in case the input has characters that couldn't be handled otherwise. May require some post-processing for your use cases.

Integrate sent2vec encoder
LSH for building doc-to-doc graph of semantic similarity (per chapter-ish)
Docker container for managing the installation/dependencies

Kudos

@htmartin @williamsmj @mattkohl @HarshGrandeur @mnowotka

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
dat		dat
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run.sh		run.sh
scrub.py		scrub.py
stage1.py		stage1.py
stage2.py		stage2.py
stage3.py		stage3.py
stage4.py		stage4.py
textrank.py		textrank.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dat

dat

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

run.sh

run.sh

scrub.py

scrub.py

stage1.py

stage1.py

stage2.py

stage2.py

stage3.py

stage3.py

stage4.py

stage4.py

textrank.py

textrank.py

Repository files navigation

Python impl for TextRank

Dependencies and Installation

Example Usage

TODO: Stay tuned for more...

Kudos

About

Releases

Packages

Languages

License

kjam/pytextrank

Folders and files

Latest commit

History

Repository files navigation

Python impl for TextRank

Dependencies and Installation

Example Usage

TODO: Stay tuned for more...

Kudos

About

Resources

License

Stars

Watchers

Forks

Languages