Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@ dist
venv
htmlcov
*.swp
venv36
metadoc/extract/data/*
.pytest_cache
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,13 @@ python metadoc/__install__.py
python serve.py => serving @ 6060
```

## Test
```shell
py.test -v tests
```
If you happen to run into an error with OSX 10.11 concerning a lazy bound library in PIL,
just remove `/PIL/.dylibs/liblzma.5.dylib`.

## Todo
* Page concatenation is needed in order to properly calculate wordcount and reading time.
* Authenticity heuristic with sharecount deviance detection (requires state).
Expand All @@ -118,4 +125,3 @@ python serve.py => serving @ 6060
Metadoc stems from a pedigree of nice libraries like [libextract](https://github.com/datalib/libextract), [langdetect](https://github.com/Mimino666/langdetect) and [nltk](https://github.com/nltk/nltk).
Metadoc leans on [this](https://github.com/hankcs/AveragedPerceptronPython) perceptron implementation inspired by Matthew Honnibal.
Metadoc is work-in-progress and maintained by [@___paul](https://twitter.com/___paul)

2 changes: 1 addition & 1 deletion metadoc/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ def _request_url(self):

req = requests.get(url, headers={
'Accept-Encoding': 'identity, gzip, deflate, *',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
'User-Agent': 'Googlebot/2.1 (+http://www.google.com/bot.html)'
})

if req.status_code != 200:
Expand Down
2 changes: 1 addition & 1 deletion serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def full_article():
abort(404)

metadoc = Metadoc(url=url, html=html)
payload = metadoc.query_all()
payload = metadoc.query()

return json.dumps(payload)

Expand Down