Skip to content

Commit

Permalink
Merge pull request #18 from psolbach/fix/consent-walls
Browse files Browse the repository at this point in the history
spoof User Agent to circumvent consent pages
  • Loading branch information
mborho committed Jun 12, 2018
2 parents 7bd1a01 + 22be1b9 commit 381a02b
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 3 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Expand Up @@ -10,3 +10,6 @@ dist
venv
htmlcov
*.swp
venv36
metadoc/extract/data/*
.pytest_cache
8 changes: 7 additions & 1 deletion README.md
Expand Up @@ -107,6 +107,13 @@ python metadoc/__install__.py
python serve.py => serving @ 6060
```

## Test
```shell
py.test -v tests
```
If you happen to run into an error with OSX 10.11 concerning a lazy bound library in PIL,
just remove `/PIL/.dylibs/liblzma.5.dylib`.

## Todo
* Page concatenation is needed in order to properly calculate wordcount and reading time.
* Authenticity heuristic with sharecount deviance detection (requires state).
Expand All @@ -118,4 +125,3 @@ python serve.py => serving @ 6060
Metadoc stems from a pedigree of nice libraries like [libextract](https://github.com/datalib/libextract), [langdetect](https://github.com/Mimino666/langdetect) and [nltk](https://github.com/nltk/nltk).
Metadoc leans on [this](https://github.com/hankcs/AveragedPerceptronPython) perceptron implementation inspired by Matthew Honnibal.
Metadoc is work-in-progress and maintained by [@___paul](https://twitter.com/___paul)

2 changes: 1 addition & 1 deletion metadoc/__init__.py
Expand Up @@ -181,7 +181,7 @@ def _request_url(self):

req = requests.get(url, headers={
'Accept-Encoding': 'identity, gzip, deflate, *',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
'User-Agent': 'Googlebot/2.1 (+http://www.google.com/bot.html)'
})

if req.status_code != 200:
Expand Down
2 changes: 1 addition & 1 deletion serve.py
Expand Up @@ -58,7 +58,7 @@ def full_article():
abort(404)

metadoc = Metadoc(url=url, html=html)
payload = metadoc.query_all()
payload = metadoc.query()

return json.dumps(payload)

Expand Down

0 comments on commit 381a02b

Please sign in to comment.