This repository has been archived by the owner on Mar 5, 2022. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 533
No result with Google's experimental layout #103
Labels
Comments
I'll take a look tomorrow. |
By the way, I'm thinking about adding some dev scripts to facilitate development. Current idea:
Content of #!/usr/bin/env python3
import argparse
import json
import googler
def main():
argparser = argparse.ArgumentParser(description='Parse Google responses.')
argparser.add_argument('-N', '--news', action='store_true',
help='parse as Google News responses')
argparser.add_argument('files', nargs='+', metavar='FILE',
help="HTML file with Google's response body")
args = argparser.parse_args()
for fn in args.files:
with open(fn, encoding='utf-8') as fp:
htmlparser = googler.GoogleParser(news=args.news)
htmlparser.feed(fp.read())
results_object = [r.jsonizable_object() for r in htmlparser.results]
print(json.dumps(results_object, indent=2, sort_keys=True, ensure_ascii=False))
if __name__ == '__main__':
main() Invocation: > devbin/parse -h
usage: parse [-h] [-N] FILE [FILE ...]
Parse Google responses.
positional arguments:
FILE HTML file with Google's response body
optional arguments:
-h, --help show this help message and exit
-N, --news parse as Google News responses > devbin/parse /Volumes/ramdisk/googler-response-2qmy246b # Good
[
{
"abstract": "HELLO! Online brings you the latest celebrity & royal news from the UK & around the world, magazine exclusives, celeb babies, weddings, pregnancies and ...",
"title": "HELLO! Online: celebrity & royal news, magazine, babies, weddings ...",
"url": "http://www.hellomagazine.com/"
},
{
"abstract": "Hello, it's me. I was wondering if after all these years you'd like to meet. To go over everything. They say that time's supposed to heal ya. But I ain't done much ...",
"title": "ADELE LYRICS - Hello - A-Z Lyrics",
"url": "http://www.azlyrics.com/lyrics/adele/hello.html"
},
{
"abstract": "hello connects you with people and content around your passions. Show the world who you are, express what you love, and create meaningful connections.",
"title": "hello network",
"url": "http://www.hello.com/"
},
{
"abstract": "\"Hello\" is a song by English singer Adele. It was released on 23 October 2015 by XL Recordings as the lead single from her third studio album, 25 (2015). Adele ...",
"title": "Hello (Adele song) - Wikipedia, the free encyclopedia",
"url": "https://en.wikipedia.org/wiki/Hello_(Adele_song)"
},
{
"abstract": "Watch Hello by Adele online at vevo.com. Discover the latest music videos by Adele on Vevo.",
"title": "Hello - Adele - Vevo",
"url": "http://www.vevo.com/watch/adele/Hello/GBH481500074"
},
{
"abstract": "Make your phone smarter with Hello. Built by Messenger just for Android, Hello combines info from Facebook with the contact info on your phone so it's easy to ...",
"title": "Hello — Caller ID & Blocking - Android Apps on Google Play",
"url": "https://play.google.com/store/apps/details?id=com.facebook.phone&hl=en"
},
{
"abstract": "HELLO! 1779396 likes · 92804 talking about this. The official Facebook page for HELLO! magazine & http://www.hellomagazine.com/ FOLLOW US @...",
"title": "HELLO! - Facebook",
"url": "https://www.facebook.com/hello/"
},
{
"abstract": "Listen to Hello in full in the Spotify app. Play on Spotify. ℗ 2015 XL Recordings Ltd., under exclusive license to Columbia Records, a Division of Sony Music ...",
"title": "Hello by Adele on Spotify",
"url": "https://open.spotify.com/album/1Eo1pg2D4beFgf3HFTJTMc"
}
] > devbin/parse /Volumes/ramdisk/googler-response-unjwqp8g # Bad
[] |
👍 |
zmwangx
added a commit
to zmwangx/googler
that referenced
this issue
Jun 5, 2016
Fixes jarun#103. By comparing https://git.io/vofW5 (experimental layout in jarun#103) and https://git.io/voJgB (traditional layout), we see two things in common: - The mnr-c class is used for cards, regardless of whether it a smart result presumably produced by Google's deep neural network; - The g-blk class (presumably standing for g-block) coupled with the g class (result wrapper) is used exclusively for smart cards, either at the top or on the right (sample query: paris). Therefore, we now test for the presence of `g' instead of requiring `g'-only, and exclude `g-blk'.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
As we all know, Google experiments with stuff all the time. Here's a layout that they've been experimenting lately which I've seen a few times:
Sample response: https://git.io/vofW5
Our parser can't parse any result from this because the result wrapper here is
<div class="g card-section">
, and we've been limiting ourselves to precisely<div class="g">
in order to get rid of the occasional top card.I'll see what I can do when I have time. (The basic idea is to open up the class restriction, and somehow still filter out the top card? I thought about enforcing nonempty abstracts, but that's not a good criterion because if you google
google
for instance, the second result is Google's Twitter account and there's no abstract for that. If we can't discern the top card from everything else, I guess we'll have to include it after all.)The text was updated successfully, but these errors were encountered: