OO refresh #99

zmwangx · 2016-05-30T00:31:14Z

This is a huge WIP, intended to group and modularize googler's various functional units, hence lifting googler from its quick-and-dirty origin. (Starting out quick-and-dirty is totally fine, but at some point we need to rethink the structure in order to not dissolve into a pile of unmaintainable mess when we add more features. That's just my opinion.)

The guiding principle here is to break down big functional units and dissociate state info so that every small blob of code should make as much sense as possible locally, instead of relying on memorizing the effect of another blob that is two hundred lines away. (Bad example: url = url.replace('start=%d&' % oldstart, 'start=%d&' % start, 1), which relies on storing a snapshot of start in oldstart many lines before actually using it; this sort of disconnectedness could be a source of bugs when part of the program is updated.) When we do need persistent state info, they should be packaged into well-defined and well-documented OO interfaces, complete with relevant state manipulation helpers.

A brief and incomplete checklist:

GooglerArgumentParser: absorb type guards;
GoogleUrl: URL constructor;
GoogleConnection: handle connection opening/renewal and page fetching (including redirection) (should absorb google_get, new_connection and a big chunk of fetch_results);
GooglerCmd: OO command line interpreter and executioner (model on cmd.Cmd?).

Classes should be global-free (except constants and logger):

GoogleUrl
GoogleConnection
GoogleParser
Result — lucky should be dealt with outside Result; colors, columns etc. that affect the output should be class variables; ~~json should be dealt with inside Result~~; oh, and think of a better name for urlindex (urltable maybe?)...
GooglerCmd
GooglerArgumentParser

I'm busy recently so it will take a while.

The finished parts could be reviewed and/or regression-tested and I'd like to have feedback.

zmwangx · 2016-05-30T00:33:39Z

By the way, this is laying groundwork for both "Option-related improvements" and "Refactor and write unit tests" on the todo list.

jarun · 2016-05-30T05:05:23Z

I'm busy recently so it will take a while.

Please take your time. This is massive improvement over what we have today.

zmwangx · 2016-05-30T05:37:31Z

Pushed another commit, implementing GoogleConnection. That wraps up this week's work.

jarun · 2016-05-30T05:40:14Z

👍 get some rest.

zmwangx · 2016-05-30T05:42:42Z

👍 ~~get some rest.~~ Pick up another piece of work.

FTFY 😛

zmwangx · 2016-05-30T05:44:16Z

Seems that Google raised their bar of suspicious activity, or we have some unknown competitor testing Google search on Travis at the same time...

jarun · 2016-05-30T05:48:36Z

Seems that Google raised their bar of suspicious activity

We can experiment with the cookie stuff our user share on Gitter. I'll check it next weekend.

zmwangx · 2016-05-30T05:53:51Z

We can experiment with the cookie stuff our user share on Gitter.

You mean grab a cookie from the browser?

I don't think spamming hundreds requests with cookies is not suspicious from Google's point of view;
Even if it does prevent Google's block, it only works for so long before it expires;
I would be hesitant to make my cookies (signed in) public, because that may open up a channel for session hijacking attacks against my Google account.

It doesn't hurt to try though.

jarun · 2016-05-30T06:01:43Z

No no... This one: https://docs.python.org/3.0/library/http.cookiejar.html

Now we are all-Python3. No double maintenance.
If we do it, we handle ourselves.

jarun · 2016-05-30T06:03:31Z

I don't think spamming hundreds requests with cookies is not suspicious from Google's point of view

We can check this by firing 100 requests from the browser. I can't do it right now.

zmwangx · 2016-05-30T06:06:01Z

Oh, you mean just keeping the cookies? Not sure that would help. I would say keeping the cookies would be more suspicious, because you're clearly spamming from the same session, rather than having multiple people searching from the same IP, which is more excusable.

What I had in mind was cookies from a signed-in session, because I've never been greeted by a reCAPTCHA when I'm signed in, but occasionally do when I search from Incognito/Private, especially when using my PIA VPN.

zmwangx · 2016-05-30T06:07:15Z

What I had in mind was cookies from a signed-in session

Obviously we don't want to introduce signing in.

Anyway, it's not a big deal. Also, my plans for testing would render this spamming a thing of the past.

jarun · 2016-05-30T06:13:08Z

Not sure that would help.

This is what I wanna test, from a fresh, cookie-cleared browser session.

Obviously we don't want to introduce signing in.

We can't. Google has added restrictions recently. Many third-party clients can't anymore.

zmwangx · 2016-05-30T06:14:26Z

We can't.

Plus that hardly helps our users. Only helps with automated testing.

zmwangx · 2016-05-30T06:15:42Z

This is what I wanna test, from a fresh, cookie-cleared browser session.

And I just tried.

for i in {1..500}; do echo -n $'\r'$i; chrome-cli open 'https://google.com/search?q=hello' >/dev/null; sleep 1; chrome-cli close; done

All pages are opened in a fresh Incognito session (same session). Got reCAPTCHA'ed on the ~100th request.

jarun · 2016-05-30T06:16:18Z

Got ReCAPTCHA'ed on the ~100th request.

LoL!!! Doesn't help then.

This is Python 3, dude!

- Make previously global type guard functions into static methods of GooglerArgumentParser (cleaner global namespace, easier-to-locate type guards); - Add two new type guards: positive_int: for --num; nonnegative_int: for --start (technically a breaking change, realistic no one should ever type in a negative start index or a non-positive number of results per page). - Drop Python 2 legacy (super).

The performance gain of compiling r'\s+' is negligible, which is completely outweighed by the awkwardness of the variable and the dev time wasted in looking for the definition of another variable (although it could be easily located by automated tools).

The URL construction interface is now fully OO. It is to some extend modeled on urllib.parse.ParseResult for familiarity, but it takes care of all aspects of URL construction (even pagination) and defines a easy-to-use and flexity interface with properties and methods. See the docstring of the GoogleUrl class for details, as well as the docstrings of individual user-facing methods. As a side effect, we have reduced the number of globals, which are now grouped into three categories: state carriers (which, as the name suggests, carry global states), output/user interaction options, and cosmetic options.

move GooglerArgumentParser to the end of classes. - Logically GoogleUrl should precede GoogleParser because we can't parse anything if we haven't constructed the request URL in the first place. Once we have GoogleConnection (OO version of the current conn global), it will go between GoogleUrl and GoogleParser. - GooglerArgumentParser is a class.

GoogleConnection absorbed new_connection, google_get, and the connecting and fetching parts of fetch_results. The new OO interface is nicely abstract and fully documented. Note that we have now registered an exit handler for closing the connection. This way we can exit the program at any point using sys.exit instead of our kind of awkward quit_program. There's one breaking change which should not have much of a consequence: we no longer modify the host segment of the global URL (now managed by google_url) after a redirection. This is partly stipulated by the OO interface (GoogleConnection class should not have access to the global google_url) and partly because users should get what they specified when they ask for the URL (with the 'o' command, for instance), not what Google forced on them with a redirection.

1. I'm trying to eliminate globals. `debug' is a non-idiomatic global, whereas `logger' is idiomatic. 2. I cringe at `debugp'. 3. Having to maintain home-grown messaging functions sucks. Also moving import tempfile and import json to where they're actually used for maintainability. Doing the same to webbrowser because it's also an optional component that's only used once.

We already have a global variable `colors' which we can test for truthiness, so another boolean global is totally unnecessary. Note that colors will eventually be absorbed into class variables of classes that rely on it (Result and GooglerCmd), but for now it is kept.

immediately after constants. These global helpers are called from classes, so logically they should appear first.

The `columns' global variable has been eliminated (when I say eliminated I mean it can be wrapped into a main function and is referenced nowhere else). `colors' is currently used in `read_next_command' but will be eliminated once we pack up the REPL. Also restructured the opt handling part a bit to make them more organized and more sensible.

instead of Result.print_entry. Opening the first result really has nothing to do with "print entry". While we're at it, we - Make --lucky imply --noprompt. (Previously, there was a pointless blank line printed due to not in noninteractive mode.) - Print a warning of "No results" when no results were found unless in JSON mode (where an empty array should be obvious enough).

The name "urlindex" is so weird, considering it is a dict of index: url pairs.

This way we - Better adapt to terminal resize; - Get rid of one kind of awkward class variable. We also document the class variables of the Result class. Note that os.get_terminal_size is *fast*. I did some benchmarking: termsize.py: #!/usr/bin/env python3 import os import sys try: count = int(sys.argv[1]) except: count = 1 for _ in range(count): try: columns, lines = os.get_terminal_size() except: pass With Python 3.5.1 on OS X 10.11.5 on my Mid-2015 15'' rMBP with 2.5 GHz quad-core Core i7, 1E7 calls take this long: zsh: > time (./termsize.py 10000000) ( ./termsize.py 10000000; ) 5.02s user 3.22s system 99% cpu 8.263 total > time (./termsize.py 10000000 &>/dev/null </dev/null) ( ./termsize.py 10000000 &> /dev/null < /dev/null; ) 9.13s user 2.50s system 99% cpu 11.667 total > unset COLUMNS LINES; time (./termsize.py 10000000) ( ./termsize.py 10000000; ) 5.18s user 3.25s system 99% cpu 8.458 total bash: > time (./termsize.py 10000000) real 0m8.459s user 0m5.186s sys 0m3.246s > time (./termsize.py 10000000 &>/dev/null </dev/null) real 0m13.320s user 0m10.363s sys 0m2.899s > unset COLUMNS LINES; time (./termsize.py 10000000) real 0m8.177s user 0m5.006s sys 0m3.152s

There's one breaking change: - The `n' command now navigates to the next page even when the current page seems to be empty. The reason is not being able to find any result (that matches our criteria) on the page doesn't necessarily mean it is the last page, especially when --count is small. Other improvements other than the obvious include - Global variables are finally eliminated, and program no longer has side effects; - Argument parsing code has been factored into a function parse_args() to be forward-looking; - All exceptions are now captured. In non-debug mode we only print the error message if there's an otherwise uncaught exception; in debug mode we re-raise to produce a traceback.

Previously I forgot to handle the `site' key in GoogleUrl.update(). Also ordered keys in update (we need some easy-to-follow order so that we won't need to worry about where to insert a new key in the future).

zmwangx · 2016-06-05T07:13:18Z

Lazy import done. Rebasing was kind of a nightmare.

Any other comments or replies to my comments?

zmwangx · 2016-06-05T07:21:06Z

Rebasing this on master won't be fun, because that cyan change would result in amending almost every single commit. If you're happy with this branch, I'll do a manual merge instead, which is easy.

jarun · 2016-06-05T08:16:38Z

Lazy import done. Rebasing was kind of a nightmare.

I guessed it but I really crave for speed.

I'll do a manual merge instead, which is easy.

Go as you like :)

zmwangx · 2016-06-05T08:17:38Z

but I really crave for speed.

Needed to rebase anyway. No problem.

zmwangx · 2016-06-05T08:18:29Z

Merged.

jarun · 2016-06-05T08:33:45Z

👍

jarun · 2016-06-09T06:08:35Z

I'm planning to revert this commit. Earlier I was thinking of supporting all options at the prompt and this commit would help with that but I have dumped the plan.

This commit takes re-directions to a new level. Just to figure out where program args (opts) are being parsed I had to go though 4 functions where opts is being passed around!

Let's keeps things simpler. So much for object-oriented.

jarun · 2016-06-09T07:03:16Z

OK, that came in with GooglerCmd.

jarun · 2016-06-09T10:19:45Z

And we couldn't live without re???

zmwangx · 2016-06-09T11:50:14Z

Earlier I was thinking of supporting all options at the prompt and this commit would help with that but I have dumped the plan.

I think it really makes sense to support that. But if you don't want to, we can dump GooglerCmd. Other classes can stay.

This commit takes re-directions to a new level.

Eliminating side effects come with a price. Otherwise something like devbin/parse would not work (or it could work, but you have global everywhere, which would be insane).

Just to figure out where program args (opts) are being parsed I had to go though 4 functions where opts is being passed around!

Dynamic URL construction makes sense even if you don't want to support dynamically updating options during the execution. Previously you need to understand all URL manipulation code scattered around just to make sure your redirection code won't break anything (and you actually can't). Now the complexity has been contained, and you don't have to understand anything about URL construction unless you need to.

Plus, tracebacks in the OO setting are arguably more helpful. You have the entire detailed callstack, while you could also quickly isolate the issue, rather than worrying about states.

zmwangx · 2016-06-09T11:51:42Z

In short, if you confirm that the plan to support dynamic options has been dropped, then I'll go ahead and try to eliminate GooglerCmd, because that's the only class that's adding complexity (in addition to wrapping a few globals) if we don't want to actually achieve something with it.

zmwangx · 2016-06-09T12:21:25Z

Forgot to reply to re comment.

Using re to replace consecutive Unicode whitespace

Results in cleaner queries;
Makes it more flexible for the users who might type in whitespace characters other than the space (e.g., when you use a Chinese input method).

But it is by no means necessary.

jarun · 2016-06-09T12:49:50Z

Other classes can stay.

For generating a URL (which could have been done easily with string manipulation and class variables) we have used dictionary, multiple levels of API calls to the extent that it becomes boring to understand the code... it's everywhere! This has now become a Python-internal DS usage demonstration rather than a utility one would want to go through and contribute to. Why couldn't we keep things simpler?

Dynamic URL construction makes sense even if you don't want to support dynamically updating options during the execution.

I'll get back to this but if this is what it takes to re-form and generate a URL, I'm not sure I'll carry it along.

googler doesn't have to be that complex a utility. We don't want to turn it into a full-fledged enterprise database management system loaded with cannons.

zmwangx · 2016-06-09T13:09:49Z

we have used dictionary

Because URL query is a dictionary. It's the natural data structure to use.

multiple levels of API calls

To construct, you call the constructor with options; to update, you call update; to get the URL, you access the url property, or full() (same thing, just for clarity in certain use cases) or relative() depending on use case. Navigation is handled too. What is multi-leveled? If you don't need dynamic manipulation of options, just ignore some of the methods. You don't have to use all methods, and existence of a few possibly extraneous methods doesn't make the class more complex.

googler doesn't have to be that complex a utility.

Number of lines isn't equal to complexity. Complexity is nothing if it's contained and split up.

zmwangx added do not merge WIP labels May 30, 2016

zmwangx self-assigned this May 30, 2016

zmwangx force-pushed the oo-refresh branch 3 times, most recently from 74b59c6 to cbcd55c Compare June 2, 2016 00:43

zmwangx added 6 commits June 1, 2016 17:45

googler: Switch to new style classes

e0f5583

This is Python 3, dude!

zmwangx force-pushed the oo-refresh branch 2 times, most recently from d30b4a2 to ca33fda Compare June 2, 2016 21:25

zmwangx added 9 commits June 5, 2016 00:04

googler: Lift open_url and printerr to the top of file...

97917f7

immediately after constants. These global helpers are called from classes, so logically they should appear first.

googler: Improve the Result class and rename urlindex to urltable

43ff816

The name "urlindex" is so weird, considering it is a dict of index: url pairs.

googler: Bring back --site option's functionality

baedc3b

Previously I forgot to handle the `site' key in GoogleUrl.update(). Also ordered keys in update (we need some easy-to-follow order so that we won't need to worry about where to insert a new key in the future).

zmwangx force-pushed the oo-refresh branch from d32d102 to baedc3b Compare June 5, 2016 07:12

zmwangx merged commit baedc3b into jarun:master Jun 5, 2016

zmwangx deleted the oo-refresh branch June 5, 2016 08:18

zmwangx removed do not merge WIP labels Jun 5, 2016

mberg2007 mentioned this pull request Jun 9, 2017

--news is not working #187

Closed

lock bot locked and limited conversation to collaborators Nov 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OO refresh #99

OO refresh #99

zmwangx commented May 30, 2016 •

edited

Loading

zmwangx commented May 30, 2016

jarun commented May 30, 2016

zmwangx commented May 30, 2016

jarun commented May 30, 2016

zmwangx commented May 30, 2016

zmwangx commented May 30, 2016

jarun commented May 30, 2016 •

edited

Loading

zmwangx commented May 30, 2016

jarun commented May 30, 2016

jarun commented May 30, 2016

zmwangx commented May 30, 2016 •

edited

Loading

zmwangx commented May 30, 2016

jarun commented May 30, 2016

zmwangx commented May 30, 2016

zmwangx commented May 30, 2016 •

edited

Loading

jarun commented May 30, 2016

zmwangx commented Jun 5, 2016

zmwangx commented Jun 5, 2016

jarun commented Jun 5, 2016

zmwangx commented Jun 5, 2016

zmwangx commented Jun 5, 2016

jarun commented Jun 5, 2016

jarun commented Jun 9, 2016

jarun commented Jun 9, 2016

jarun commented Jun 9, 2016

zmwangx commented Jun 9, 2016

zmwangx commented Jun 9, 2016

zmwangx commented Jun 9, 2016

jarun commented Jun 9, 2016

zmwangx commented Jun 9, 2016

OO refresh #99

OO refresh #99

Conversation

zmwangx commented May 30, 2016 • edited Loading

zmwangx commented May 30, 2016

jarun commented May 30, 2016

zmwangx commented May 30, 2016

jarun commented May 30, 2016

zmwangx commented May 30, 2016

zmwangx commented May 30, 2016

jarun commented May 30, 2016 • edited Loading

zmwangx commented May 30, 2016

jarun commented May 30, 2016

jarun commented May 30, 2016

zmwangx commented May 30, 2016 • edited Loading

zmwangx commented May 30, 2016

jarun commented May 30, 2016

zmwangx commented May 30, 2016

zmwangx commented May 30, 2016 • edited Loading

jarun commented May 30, 2016

zmwangx commented Jun 5, 2016

zmwangx commented Jun 5, 2016

jarun commented Jun 5, 2016

zmwangx commented Jun 5, 2016

zmwangx commented Jun 5, 2016

jarun commented Jun 5, 2016

jarun commented Jun 9, 2016

jarun commented Jun 9, 2016

jarun commented Jun 9, 2016

zmwangx commented Jun 9, 2016

zmwangx commented Jun 9, 2016

zmwangx commented Jun 9, 2016

jarun commented Jun 9, 2016

zmwangx commented Jun 9, 2016

zmwangx commented May 30, 2016 •

edited

Loading

jarun commented May 30, 2016 •

edited

Loading

zmwangx commented May 30, 2016 •

edited

Loading

zmwangx commented May 30, 2016 •

edited

Loading