-
-
Notifications
You must be signed in to change notification settings - Fork 528
Conversation
By the way, this is laying groundwork for both "Option-related improvements" and "Refactor and write unit tests" on the todo list. |
Please take your time. This is massive improvement over what we have today. |
Pushed another commit, implementing |
👍 get some rest. |
FTFY 😛 |
Seems that Google raised their bar of suspicious activity, or we have some unknown competitor testing Google search on Travis at the same time... |
We can experiment with the cookie stuff our user share on Gitter. I'll check it next weekend. |
You mean grab a cookie from the browser?
It doesn't hurt to try though. |
No no... This one: https://docs.python.org/3.0/library/http.cookiejar.html Now we are all-Python3. No double maintenance. |
We can check this by firing 100 requests from the browser. I can't do it right now. |
Oh, you mean just keeping the cookies? Not sure that would help. I would say keeping the cookies would be more suspicious, because you're clearly spamming from the same session, rather than having multiple people searching from the same IP, which is more excusable. What I had in mind was cookies from a signed-in session, because I've never been greeted by a reCAPTCHA when I'm signed in, but occasionally do when I search from Incognito/Private, especially when using my PIA VPN. |
Obviously we don't want to introduce signing in. Anyway, it's not a big deal. Also, my plans for testing would render this spamming a thing of the past. |
This is what I wanna test, from a fresh, cookie-cleared browser session.
We can't. Google has added restrictions recently. Many third-party clients can't anymore. |
Plus that hardly helps our users. Only helps with automated testing. |
And I just tried. for i in {1..500}; do echo -n $'\r'$i; chrome-cli open 'https://google.com/search?q=hello' >/dev/null; sleep 1; chrome-cli close; done All pages are opened in a fresh Incognito session (same session). Got reCAPTCHA'ed on the ~100th request. |
LoL!!! Doesn't help then. |
74b59c6
to
cbcd55c
Compare
This is Python 3, dude!
- Make previously global type guard functions into static methods of GooglerArgumentParser (cleaner global namespace, easier-to-locate type guards); - Add two new type guards: positive_int: for --num; nonnegative_int: for --start (technically a breaking change, realistic no one should ever type in a negative start index or a non-positive number of results per page). - Drop Python 2 legacy (super).
The performance gain of compiling r'\s+' is negligible, which is completely outweighed by the awkwardness of the variable and the dev time wasted in looking for the definition of another variable (although it could be easily located by automated tools).
The URL construction interface is now fully OO. It is to some extend modeled on urllib.parse.ParseResult for familiarity, but it takes care of all aspects of URL construction (even pagination) and defines a easy-to-use and flexity interface with properties and methods. See the docstring of the GoogleUrl class for details, as well as the docstrings of individual user-facing methods. As a side effect, we have reduced the number of globals, which are now grouped into three categories: state carriers (which, as the name suggests, carry global states), output/user interaction options, and cosmetic options.
move GooglerArgumentParser to the end of classes. - Logically GoogleUrl should precede GoogleParser because we can't parse anything if we haven't constructed the request URL in the first place. Once we have GoogleConnection (OO version of the current conn global), it will go between GoogleUrl and GoogleParser. - GooglerArgumentParser is a class.
GoogleConnection absorbed new_connection, google_get, and the connecting and fetching parts of fetch_results. The new OO interface is nicely abstract and fully documented. Note that we have now registered an exit handler for closing the connection. This way we can exit the program at any point using sys.exit instead of our kind of awkward quit_program. There's one breaking change which should not have much of a consequence: we no longer modify the host segment of the global URL (now managed by google_url) after a redirection. This is partly stipulated by the OO interface (GoogleConnection class should not have access to the global google_url) and partly because users should get what they specified when they ask for the URL (with the 'o' command, for instance), not what Google forced on them with a redirection.
d30b4a2
to
ca33fda
Compare
1. I'm trying to eliminate globals. `debug' is a non-idiomatic global, whereas `logger' is idiomatic. 2. I cringe at `debugp'. 3. Having to maintain home-grown messaging functions sucks. Also moving import tempfile and import json to where they're actually used for maintainability. Doing the same to webbrowser because it's also an optional component that's only used once.
We already have a global variable `colors' which we can test for truthiness, so another boolean global is totally unnecessary. Note that colors will eventually be absorbed into class variables of classes that rely on it (Result and GooglerCmd), but for now it is kept.
immediately after constants. These global helpers are called from classes, so logically they should appear first.
The `columns' global variable has been eliminated (when I say eliminated I mean it can be wrapped into a main function and is referenced nowhere else). `colors' is currently used in `read_next_command' but will be eliminated once we pack up the REPL. Also restructured the opt handling part a bit to make them more organized and more sensible.
instead of Result.print_entry. Opening the first result really has nothing to do with "print entry". While we're at it, we - Make --lucky imply --noprompt. (Previously, there was a pointless blank line printed due to not in noninteractive mode.) - Print a warning of "No results" when no results were found unless in JSON mode (where an empty array should be obvious enough).
The name "urlindex" is so weird, considering it is a dict of index: url pairs.
This way we - Better adapt to terminal resize; - Get rid of one kind of awkward class variable. We also document the class variables of the Result class. Note that os.get_terminal_size is *fast*. I did some benchmarking: termsize.py: #!/usr/bin/env python3 import os import sys try: count = int(sys.argv[1]) except: count = 1 for _ in range(count): try: columns, lines = os.get_terminal_size() except: pass With Python 3.5.1 on OS X 10.11.5 on my Mid-2015 15'' rMBP with 2.5 GHz quad-core Core i7, 1E7 calls take this long: zsh: > time (./termsize.py 10000000) ( ./termsize.py 10000000; ) 5.02s user 3.22s system 99% cpu 8.263 total > time (./termsize.py 10000000 &>/dev/null </dev/null) ( ./termsize.py 10000000 &> /dev/null < /dev/null; ) 9.13s user 2.50s system 99% cpu 11.667 total > unset COLUMNS LINES; time (./termsize.py 10000000) ( ./termsize.py 10000000; ) 5.18s user 3.25s system 99% cpu 8.458 total bash: > time (./termsize.py 10000000) real 0m8.459s user 0m5.186s sys 0m3.246s > time (./termsize.py 10000000 &>/dev/null </dev/null) real 0m13.320s user 0m10.363s sys 0m2.899s > unset COLUMNS LINES; time (./termsize.py 10000000) real 0m8.177s user 0m5.006s sys 0m3.152s
There's one breaking change: - The `n' command now navigates to the next page even when the current page seems to be empty. The reason is not being able to find any result (that matches our criteria) on the page doesn't necessarily mean it is the last page, especially when --count is small. Other improvements other than the obvious include - Global variables are finally eliminated, and program no longer has side effects; - Argument parsing code has been factored into a function parse_args() to be forward-looking; - All exceptions are now captured. In non-debug mode we only print the error message if there's an otherwise uncaught exception; in debug mode we re-raise to produce a traceback.
Previously I forgot to handle the `site' key in GoogleUrl.update(). Also ordered keys in update (we need some easy-to-follow order so that we won't need to worry about where to insert a new key in the future).
Lazy import done. Rebasing was kind of a nightmare. Any other comments or replies to my comments? |
Rebasing this on master won't be fun, because that cyan change would result in amending almost every single commit. If you're happy with this branch, I'll do a manual merge instead, which is easy. |
I guessed it but I really crave for speed.
Go as you like :) |
Needed to rebase anyway. No problem. |
Merged. |
👍 |
I'm planning to revert this commit. Earlier I was thinking of supporting all options at the prompt and this commit would help with that but I have dumped the plan. This commit takes re-directions to a new level. Just to figure out where program args (opts) are being parsed I had to go though 4 functions where opts is being passed around! Let's keeps things simpler. So much for object-oriented. |
OK, that came in with GooglerCmd. |
And we couldn't live without re??? |
I think it really makes sense to support that. But if you don't want to, we can dump
Eliminating side effects come with a price. Otherwise something like
Dynamic URL construction makes sense even if you don't want to support dynamically updating options during the execution. Previously you need to understand all URL manipulation code scattered around just to make sure your redirection code won't break anything (and you actually can't). Now the complexity has been contained, and you don't have to understand anything about URL construction unless you need to. Plus, tracebacks in the OO setting are arguably more helpful. You have the entire detailed callstack, while you could also quickly isolate the issue, rather than worrying about states. |
In short, if you confirm that the plan to support dynamic options has been dropped, then I'll go ahead and try to eliminate |
Forgot to reply to Using
But it is by no means necessary. |
For generating a URL (which could have been done easily with string manipulation and class variables) we have used dictionary, multiple levels of API calls to the extent that it becomes boring to understand the code... it's everywhere! This has now become a Python-internal DS usage demonstration rather than a utility one would want to go through and contribute to. Why couldn't we keep things simpler?
I'll get back to this but if this is what it takes to re-form and generate a URL, I'm not sure I'll carry it along.
|
Because URL query is a dictionary. It's the natural data structure to use.
To construct, you call the constructor with options; to update, you call
Number of lines isn't equal to complexity. Complexity is nothing if it's contained and split up. |
This is a huge WIP, intended to group and modularize
googler
's various functional units, hence lifting googler from its quick-and-dirty origin. (Starting out quick-and-dirty is totally fine, but at some point we need to rethink the structure in order to not dissolve into a pile of unmaintainable mess when we add more features. That's just my opinion.)The guiding principle here is to break down big functional units and dissociate state info so that every small blob of code should make as much sense as possible locally, instead of relying on memorizing the effect of another blob that is two hundred lines away. (Bad example:
url = url.replace('start=%d&' % oldstart, 'start=%d&' % start, 1)
, which relies on storing a snapshot ofstart
inoldstart
many lines before actually using it; this sort of disconnectedness could be a source of bugs when part of the program is updated.) When we do need persistent state info, they should be packaged into well-defined and well-documented OO interfaces, complete with relevant state manipulation helpers.A brief and incomplete checklist:
GooglerArgumentParser
: absorb type guards;GoogleUrl
: URL constructor;GoogleConnection
: handle connection opening/renewal and page fetching (including redirection) (should absorbgoogle_get
,new_connection
and a big chunk offetch_results
);GooglerCmd
: OO command line interpreter and executioner (model oncmd.Cmd
?).Classes should be global-free (except constants and
logger
):GoogleUrl
GoogleConnection
GoogleParser
Result
—lucky
should be dealt with outsideResult
;colors
,columns
etc. that affect the output should be class variables;; oh, and think of a better name forjson
should be dealt with insideResult
urlindex
(urltable
maybe?)...GooglerCmd
GooglerArgumentParser
I'm busy recently so it will take a while.
The finished parts could be reviewed and/or regression-tested and I'd like to have feedback.