-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrape fails be javascript (Mouser/Farnell) & web module enhancement #242
Comments
I could not reproduce this error, for me the result was just fine: test.xlsx. |
It looks like the root-cause of this issue is that mouser returns a captcha after scraping some parts. I did some further experiments and figured out that if you correctly handle cookies, the captcha can be avoided. The following (partial?) solution worked for me if the
See attached mouser.py.txt for an example implementation of those hacks. Now, the question is how to cleanly implement the following in KiCost's code:
|
@mmmaisel thanks by debug and discussion. We (@devbisme and I) know that this multiple-sequencial scrape could be interpreted as robots and block by the distributors. (Since the last code modifications #223 somethink could be implemented on |
@hildogjr yes, fake_browser (as a class) would be a good point for storing various things like the cookiejar, selected user_agent, domain name, urllib opener, scrape retries and other things. At this point, I think the code needs some redesign because currently KiCost first iterates over all parts (eventually distributed over several threads) and then iterates over all known distributors for a single part (with random user-agent and without cookies from previous requests). This needs to be changed to iterate distributors first, then parts to keep the state information. Maybe, it would also make sense to change all distributor modules to classes which hold all required state information and common values like the fake_browser object, currency settings and maybe others. This would reduce the amount of function arguments that has to passed around as well. While I'm not a Python specialist, I can help implementing the changes if desired. |
@mmmaisel nice to ear some tips. I am not expert too but, all standardization that help the next user / developer to add and correct features will be welcome. If @devbisme (owner of the package) agree and pass us some directives, we could use the In my view, the points of this implementation is (more your tips) allow enhacement #4 and #65 (in beta on Digikey). Also if allow to search some part by "value/footprint/tolerance/so on" and get by, for example, more chipper, could me allow to future solve #17 (for resistor and capacitor at least). Yes, I agree, the distributors modules are now the "bottleneck" of KiCost. |
@mmmaisel, what you think:
|
Goals are: - Add stateful user-agent and cookie handling in fake_browser - Scrape parts in distrtibutor then part order - Use a class inheritance approach to distributor modules, this allows adding state information and reduces the amount of variabled passed around. - One scraping thread per distributor, simplify locking Implemented in this commit: - Used class approach to distributors and fake_browser - Parts are scraped in distributor -> part order - One (IO limited) python thread per distributor - Simplified locking
I started implementing the changes in my local fork. Currently implemented are:
|
Goals are: - Add stateful user-agent and cookie handling in fake_browser - Scrape parts in distrtibutor then part order - Use a class inheritance approach to distributor modules, this allows adding state information and reduces the amount of variabled passed around. - One scraping thread per distributor, simplify locking Implemented in this commit: - Used class approach to distributors and fake_browser - Parts are scraped in distributor -> part order - One (IO limited) python thread per distributor - Simplified locking
It is getting a new level of code organization, nice. I will test to keep us updated about compatibility and others issues. Tip: make the difference in the py2 / p3 import as
This make possible to raise the import error (not using it to deal with py2/3 difference) and may be more pythonizide, as used in some files already revised, line 70. Just other guidance about the dependence used: |
@mmmaisel, appear that the Farnell issue was fixed on your branch.
Traceback (most recent call last): |
Goals are: - Add stateful user-agent and cookie handling in fake_browser - Scrape parts in distrtibutor then part order - Use a class inheritance approach to distributor modules, this allows adding state information and reduces the amount of variabled passed around. - One scraping thread per distributor, simplify locking Implemented in this commit: - Used class approach to distributors and fake_browser - Parts are scraped in distributor -> part order - One (IO limited) python thread per distributor - Simplified locking
Fixed on #242, merging soon. |
Scraping some Mouser parts fails with "Unknown error for from mouser" while other parts or distributors just work fine. This was tested with git revision e7e4243 on Ubuntu 16.04 64bit (Python 2.7) and Ubuntu 18.04 64bit (Python 3.6).
KiCost was invoked with the following command:
PDB shows that the variables "html" and "tree" in distributors/mouser/mouser.py:178 contain some javascript garbage instead of the intented HTML source code.
The following minimal KiCad xml demonstrates the error:
test.xml.txt
Terminal output:
term_output.txt
The text was updated successfully, but these errors were encountered: