-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crappyspider VS casperjs, healthchecks and sitemaps #17
Comments
crappyspider is not intended for functional testing: it is primarily a reusable tool to quickly provide a map of the site. This map can be consumed by any other tool, like a screenshot script, a healthcheck, ... It also features a lightweight "healthcheck" by providing the http status of each visited page, since the crawler operates that way to discover urls. One advantage of making it an external tool is that it could be used on sites using a different stack like the corporate website. We will still need selenium, casperjs, whatever, for functional testing. It is a much bigger project than this modest script :) @benoitbryon , regarding your thoughts: Moving the config file related to web application to its repository is a very viable alternative to my previous idea of creating a dedicated repo. 👍 I believe we cannot crawl once on a given environment and then reuse the create map on another. We cannot guarantee the datasets will be the same. Creating data on the fly, as I understand you suggest, would be awesome. But once again it is out of the scope of this project whose ambitions are quite limited. There is indeed a task force to setup regarding anonymized / fake data generation. Until then, crappyspider can prove useful, give it a chance :) |
We can do it quite easily. We once did it in PostBox for healthchecks. It is a matter of
Isn't a simple dataset enough to have a working sitemap: all you need is one working URL for each URL pattern. So the "check" dataset can be really tiny. |
|
Notice that my proposal is not to re-create fake data on the fly everytime you need it. |
So, some notes after the talk about crappyspider... As I understood from the talk:
As an example, if you have the following data:
And the following URL patterns in your application:
Given the data and the application version, you will always get:
Then one question is: wouldn't it be more simple to manage the list of URLs directly inside the application? I mean, what is the value added by crappyspider's initial crawling? Imagine the following scenario:
As said in previous comments, as a bonus, the fixtures can be used for healthchecks/smoketests. As an example, "user can log in" would be a nice smoketest. |
I think the "dynamic" URL feature (that is what crappyspider does by crawling a website) could be useful to assert that every URL is tested. But I also think this feature is not a big deal if we do some TDD:
|
Notice that I'm no longer speaking about sitemaps: the URL list is a subset of the website. It is not the full list of all the URL in the website. It is a list of URL that are valid in a given context (typically an authenticated user). |
In fact, I think crappyspider could be useful for public websites, or websites you do not maintain. Where you can have a simple list of URLs and do not need to bother about data. But here at Novapost we are planning to use it for private websites (where authentication is required) and URL list depends on data. |
I'm closing this issue for now. The crawling feature of this script does the trick for our "screenshoting" needs at the moment. |
Here is a summary of some points discussed on IRC, for the record.
I'm starting with what I noticed... feel free to discuss or mark as "wontfix" ;)
As I understood, this project (crappyspider) is meant to:
I think the URL list should be moved to repositories of web applications. Could be done with #12. Could also be done using some "sitemap" feature in code of web application. It is the best way to ensure list of URLs is valid and updated as part of application's source code.
One argument about scrappy VS selenium/casperjs is speed: we do not need a browser. With speed in mind, I think the list of URLs can be unique for a given version of web application. I mean, we do not have to scan the application on every environment. We can run it once (in DEV or INTEGRATION?) and then use it on other environments.
But then how to support dynamic URLs that contain slugs/PKs? Let's use the same tools as for healthchecks... Healthchecks read and use live database, i.e. in PROD, real data is involved, not fake data like in tests. If we had to write some healthcheck that checks "an user can log in", then we will have to get_or_create() a known user sample. A real account, but used for healthchecks only. Then we make sure this data is available in every environment (kind of fixtures). We can do that too in this crappyspider project.
As I explained above, I think that list of URLs could be almost a static list maintained as part of the code. So I wonder if scrappy is the adequate tool for this purpose. I wonder if scrappy is not a bit overkill, where, typically, we just want one valid "live" URL for each pattern in urls.py. The list is finite. The list is updated along with urls.py.
Then, about scrappy itself, as a way to check a web application, I would recommend casperjs, selenium, or any other tool that is more JS-developer or designer friendly:
Last but not least, I think we have to check 405, 302 or 301 status codes, or POST/PUT/DETETE actions. Again, I think casperjs/selenium could be better than scrappy for that.
The text was updated successfully, but these errors were encountered: