Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crappyspider VS casperjs, healthchecks and sitemaps #17

Closed
benoitbryon opened this issue Oct 30, 2014 · 9 comments
Closed

crappyspider VS casperjs, healthchecks and sitemaps #17

benoitbryon opened this issue Oct 30, 2014 · 9 comments
Labels

Comments

@benoitbryon
Copy link

Here is a summary of some points discussed on IRC, for the record.
I'm starting with what I noticed... feel free to discuss or mark as "wontfix" ;)

As I understood, this project (crappyspider) is meant to:

  • have a list of URLs in a web application, for use in various "checks".
  • quickly check HTTP status 200 OK of URLs
  • do not check every URLs, but at least each "URL pattern" (a.k.a check /user/123/ is enough, we do not have to check every /user/{id}/ like /user/456/ and /user/789/).

I think the URL list should be moved to repositories of web applications. Could be done with #12. Could also be done using some "sitemap" feature in code of web application. It is the best way to ensure list of URLs is valid and updated as part of application's source code.

One argument about scrappy VS selenium/casperjs is speed: we do not need a browser. With speed in mind, I think the list of URLs can be unique for a given version of web application. I mean, we do not have to scan the application on every environment. We can run it once (in DEV or INTEGRATION?) and then use it on other environments.

But then how to support dynamic URLs that contain slugs/PKs? Let's use the same tools as for healthchecks... Healthchecks read and use live database, i.e. in PROD, real data is involved, not fake data like in tests. If we had to write some healthcheck that checks "an user can log in", then we will have to get_or_create() a known user sample. A real account, but used for healthchecks only. Then we make sure this data is available in every environment (kind of fixtures). We can do that too in this crappyspider project.

As I explained above, I think that list of URLs could be almost a static list maintained as part of the code. So I wonder if scrappy is the adequate tool for this purpose. I wonder if scrappy is not a bit overkill, where, typically, we just want one valid "live" URL for each pattern in urls.py. The list is finite. The list is updated along with urls.py.

Then, about scrappy itself, as a way to check a web application, I would recommend casperjs, selenium, or any other tool that is more JS-developer or designer friendly:

  • I think we need functional tests (with browser)
  • I think functional tests should be written by the ones who make the user interface, or by the ones who ask for the features (product owners). Typically, not the Django developers.
  • I think front developers and desires will appreciate casperjs/selenium more than this carppyspider.

Last but not least, I think we have to check 405, 302 or 301 status codes, or POST/PUT/DETETE actions. Again, I think casperjs/selenium could be better than scrappy for that.

@amessinger
Copy link
Contributor

crappyspider is not intended for functional testing: it is primarily a reusable tool to quickly provide a map of the site.

This map can be consumed by any other tool, like a screenshot script, a healthcheck, ...

It also features a lightweight "healthcheck" by providing the http status of each visited page, since the crawler operates that way to discover urls.

One advantage of making it an external tool is that it could be used on sites using a different stack like the corporate website.

We will still need selenium, casperjs, whatever, for functional testing. It is a much bigger project than this modest script :)

@benoitbryon , regarding your thoughts:

Moving the config file related to web application to its repository is a very viable alternative to my previous idea of creating a dedicated repo. 👍

I believe we cannot crawl once on a given environment and then reuse the create map on another. We cannot guarantee the datasets will be the same.
Simply consider the development of the crawler : it is based on a someapp.local domain which features a very limited dataset. Now if I wanted to reuse the map on another domain, I could not guarantee that article X or employee Z would still exist.

Creating data on the fly, as I understand you suggest, would be awesome. But once again it is out of the scope of this project whose ambitions are quite limited. There is indeed a task force to setup regarding anonymized / fake data generation. Until then, crappyspider can prove useful, give it a chance :)

@benoitbryon
Copy link
Author

Creating data on the fly, as I understand you suggest, would be awesome

We can do it quite easily. We once did it in PostBox for healthchecks. It is a matter of get_or_create() utilities.

anonymized / fake data generation

Isn't a simple dataset enough to have a working sitemap: all you need is one working URL for each URL pattern. So the "check" dataset can be really tiny.
As an example, if you have /articles/<article-slug>/ URL you create article X and that's all, you have /articles/x/ URL. Of course, we have to choose names/slugs/ids carefully so that they do not clash with real world names, but that's pretty simple.
Simple enough to be created and maintained manually ;)

@benoitbryon
Copy link
Author

we have to choose names/slugs/ids carefully so that they do not clash with real world names, but that's pretty simple.

  1. Choose a slug that does not exist and will certainly never exist. Say "healthcheck-slug"
  2. Create the instance in DB.
  3. Later, if a real-life user wants to register "healthcheck-slug" in the database, he cannot because slug is an unique key. He just chooses another slug.
  4. No problem.

@benoitbryon
Copy link
Author

Creating data on the fly

Notice that my proposal is not to re-create fake data on the fly everytime you need it.
My proposal is to create known data once on every environment (right after site installation or upgrade), maintain it and use it forever.

@benoitbryon
Copy link
Author

So, some notes after the talk about crappyspider...

As I understood from the talk:

  • login is performed using credentials of a real user...
  • implies this user exists in the database
  • implies this user has enough data to illustrate all the features you want to check. As an example, in order to have the URL of an user's item in crappyspider results, you need to make sure the user (the one you log in with) has items attached to his account.
  • implies you manage a set of fixtures (user account, items attached to user account)
  • as a result, given the same set of data, the list of URLs is always the same for a given version of your application

As an example, if you have the following data:

  • user = johndoe
  • user items = [one, two]

And the following URL patterns in your application:

  • /user/{user}/
  • /items/
  • /items/{item}/

Given the data and the application version, you will always get:

  • /user/johndoe/
  • /items/
  • /items/one/
  • /items/two/

Then one question is: wouldn't it be more simple to manage the list of URLs directly inside the application? I mean, what is the value added by crappyspider's initial crawling?

Imagine the following scenario:

  • in application's code:
    • maintain a set of "data useful for checks", i.e. data that can live in production
    • along with data, maintain the list of URLs
  • you do not need additional software (i.e. crappyspider) to get the list of URLs
  • use the list of URLs to perform checks such as HTTP status codes. You may use crappyspider there.

As said in previous comments, as a bonus, the fixtures can be used for healthchecks/smoketests. As an example, "user can log in" would be a nice smoketest.

@benoitbryon
Copy link
Author

I think the "dynamic" URL feature (that is what crappyspider does by crawling a website) could be useful to assert that every URL is tested. But I also think this feature is not a big deal if we do some TDD:

  • my job is to develop a new feature, say "users own items"
  • the URL list is part of the specification. I mean, as part of the feature's expectations, I know that, given user "johndoe" and item "one", I will have URL "/johndoe/items/" and URL "/johndoe/items/one/".
  • so the first thing I do is maintaining the URL list, then include it in tests, then implement it
  • so the URL list is always up to date when I merge master. Maintaining the list of URLs is part of my definition of done.

@benoitbryon
Copy link
Author

Notice that I'm no longer speaking about sitemaps: the URL list is a subset of the website. It is not the full list of all the URL in the website. It is a list of URL that are valid in a given context (typically an authenticated user).

@benoitbryon
Copy link
Author

In fact, I think crappyspider could be useful for public websites, or websites you do not maintain. Where you can have a simple list of URLs and do not need to bother about data. But here at Novapost we are planning to use it for private websites (where authentication is required) and URL list depends on data.
I think that, in our use case, URL list, fixtures and application are too much coupled. So that we'd better maintain them together in the application's repository.

@amessinger
Copy link
Contributor

I'm closing this issue for now. The crawling feature of this script does the trick for our "screenshoting" needs at the moment.
We might consider switching it for another solution in the future. It would be quite easy to do once this solution exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants