Skip to content
This repository has been archived by the owner on Aug 11, 2022. It is now read-only.

Enable automated running of dedupe #4761

Closed
bclinkinbeard opened this issue Feb 23, 2014 · 21 comments
Closed

Enable automated running of dedupe #4761

bclinkinbeard opened this issue Feb 23, 2014 · 21 comments
Milestone

Comments

@bclinkinbeard
Copy link

The nested versus flat dependency tree argument is literally the only thing people regularly bring up as a reason to use Bower over npm. The only reply ever given is to use dedupe, but there is no way to automate it. If npm is going to significantly improve adoption for client side use, I think this needs to change. You can't make the argument that people should use a different tool than they currently are while rejecting their top complaint about the tool being offered.

Currently, the primary hurdle to automating dedupe is that running a command like npm install --save foo does not trigger the postinstall hook. (I personally think that should be considered a bug, but that is neither here nor there.) There also seem to be issues with the npm update hook being inconsistent but I've yet to document their exact nature.

What it comes down to is there needs to be a way to run a command whenever the contents of node_modules changes for any reason (as a result of an npm command). This could be used to run dedupe for people worried about such things, or even something like lockdown.

I want npm to "win" as badly as anyone, but refusing to address this issue just seems like a non-starter.

/cc @isaacs @domenic @timoxley @kristoferjoseph @techwraith @robdodson @IgorMinar

@domenic
Copy link
Contributor

domenic commented Feb 24, 2014

This makes sense to me, although I would approach it more from first principles. You are going from the particular to the general in a direction I'm not that excited about. (After all, people can use their favorite watch utility: mywatch node_modules/**/* "$my_command_here" or whatever.)

What I would be more interested in is making sure npm always minimizes duplication. That is, npm install first downloads and analyzes the entire deep dependency graph, then figures out the minimal number of package installs that satisfies that. That would mean making dedupe unnecessary in most cases.

@isaacs has mentioned that there was a conscious choice in the past not to do this, presumably because the two-stage download is inefficient in some cases. I would ask if he wants to reconsider.

Otherwise, I'd suggest making this a config you could set on all package-modifying operations. npm install --dedupe or npm update --dedupe or similar would run npm dedupe afterward, at first approximation.


An alternative, more ambitious idea:

Right now we kind of have the worst of both worlds: it sometimes dedupes automatically, if it can, and it's convenient. But usually not. This inconsistency means you get neither the benefits of complete-deduping, nor the benefits of complete isolation. (I know @defunctzombie is a fan of the latter, and has in fact written npm-install to carry that out.)

It might be nice to make npm do only one of these two modes, I guess controlled by a flag. Which the default would be is unclear to me; I'd prefer dedupe, but isolated would be closer to the current default.

@bclinkinbeard
Copy link
Author

I like the config flag idea, though that precludes automating the use of something like lockdown. A hook is more general purpose and opens the door to more workflow customizations we may not have considered.

I'm also hesitant to endorse any of the more ambitious suggestions, simply because they seem likely to delay any resolution by quite a bit of time.

@domenic
Copy link
Contributor

domenic commented Feb 24, 2014

I really think the general solution for the general problem you describe is to just use a watch utility.

@bclinkinbeard
Copy link
Author

I think a watch utility on a project of any significant size is going to run into "too many open files" issues and be a regular source of headaches. It's also a barrier to entry of being able to say "just use npm... oh, and set up a watch utility, configure it to do x and y in response to these events, etc.". It will have the same issues hook scripts do now, which is not knowing when an operation has really completed. A watch utility receiving 73 update events and 284 add events is far from useful.

@defunctzombie
Copy link

Localized dependencies make module maintenance and dependency management wonderful and I never understood why npm dedupes anything at all during install time. This seems like a premature optimization at best and leads to broken, non-deterministic installs at worst (yes, I have faced these issues myself). IMO npm should not do any sort of deduplication at any time and should always install the full tree (note, this does not mean that modules have to be downloaded over and over, we already have caching to solve this). Deduplication should be opt-in and not even something npm needs to do, deferring to external tools.

To answer the comment about bower, I again don't see the issue. People are complaining that the same module version is installed several times? This is a job for the package tool (browserify already handles de-duping the same code which is a far better approach than some sort of arbitrary version de-duplication). The safety net with isolated modules and dependencies allows us to move away from the world of (install this giant framework on your page to use my module) and helps keep modules small, maintainable, and tested. I will also add that I think the notion of some duplicate code in client js is really a small nitpick trying to spread into FUD as to why npm can't work for client side modules. The reality is that it works fine and the few instances where you run into some duplicate code, it doesn't matter! If it does start to matter, you can take a look at your deps and see where you might have issues, but avoiding a great module and dependency system just to save 1k is not gonna make your break your hobby project or your production website.

@bclinkinbeard
Copy link
Author

For the record, I completely agree with you that the concerns over duplication are overblown. The example always cited is multiple (conflicting?) versions of jQuery, leading to a bundle larger than you ideally want to serve, but I think the chances of that happening, especially without someone noticing, are very slim. 

That being said, it is something that comes up a lot and simply providing a hook that can be used seems like a decent way to appease people who feel like it is a reason to avoid npm.

@ryanflorence
Copy link
Contributor

  • As npm gains adoption for client-side packaging, modules will get smaller and iterate faster like those for the server, then we will start to see surprising amounts of JS being delivered to the browser just like the seemingly endless scrolling of npm install.
  • the jQuery case is real, people who are using jQuery from npm are just more clueful today. However, in the future, if it becomes commonplace, some modules depending on jQuery will unfortunately be tied to a specific version w/o reason, and novice devs won't even know its a problem, and probably won't realize how big their app.js's file is.

I've been screwing around with dedupe on some random projects of mine and have noticed it drops the size of node_modules nearly 75%.

However, it is a compelling case that browserify handles this. Perhaps it is a job for the bundling script, not the package installer.

@bclinkinbeard
Copy link
Author

LOL, Bower appears to have shipped this feature today. bower/bower#718

@timoxley
Copy link
Contributor

@bclinkinbeard I had a quick look into what would be involved in implementing the additional hooks, seems easy enough, maybe start with this, add tests, make names better, add remove hooks, and send a PR – Might be a waste of time but it's easier to discuss concrete implementations rather than hypotheticals.

@isaacs
Copy link
Contributor

isaacs commented Feb 25, 2014

FWIW: I am ok with deduping in the process of installation. It's basically a rewrite of lib/install.js, which is a sensitive thing to go around rewriting, but it's a rewrite that can certainly be valuable.

It'd be best to approach as a multi-stage process. The first step builds the tree, doing lots of http/git stuff, then it prunes and dedupes, then fetch all the things and lay them down on the file system, then walk through the tree three times to run preinstall/install/postinstall scripts.

The other benefit is that we could conceivably do a bunch of the fs stuff synchronously, and have less locking and other monkey business.

@isaacs
Copy link
Contributor

isaacs commented Feb 25, 2014

Also, then we could have progress bars, which is pretty much even theoretically impossible with the current approach.

@kristoferjoseph
Copy link

For me dedupe on install isn't as big of a deal as dedupe on publish. I would rather rest assured that this will be done at publish time.

@kristoferjoseph
Copy link

Running dedupe and lock at publish time would be ideal for all the use cases I have.
This could also open an opportunity to prompt a module developer to make educated decisions on what they really want bundled with their published library.

@domenic
Copy link
Contributor

domenic commented Feb 25, 2014

I don't understand dedupe on publish; published packages don't contain dependencies, so what would that accomplish?

@timoxley
Copy link
Contributor

@kristoferjoseph

Running dedupe and lock at publish time would be ideal for all the use cases I have.

You can already do anything at publish time via the prepublish script: https://www.npmjs.org/doc/misc/npm-scripts.html

This could also open an opportunity to prompt a module developer to make educated decisions on what they really want bundled with their published library.

https://www.npmjs.org/doc/misc/npm-developers.html#Keeping-files-out-of-your-package

@bclinkinbeard
Copy link
Author

@isaacs Does that mean you would prefer not to create a new hook and/or change the conditions under which postinstall runs? Or does integrating dedupe not necessarily preclude those as well? I don't know that I have a preference either way at this point, just curious.

@kristoferjoseph
Copy link

That will teach me to try and comment on an issue from the train. I conflated the need to educate package publishers with issues bower proponents made for needing to not include multiple versions of a library. I.e. Compiling two versions of jQuery into your bundle etc.
Deduplicating modules by default at install time would be useful for optimizing files downloaded to disk. +1000

@stuartpb
Copy link
Contributor

I'd complain that tree-flattening/deduping will lead to problems if I require ("module/node_modules/submodule"), but I guess that would be a natural consequence I'd have to face for not just adding "submodule" to my package's dependencies in the first place. (Might be worth checking how many modules in the npm registry do this atm.)

If a tree-flattening dedupe were made part of installation, I can still kinda foresee issues with people writing code that requires modules that they didn't specify as dependencies, but were coincidentally installed to their module's root as a shared dependency of all of the dependencies they did specify.

(ie. a lists b 0.4.x and c 0.5.x as dependencies, b 0.4.1 and c 0.5.3 both list d 0.3.x as a dependency, dedupe installs d 0.3.7 alongside b and c in node_modules, and then a has require("d") in its code).

Then, later, a conflict introduced by dependency changes in an explicit dependency's patch bump makes that shared junction disappear (eg. b 0.4.2 updates to d 0.6.x, so then d 0.6.8 is installed under node_modules/b and d 0.3.7 is installed under node_modules/c, and node_modules/d is no longer installed)...

This could warrant adding a warning to the core Node require function, something like WARNING: requiring module from node_modules not specified as a dependency in package.json, meaning Node would now have two points where it knows about package.json.

@edef1c
Copy link
Contributor

edef1c commented Sep 16, 2014

@stuartpb If you require from a node_modules that isn't yours, that's user error. We're not going to support that (in fact, for the Windows tree flattening thing, we're looking for ways to actively break it so people don't come to depend on it)

The issue with tree-flattening is more subtle, but I'd say that's an issue of education mainly.
Don't ls node_modules, npm ls.

Meddling with node's module loader comes with the usual caveat of it being frozen, though node core folk might be more open to adding a warning like that.

@stuartpb
Copy link
Contributor

Meddling with node's module loader comes with the usual caveat of it being frozen, though node core folk might be more open to adding a warning like that.

If not, another alternative would be to have a "nanny" bot on the npm registry that scans require calls in all published modules (like browserify) and emails the author with a warning when it finds something wrong like this. (Granted, that means it wouldn't catch problems in unpublished projects, like apps, but it'd still prevent problems like this from finding their way into places where they could spread.)

@iarna
Copy link
Contributor

iarna commented Dec 12, 2014

This is going to be implemented under #6912. As such, I'm going to close this ticket and any further discussion should occur over there.

@iarna iarna closed this as completed Dec 12, 2014
@npm npm locked and limited conversation to collaborators Jun 24, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests