Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conflicting file extensions #138

Closed
famuvie opened this issue Aug 5, 2015 · 17 comments
Closed

conflicting file extensions #138

famuvie opened this issue Aug 5, 2015 · 17 comments

Comments

@famuvie
Copy link
Contributor

famuvie commented Aug 5, 2015

I wrote a draft shapefile reader to be used with ProjectTemplate (related with this question in the mailing list).

However I find an extension conflict issue with shapefiles:

  • a shapefile is specified using at least three actual files with extensions .shp, .shx and .dbf)
  • this last file is recognized by ProjectTemplate as an XBASE format
  • this is read using package:foreign, before the .shp file (due to alphabetical ordering)
  • when the time comes for the .shp extension, the base name of the file is found in memory and thus skipped

This is a design problem in .load.data(). It is assumed implicitly that each recognized file corresponds to one object. While in the case of shapefiles this is not the case.

@KentonWhite
Copy link
Owner

A shapefile reader is pretty great! One thought would be to override the .dbf extension using .add.extension. The new .dbf reader could check for a matching shapefile. If there is a matching shapefile skip the dbf file. If no matching shapefile is found, then load the dbf file as usual.

We created the ability to add new readers as separate packages to address the design challenges of .load.data When dealing with so many different general data types it is necessary to make some assumptions, such as each file corresponds to a single object. Looking at all the possible edge cases (such as shapefiles) it became clear that the required conditional logic would make ProjectTemplate difficult to maintain over time.

Basically every time a new data type is added, we would need to check if it interferes with an currently supported data type. If it does, we would need to decide precedence for loading. This becomes an N-Squared problem.

With the current release we made a design decision that new data types can be supported as separate packages. The current data types are the core and additional ones can be added by the user. Not only can people now create their own readers without worrying that other users' tool chains will be broken, they are free to override the behavior of existing types, as I've suggested with the .dbf extension.

Please let me know if you have trouble overriding the .dbf extension or if it doesn't work as it should.

When your shapefile package is ready please let me know and I'll add it to the website.

@famuvie
Copy link
Contributor Author

famuvie commented Aug 6, 2015

Yes! that's an excellent solution.
It works like charm.

On the other hand, I had to perform a couple of non-recommended operations that give rise to NOTEs in a R CMD check:

  • I had to call ProjectTemplate:::dbf.reader in the overriding function, for the case of true dbf files.
  • I had to assign(..., envir = .GlobalEnv) in the shapefile reader.

Calling other package's functions throug ::: and assigning to the global environment are two bad things.
Is there a correct way of doing this?

Thanks!

@KentonWhite
Copy link
Owner

Glad it works! I agree that the hacks are not very elegant. Here are some thoughts -- may or may not work.

Rather than calling ProjectTempate:::dbf.reader in the overriding function, can you assign the old function to a new variable before overriding? something like old.dbf.reader <- dbf.reader? The benefit is if other readers before you have modified dbf.reader then you will use the modified reader and not the original reader.

Right now there is no way to avoid referencing .GlobalEnv in an assign call. Having .GlobalEnv scattered throughout the package has caused us issues with CRAN. We are working on alternative methods, but they are not ready for release.

While working on the alternative, we have reached a compromise with CRAN where we have replaced .GlobalEnv with .TargetEnv. Currently .TargetEnv simply points to .GlobalEnv, but provides us with the internal hook for moving towards a better solution.

If you don't want .GlobalEnv in your code, you are welcome to use .TargetEnv. The variable is not exported at this time so you would need to reference it as ProjectTemplate:::.TargetEnv -- not the most satisfactory method but still better than .GlobalEnv.

@famuvie
Copy link
Contributor Author

famuvie commented Aug 24, 2015

Thank you for your suggestions.
Substituting the reference to .GlobalEnv by ProjectTemplate:::.TargetEnv works fine, although again I need to make use of :::.
On the other hand, dbf.reader is also not exported in ProjectTemplate, so I cannot simply assign the function to a new variable, without using ::: which is what I was trying to avoid.

In any case, I hope it helps someone else using ProjectTemplate and shapefiles.

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 24, 2015

How about creating your own .TargetEnv in your package?

As for the dbf.reader, I'm not sure if this should be the responsibility of your package to call it. Perhaps ProjectTemplate should try calling "the next reader" itself, if it detects that a reader could not cope with a certain input file.

@famuvie
Copy link
Contributor Author

famuvie commented Aug 24, 2015

You mean a .TargetEnv also pointing to .GlobalEnv?

I agree with your second thought. But it requires a structural change in how ProjectTemplate loads datasets. I don't dare to make such modification myself.

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 24, 2015

If you decide to give it a try anyway, I'll be happy to review the pull request. I think this is closely related to #140.

@famuvie
Copy link
Contributor Author

famuvie commented Aug 24, 2015

Indeed. Haven't you implemented such a more flexible approach in your package LoadMyData?
If it works, why not importing this one feature back to ProjectTemplate?

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 24, 2015

As I said -- I'm not happy with LoadMyData, I think the whole idea is flawed.

Perhaps LoadMyData could be really useful if it produced source code instead of actually executing code. Then, ProjectTemplate could add this source code to an R file and execute this file. The benefit here is that it gives the user full control over the data loading process, while at the same time providing a sensible default that works in many cases. This could be achieved with a few minor modifications.

For detecting file types, we could use yihui/mime, which in turn could be expanded to detect file types based on content (and not on the extension) using file.

@KentonWhite
Copy link
Owner

@famuvie I've played around with the shapefile reader and like how it works! I hadn't considered that dbf.reader is not exported from ProjectTemplate. A thought -- we could expose a function .get.extension that returns the current reader associated with the extension. Then when overriding the previous reader one could call .get.extension to retrieve the previous reader and store it in a local variable to be used as a fall back. Would that solution be more elegant?

@famuvie
Copy link
Contributor Author

famuvie commented Aug 25, 2015

Yes, I think the .get.extension() function would be a reasonable solution.
In the long term, though, I agree with @krlmlr that the whole dataset-loading process should be refactored.

@famuvie
Copy link
Contributor Author

famuvie commented Dec 17, 2016

Hi @KentonWhite, any progress on this issue? shall we open a new one?

@connectedblue
Copy link
Contributor

connectedblue commented Dec 19, 2016

@famuvie What about introducing a convention that directory names can have extensions too, and a generic reader passes all the files in the directory to a specific reader for data loading.

So, there could be a directory called xxx.shp containing all the relevant files for the xxx shapefile object that are passed safely and without conflicting with other files in the top level data directory.

Some thought would have to be given to the recursive load config parameter, but I'm sure there's a solution there.

This would be a generic solution to other types of multi file object also. For example, I'm interested in text analysis and these are usually initialised by loading many training files in a directory into a single corpus object. I'd be interested in collaborating on a generic solution that could handle both shape files and corpus files.

@famuvie
Copy link
Contributor Author

famuvie commented Dec 19, 2016

Seems good to me, @connectedblue. Yet, the implementation should allow single files in data as well, for backwards compatibility, and for usability.

@connectedblue
Copy link
Contributor

@famuvie agreed. I have my hands full at the moment with a data project that needs completing by the end of the year, so I don't want to get distracted. However, we could work on it in the new year if you're up for it?

@KentonWhite
Copy link
Owner

I thought I had exposed this behaviour. My main computer is in the shop this week and will check once I have it back. If the behaviour isn't exposed I'll add it in.

@famuvie
Copy link
Contributor Author

famuvie commented Dec 20, 2016

Great! thanks folks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants