Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce repository size #1

Closed
carlschroedl opened this issue Jan 13, 2016 · 22 comments
Closed

Reduce repository size #1

carlschroedl opened this issue Jan 13, 2016 · 22 comments

Comments

@carlschroedl
Copy link
Collaborator

Github has a soft repo size limit of 1GB. Though this is a small project, this repository is already approaching the soft limit. Contributors to large file size include checked-in:

  1. dependencies (jars)
  2. data (.zip, shapefiles, etc)
  3. compiled artifacts (the autoredistrict jar)
  4. release artifacts (autoredistrict.zip on the GH releases entries)

I have experience using tooling that we can use to address all three. We could resolve these issues by...

  1. using formal dependency management tools like maven, gradle, ivy, etc. I have used maven on more than 50 projects. If you would prefer a different tool, I'd consider learning it.
  2. moving the data files to git-lfs, or hosting them on an ftp server. Though git-lfs is new, I'm using it on a fledgling work project, so I understand some of the pitfalls.
  3. Using formal dependency management tools like maven, we can upload compiled artifacts to publicly accessible locations so that others can download the autoredistrict jar for purposes of running it or re-using it from other code. I do this at work for all of our java projects.
  4. If you want to include autoredistrict.zip in your github releases, we can automatically attach them when we make the tag by running Travis CI. I've used Travis CI before. I haven't used the github releases deployment step, but it looks easy.

What do you think?

@happyjack27
Copy link
Owner

I'm all for just deleting the data sources directory, which is probably most of the repository size. Also don't need the jar or the zip.

I want to keep linking and compiling as simple as possible, though. So I'd rather just include the jar/class/java files than have to use a dependency management tool.

We can delete the unnecessary crap like data sources - that was more for my convenience when starting it out and debugging. Not for prime time. I considered just making a completely separate repository for data sources, but I don't think even that's very valuable now that the program has automated download and aggregate in the file menu.

So... Delete unnecessary crap, then see where it's at. If it's still too big or eventually becomes too big, revisit. But I'd rather just make it a non-issue if possible. One think that's turned me off about projects is all the linking and compiling configuration. If we have to, we have to. But if we don't have it, why not keep it simple?

So delete crap and see where that leaves us.

Sent from my iPhone

On Jan 13, 2016, at 5:30 PM, Carl Schroedl notifications@github.com wrote:

Github has a soft repo size limit of 1GB Though this is a small project, this repository is already approaching the soft limit Contributors to large file size include checked-in:
1 dependencies (jars)
2 data (zip, shapefiles, etc)
3 compiled artifacts (the autoredistrict jar)
4 release artifacts (autoredistrictzip on the GH releases entries)
I have experience using tooling that we can use to address all three We could resolve these issues by
1 using formal dependency management tools like maven, gradle, ivy, etc I have used maven on more than 50 projects If you would prefer a different tool, I'd consider learning it
2 moving the data files to git-lfs, or hosting them on an ftp server Though git-lfs is new, I'm using it on a fledgling work project, so I understand some of the pitfalls
3 Using formal dependency management tools like maven, we can upload compiled artifacts to publicly accessible locations so that others can download the autoredistrict jar for purposes of running it or re-using it from other code I do this at work for all of our java projects
4 If you want to include autoredistrictzip in your github releases, we can automatically attach them when we make the tag by running Travis CI I've used Travis CI before I haven't used the github releases deployment step, but it looks easy

What do you think?


Reply to this email directly or view it on GitHub.

@happyjack27
Copy link
Owner

So to reply itemized:

  1. Keep
  2. Delete with prejudice
  3. Delete
  4. Delete older releases. Maybe keep the 2 or 3 newest.
  5. See how above works, go from there
  6. Just delete me thinks. Or if you want make an FTP. Doesn't matter to me
  7. Same as 1.
  8. Agreed. Just me being lazy.

Sent from my iPhone

On Jan 13, 2016, at 5:30 PM, Carl Schroedl notifications@github.com wrote:

Github has a soft repo size limit of 1GB Though this is a small project, this repository is already approaching the soft limit Contributors to large file size include checked-in:
1 dependencies (jars)
2 data (zip, shapefiles, etc)
3 compiled artifacts (the autoredistrict jar)
4 release artifacts (autoredistrictzip on the GH releases entries)
I have experience using tooling that we can use to address all three We could resolve these issues by
1 using formal dependency management tools like maven, gradle, ivy, etc I have used maven on more than 50 projects If you would prefer a different tool, I'd consider learning it
2 moving the data files to git-lfs, or hosting them on an ftp server Though git-lfs is new, I'm using it on a fledgling work project, so I understand some of the pitfalls
3 Using formal dependency management tools like maven, we can upload compiled artifacts to publicly accessible locations so that others can download the autoredistrict jar for purposes of running it or re-using it from other code I do this at work for all of our java projects
4 If you want to include autoredistrictzip in your github releases, we can automatically attach them when we make the tag by running Travis CI I've used Travis CI before I haven't used the github releases deployment step, but it looks easy

What do you think?


Reply to this email directly or view it on GitHub.

@happyjack27
Copy link
Owner

Sorry for replying in pieces.

  1. Re Travis CI - not neccessary, I just select all, send to zip. No need to commit the .zip or .jar to repository though. Maybe add it to exclude. Just me being lazy.

Re Mavis v other dependency tools - Mavis is fine, seems to me it's popular. I really don't have experience with dependency management tools.

And finally, how does all that sound - what are your thoughts?

And thanks!

Sent from my iPhone

On Jan 13, 2016, at 5:59 PM, Kdog happyjack27@gmail.com wrote:

So to reply itemized:

  1. Keep
  2. Delete with prejudice
  3. Delete
  4. Delete older releases. Maybe keep the 2 or 3 newest.
  5. See how above works, go from there
  6. Just delete me thinks. Or if you want make an FTP. Doesn't matter to me
  7. Same as 1.
  8. Agreed. Just me being lazy.

Sent from my iPhone

On Jan 13, 2016, at 5:30 PM, Carl Schroedl notifications@github.com wrote:

Github has a soft repo size limit of 1GB Though this is a small project, this repository is already approaching the soft limit Contributors to large file size include checked-in:
1 dependencies (jars)
2 data (zip, shapefiles, etc)
3 compiled artifacts (the autoredistrict jar)
4 release artifacts (autoredistrictzip on the GH releases entries)
I have experience using tooling that we can use to address all three We could resolve these issues by
1 using formal dependency management tools like maven, gradle, ivy, etc I have used maven on more than 50 projects If you would prefer a different tool, I'd consider learning it
2 moving the data files to git-lfs, or hosting them on an ftp server Though git-lfs is new, I'm using it on a fledgling work project, so I understand some of the pitfalls
3 Using formal dependency management tools like maven, we can upload compiled artifacts to publicly accessible locations so that others can download the autoredistrict jar for purposes of running it or re-using it from other code I do this at work for all of our java projects
4 If you want to include autoredistrictzip in your github releases, we can automatically attach them when we make the tag by running Travis CI I've used Travis CI before I haven't used the github releases deployment step, but it looks easy

What do you think?


Reply to this email directly or view it on GitHub.

@happyjack27
Copy link
Owner

Okay I deleted all the unneeded stuff. still says the repository is half a gig, but when i add up the folder sizes, it should only be a few MB. what gives? any ideas?

i noticed the .git folder on my hard drive is over a gig! related?

Also turns out i can't figure out how to delete old releases. if you can, be my guest.

@carlschroedl
Copy link
Collaborator Author

Sounds good! I'll help out with things you agreed you'd like to delete.

@carlschroedl
Copy link
Collaborator Author

Since git is a distributed version control system, the default behavior is for every collaborator to have every version of every file. When you run git clone, it's kind of like rsyncing the root repository directory on an svn server, only a bit easier and faster. So when you git rm a file, that file hangs around in the repo history (the .git directory). Similarly, the file would still be around in an older revision directory in svn (I think? Sorry, it's been a while since I used svn). To get past this limitation in git, we have to rewrite history. On a project with many collaborators, rewriting history would be a very scary thing; but since we are just starting out, it's probably ok.

Long story short, BFG is the tool you need:
https://rtyley.github.io/bfg-repo-cleaner/

@carlschroedl
Copy link
Collaborator Author

I could run BFG to clean the large files out of the repo history, but I think I would need rights to force push to your fork, which is a kinda scary proposition since at this point I'm still pretty much an internet rando :).

@happyjack27
Copy link
Owner

i was going to ask if you could do it. i've used it once to fix the repo
after commiting a >100MB shapefile before i knew that that breaks the repo.
not too experienced with it other than that, though. i do feel comfortable
giving you rights to the repo. not a "rando" to me. (and i got local
copies anyways.) (though i don't look forward to the fabled git hot
potato... im more used to svn which is much easier to resolve conflicts)
would would i need to do to give you the rights you need to clean it?

On Thu, Jan 14, 2016 at 9:23 PM, Carl Schroedl notifications@github.com
wrote:

I could run BFG to clean the large files out of the repo history, but I
think I would need rights to force push to your fork, which is a kinda
scary proposition since at this point I'm still pretty much an internet
rando :).


Reply to this email directly or view it on GitHub
#1 (comment)
.

@carlschroedl
Copy link
Collaborator Author

You would need to add me as a collaborator so that I get push rights. You can manage that in this repo's "Settings" page and the "Collaboration" sub-page, or by visiting this url:
https://github.com/happyjack27/autoredistrict/settings/collaboration

@carlschroedl
Copy link
Collaborator Author

After I finished removing the big files from the history, I would force push the rewritten history to your github fork. Collaborators would need to re-clone the repo to ensure we didn't have different versions of history.

@happyjack27
Copy link
Owner

meaning i'd need to re-clone it, right?

On Thu, Jan 14, 2016 at 9:42 PM, Carl Schroedl notifications@github.com
wrote:

After I finished removing the big files from the history, I would force
push the rewritten history to your github fork. Collaborators would need to
re-clone the repo to ensure we didn't have different versions of history.


Reply to this email directly or view it on GitHub
#1 (comment)
.

@carlschroedl
Copy link
Collaborator Author

Haha, yeah. You and whoever "jimbrill" is.

@happyjack27
Copy link
Owner

that's also me. he loaned me his mac so i could write some iphone
software. now i own it.

On Thu, Jan 14, 2016 at 9:46 PM, Carl Schroedl notifications@github.com
wrote:

Haha, yeah. You and whoever "jimbrill" is.


Reply to this email directly or view it on GitHub
#1 (comment)
.

@carlschroedl
Copy link
Collaborator Author

XD! Haha. ok.

@happyjack27
Copy link
Owner

you're now a collaborator.

On Thu, Jan 14, 2016 at 9:47 PM, Kevin Baas happyjack27@gmail.com wrote:

that's also me. he loaned me his mac so i could write some iphone
software. now i own it.

On Thu, Jan 14, 2016 at 9:46 PM, Carl Schroedl notifications@github.com
wrote:

Haha, yeah. You and whoever "jimbrill" is.


Reply to this email directly or view it on GitHub
#1 (comment)
.

@carlschroedl
Copy link
Collaborator Author

Thanks! The 'data sources' dir is definitely the low-hanging fruit, so I'm trying that first via:

java -jar bfg-1.12.8.jar --delete-folders 'data sources' autoredistrict.git

@carlschroedl
Copy link
Collaborator Author

That command knocks repo size down to about 70 MB. Looks like the remainder of that is largely src/resources/Wards_111312_ED_110612.json, so I'll scrub that next.

@carlschroedl
Copy link
Collaborator Author

I removed it from the history via:

java -jar bfg-1.12.8.jar --delete-files 'Wards_111312_ED_110612.json' autoredistrict.git

To the best of my limited abilities(mostly just building off of this blog post), I've analyzed the remaining objects in the repo's history. I've also poked around the file system a bit on the most recent commit on master. If there are data that could be further deleted, they are pretty small. I've force-pushed to my fork. You can verify that my fork is smaller and retains the relevant files and history entries by cloning my fork to your workstation. It should be much faster, but still have what we need.

If you approve, I can force push to your fork.

@carlschroedl
Copy link
Collaborator Author

As in, I probably /can/ push to your fork right now, but I'd like to get the 👍 from you first :)

@happyjack27
Copy link
Owner

looks fine. i guess just launch ui.Applet and make sure it launches fine,
and if so you have can go ahead to push.
worst case scenario i have local copies.

On Thu, Jan 14, 2016 at 10:48 PM, Carl Schroedl notifications@github.com
wrote:

As in, I probably /can/ push to your fork right now, but I'd like to get
the [image: 👍] from you first :)


Reply to this email directly or view it on GitHub
#1 (comment)
.

@carlschroedl
Copy link
Collaborator Author

I pushed the slimmer repo up to your fork:

> git push
Counting objects: 6531, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (1113/1113), done.
Writing objects: 100% (6531/6531), 51.07 MiB | 522.00 KiB/s, done.
Total 6531 (delta 5336), reused 6531 (delta 5336)
To git@github.com:happyjack27/autoredistrict.git
 + 1b347d8...1e3d3b9 master -> master (forced update)
 + 6d36f20...7349976 1.0 -> 1.0 (forced update)
 + 9d4473d...fa2d04b 1.02 -> 1.02 (forced update)
 + bc88229...41c359b 1.03 -> 1.03 (forced update)
 + 3976efe...fc0a9b0 1.04 -> 1.04 (forced update)
 + e9d6f67...3bca46c 1.05 -> 1.05 (forced update)
 + 95c131b...38e22c2 1.06 -> 1.06 (forced update)
 + 8d14ad2...766edfe 1.07 -> 1.07 (forced update)
 + d1db825...4db8e2b 1.08 -> 1.08 (forced update)
 + 04be2c3...0dfe0f2 1.09 -> 1.09 (forced update)
 + 446dabc...63f35f0 1.1 -> 1.1 (forced update)
 + 40b445d...8af0044 1.10 -> 1.10 (forced update)
 + ece0827...1590954 1.11 -> 1.11 (forced update)
 + 6526fb0...e638281 1.12 -> 1.12 (forced update)
 + 8993d79...5860c52 1.13 -> 1.13 (forced update)
 + 93a58f6...40236b4 1.14 -> 1.14 (forced update)
 + 9a8fc6e...ab8e4f3 1.15 -> 1.15 (forced update)
 + b7b3b54...6b6c215 1.16 -> 1.16 (forced update)
 + bfb67bb...4b30353 1.17 -> 1.17 (forced update)

Please verify and close.

@happyjack27
Copy link
Owner

  • backed-up local repo,
  • deleted local repo and disconnected
  • re-cloned
  • re-built (automatically)
  • launched application
    successful.

Closing issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants