Skip to content

Open Science Utility Belt #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bkatiemills opened this issue Jul 30, 2015 · 53 comments
Open

Open Science Utility Belt #7

bkatiemills opened this issue Jul 30, 2015 · 53 comments

Comments

@bkatiemills
Copy link
Member

Open Science 101

This is a session series introducing practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester.

Help Us Develop This Curriculum

We're trying to answer these questions:

  • What skills are needed to practice open science?
  • What's missing / What's unnecessary? (aiming for < 12 sessions over a semester)
  • What's out there now? References! If you've seen related material, send it over.

Let us know your thoughts in the comments!

Sessions

  1. Introduction: what & why
    • What skills will be part of this series?
      • working openly through the entire process (not just warehousing things on the web afterwards) in order to leverage collaboration
      • emphasizing legibility of research outputs for the sake of reuse & reproducibility
    • Why do these things matter?
      • lit review on citation benefits, efficacy benefits, retraction scandals & efficiency.
    • Sources: Working Open guide, TBD
  2. Open Data I: Standards & Legibility
    • What is an ontology?
    • How to effectively use data standards and make data legible?
    • Sources: TBD
  3. Open Data II: Clean Data
    • What is 'clean' vs 'dirty' data, and why do they matter?
      • how to keep data organized and easy to reuse at a later date (including in-house reuse); consider metadata, storage and formats.
    • Best practices for making a reusable dataset when no standard exists.
    • Sources: TBD
  4. Collaboration I: Version Control
    • Basic git, with an emphasis on getting to GitHub as a platform for sharing & collaboration.
    • Source: TBD
  5. Collaboration II: Roadmapping
    • How to lay out a project for effective collaboration.
    • Source: Working Open guide.
  6. Collaboration III: Code Review
    • How to set expectations for good contributions that lead to easy-to-review code
    • How to make the code review process fast and efficient
    • Source: Working Open guide, Code Review Teaching Kit
  7. Code Wrangling I: Sustainable Coding
    • Effective use of documentation.
    • Producing end-to-end analysis automation scripts (R, Python, Shell, or make); understanding of how a well-made automation script serves as 'living documentation'.
    • Sources: TBD
  8. Code Wrangling II: Testing
    • Writing test suites to ensure code quality & build trust to support reuse.
    • Sources: this lesson in Python, TBD in R.
  9. Code Wrangling III: Code Packaging
    • Making & distributing packages to support reuse & collaboration.
      • discussion of useful formalisms for organizing data & code in packages / repos
    • Sources: this lesson in Python, and this lesson in R
  10. Publishing & Communication I: Citation & Discoverability
    • Software & data citation
      • DOIs
      • comments on how this addresses discoverability of code & data
    • Authoring for the Web
      • markdown / knittr
      • metadata
    • Sources: Working Open guide, TBD
  11. Publishing & Communication II: The Research Cycle
    • Strategies for opening the entire research process:
      • Grant process
      • Online lab notebooks
      • blogging, twitter & social media
      • protocol publishing
      • study pre-registration
  12. Publishing & Communication III: Licensing
    • open access publishing
      • comments on impact on science in the Global South / decoupling access from privilege
    • Why are licenses necessary?
    • What can they do? What can't they do?
    • Which ones are the most important and how do they work?
    • How to choose a license, and the intersection of licensing and copyright
    • The importance of agreeing on a license explicitly and early on a collaboration
    • sources: TBD
  13. Change Making
@kaythaney
Copy link
Member

Thanks for this, Bill! Would be worth adding links to the Working Open guide here, and seeing if you could line up with some of the language and key categories there to strengthen / augment that work. That also may help with some of the verbiage issues (like "Programming" as a header - not sure that's the best term here, crisp up the language and minimize jargon).

Great start, and more comments to come!

@bkatiemills
Copy link
Member Author

yep, these lessons are going to pull heavily from the WOG, once we agree the curriculum. Changed 'programming' -> 'code wrangling'.

@abbycabs
Copy link
Member

abbycabs commented Aug 5, 2015

Suggested language for the beginning:


Open Science 101

This session series introduces practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester.

Help Us Develop This Curriculum

We're trying to answer these questions:

  • What skills are needed to practice open science?
  • What's missing / What's unnecessary? (aiming for < 12 sessions over a semester)
  • What's out there now? References! If you've seen related material, send it over.

Let us know your thoughts in the comments!

Sessions

...


Going through actual sessions now :) Really excited for this work!

@bkatiemills
Copy link
Member Author

From the twitterverse:

@ttimbers suggests data storage & archiving - how to find data associated with a study, how to organize your own data for future reuse; also metadata, storage and formats.
@minisciencegirl suggests organizing data, with useful naming schemes, structure etc.

@taddallas
Copy link

+1 for @minisciencegirl 's suggestion about naming schemes

Is there any reason to have the Open Data sections after Collaboration sections? It might flow better (in my mind) from data wrangling into setting the wrangled data free (i.e. Open Data), then to dive into collaborations/workflows/code review/version control. This change could also make the transition into publishing easier, as collaborations may lead to publications ... 🙏

@abbycabs
Copy link
Member

abbycabs commented Aug 5, 2015

I'm thinking along the same lines as @taddallas re:flow. Teaching packages before git seems off to me.

@bkatiemills
Copy link
Member Author

The ordering of 'Code Wrangling', 'Collaboration', 'Open Data' and 'Publishing & Communication' are actually just in the order I thought of them in :)

So, how about the order:

  • 'Open Data'
  • 'Collaboration'
  • 'Code Wrangling'
  • 'Publishing & Communication'

@taddallas
Copy link

Looks good to me!

One tiny thing: I noticed that much of the material uses Python. Perhaps it would be worthwhile to also show some R examples, as some pretty solid tools for Open Science are built around R (e.g. reproducible analyses and manuscript writing with R Markdown, testing with testthat, etc.). This point is null if the course is designed to be Python-specific, or if you think there'd be too much overlap with the R utility belt you already have.

@bkatiemills
Copy link
Member Author

Nope, R-flavoured implementations of these techniques are definitely something we want! Which will get used depends on the audience, but we definitely want both options for the Code Wrangling section. The current examples are Python for no other reason than I speak Python. That said, I think there was a packages in R lesson from UBC recently I can dig up and link here - if you have a good lesson for testing in R, send it on by!

@blahah
Copy link

blahah commented Aug 5, 2015

The only thing that is conspicuously missing to my mind is licenses - they are fundamental to open science and are relevant to all the sections above. I would think these are the most important aspects of licenses to cover:

  • Why are licenses necessary?
  • What can they do? What can't they do?
  • Which ones are the most important and how do they work?
  • How to choose a license
  • The importance of agreeing on a license explicitly and early on a collaboration

@bkatiemills
Copy link
Member Author

@blahah - totally agree, added your points to an additional section under 'publishing & communication' - thanks! One thing that would be super helpful in that section, is ideas for hands-on activities, and engaging ways to introduce things like licenses as well as code and data citation; definitely A-list important stuff, but runs the risk of turning into a really dry lecture about DOIs and copyright.

@blahah
Copy link

blahah commented Aug 5, 2015

I think a nice way to introduce licenses and citation is by doing a set of small hands-on data mining tasks. Introducing some frustrating scenarios that are solved by proper licensing and good data citation should be memorable. We just need a paper with great data but no license, and a paper that does something good with someone else's data but doesn't cite it properly.

@noamross
Copy link

noamross commented Aug 5, 2015

This may be expanding the scope a bit, but some topics that would have been helpful for me early on, before I really did much coding or had a solid project together, would have been:

  • Keeping and organizing a open, digital lab notebook
  • Searching, collecting, reading and annotating content for re-usability and collaboration

@bkatiemills
Copy link
Member Author

@noamross could that first point fit with the social media unit?

I'd love to hear your ideas on your second point - to be honest, content aggregation is a pretty weak part of my own game, I've never found a method I really liked.

@noamross
Copy link

noamross commented Aug 6, 2015

Yes, lab notebook could go in social media, but there's a fair amount of the topic that isn't explicitly social: metadata/tagging of notes, formats and organization for searching, plain-text for posterity, etc.

On collecting content, I'm similar. I have a semi-working system of Mendeley + a collection of tagged plain-text notes, but I'm not sure how well it works in terms of collaboration. @cboettig and I once wrote a review together where we built an annotated bibliography using markdown + bibtex, but it felt more like a one-use hack than a system. Ideas from others would be welcome.

@Daniel-Mietchen
Copy link

Great suggestions so far.

I agree on the "importance of agreeing on a license explicitly and early", and thus think this should come at the beginning of the course and not at the end. As @blahah mentioned, this should work well after some moments of reuse-rights-related frustration, which unfortunately remain all too easy to create.

One aspect that I am missing is an overview of where things are or are not open along the research cycle - we are making progress with making research outputs more widely available, but the research process is still mostly closed (safe a few open notebooks), and funding is basically a dark corner (very few proposals are open, and basically no funding decisions).

@dsalo
Copy link

dsalo commented Aug 6, 2015

Working with collaborators who don't necessarily Get It about the whole "open" thing. This is one of the top questions I get whenever I talk open with people.

DOIs, and how they are not magic but are important. Data citation. Data journals and other data-publication venues. Data-use tracking and metrics, and how to use them to make a tenure case or a grant proposal stronger.

Where to get help shoring up your weak spots -- nobody can do everything!

Basic digital hygiene: backups, basic security, basic digital preservation (why "I'll put it on my website!" is a lousy idea long-term).

Navigating openness vs. privacy in human-subjects and other sensitive research.

How to use Excel, if you must, without making everyone else hate you. What to use when Excel stops being useful (stats packages, relational databases).

@tgardner4
Copy link

Would love to see design of experiments, multiple testing corrections, and quality engineering (reducing variability) of experiments in the curriculum. (Happy to contribute on these subjects.)

@wolass
Copy link

wolass commented Aug 6, 2015

To publishing:
Digital object identifiers - their importance in citing and version control. (NOT RESTRICTED TO CrossRef's DOI)

@noamross I'm using the knitcitations from @cboettig on a daily basis. So if this is the result of your cooperation it certainly wasn't a one-time hack :)

I would underline the importance of learning markdown and using knitr when collaborating on scientific projects.
The most important skills for me were:

  1. Statistics (Coursera courses)
  2. R programming
  3. Markdown
  4. Learning the pipeline: Markdown to Word, and PDF using Knitr package in R studio with knitcitations and BibTeX
  5. Using Mendeley as bibliography database with quick search option (deadly useful)
  6. Putting my results on the OSF.io project page and sharing them with collaborators
  7. LaTeX <- but this is sth extra

@Celyagd
Copy link

Celyagd commented Aug 6, 2015

Here is the program that we are developping on the MOOCSciNum "research practices at the digital age" with a strong focus on open research practices Here is the enrollement page
It's in french but we hope that participant will help us to translate it in english.

Cheers

Célya

Plan du cours

Séance 0 : Recherche à l'ère du numérique : quelles transformations ? (séance d'introduction)
[Interview] Numérique et Recherche
[Screencast] Présentation du MOOC

Séance 1 : S'appuyer sur des ressources scientifiques existantes
[Interview] Bibliothèque et numérique : quels défis et quels rôles à jouer ?
[Screencast] Savoir gérer sa bibliographie seul ou en groupe avec Zotero

Séance 2 : Collecter/produire des données scientifiques
[Interview] Numérique et collecte de données en santé
[Screencast] Daydream : un exemple de collecte de données en ligne

Séance 3 : Traiter/analyser des données scientifiques
[Interview] Données et numérique : quelles "réelles" transformations ?
[Screencast 1] Recherche en neurogénétique : exemple d'utilisation de Python et de Github
[Screencast 2] Analyse de données en épidémiologie avec R

Séance 4 : Archiver/partager des données scientifiques : données de santé, données sensibles
[Interview 1] Des données partagées aux données ouvertes en recherche
[Interview 2] Données de santé, données sensibles : quels droits ? Quelles protections ?
[Screencast] Partage de données médicales anonymisées

Séance 5 : Partager ses résultats scientifiques : écrire et publier
[Interview 1] Publier sa recherche à l'ère du numérique : Open Access
[Interview 2] Droit d’auteur et licences Creative Commons : quelques précisions utiles avant de publier
[Screencast 1] Déposer un article dans HAL

Séance 6 : Faire partie d'une communauté scientifique
[Interview] Evaluer, être évalué : retour sur la "machinerie" de l'évaluation et ses évolutions
[Screencast 1] Faire connaître ses activités de recherche : comparaison de Zenodo et Figshare
[Screencast 2] Communiquer sur ses recherches : présence "en ligne".

Séance 7 : Nouvelles formes d’interaction en recherche et enjeux éthiques
[Interview 1] Ouvrir le processus de recherche : des sciences citoyennes à la recherche participative
[Interview 2] Ethique de la recherche à l'ère du numérique
[Screencast 1] Blogs scientifiques

@abbycabs
Copy link
Member

abbycabs commented Aug 6, 2015

Google translate + my eyes (not perfect, please make improvements).

My comments: This (below) is a broader overview of research & access. I think something more hands-on would work better in study group sessions. But I do think looking a bit broader picture is a good move - thanks for adding DOI for code, licensing info.


Course Outline

Session 0: Research in the digital age: what is different? (Introductory session)
[Interview] Digital and Research
[Screencast] Overview MOOC

Session 1: Building on existing scientific resources
[Interview] Library and Digital: What challenges and roles exist?
[Screencast] Generate a bibliography alone or in groups with Zotero

Session 2: Collect / produce scientific data
[Interview] Digital and health data collection
[Screencast] Daydream: an example of online data collection

Session 3: Edit / analyze scientific data
[Interview] Data and digital: what are "real" transformations?
[Screencast 1] Neurogenetics research: example of using Python and Github
[Screencast 2] Epidemiology Data Analysis with R

Session 4: Archive / share scientific data: health data, sensitive data
[Interview 1] Shared data and open data in research
[Interview 2] Health data, sensitive data: which rights? What protections?
[Screencast] Sharing anonymized medical data

Session 5: Sharing research results: write and publish
[Interview 1] Publishing your research in the digital age: Open Access
[Interview 2] Copyright and Creative Commons licenses: some useful clarifications before publishing
[Screencast 1] Uploading an article in HAL

Session 6: Being part of a scientific community
[Interview] Evaluate, be evaluated: return to the "machinery" of the evaluation and its evolutions
[Screencast 1] Publicize your research activity: comparison of Zenodo and Figshare
[Screencast 2] Communicating your research: "online" presence

Session 7: New forms of interaction in research and ethical issues
[Interview 1] Opening the research process: citizen science and participatory research
[Interview 2] Research ethics in the digital age
[Screencast 1] Scientific blogs

@noamross
Copy link

noamross commented Aug 6, 2015

@tgardner4 I agree that these topics are fundamental but also think they are somewhat out of scope for a ~12 lesson group-study on open science. There are, however, some important connections between experimental design and open science that could be addressed, such as:

  • How can an open scientific process facilitate checking and quality of methods?
  • How to maximize the transparency and auditability experimental design and statistical methods shared.
  • How to think about data quality at different stages of data publication
  • How and where to include quality checking information in your data and metadata

@tgardner4
Copy link

@noamross I agree with your points. What I outlined is a study group unto it's own. I think you propose a nice solution though - an intro to the topic (and perhaps a pointer to a separate study group dedicated to a full treatment). By including it the open science curriculum - even as an intro - you would teach participants that these are core issues that can't be overlooked in proper scientific pursuit.

I would also suggest that the original open science outline described above (the very first post in this thread) is heavily tilted toward a view that open science = coding + publishing. Absent from this curriculum is anything about experimentation or the scientific process. My suggestions are a reaction to this gap. When I hear "science" my mind goes to experimentation. When I hear "open science" I think: "collaboration on the design, execution and sharing of experiments & results." Code and publishing are a necessary, but not sufficient, portion of the the scientific process.

@abbycabs
Copy link
Member

abbycabs commented Aug 7, 2015

Taking comments from @tgardner4 @noamross and more, there might be more clarity if we change the title to:

Open Science & Data: open research practices when working with scientific data

This could be a follow up series after a broader 'Introduction to Open Science'.

@blahah
Copy link

blahah commented Aug 7, 2015

@tgardner4 I agree with all your points - these are essential skills and they are not taught sufficiently well in general science courses. I still think that they are not part of open science specifically, and are too many steps removed from the core toolset of open science to feature heavily in the curriculum. Having a short discussion of the relevant aspects where they are directly related (for example in reusing open data), then linking out to a resource which would be developed separately from this curriculum seems like a good way to go.

@dsalo
Copy link

dsalo commented Aug 7, 2015

@BillMills There's vastly less written about this than I would like. :/ Sometimes introductory project-management techniques are a way in. I have a slidedeck I'd be happy to share with you if you think it would help; otherwise, maybe the way to approach it is an unconference-style discussion, maybe with a plausible case study as an example.

Another way that can work is using a "horror story" as a discussion seed. In this context, I might use a story about a data-ownership snafu (see examples at http://pinboard.in/u:dsalo/t:horrorstories/t:dataownership ) but obviously almost any interpersonal issue specific to open science can be made to work, if there's an available horror story.

@blahah
Copy link

blahah commented Aug 7, 2015

@acabunoc that title sounds like it fits the content better

@tgardner4
Copy link

I agree - that’s a better title - more reflective of the content.

Tim

On Aug 7, 2015, at 4:09 PM, Richard Smith-Unna notifications@github.com wrote:

@acabunoc https://github.com/acabunoc that title sounds like it fits the content better


Reply to this email directly or view it on GitHub #7 (comment).

@ctb
Copy link

ctb commented Aug 8, 2015

A few cents from wandering through this discussion --

"How to set expectations for good contributions that lead to easy-to-review code" - phrasing sets off alarm bells. Also, testing comes 4 sessions later, which is the wrong way 'round - how do you review code that you can't trust? ;). Tests are a prerequisite for code review. Code coverage analysis is missing, also. I would suggest a checklist among other things.

Code packaging? Nix it, IMO. Or move it much later. (Definitely well after testing.) Reasons: it has a lot of sysadminy type stuff that most people won't know or care about.

Soooooooooo you're saying DOIs enhance discoverability of code? Sounds like a theoretical point that doesn't actually work to me. Fine to mention it but DOIs + code are not terribly useful yet.

publishing and communications: copyright vs license should be in there, no?

Automation & scripting is missing from the entire discussion, and yet it's key to sharing any kind of workflow. <=> reproducibility, which is underemphasized.

In my experience, selling scientists on this stuff is 80% of the battle, once they show up. (Technical skills is the next 80% ;). More and stronger motivation. Very few people seem to worry about incorrect results (oddly enough) so efficiency and reputation is a good focus.

Twitter probably belongs in social media, too. Lurking, favoriting, retweeting, subtweeting.

Looks great overall - I think there are probably many ways through all this material, but this is a nice collection of things to consider for any such course!

@bkatiemills
Copy link
Member Author

@tgardner4 yes! Including a pointer to your content and then following it up as its own series of lessons is an A+ solution, 100% on-board with that.

@acabunoc sorry - which title do you want to replace with that, @tgardner4's or mine? Happy to comply either way, let me know.

@dsalo: please link me to your slide-deck! I'm really keen for this content, but I need some help bringing it into focus.

@ctb:

  • twitter, licensing v. copyright, and order of testing v. packaging: amended as per your suggestion, thanks!
  • would keep packaging if rephrased - want to encourage people to break their work out into small, reusable parts rather than big plates of spaghetti, but open to other ways to address this.
  • would appreciate some help rephrasing the importance of contribution guidelines to address those alarm bells; want to touch on lessons learned here and here, but always open to better phrasing.
  • Automation: absolutely agree, need to think about where & how to introduce this, but it should definitely go in. Perhaps folding in @sjackman 's make lesson?

@ctb
Copy link

ctb commented Aug 8, 2015

On Sat, Aug 08, 2015 at 02:27:02PM -0700, Bill Mills wrote:

  • would keep packaging if rephrased - want to encourage people to break their work out into small, reusable parts rather than big plates of spaghetti, but open to other ways to address this.

modularity, maybe? I don't know how this works in R but in Python it is
super easy to do syntactically, w/module globals lowering the cost (i.e.
unlike Java you don't need to make everything a class to have some
privacy ;)

  • would appreciate some help rephrasing the importance of contribution guidelines to address those alarm bells; want to touch on lessons learned here and here, but always open to better phrasing.

I was thinking about this after my comment. How about tying it together
with a slightly more advanced lesson on git/github/pull requests?

  • make many small contributions, as separate branches;
  • push to github, examine diffs carefully before merging;
  • add checklist with basic things, adhere to checklist;
  • as soon as possible, automate workflow and examine workflow artifacts
    (plots, summary info) before each merge (add to checklist)
  • add a two-person sign off rule for merges when project grows beyond one
    person;
  • if you have automated unit/functional tests, examine code coverage
    periodically to target new tests;
  • set up continuous integration (unit/functional/workflow-level) if you can
    (this is probably too advanced)
  • Automation: absolutely agree, need to think about where & how to introduce this, but it should definitely go in. Perhaps folding in @sjackman 's make lesson?

+0.5, or start with some shell/R/Python scripts that do soup-to-nuts analysis
(load/transform data/make graph/output summary) and then tack on assert
statements. can be done inside knitr too, I think?

@sjackman
Copy link

An end-to-end R/Python script is more important than shell/make. If you plan to teach shell scripts, I would teach Makefile scripts soon after. I consider Makefile scripts to be structured, self-documenting shell scripts, and better suited to data analysis than pure shell scripts.

@ctb
Copy link

ctb commented Aug 11, 2015

On Mon, Aug 10, 2015 at 06:17:34PM -0700, Shaun Jackman wrote:

An end-to-end R/Python script is more important than shell/make

+1

@sjackman
Copy link

Here's a small example of using make for a data analysis pipeline that I used for teaching a one hour introduction to make that I created for the Scientific Programming Study Group at SFU: https://github.com/sjackman/makefile-example

@sjackman
Copy link

I am adamant however that introductory make is more important than advanced sh.

@ctb
Copy link

ctb commented Aug 14, 2015

On Thu, Aug 13, 2015 at 08:26:22AM -0700, Shaun Jackman wrote:

I am adamant however that introductory make is more important than advanced sh.

Interesting. The former is the right way to do things for workflows, the
latter is important for personal efficiency...

@sjackman
Copy link

I tend to record all my analyses big and small in a Makefile script, so workflows and personal efficiency are nearly one and the same for me. Where you draw the line between basic shell and advanced shell is clearly pretty fuzzy though. To clarify, I would teach make before I taught shell features useful for writing large shell scripts, such as shell functions and parsing options.

@quantheory
Copy link

Speaking from the perspective of an "early career" type who did a lot of time as a software software engineer post-undergrad, shell scripting should be deliberately minimized. I don't want to knock it too much, since there are lots of neat "tips n' tricks" for Bash, and probably other shells, especially for the command line. If nothing else, I've certainly won some benefits from fancy .bashrc/.bash_profile scripts.

But maintenance and debugging tend to be quite painful for shell scripts. A well-tested script in a web-focused language like Perl is better, but a script in Python/Ruby really wins on syntax and debug-ability. (I don't happen to have any R experience, partly due to the fields I've worked in.)

Build systems are probably required reading at some point. The tricky bit is that build systems are hard to work in most edge cases. Autotools and CMake make some things easier over bare Makefiles, but they tend to fall down for cross-compiling and for HPC, particularly if you have platform-specific optimization flags. (What percentage of scientists think deeply about whether they are "respecting" user-specified CFLAGS in their Makefiles/CMakeLists/whatever?) For a course this short, and for such a broad audience as "students/scientists", it's hard to see anything to recommend, except for some basic knowledge of make.

@dsalo
Copy link

dsalo commented Aug 17, 2015

@BillMills Here you go, quite short but I hope pithy: https://speakerdeck.com/dsalo/changing-workflows

@sjackman
Copy link

Nice slides, Dororthea. I do my own academic writing of manuscripts in Markdown stored on GitHub (e.g. UniqTag), and I've faced resistance from senior colleagues who will consider using only Word with track changes and e-mail/Dropbox. I've taken to exporting Markdown to DOCX with Pandoc, soliciting edits with track changes, and then incorporating those changes back into the original Markdown. Anyone else face a similar situation?

@cboettig
Copy link

@sjackman Yes, I've often been in a similar situation and I've used that same strategy. (Which is also handy if the journal only accepts Word).

Alternatively, you can also just paste the source (e.g. LaTeX / md / Rmd) into a word document and send that. I've found this avoids some pitfalls of the pandoc conversion to Word (though this is improving), particularly where equations are concerned. This strategy was introduced to me by a senior colleague who has long worked in LaTeX while collaborating successfully with many Word-only folks. We've both found collaborators are perfectly happy to ignore the markup and just read + track-changes the text. Of course Word is a terrible text-editor that may play havoc with some character encodings, so you cannot always copy-paste the changes whole cloth.

Neither approach is ideal of course. Several collaborators always return documents to me marked in pen anyway, so the question of output format becomes irrelevant. Paper is the great interoperable standard. In the end, manually writing in the changes, as required by any of these approaches, doesn't take that much time and does force you to pay close attention.

@blahah
Copy link

blahah commented Aug 18, 2015

@dsalo that deck is outstanding! Great stuff. Could you put a license on it? I would only disagree with one point: in my experience graduate students are ideal agents of change in scientific practise.

@sjackman Yeah, this plagues me in almost every project. It usually goes something like: me and some other collaborators work together on a github repo, authorea page, overleaf, or similar. Then at some point a senior collaborator insists we all start using word with track changes, destroys the automated reference management, and will only work by emailing copies back and forth. It's then a huge effort to restore the paper to a nice format at the end. Another kicker is when they insist on manually editing figure images, rather than letting me edit the code and regenerate, because "it's more efficient for them". 😡 The only solution I can think of is to stop working with those people, which is what I'm trying to do.

@bkatiemills
Copy link
Member Author

Great comments again, all! A few scattered responses:

@ctb @sjackman & other build/workflow management enthusiasts: I think you hit on something important with focusing on end-to-end automation of an analysis; superficially this is a convenience strategy, but more deeply this is a communication strategy for helping others have a hope of reproducing an analysis. This definitely belongs in this curriculum (perhaps without diving toooo far down the make rabbithole). Reviewing the curriculum so far, much of the Sustainable Coding section is redundant with the packaging / modularity unit that follows; I'll re-write this unit momentarily to reflect your conversation.

@dsalo What a fantastic slide deck! And yes - sometimes with a new project, people just need to come into work on Monday and find out everything got version controlled over the weekend :) Your comments make me want to add a 'change making' unit as the last session in the course, but I'm struggling to keep things to size - I'm just going to add it on anyway for now, and we'll see how things evolve once the curriculum actually gets written; I suspect some topics will move / transform as that work gets done.

@blahah
Copy link

blahah commented Aug 23, 2015

@BillMills I vote that 'change making' should be a separate course

@bkatiemills
Copy link
Member Author

@blahah ha, it could fill one - but it would be nice to wrap up this course with something that sets people on a course of action with the new things they learned. Do you think there's a useful way to go about this in only one session?

@ivanhanigan
Copy link

For the section on 3. Open Data II: Clean Data

  • how to keep data organized and easy to reuse at a later date (including in-house reuse)

I recommend some 'convention over configuration' advice, and links to evidence based recommended filing systems. My faves are:

a 2008 book recommended folder structure for statistical programmers

\ProjectAcronym
    \- History starting YYYY-MM-DD
    \- Hold then delete 
    \Admin
    \Documentation 
    \Posted
         \Paper 1
             \Correspondence 
             \Text
             \Analysis
    \PrePosted 
    \Resources 
    \Write 
    \Work

Simple R analysis

This concept originally introduced by Josh Reich as the LCFD framework, on the stack overflow website here http://stackoverflow.com/a/1434424, and encoded into the makeProject R package http://cran.r-project.org/web/packages/makeProject/makeProject.pdf.

# choose your project dir
setwd("~/projects")   
library(makeProject)
makeProject("makeProjectDemo")

# gives
/makeProjectDemo/
    /code/*.R
    /data/
    /DESCRIPTION
    /main.R

# in main.R you put
source("code/load.R")
source("code/clean.R")
source("code/func.R")
source("code/do.R")

More complicated R framework for data analysis

/project/
    /cache/
    /config/
    /data/
    /diagnostics/
    /doc/
    /graphs/
    /lib/
        /helpers.R
    /logs/
    /munge/
    /profiling/
        /01_profile.R
    /reports/
    /src/
        /01_EDA.R
        /02_clean.R
        /03_do.R
    /tests/
        /01_tests.R
    /README
    /TODO

For metadata I like EML

@bkatiemills
Copy link
Member Author

Thanks, @ivanhanigan, this is great stuff! We talked about similar things at Study Group Journal Club at UBC the other week - we read this paper and this other paper which touch on related topics - definitely all things to include. Thanks again for the notes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests