-
Notifications
You must be signed in to change notification settings - Fork 36
Open Science Utility Belt #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for this, Bill! Would be worth adding links to the Working Open guide here, and seeing if you could line up with some of the language and key categories there to strengthen / augment that work. That also may help with some of the verbiage issues (like "Programming" as a header - not sure that's the best term here, crisp up the language and minimize jargon). Great start, and more comments to come! |
yep, these lessons are going to pull heavily from the WOG, once we agree the curriculum. Changed 'programming' -> 'code wrangling'. |
Suggested language for the beginning: Open Science 101This session series introduces practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester. Help Us Develop This CurriculumWe're trying to answer these questions:
Let us know your thoughts in the comments! Sessions... Going through actual sessions now :) Really excited for this work! |
From the twitterverse: @ttimbers suggests data storage & archiving - how to find data associated with a study, how to organize your own data for future reuse; also metadata, storage and formats. |
+1 for @minisciencegirl 's suggestion about naming schemes Is there any reason to have the Open Data sections after Collaboration sections? It might flow better (in my mind) from data wrangling into setting the wrangled data free (i.e. Open Data), then to dive into collaborations/workflows/code review/version control. This change could also make the transition into publishing easier, as collaborations may lead to publications ... 🙏 |
I'm thinking along the same lines as @taddallas re:flow. Teaching packages before git seems off to me. |
The ordering of 'Code Wrangling', 'Collaboration', 'Open Data' and 'Publishing & Communication' are actually just in the order I thought of them in :) So, how about the order:
|
Looks good to me! One tiny thing: I noticed that much of the material uses Python. Perhaps it would be worthwhile to also show some R examples, as some pretty solid tools for Open Science are built around R (e.g. reproducible analyses and manuscript writing with R Markdown, testing with |
Nope, R-flavoured implementations of these techniques are definitely something we want! Which will get used depends on the audience, but we definitely want both options for the Code Wrangling section. The current examples are Python for no other reason than I speak Python. That said, I think there was a packages in R lesson from UBC recently I can dig up and link here - if you have a good lesson for testing in R, send it on by! |
The only thing that is conspicuously missing to my mind is licenses - they are fundamental to open science and are relevant to all the sections above. I would think these are the most important aspects of licenses to cover:
|
@blahah - totally agree, added your points to an additional section under 'publishing & communication' - thanks! One thing that would be super helpful in that section, is ideas for hands-on activities, and engaging ways to introduce things like licenses as well as code and data citation; definitely A-list important stuff, but runs the risk of turning into a really dry lecture about DOIs and copyright. |
I think a nice way to introduce licenses and citation is by doing a set of small hands-on data mining tasks. Introducing some frustrating scenarios that are solved by proper licensing and good data citation should be memorable. We just need a paper with great data but no license, and a paper that does something good with someone else's data but doesn't cite it properly. |
This may be expanding the scope a bit, but some topics that would have been helpful for me early on, before I really did much coding or had a solid project together, would have been:
|
@noamross could that first point fit with the social media unit? I'd love to hear your ideas on your second point - to be honest, content aggregation is a pretty weak part of my own game, I've never found a method I really liked. |
Yes, lab notebook could go in social media, but there's a fair amount of the topic that isn't explicitly social: metadata/tagging of notes, formats and organization for searching, plain-text for posterity, etc. On collecting content, I'm similar. I have a semi-working system of Mendeley + a collection of tagged plain-text notes, but I'm not sure how well it works in terms of collaboration. @cboettig and I once wrote a review together where we built an annotated bibliography using markdown + bibtex, but it felt more like a one-use hack than a system. Ideas from others would be welcome. |
Great suggestions so far. I agree on the "importance of agreeing on a license explicitly and early", and thus think this should come at the beginning of the course and not at the end. As @blahah mentioned, this should work well after some moments of reuse-rights-related frustration, which unfortunately remain all too easy to create. One aspect that I am missing is an overview of where things are or are not open along the research cycle - we are making progress with making research outputs more widely available, but the research process is still mostly closed (safe a few open notebooks), and funding is basically a dark corner (very few proposals are open, and basically no funding decisions). |
Working with collaborators who don't necessarily Get It about the whole "open" thing. This is one of the top questions I get whenever I talk open with people. DOIs, and how they are not magic but are important. Data citation. Data journals and other data-publication venues. Data-use tracking and metrics, and how to use them to make a tenure case or a grant proposal stronger. Where to get help shoring up your weak spots -- nobody can do everything! Basic digital hygiene: backups, basic security, basic digital preservation (why "I'll put it on my website!" is a lousy idea long-term). Navigating openness vs. privacy in human-subjects and other sensitive research. How to use Excel, if you must, without making everyone else hate you. What to use when Excel stops being useful (stats packages, relational databases). |
Would love to see design of experiments, multiple testing corrections, and quality engineering (reducing variability) of experiments in the curriculum. (Happy to contribute on these subjects.) |
To publishing: @noamross I'm using the knitcitations from @cboettig on a daily basis. So if this is the result of your cooperation it certainly wasn't a one-time hack :) I would underline the importance of learning markdown and using knitr when collaborating on scientific projects.
|
Here is the program that we are developping on the MOOCSciNum "research practices at the digital age" with a strong focus on open research practices Here is the enrollement page Cheers Célya Plan du cours Séance 0 : Recherche à l'ère du numérique : quelles transformations ? (séance d'introduction) Séance 1 : S'appuyer sur des ressources scientifiques existantes Séance 2 : Collecter/produire des données scientifiques Séance 3 : Traiter/analyser des données scientifiques Séance 4 : Archiver/partager des données scientifiques : données de santé, données sensibles Séance 5 : Partager ses résultats scientifiques : écrire et publier Séance 6 : Faire partie d'une communauté scientifique Séance 7 : Nouvelles formes d’interaction en recherche et enjeux éthiques |
Google translate + my eyes (not perfect, please make improvements). My comments: This (below) is a broader overview of research & access. I think something more hands-on would work better in study group sessions. But I do think looking a bit broader picture is a good move - thanks for adding DOI for code, licensing info. Course Outline Session 0: Research in the digital age: what is different? (Introductory session) Session 1: Building on existing scientific resources Session 2: Collect / produce scientific data Session 3: Edit / analyze scientific data Session 4: Archive / share scientific data: health data, sensitive data Session 5: Sharing research results: write and publish Session 6: Being part of a scientific community Session 7: New forms of interaction in research and ethical issues |
@tgardner4 I agree that these topics are fundamental but also think they are somewhat out of scope for a ~12 lesson group-study on open science. There are, however, some important connections between experimental design and open science that could be addressed, such as:
|
@noamross I agree with your points. What I outlined is a study group unto it's own. I think you propose a nice solution though - an intro to the topic (and perhaps a pointer to a separate study group dedicated to a full treatment). By including it the open science curriculum - even as an intro - you would teach participants that these are core issues that can't be overlooked in proper scientific pursuit. I would also suggest that the original open science outline described above (the very first post in this thread) is heavily tilted toward a view that open science = coding + publishing. Absent from this curriculum is anything about experimentation or the scientific process. My suggestions are a reaction to this gap. When I hear "science" my mind goes to experimentation. When I hear "open science" I think: "collaboration on the design, execution and sharing of experiments & results." Code and publishing are a necessary, but not sufficient, portion of the the scientific process. |
Taking comments from @tgardner4 @noamross and more, there might be more clarity if we change the title to: Open Science & Data: open research practices when working with scientific data This could be a follow up series after a broader 'Introduction to Open Science'. |
@tgardner4 I agree with all your points - these are essential skills and they are not taught sufficiently well in general science courses. I still think that they are not part of open science specifically, and are too many steps removed from the core toolset of open science to feature heavily in the curriculum. Having a short discussion of the relevant aspects where they are directly related (for example in reusing open data), then linking out to a resource which would be developed separately from this curriculum seems like a good way to go. |
@BillMills There's vastly less written about this than I would like. :/ Sometimes introductory project-management techniques are a way in. I have a slidedeck I'd be happy to share with you if you think it would help; otherwise, maybe the way to approach it is an unconference-style discussion, maybe with a plausible case study as an example. Another way that can work is using a "horror story" as a discussion seed. In this context, I might use a story about a data-ownership snafu (see examples at http://pinboard.in/u:dsalo/t:horrorstories/t:dataownership ) but obviously almost any interpersonal issue specific to open science can be made to work, if there's an available horror story. |
@acabunoc that title sounds like it fits the content better |
I agree - that’s a better title - more reflective of the content. Tim
|
A few cents from wandering through this discussion -- "How to set expectations for good contributions that lead to easy-to-review code" - phrasing sets off alarm bells. Also, testing comes 4 sessions later, which is the wrong way 'round - how do you review code that you can't trust? ;). Tests are a prerequisite for code review. Code coverage analysis is missing, also. I would suggest a checklist among other things. Code packaging? Nix it, IMO. Or move it much later. (Definitely well after testing.) Reasons: it has a lot of sysadminy type stuff that most people won't know or care about. Soooooooooo you're saying DOIs enhance discoverability of code? Sounds like a theoretical point that doesn't actually work to me. Fine to mention it but DOIs + code are not terribly useful yet. publishing and communications: copyright vs license should be in there, no? Automation & scripting is missing from the entire discussion, and yet it's key to sharing any kind of workflow. <=> reproducibility, which is underemphasized. In my experience, selling scientists on this stuff is 80% of the battle, once they show up. (Technical skills is the next 80% ;). More and stronger motivation. Very few people seem to worry about incorrect results (oddly enough) so efficiency and reputation is a good focus. Twitter probably belongs in social media, too. Lurking, favoriting, retweeting, subtweeting. Looks great overall - I think there are probably many ways through all this material, but this is a nice collection of things to consider for any such course! |
@tgardner4 yes! Including a pointer to your content and then following it up as its own series of lessons is an A+ solution, 100% on-board with that. @acabunoc sorry - which title do you want to replace with that, @tgardner4's or mine? Happy to comply either way, let me know. @dsalo: please link me to your slide-deck! I'm really keen for this content, but I need some help bringing it into focus. @ctb:
|
On Sat, Aug 08, 2015 at 02:27:02PM -0700, Bill Mills wrote:
modularity, maybe? I don't know how this works in R but in Python it is I was thinking about this after my comment. How about tying it together
+0.5, or start with some shell/R/Python scripts that do soup-to-nuts analysis |
An end-to-end R/Python script is more important than shell/make. If you plan to teach shell scripts, I would teach Makefile scripts soon after. I consider Makefile scripts to be structured, self-documenting shell scripts, and better suited to data analysis than pure shell scripts. |
On Mon, Aug 10, 2015 at 06:17:34PM -0700, Shaun Jackman wrote:
+1 |
Here's a small example of using make for a data analysis pipeline that I used for teaching a one hour introduction to make that I created for the Scientific Programming Study Group at SFU: https://github.com/sjackman/makefile-example |
I am adamant however that introductory |
On Thu, Aug 13, 2015 at 08:26:22AM -0700, Shaun Jackman wrote:
Interesting. The former is the right way to do things for workflows, the |
I tend to record all my analyses big and small in a Makefile script, so workflows and personal efficiency are nearly one and the same for me. Where you draw the line between basic shell and advanced shell is clearly pretty fuzzy though. To clarify, I would teach |
Speaking from the perspective of an "early career" type who did a lot of time as a software software engineer post-undergrad, shell scripting should be deliberately minimized. I don't want to knock it too much, since there are lots of neat "tips n' tricks" for Bash, and probably other shells, especially for the command line. If nothing else, I've certainly won some benefits from fancy But maintenance and debugging tend to be quite painful for shell scripts. A well-tested script in a web-focused language like Perl is better, but a script in Python/Ruby really wins on syntax and debug-ability. (I don't happen to have any R experience, partly due to the fields I've worked in.) Build systems are probably required reading at some point. The tricky bit is that build systems are hard to work in most edge cases. Autotools and CMake make some things easier over bare Makefiles, but they tend to fall down for cross-compiling and for HPC, particularly if you have platform-specific optimization flags. (What percentage of scientists think deeply about whether they are "respecting" user-specified CFLAGS in their Makefiles/CMakeLists/whatever?) For a course this short, and for such a broad audience as "students/scientists", it's hard to see anything to recommend, except for some basic knowledge of make. |
@BillMills Here you go, quite short but I hope pithy: https://speakerdeck.com/dsalo/changing-workflows |
Nice slides, Dororthea. I do my own academic writing of manuscripts in Markdown stored on GitHub (e.g. UniqTag), and I've faced resistance from senior colleagues who will consider using only Word with track changes and e-mail/Dropbox. I've taken to exporting Markdown to DOCX with Pandoc, soliciting edits with track changes, and then incorporating those changes back into the original Markdown. Anyone else face a similar situation? |
@sjackman Yes, I've often been in a similar situation and I've used that same strategy. (Which is also handy if the journal only accepts Word). Alternatively, you can also just paste the source (e.g. LaTeX / md / Rmd) into a word document and send that. I've found this avoids some pitfalls of the pandoc conversion to Word (though this is improving), particularly where equations are concerned. This strategy was introduced to me by a senior colleague who has long worked in LaTeX while collaborating successfully with many Word-only folks. We've both found collaborators are perfectly happy to ignore the markup and just read + track-changes the text. Of course Word is a terrible text-editor that may play havoc with some character encodings, so you cannot always copy-paste the changes whole cloth. Neither approach is ideal of course. Several collaborators always return documents to me marked in pen anyway, so the question of output format becomes irrelevant. Paper is the great interoperable standard. In the end, manually writing in the changes, as required by any of these approaches, doesn't take that much time and does force you to pay close attention. |
@dsalo that deck is outstanding! Great stuff. Could you put a license on it? I would only disagree with one point: in my experience graduate students are ideal agents of change in scientific practise. @sjackman Yeah, this plagues me in almost every project. It usually goes something like: me and some other collaborators work together on a github repo, authorea page, overleaf, or similar. Then at some point a senior collaborator insists we all start using word with track changes, destroys the automated reference management, and will only work by emailing copies back and forth. It's then a huge effort to restore the paper to a nice format at the end. Another kicker is when they insist on manually editing figure images, rather than letting me edit the code and regenerate, because "it's more efficient for them". 😡 The only solution I can think of is to stop working with those people, which is what I'm trying to do. |
Great comments again, all! A few scattered responses: @ctb @sjackman & other build/workflow management enthusiasts: I think you hit on something important with focusing on end-to-end automation of an analysis; superficially this is a convenience strategy, but more deeply this is a communication strategy for helping others have a hope of reproducing an analysis. This definitely belongs in this curriculum (perhaps without diving toooo far down the @dsalo What a fantastic slide deck! And yes - sometimes with a new project, people just need to come into work on Monday and find out everything got version controlled over the weekend :) Your comments make me want to add a 'change making' unit as the last session in the course, but I'm struggling to keep things to size - I'm just going to add it on anyway for now, and we'll see how things evolve once the curriculum actually gets written; I suspect some topics will move / transform as that work gets done. |
@BillMills I vote that 'change making' should be a separate course |
@blahah ha, it could fill one - but it would be nice to wrap up this course with something that sets people on a course of action with the new things they learned. Do you think there's a useful way to go about this in only one session? |
For the section on 3. Open Data II: Clean Data
I recommend some 'convention over configuration' advice, and links to evidence based recommended filing systems. My faves are: a 2008 book recommended folder structure for statistical programmers
Simple R analysisThis concept originally introduced by Josh Reich as the LCFD framework, on the stack overflow website here http://stackoverflow.com/a/1434424, and encoded into the makeProject R package http://cran.r-project.org/web/packages/makeProject/makeProject.pdf. # choose your project dir
setwd("~/projects")
library(makeProject)
makeProject("makeProjectDemo")
# gives
/makeProjectDemo/
/code/*.R
/data/
/DESCRIPTION
/main.R
# in main.R you put
source("code/load.R")
source("code/clean.R")
source("code/func.R")
source("code/do.R") More complicated R framework for data analysis
For metadata I like EML
|
Thanks, @ivanhanigan, this is great stuff! We talked about similar things at Study Group Journal Club at UBC the other week - we read this paper and this other paper which touch on related topics - definitely all things to include. Thanks again for the notes! |
Open Science 101
This is a session series introducing practical skills needed to get started in open science. A Mozilla Science Study Group can use this series to introduce open science over an academic semester.
Help Us Develop This Curriculum
We're trying to answer these questions:
Let us know your thoughts in the comments!
Sessions
The text was updated successfully, but these errors were encountered: