Creating a Bioconductor package

Complete Documentation for Creating a Bioconductor R package

This resource can be used as a reference when writing code for, or when building, an R package for submission to Bioconductor. These notes are also suitable for the creation of a general R package with or without submission to CRAN but note that Bioconductor enforces more stringent checks than those enforced by CRAN.

1. R coding best practices

As most people would generally have their R code wrote (most likely scattered across differing R scripts) before they decide to make them into a comprehensive package, there are some general best practice guidelines to adhere to that we should address.

Do's

Put package name with function - package::some.function() when calling external packages in your R code.
Separate out your functions into separate R scripts. This will make nearly all parts of the subsequent R package development lifecycle that little bit less painful.
Have examples for 80%+ of functions which are exported (#' @export in roxygen2)
Keep the number of characters per line at 80 or less (All lines don't have to be for Bioconductor but aim to have most).
Keep function size at or below 50 lines if possible (This isn't enforced in Bioconductor/R checks but is a good idea).
Ensure tabbing done at 4 spaces (can set this in Tools->Global Options in RStudio). This is for Bioconductor and is a lot less painfull if you do it as you code (This isn't enforced in Bioconductor/R checks but is a good idea).
- If you don't do this the package styler has a function to help update your code to the correct tab formatting (note doesn't work 100% of the time but very good):
  
  styler::style_pkg(transformers = styler::tidyverse_style(indent_by = 4))
- When shortening lines to go below 80, make sure if inside () you can return to the next line. If inside anything else e.g. {} [] don't. For example:
```
         apply(something,
```
```
                  `tosomething)`
```
- Also don't go next line on a %in%, this will cause an error with the tab check. For example, this is fine: mypkg::myPackageData[ mypkg::myPackageData$data_type %in% a_vector, ] but this will fail mypkg::myPackageData[mypkg::myPackageData$hgnc_symbol %in% a_vector, ]
The following can be used to check these three formatting issues when writing an R package, listing the offending lines per function: BiocCheck:::checkFormatting("link/to/myPackage", nlines=Inf)

Don'ts

Don't use class() instead use methods::is() or equivalent for actual class: is.numeric()
Don't use T/F these don't act the same in every situation, use TRUE/FALSE
Don't use 1:length(a) it's error prone, use seq_len(length(a))
Functions should not contain . for Bioconductor submissions as it indicates a S3 disbatch. Use camelcase or underscores.

2. Creating an R package

There are multiple R packages developed to make the process of creating your own R package easier. The ones I would advise are:

roxygen2
devtools
usethis - previously part of devtools

NOTE: There is now(released 2021) an equivalent package to usethis for bioconductor; biocthis. I haven't used this myself yet but if you are starting from scratch with your package I think it could be very useful and save you a lot of time!

There are also some great guides for setting up your R package which have extensive detail on each aspect so it would be redundant to reiterate them here. I advise one/both of the following:

R packages by Hadley Wickham
This blog

These resources should get you up to the point where you have the correct R package folder structure, vignette, R scripts. I would also advise using Github to store your package repository. This has many benefits and is nearly essential for creating a proper development cycle for R package developments. This is also particularly important for future version development when the package is live on Bioconductor.

The remaining aspects of your package that we will consider more closely are checking your package on different R versions and operating systems, unit tests, code coverage and NEWS files.

2.1 Testing with R versions & differing operating systems

There a multiple approaches to do this but probably the easiest is to use is from usethis in r, see Github Action setup and badges. Through some very simple R commands you can get external builds along with build badges and also code coverage badges added to your package. This is the prefered method by the community.

Another decent option is travis. Travis integrates with your Github repository so once you push your changes, checks are automatically run and reported. A good introduction to travis along with its use with github and code coverage reporting (more on this below) is given at JEFworks lab MIT.

2.2 Unit testing

R packages - Unit Testing by Hadley Wickham and JEFworks lab MIT cover this area quite well so these should be used but some commands to note are: #Add unit tests usethis::use_testthat() #run the written tests devtools::test()

NOTE - when you create a test file it must start with "test-" i.e. test-my_first_function.R

More generally, It is worth having a separate test R script per each function R script in your package. It is helpful to keep tests vague but ensure they cover all functionality of your package. Note you need test coverage to be 80%+ for Bioconductor.

Note that if your tests are too time/RAM consuming might be best to use long tests see herefor more info for implementations and use MungeSumstats as a reference.

2.3 Code coverage

A good way to check if your tests cover all functionality of your package are to use a code coverage tool. I recommend codecov which integrates with your github repository and gives you a detailed code coverage report each time you push to github. You can use this to identify the sections of each function which aren't currently covered by your tests, making the development process far easier.

Note you can run code coverage checks in R using:

devtools::test_coverage()

You can create a Github Action using a yaml file so your code coverage test (along with a R CMD Check) can be run when you push to github. See the MungeSumstats GHA yaml for an example of how to do this.

Notes on GitHub Actions checks

There can be problems when using github actions:

Taking too long to install packages can be a problem for GH Actions checks bc if they time out then the fail entirely.
When running checks via your GH Actions workflow, these checks are running on a server elsewhere. This means they don't have access to your local environmental variables, including GITHUB_PAT which lets you install remote R packages much faster. Brian Schilder can up with a solution to this.
Speed up any packages installed by GitHub by first going to your repo's Setting --> Secrets, and creating a new secret named PAT_GITHUB. Note the reversal from the local variable name (GITHUB_PAT), since GH doesn't let you name Secrets starting with "GITHUB".
Then, add these as global variables in your GHA yaml so it can be used in all subsequent installations (see here for an example):

    env:
      R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
      GITHUB_PAT: ${{ secrets.PAT_GITHUB }}

2.4 NEWS file

A NEWS.md is a markdown file used to inform users of all changes in your package with version updates. This, of course, is more important to keep up-to-date once your package is live on Bioconductor but should also be included when first submitting your package. There is a function in R to create this file: usethis::use_news_md()

Or else simply create the file yourself and put it in the main package directory, in the same location as your DESCRIPTION and NAMESPACE files. I like to use Documentation as a reference for what should be included and how it should be laid out.

You can check if your format is correct by running:

utils::news(package="<name of your package>")

3. R package development lifecycle

As you continue to add features to your package or even as you fight to keep the never-ending swarm of bugs at bay, you will find that you run the same set of commands again and again to check your progress. The approach to doing this I like, which you can copy and manipulate as necessary, is:

Make the change to (hopefully) fix the bug/add the new feature
Set the working directory to inside your package setwd("path/to/myPackage/")
Document the changes with devtools devtools::document()
Run the CRAN checks to ensure the bug is fixed or the new features don't produce any errors: devtools::check()
devtools::check() will run using CRAN standards by default, but the check button in Build tab of Rstudio does not. To enforce CRAN standards using Rstudio's check button, click the gear icon labeled More-->Configure Build Tools... and set the Check-package... box as --as-cran .

Screenshot 2022-03-17 at 14 58 41

If you pass the check, run bioconductor's own check function: BiocCheck::BiocCheck("path/to/myPackage/") More information on these checks are available here
The issues that often show up in both of these checks are addressed in 1. R coding best pactices
If the package passes all necessary checks and you are happy there are no more changes, you can build the package and the vignettes: devtools::build(vignettes = TRUE)
Also build your readme.Rmd file to update readme.md and ensure it looks okay devtools::build_readme()
If your package has a website on the github repository with links to your vignette(s), you should also run the following to update the vignettes where they are available on the website: devtools::build_vignettes() devtools::build_site()
Lastly, push your changes to your Github repository (if using) and
- Check Travis for any fails based on different OS or R versions
- Check codecov for coverage percentage

A note on both the CRAN and Bioconductor checks: Any errors or warnings must be resolved before submission whereas notes are up to the reviewer but should be minimised as much as possible. Also, for Bioconductor, all CRAN tests must also be passed.

A note on CRAN checks - R CMD Check: To be accepted onto Bioconductor the R CMD Check (the CRAN check: devtools::check()) must run in under 15 minutes total. This includes vignette, tests, examples etc. This is the case as the check is run nightly so a restriction is necessary. Aim to reduce runtime by using smaller datasets that still cover all functionality of your code for tests and examples. If this isn't possible for your tests, have a look into long tests which are available in Bioconductor, will be run weekly and which don't have as strict a restriction on time.

4. Data size for Bioconductor

One issue which is quite common is that the data (.Rda files or other types) are above the maximum size for submission to Bioconductor. Data in R packages can be used within the core functionality of the package or, more commonly, used for examples in the vignette. The maximum size for an R package being submitted to Bioconductor is 5MB which, when working with bioinformatic data, does not get you very far.

This issue will show up in the BiocCheck::BiocCheck() as something like: $warning \[1\] "The following files are over 5MB in size: '

This is a big issue since you can't submit a package with data files exceeding this limit. There are two main routes around this which I will address.

4.1 Reduce data size

The first and by far the easier option, is to reduce the size of the data used by the package. This could be through compressing files or simply removing the parts of the files which aren't used. For example, remove cloumns from a dataset which aren't necessary

All efforts should be made to do this since the alternative entails far more work.

4.2 Create an ExperimentHub data package for your statistical package

NOTE - This section may now be out of date as Bioconductor are moving away from the old approach of data packages. New advice is here:

We would prefer the data be downloaded and distributed through the AnnotationHub as we are moving away from traditional data packages. Please see HubPub and the vignette on how to create a hub package https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html

I haven't looked into it so I don't know how different it is to the below advice

If you can not reduce the size of your data files which are necessary for the vignette or package function, an alternative is to create an Experimenthub package for your data. There is a resource describing how to do this but it isn't that intuitive for first time users so I will explain the process below.

You need to create a new package with the correct folder architecture, as you did for your actual package. You will need a vignette explaining the data along with any associated data functions (if you need a function to load a dataset). The necessary files are described here
Also create a repository for this data package on Github or another source where others can access it. Make it public, this is improtant for a later step.
What's unituitive about this data package is that it won't actually hold your data, it will instead hold a metadata.csv file which acts as a pointer to your data which is stored on AWS S3.

Storing your data in AWS S3 for your data package

Firstly create a metdata.csv file holding information about your data. Follow the steps here for each column. Some of the columns are not well explained however so I'll expand on the ones I had trouble with:
- BiocVersion - is the current devel version of Bioconductor or in other words, the current version of Bioconductor +1. So if the Bioconductor version is 3.12, the current devel is 3.13
- RDataPath - this is just the name of your package and file name formatted as: myDataPackage/mydatafile.rda
Once this is made, put the metadata.csv file in the following location: myPackage/inst/extdata/ And test the format with the following R functions: ExperimentHubData::makeExperimentHubMetadata("/path/to/myPackage") AnnotationHubData::makeAnnotationHubMetadata("/path/to/myPackage")
Also create a script named make-data.R, located myPackage/inst/scripts/. This contains information on how the datasets were derived. Think of it like the description, source and code parameters in a data.r file in a normal package. A draft is below from this script for one dataset:

## datasetName
## A dataset containing .... 
# Code:
# listMarts(host="www.ensembl.org")
# human <- useMart(host="www.ensembl.org", "ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl")
# ensembl_transript_lengths_GCcontent = getBM(attributes=c("transcript_length","percentage_gene_gc_content","ensembl_gene_id"), mart= human)
# save(ensembl_transript_lengths_GCcontent,file="datasetName.Rda")

The next step is uploading your data to AWS S3. This is done so the data isn't stored locally in the data package, making it light weight for installation. The instructions are given here but in short:
- Download AWS CLI here locally. You may be prompted after download but there is no need to create an account.
- On mac/linux, open the terminal and run aws configure --profile AnnotationContributor to create .aws/config file. You need to enter details for at least one field so I entered the following and left the others blank (just press enter) output = text region = eu-west-2
- Email hubs@bioconductor.org asking for temporary access to an S3 bucket where you can upload your data. They will give you a public and private key.
- Update your .aws/config with these keys by running vim ~/.aws/config.
- Refer to here for how it should look when done
- Now upload your data to AWS S3 by running aws --profile AnnotationContributor s3 cp myData/folder s3://annotation-contributor/myDataPackage --recursive --acl public-read
- Reply to the same email in which you got your keys letting them know you have uploaded your data and that your metadata file is created. You will also need to give them the link to your data package repository (remember that github repository you definitely created?) for them to access your metadata file.
- The response will detail how to access your data using ExperimentHub and should be something as follows: library(ExperimentHub) eh = ExperimentHub() query(eh, "myDataPackage")
- Note - The data is only available in devel version of Bioconductor and R as this is the version of Bioconductor and R that the package will be reviewed and released under.
- R devel can be downloaded online
- To update Bioconductor to the devel version, run: BiocManager::install(version = "devel")
Now that the data is uploaded, ensure you have the zzz.R file added to your data package from the tutorial. This allows you to call your data by dataName() once your data package is loaded. Make sure to update all references to your data throughout both your data and statistical packages to this nomenclature (including in the vignette).
- Note that the format given for the zzz.R file in that tutorial isn't complete and you should us the following template instead:
```
 #' @import ExperimentHub ExperimentHubData
 #' @importFrom utils read.csv
 .onLoad <- function(libname, pkgname) {
 fl <- system.file("extdata", "metadata.csv", package=pkgname)
 titles <- read.csv(fl, stringsAsFactors=FALSE,
               fileEncoding = "UTF-8-BOM")$Title
 createHubAccessors(pkgname, titles)
 }
```
If you have not already done so, remove the actual data files from your myDataPackage folder hierarchy as they are now referenced from Experimenthub. Note you will need to add ExperimentHub and ExperimentHubData as dependencies in your DESCRIPTION file.

You will also need to remove your data.R file which references all the data that is now in the AWS S3. You now need to create an .Rd file in your package's man directory referencing this data. This specifics of this aren't covered clearly in any documentation so use the following template and give your file an informative name of myDataPackage-Package.Rd. Note that this assumes you are using the zzz.R file to load data by calling them like a function: ```

  \name{myDataPackage}
  \docType{data}
  \alias{myDataPackage}
  \alias{myDataPackage-package}
  \alias{myDataset1}
  \alias{myDataset2}
  \alias{myDatasetn}
  \title{myDataPackage title line - same as in DESCRIPTION file}
  \description{
    myDataPackage description paragraph - same as in DESCRIPTION file
  }
  \usage{
  myDataset1(metadata=FALSE)
  myDataset2(metadata=FALSE)
  myDatasetn(metadata=FALSE)
  }
  \arguments{
    \item{metadata}{
  	\code{logical} value indicating whether metadata only should be returned
  	or if the resource should be loaded. Default behavior(metadata=FALSE) 
  	loads the data.
    }
  }
  \examples{
    data <- myDataset1()
  }
  \value{ These accessor functions return differing dataset types}
  \source{These datasets have been sourced from various repositories, see 
  the ExperimentHub database for details}
  \keyword{datasets}

Use devtools::check() and BiocCheck::BiocCheck("path/to/myDataPackage/") to check that there are no issues when building your data package. The data package should then be ready for submission to Bioconductor.
Note that you should update your statistical package to use your data package as if it was already available on Bioconductor. For R package development lifecycle described in section 3, this will fail since the data package isn't actually available yet. To get around this, locally install the data package from github and you will be able to run all checks. This will still cause a fail on Travis as it won't be able to install your data package. This is fine and can be ignored for now.

Clearing git history of large data files

Unfortunately, all the previous to move the large data files to a separate package doesn't clear them from git history memory in the statistical package. This is something we may need to do to stop errors when we run the BiocCheck. If you get the a warning message similar to the following, it is more than likely this issue:

$warning [1] "The following files are over 5MB in size: '.git/objects/pack/pack-db4454be53c159209494f1bc01ed91f01c0be0e0.pack'"

The work around is to delete history of these large data files from the statistical package. Note that these won't be retrievable using git so make sure to back them up separately before doing these steps. The below steps, taken from here walk through this process:

Download BFG repo-cleaner

Clone a fresh copy of your repo, using the --mirror flag:

 git clone --mirror https://github.com/<username>/<repo_name>.git

Run the BFG to clean your repository up, note we want to remove everything over 5 MB, the Bioconductor size limit
```
 java -jar path/to/bfg.version.no.jar --strip-blobs-bigger-than 5M <repo_name>.git
```

Now change directory into your git package and garbage collect

 cd <repo_name>.git
 git reflog expire --expire=now --all && git gc --prune=now --aggressive

Finally push the changes and clone a new repository and rerun the BiocCheck and this error should be gone:
```
 git push
```

This approach using the BFG didn't work for me as there were still large .rda (R data) files remaining in git history. If this happens to you, on the command line, go to your package folder and use the following approach. Note again, this is deleting from git history and is not reversible so be careful:

Identify the names of largest objects in the .git pack file (not necessarily in size order)(In my case they're almost all .rda files) git rev-list --objects --all | grep -f <(git verify-pack -v .git/objects/pack/*.idx| sort -k 3 -n | cut -f 1 -d " " | tail -15)
Filter files with the .rda extension (or your large files to be deleted). I guess you should be more careful here if there are rda files you want to retain, but this won't be the case if you are using a separate data package such as here (I got a pretty scary looking warning from git, but it seems to have worked out ok for me in the past): git filter-branch --index-filter 'git rm --cached --ignore-unmatch *.rda' -- --all
Apply the removal to the repo: rm -Rf .git/refs/original
rm -Rf .git/logs/
git gc --aggressive --prune=now
Now you can check the new size of the pack folder: du -h .git/objects/pack
Mine was under 5 MB so it was finally ready for submission to Bioconductor.

Speeding up loading your data

You may have issues with runtime of your R CMD Check (see the note at the end section 3). One way to speed up your examples/vignette/tests is to speed up the data loading process. Currently, with the operation datapackage::myData() the database is queried with each call which can take a few seconds. We can get around this is to create an environment variable of the dataset query so we only call it once for the whole session, in our case, the R CMD check.

To do this, firstly delete the zzz.R file since we will create a function for each dataset ourselves. Then create a utils.R script in your dataset package containing the following:

#' @import ExperimentHub

.my_internal_global_variables <- new.env(parent=emptyenv())

.get_eh <- function() get("eh", envir=.my_internal_global_variables)
.set_eh <- function(value) assign("eh", value,
                                 envir=.my_internal_global_variables)

get_ExperimentHub <- function()
{
 eh <- try(.get_eh(), silent=TRUE)
 if (inherits(eh, "try-error")) {
   eh <- ExperimentHub::ExperimentHub()
   .set_eh(eh)
 }
 eh
}

#internal functions to call the data quickly
#doesn't require multiple calls to eh
#' myDataset
#' 
#' \code{myDataset} returns the myDataset dataset
#' @return myDataset dataset
#' @examples myDataset()
#' @export
myDataset <- function()
{
 eh <- get_ExperimentHub()
 eh[["EHID"]] #ExperimentHub ID of myDataset
}

This creates the environment variable on the first call and there is a properly documented function for each dataset (makesure to update this for all your datasets). The end result is you still use the datapackage::myData() functionality but it will be speed up.

Note you also need to update the .Rd file in your package's man directory referencing the data that you created (see section 3). Remove all references to the datasets here, namely the \usage and \alias tag references:

   	\name{myDataPackage}
   	\docType{data}
   	\alias{myDataPackage}
   	\alias{myDataPackage-package}
   	\title{myDataPackage title line - same as in DESCRIPTION file}
   	\description{
   	  myDataPackage description paragraph - same as in DESCRIPTION file
   	}
   	\arguments{
   	  \item{metadata}{
   		\code{logical} value indicating whether metadata only should be returned
   		or if the resource should be loaded. Default behavior(metadata=FALSE) 
   		loads the data.
   	  }
   	}
   	\examples{
   	  data <- myDataset1()
   	}
   	\value{ These accessor functions return differing dataset types}
   	\source{These datasets have been sourced from various repositories, see 
   	the ExperimentHub database for details}
   	\keyword{datasets}

Other notes on Bioconductor package

There are some other nuiances to creating ExperimentHub/statistical packages for submission to Bioconductor which I will list here:

Don't use LazyData: true in your description. I was told this as a review note as the Bioconductor team have rarely found it useful and can slow package installation.
Bioconductor recommends a table of contents in vignettes. This should be an option with toc: true
Anything that the package saves should be a parameter and for all tests, vignette, examples should be set to a temporary directory with tempdir()
Don't use print() use message() instead, be careful though, keep print if you are printing a plot (like a ggplot: print(ggplot()) is okay) and also note that vectors need to be separaterd by spaces:
```
 #no space
 message(c(1,2,3))
 
 #looks better
 message(paste(c(1,2,3),collapse=', '))
```

5. Submitting to Bioconductor

So you are finally at the stage where you can submit your package to Bioconductor to be reviewed. To do this simply create a new Issue on the Bioconductor/Contributions GitHub repo, following the posted directions and replying to reviewer's comments.

If you had to create a data package for with your statistical package (see section 4.2) you should create an issue for your data package first. This is necessary as the build for your statistical package will require the data package. Once the issue is created and the status shows up as review in progress, submit your statistical package to the same issue by commenting:

AdditionalPackage: https://github.com/username/repositoryname

This is described in more detail here. Note that the Bioconductor devel mailing list is a great resource where you can email questions about your R Bioconductor package development and anyone in the community can reply and help. Think of it as Stack Overflow for Bioconductor development.

Before submitting to Bioconductor for the first time:

In your DESCRIPTION file, set Version: 0.99.0.
Make sure you've set up SSH keys with your GitHub account first. Otherwise bioc-issue-bot will reply:

Add SSH keys to your GitHub account. SSH keys will are used to control access to accepted Bioconductor packages. See these instructions to add SSH keys to your GitHub account. Once you add your SSH keys to your github account, please resubmit your issue. We require SSH keys to be associated with the github username .

Register the SSH key with Bioconductor via the BiocCredentials app. You'll need to create a new account the first time.
git fetch --all

What to expect after submitting to Bioconductor

Details on what to expect after submission can be found here. In summary:

Initially, bioc-issue-bot will label the submission with 1.awaiting moderation. During this time, a Bioconductor team member will have a brief look at your package. If there are any issues with the package (e.g. vignette has no content), the moderator will comment and let you know. You should fix this, push the changes and comment to let them know that you've solved the issue.
Once this is cleared, a moderator will be assigned to your package and the package will be labelled 2. Review in progress. At this stage, you must set up remotes to push to git.bioconductor.org. Read this on setting up remotes to Bioconductor. This is called "upstream".
Throughout 2. Review in progress stage, you may face errors in your package(s). Warnings and errors could be raised from BiocChecks and/or by the moderator. These need to be resolved, the version of your package bumped and pushed to bioconductor git repository (upstream) and then your git repository master branch (in that order, this is important as you can get errors otherwise). Use the same R development lifecycle described in section 3 and this help page describing how to bump versions and push to bioconductor to trigger another build. Essentially the extra step is to update the version in 0.99.X and push upstream to the Bioconductor git and to your master branch with the following commands to trigger a new build. :

NOTE: ⚠️ Since March 8th 2023, Bioc has changed the name of their upstream development branch from master to devel (see here for details). While Bioc currently recommends renaming your GitHub master branch to devel, we disagree with this strategy as it is counter to GitHub conventions and is likely to cause confusion amongst developers and users. Instead, we recommend using the main:devel syntax to map your GitHub branch master to the upstream Bioc devel branch.

git push upstream master:devel # git.bioconductor 
git push origin master # personal git

If you get the error error: failed to push some refs to 'git@git.bioconductor.org:packages/<package_name>.git' try replacing master:devel with main:devel, as suggested in Step 5 here.

Troubleshooting Bioconductor write access

First, have a read through the following guide and FAQ and see if there's anything you may have missed.

If the problem persists, here's solutions for two different scenarios.

Scenario 1

When trying to run git fetch --all, you might get the following error.

Fetching origin
Fetching upstream
/home/rstudio/.ssh/config: line 3: Bad configuration option: usekeychain
/home/rstudio/.ssh/config: terminating, 1 bad configuration options
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
error: Could not fetch upstream

Solution:
This can be resolved by adding IgnoreUnknown UseKeychain to your ~/.ssh/config file, as shown here.

Scenario 2

Here's another error you might encounter when running git fetch --all. This means Bioconductor can't find the SSH key that you registered with GitHub set up before you submitted your package to Bioconductor (following these steps). Or simply that the SSH key is not in the expected location (e.g. if you created it from within a Docker container, but not you're trying to run it in your Terminal outside of that Docker container).

Fetching origin
Fetching upstream
git@git.bioconductor.org: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
error: Could not fetch upstream

Solution:

Find the key that you previously registered with GitHub. You can list which keys you have available with ls -al ~/.ssh. In the example below, it is named id_ed25519.pub, but your might have a slightly different name. Now copy the relevant key to the new location that you want to have Bioconductor write access from, for example:

#### Within Docker container's Terminal ####
# Copy  the SSH key to some mounted volume that you can easily find from outside of Docker
scp  ~/.ssh/id_ed25519.pub /Desktop 
#### Within the default Terminal outside of Docker ####
# Copy the SSH key to the root ssh folder
scp  ~/Desktop/id_ed25519.pub /Desktop ~/.ssh/

Lastly, you need to edit the ~/.ssh/config file outside of the Docker (or whatever environment you've decided to copy it into):

nano ~/.ssh/config

Now add the following bit of code to ~/.ssh/config:

Host git.bioconductor.org
	HostName git.bioconductor.org
	IdentityFile ~/.ssh/id_ed25519
	User git

You should now have write access to Bioconductor from outside of Docker!

6. Bioconductor - Future development

Newly added package

When the package is first added to the new bioconductor release you have to sync your github repository. The bioconductor release (e.g. RELEASE_3_13) becomes a separate upstream branch to which you need to sync. The information is here but in short:

First fetch updates and new upstream releases:

 git fetch --all

You should see RELEASE_3_13 here

Since this is your first time using upstream RELEASE_3_13, you have to create a branch for it:

git checkout -b RELEASE_3_13 upstream/RELEASE_3_13

Next merge with that release:

 git checkout RELEASE_3_13
 git merge upstream/RELEASE_3_13
 git merge origin/RELEASE_3_13
 git push upstream RELEASE_3_13
 git push origin RELEASE_3_13

Also have to update master branch as this will have updated version to odd numbers (see below):

 git checkout master
 git merge origin/master
 git merge upstream/devel
 git push upstream devel
 git push origin master

You should see a version 1.0.0 in RELEASE_3_13 and 1.1.0 in the master branch

See here for further details and here for reference to version numbering specifically but, in short, Bioconductor has bi-annual releases so any updates to packages should be made in the devel version with an odd version number: 1.1.1 and should have a new subversion number 1.1.1 to be pushed to Bioconductor to be made live in the next release. Any bug fixes should be made in the live version with an even version number 1.0.1, incrementingly similarly using a new subversion 1.0.1.

Bug fix

If there is a bug in the code, this will need to be fixed on the current release and the devel branch. Note that the devel branch is your master branch on github and the current release will be the branch (release_X_XX e.g. RELEASE_3_13). The easiest way to make a bug fix is to checkout each branch separately, make the fix, bump the version by one integer on the z (x.y.z e.g. from 1.1.0 to 1.1.1 on master and 1.0.0 to 1.0.1 on RELEASE_3_13). Then push upstream and push to origin:

For the master branch,

 git checkout master
 #FIX BUG
 #Commit changes
 git push upstream devel
 git push origin master

For the release branch,

 git checkout <RELEASE_X_Y>
 #FIX BUG
 #Commit changes
 git push upstream <RELEASE_X_Y>
 git push origin <RELEASE_X_Y>

These resources are very helpful for bug fixing and version numbering.

Future development

As you work on future enhancements to the code you should push changes to your master branch as well as upstream to the devel branch along with a single integer bump in the (x.y.z) z version number. This would be something like:

git checkout master
git push upstream devel
git push origin master

These changes will then be visible on the nightly devel builds and will be in Bioconductor on the next release date (every 6 months). See this resources for help with version numbering.

Rebasing branches

Let's say you've made a bunch of edits to your master branch. But now you want to propagate all those same changes to some other branch. You could switch branches and rewrite all of those same edits you just did on master, but this is tedious and prone to errors/inconsistencies.

Instead, a more efficient approach is to simply copy all the changes you made in your master branch into your other branch (in this example, the RELEASE_3_15 branch):

## After committing a bunch of changes on the *master* branch...
git switch RELEASE_3_15
git rebase master -s theirs

The -s theirs flag indicates that any conflicts between the two branches should automatically be resolved by the strategy of using the master's version of the code by default. This way, all changes that were made in master are now integrated into RELEASE_3_15 without any conflicts.

View Bioconductor status once pushed

To view the build of your package on the three bioconductor machines (windows, linux and mac) along with the check go to https://bioconductor.org/checkResults/. Note that any errors or warnings will mean you will need to fix and repush to get your changes live on bioconductor.

When you lose upstream connection because you deleted local repo but you want to push to bioconductor

#first push your changes to the master from whatever branch you were using
git clone "you repo"
#now we need to add upstream branches
git remote add upstream git@git.bioconductor.org:packages/MungeSumstats
git fetch -all
#now you can see all upstream bioconductor releases and dev
git push upstream devel
#should work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly