## Dealing with data

- Passing data around
  - recommendation: do NOT use pickle (Python) or RDS (R)
    - can only be accessed with Python/R
    - neither is backwards compatible by default
    - cannot be diffed with git not viewed on GitHub
  - do use plain tabular format (CSV/TSV) or any domain-specific format (e.g. VCF/FASTA/PDB):
    - can be opened in any relevant software
    - easy to track changes
- Versioning data
   - do not store data in the same git repository
     - large files make git difficult to work with after some time
     - you may want to keep data more privat than the analyses/summaries
   - have a separate repository for data
     - do not push it to external service provider (GitHub/GitLab etc)
     - do push it to a dedicated server within your organization
   - or, use artifact versioning
     - [quilt](https://github.com/quiltdata/quilt) - versioned data portal for AWS, petabyte-scale, notebook-oriented
     - [scrapbook](https://github.com/nteract/scrapbook) - records the cell outputs (graphics, tables) separately from the notebook

## Dealing with patients data

- You most likely work with the anonymized data to begin with
  - you (of course) keep the data locally, probably password-protected and outside of the repository (/have a gitignore file), 
  - and you do not print out entire dataset at once, just to be certain
- But would it be ok if the data became public at any point without proper vetting?
  - anything you push to an online repository is at a higher risk, even if it's private
- Case study: a patient from minority background + recruited at the other, smaller hospital you work with + after six pregnancies.
  - could this patient flag up as outlier? If yes, why?
     - possibly for any of the three characteristics, let alone the sum of them
  - would this patient be easier to identify than others?
     - how many persons with such characteristics **and** the disease you study live there?
  - if both answers are yes, you may want to put an extra effort to avoid revealing any additional information, e.g. which metabolites they had up/down

### Mitigation strategies
- Proper inspection of diffs is one way to control what is being committed to your repository - even annonymized data should be taken a with a great care.
- If you want to check for any accidental change in the data without printing it out (or adding to the version control) it might be better to use a checksum as demonstrated with MD5 example\*

\*) while cheksums are not unique, to get one collision you would need to generate random data in a rate of ["6 billion per second for 100 years"](https://stackoverflow.com/a/288519). It might be that there are more pressing issues to worry about in such time scales.