Skip to content

Commit

Permalink
Copy files from 1e to 2e
Browse files Browse the repository at this point in the history
  • Loading branch information
jeroenjanssens committed Jul 8, 2020
1 parent 87e4c88 commit 4487dc5
Show file tree
Hide file tree
Showing 1,207 changed files with 205,011 additions and 9 deletions.
2 changes: 1 addition & 1 deletion book/1e/_bookdown.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
book_filename: "book"
repo: https://github.com/jeroenjanssens/data-science-at-the-command-line
output_dir: "../_book_output"
output_dir: "../../www/static/1e/"
link-citations: true
language:
ui:
Expand Down
Binary file modified book/1e/book.rds
Binary file not shown.
3 changes: 2 additions & 1 deletion book/1e/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ knitr::include_graphics('images/cover.png', dpi = NA)

If you find this content useful, please consider supporting the work by either:

* Sponsoring the author on [GitHub Sponsors](https://github.com/sponsors/jeroenjanssens/).
* Buying the book on [Amazon](https://www.amazon.com/Data-Science-Command-Line-Time-Tested/dp/1491947853) or [bol.com](https://www.bol.com/nl/p/data-science-at-the-command-line/9200000031673818).
* Writing a review on [Amazon](https://www.amazon.com/Data-Science-Command-Line-Time-Tested/dp/1491947853) or [Goodreads](https://www.goodreads.com/book/show/22967424-data-science-at-the-command-line).
* Starring the [Github repository](https://github.com/jeroenjanssens/data-science-at-the-command-line) or [Docker image](https://hub.docker.com/u/datascienceworkshops/).
Expand All @@ -66,7 +67,7 @@ knitr::include_graphics("images/data-science-workshops.svg")
```
</a>

Did you know that the author gives in-company training about this topic and other topics such as R and Python? If you and your colleagues would like to learn from Jeroen in person, please contact [Data Science Workshops B.V.](https://datascienceworkshops.com) for more information.
Did you know that the author provides training about this topic and other topics such as R and Python? If you and your colleagues would like to learn from Jeroen in person, please contact [Data Science Workshops B.V.](https://www.datascienceworkshops.com) for more information.


```{r include=FALSE}
Expand Down
60 changes: 53 additions & 7 deletions book/1e/packages.bib
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,73 @@ @Manual{R-base
author = {{R Core Team}},
organization = {R Foundation for Statistical Computing},
address = {Vienna, Austria},
year = {2019},
year = {2020},
url = {https://www.R-project.org/},
}

@Manual{R-bookdown,
title = {bookdown: Authoring Books and Technical Documents with R Markdown},
author = {Yihui Xie},
year = {2018},
note = {R package version 0.9},
year = {2020},
note = {R package version 0.20},
url = {https://CRAN.R-project.org/package=bookdown},
}

@Manual{R-knitr,
title = {knitr: A General-Purpose Package for Dynamic Report Generation in R},
author = {Yihui Xie},
year = {2019},
note = {R package version 1.22},
year = {2020},
note = {R package version 1.29},
url = {https://CRAN.R-project.org/package=knitr},
}

@Manual{R-rmarkdown,
title = {rmarkdown: Dynamic Documents for R},
author = {JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone},
year = {2019},
note = {R package version 1.12},
year = {2020},
note = {R package version 2.3},
url = {https://CRAN.R-project.org/package=rmarkdown},
}

@Book{bookdown2016,
title = {bookdown: Authoring Books and Technical Documents with {R} Markdown},
author = {Yihui Xie},
publisher = {Chapman and Hall/CRC},
address = {Boca Raton, Florida},
year = {2016},
note = {ISBN 978-1138700109},
url = {https://github.com/rstudio/bookdown},
}

@Book{knitr2015,
title = {Dynamic Documents with {R} and knitr},
author = {Yihui Xie},
publisher = {Chapman and Hall/CRC},
address = {Boca Raton, Florida},
year = {2015},
edition = {2nd},
note = {ISBN 978-1498716963},
url = {https://yihui.org/knitr/},
}

@InCollection{knitr2014,
booktitle = {Implementing Reproducible Computational Research},
editor = {Victoria Stodden and Friedrich Leisch and Roger D. Peng},
title = {knitr: A Comprehensive Tool for Reproducible Research in {R}},
author = {Yihui Xie},
publisher = {Chapman and Hall/CRC},
year = {2014},
note = {ISBN 978-1466561595},
url = {http://www.crcpress.com/product/isbn/9781466561595},
}

@Book{rmarkdown2018,
title = {R Markdown: The Definitive Guide},
author = {Yihui Xie and J.J. Allaire and Garrett Grolemund},
publisher = {Chapman and Hall/CRC},
address = {Boca Raton, Florida},
year = {2018},
note = {ISBN 9781138359338},
url = {https://bookdown.org/yihui/rmarkdown},
}

2 changes: 2 additions & 0 deletions book/2e/.Rbuildignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
^.*\.Rproj$
^\.Rproj\.user$
59 changes: 59 additions & 0 deletions book/2e/00-preface.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Preface {-}

Data science is an exciting field to work in. It’s also still very young. Unfortunately, many people, and especially companies, believe that you need new technology in order to tackle the problems posed by data science. However, as this book demonstrates, many things can be accomplished by using the command line instead, and sometimes in a much more efficient way.

Around five years ago, during my PhD program, I gradually switched from using Microsoft Windows to Linux. Because it was a bit scary at first, I started with having both operating systems installed next to each other (known as dual-boot). The urge to switch back and forth between Microsoft Windows faded and at some point I was even tinkering around with Arch Linux, which allows you to build up your own custom Linux machine from scratch. All you’re given is the command line, and it’s up to you what you want to make of it. Out of necessity I quickly became very comfortable using the command line. Eventually, as spare time got more precious, I settled down with a Linux distribution known as Ubuntu because of its ease of use and large community. However, the command line is still where I’m spending most of time.

It actually hasn’t been too long ago that I realized that the command line is not just for installing software, system configuration, and searching files. I started learning about command-line tools such as `cut`, `sort`, and `sed`. These are examples of command-line tools that take data as input, do something to it, and print the result. Ubuntu comes with quite a few of them. Once I understood the potential of combining these small tools, I was hooked.

After my PhD, when I became a data scientist, I wanted to use this approach to do data science as much as possible. Thanks to a couple of new, open-source command-line tools including `xml2json`, `jq`, and `json2csv` I was even able to use the command line for tasks such as scraping websites and processing lots of JSON data. In September 2013, I decided to write a blog post titled *Seven Command-line Tools for Data Science*, which is available at <http://www.jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html>. To my surprise, the blog post got quite some attention and I received a lot of suggestions of other command-line tools. I started wondering whether this blog post could be turned into a book. I’m pleased that, some ten months later, with the help of many talented people (see the acknowledgments below), the answer is a yes.

I am sharing this personal story not so much because I think you should know how this book came about, but because I want to you know that I had to learn about the command line as well. Because the command line is so different from using a graphical user interface, it can seem scary at first. But if I can learn it, then you can as well. No matter what your current operating system is and no matter how you currently work with data, after reading this book you will be able to do data science at the command line. If you’re already familiar with the command line, or even if you’re already dreaming in shell scripts, chances are that you’ll still discover a few interesting tricks or command-line tools to use for your next data science project.

## What to Expect from This Book {-}

In this book, we’re going to obtain, scrub, explore, and model data - a lot of it. This book is not so much about how become *better* at those data science tasks. There are already great resources available that discuss, for example, when to apply which statistical test or how data can be best visualized. Instead, this practical book aims to make you more *efficient* and *productive* by teaching you how to perform those data science tasks at the command line.

While this book discusses over 80 command-line tools, it’s not the tools themselves that matter most. Some command-line tools have been around for a very long time, while others will be replaced by better ones. There are even command-line tools that are being created as you’re reading this. In the past nine months, I have discovered many amazing command-line tools. Unfortunately, some of them were discovered too late to be included in the book. In short, command-line tools come and go. But that’s OK.

What matters most is the underlying idea of working with tools, pipes, and data. Most of the command-line tools do one thing and do it well. This is part the UNIX philosophy, which makes several appearances throughout the book. Once you become familiar with the command line, know how to combine command-line tools, and can even create new ones, you have developed an invaluable skill.

## How to Read This Book {-}

In general, you’re advised to read this book in a linear fashion. Once a concept or command-line tool has been introduced, chances are that we employ it in a later chapter. For example, in Chapter 9, we make heavy use of `parallel`, which is introduced extensively in Chapter 8.

Data science is a broad field that intersects many other fields such as programming, data visualization, and machine learning. As a result, this book touches on many interesting topics which unfortunately cannot be discussed at full length. Throughout the book, there are suggestions for additional reading. It’s not required to read this material in order to follow along with the book, but when you are interested, you know that there’s much more to learn.

## Who This Book Is For {-}

This book makes just one assumption about you: that you work with data. It doesn’t matter which programming language or statistical computing environment you’re currently using. The book explains all the necessary concepts from the beginning.

It also doesn’t matter whether your operating system is Microsoft Windows, MacOS, or some form of Linux. The book comes with a Docker image, which is an easy-to-install virtual environment. It allows you to run the command-line tools and follow along with the code examples in the same environment as this book was written. You don’t have to waste time figuring out how to install all the command-line tools and their dependencies.

The book contains some code in Bash, Python, and R so it’s helpful if you have some programming experience, but it’s by no means required to follow along with the examples.


## Acknowledgments {-}

First of all, I’d like to thank Mike Dewar and Mike Loukides for believing that my blog post [Seven Command-Line Tools for Data Science](http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html), which I wrote in September 2013, could be expanded into a book.

Special thanks to my technical reviewers Mike Dewar, Brian Eoff, and Shane Reustle for reading various drafts, meticulously testing all the commands, and providing invaluable feedback. Your efforts have improved the book greatly. The remaining errors are entirely my own responsibility.

I had the privilege of working together with three amazing editors, namely: Ann Spencer, Julie Steele, and Marie Beaugureau. Thank you for your guidance and for being such great liaisons with the many talented people at O’Reilly. Those people include: Laura Baldwin, Huguette Barriere, Sophia DeMartini, Yasmina Greco, Rachel James, Ben Lorica, Mike Loukides, and Christopher Pappas. There are many others whom I haven’t met because they are operating behind the scenes. Together they ensured that working with O’Reilly has truly been a pleasure.

This book discusses over 80 command-line tools. Needless to say, without these tools, this book wouldn’t have existed in the first place. I’m therefore extremely grateful to all the authors who created and contributed to these tools. The complete list of authors is unfortunately too long to include here; they are mentioned in the Appendix. Thanks especially to Aaron Crow, Jehiah Czebotar, Christoph Groskopf, Dima Kogan, Sergey Lisitsyn, Francisco J. Martin, and Ole Tange for providing help with their amazing command-line tools.

Eric Postma and Jaap van den Herik, who supervised me during my PhD program, deserve a special thank you. Over the course of five years they have taught me many lessons. Although writing a technical book is quite different from writing a PhD thesis, many of those lessons proved to be very helpful in the past nine months as well.

Finally, I’d like to thank my colleagues at YPlan, my friends, my family, and especially my wife Esther for supporting me and for pulling me away from the command line at just the right times.

## Dedication {-}

*To my wife, Esther. Without her encouragement, support, and patience, this book would surely have ended up in `/dev/null`.*


## About the Author {-}


[Jeroen Janssens](http://jeroenjanssens.com/) is the founder and CEO of [Data Science Workshops](https://datascienceworkshops.com), which provides on-the-job training and coaching in data visualisation, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. He can be found on [Twitter](https://twitter.com/jeroenhjanssens/), [LinkedIn](http://www.linkedin.com/in/jeroenjanssens), and [GitHub](https://github.com/jeroenjanssens).

0 comments on commit 4487dc5

Please sign in to comment.