# What is Data Science?
---

Many areas of modern scientific research and development rely on increasingly large and complex data sets. Discovery and application in science thus relies on the ability to manage these large data sets, and extract meaning from them. In other words, neuroscience now relies heavily on **data science**, which has been variously defined as “…an umbrella term to describe the entire complex and multistep processes used to extract value from data” (Wing, 2019), and the ability to “bring structure to large quantities of formless data and make analysis possible” (Davenport & Patil, 2012, p.73).

**In modern kinesiological research, data science is an increasingly necessary and valued skill.** Data from techniques like EMG, motion capture, and MRI are complex and multidimensional. Being able to understand, manipulate, and visualize the structure of these complex datasets is a necessary skill for performing the research. On top of this, it is increasingly clear that very large data sets - often built collaboratively by many labs - are often required to make reliable inferences about scientific processes.

## Is data science just a trendy name for statistics?
While data science and statistics are overlapping fields, statistics is generally focused on the specific task of testing hypotheses based on data. Data science more broadly includes the storage, manipulation, visualization, filtering, and preparation of data that is typically required prior to statistical analysis. In fact, we will, by necessity, be devoting more time in this course to the pre-data analysis steps than to actual algorithms (although there will also be some of that as well). Data science does also encompass statistics, as well as machine learning; whereas statistics generally involves deriving conclusions from existing data, machine learning involves making predictions from a data set that will generalize to other data. Since statistics is covered in other courses in the kinesiology curricula, this course focuses instead on the other “front-end” aspects of data science described above. Other areas of data science, including software development and “back-end” data science (engineering, hardware, databases), will not be covered.

This highlights a mindset that differs quite dramatically in data science, as compared to the basic statistics taught in undergraduate curricula. Data science includes practices that are more exploratory. In experimentally-oriented disciplines such as exercise physiology or biomechanics, statistics are a natural approach to deriving meaning from data. This is because data typically come from experiments, in which the researcher(s) systematically and intentionally manipulated certain variables. A good experiment is **hypothesis-driven**, meaning that the researcher has predictions in advance as to how the data will systematically vary with the experimental manipulations. These predictions are usually based on past experimental findings, or models of the process being studied. Statistics are fundamentally embedded in data science — and indeed, the concept of "data science" as a discipline emerged from the field of statistics — but data science can be thought of as a larger set of practices the includes statistics, machine learning, data cleaning and transformations, and visualization. Many of these approaches are more **exploratory** than hypothesis-driven. That is, rather than looking for a specific, predicted pattern, the data scientist explores the data to find systematic patterns that may emerge from the data. For example, researchers using techniques like motion capture have attempted to use machine learning algorithms to detect subtle alterations in walking mechanics, as a means of one day being able to "classify" clinically asymptomatic people as being at higher risk of developing neurologic disease, such as Parkinson's. 

## Tools for Data Science
Central to data science is the ability to use scientific programming languages, such as Python, Matlab, and R. This ability includes a strong understanding of the fundamentals of at least one programming language, and the ability to extend one’s knowledge through continuous learning and problem-solving. This course teaches Python, a mature and widely-used language in modern scientific research and data science more broadly. However, many of the fundamentals of scientific programming and data science are common to many languages. Thus, having learned Python, you will be well-prepared to learn new languages in the future, as necessary.

Another important facet of data science is that it is a **team endeavour**. On the one hand, it is founded on open, shared software developed by widely distributed teams of contributors. On the other hand, the practice of data science typically involves teams of individuals with complementary skillsets, both due to the size and complexity of many projects. In science, these teams often comprise students and faculty members in collaborating labs distributed around the world. Team members with different skillsets can also teach each other new things, often through demonstration in a shared project. This class prepares you for such collaboration by developing and coaching your teamwork skills, as well as teaching you how to use software platforms that support such collaboration.

The skills learned in this class will benefit students working in a wide range of areas of kinesiology. As well, the class will provide an introductory foundation in data science that can be applied to a range of areas beyond kinesiology, in academia, industry, and government.

# Reproducibility

A fundamental principle of empirical (experimental) science is **reproducibility**. Scientific results should not be flukes, they should be based on documented and replicable processes. When we report the results of an experiment, we typically present a written description of the methods, as well as written and graphical reports of the results. In principle, these descriptions should be sufficient for a reader to reproduce your experiment, and hopefully get similar results. Of course, in kinesiological research, each experiment typically involves a new sample of participants, so even if the experiment is reproduced exactly, and the data analyzed identically, we can expect some variability in the results because we sampled a different set of individuals. However, if we take a copy of the original data, we should be able to produce the same results by following the documented procedures. This is one of the fundamental principles of **open science**.

In practice, this is harder than it sounds — especially if we are using the manual spreadsheet approach described above. Unless the analysis was documented very closely, it's possible that methodological differences will arise. For example, a Methods section might state that the mean RT was calculated for each participant and then averaged across participants, but this doesn't specify that this was done by a lab volunteer cutting and pasting numbers while simultaneously watching YouTube and chatting with another student in the lab. Even if the procedures were precisely documented, replicating the process would still be as tedious and error-prone — and very likely, even the errors would be different since they are random occurrences.

Science should not be this way; science should be accurate, precise, reliable, and reproducible. For this to be true, we need high standards of control and automation to ensure that data is handled consistently and reproducibly. If only we could replace those flaky lab volunteers with machines that did precisely what we intended...


# Scientific Programming Languages

According to [Wikipedia](https://en.wikipedia.org/wiki/Programming_language), a programming language is, “a formal language, which comprises a set of instructions that produce various kinds of output” — where “formal languages” are characterized by hierarchical organization in which letters are combined to form words, which are in turn combined into larger units according to rules called a **syntax** (or grammar). In general, programming languages are instructions for computers to perform.  There are thousands of programming languages in existence, which [Wikipedia attempts to catalogue on this page](https://en.wikipedia.org/wiki/List_of_programming_languages).

There is nothing special about “scientific” programming languages to distinguish them from other programming languages, except that they are used for scientific purposes. Some languages, however, have become particularly widespread in scientific applications. Below is a discussion of different languages, but first to address the cliffhanger left at the end of the preceding section: programming languages provide a way to standardize and automate data analysis that is reproducible. Since programs are written sets of instructions stored in a file, the same set of instructions can be applied to every data file in a study, and if the programs used to analyze the data are shared with others, then others should be able to reproduce the original results. There is, of course, no guarantee that the original program was free of errors (“bugs”), but the fact that the instructions are written and saved means that they can be audited by others, which makes finding errors much easier (or possible at all) relative to a manual task performed by humans.

Computer programs can also be written in ways to “batch” work, meaning that they can scale easily. For example, once a program has been written to process one data file in a desired way (e.g., compute an individual’s mean RTs for each condition, as in our example above), that program can be placed in a “loop” that applies the same process to every data file in a study. While running the program on each data file takes a certain amount of time — meaning more data will take longer to analyze — computer programs typically perform these kinds of routine tasks far faster than humans (often literally in the blink of an eye), and far more reliably. Where humans might make random errors, computer programs do not: if the program contains an error, it will systematically make the same error on every data file it processes. While errors are obviously not desirable, a systematic error is typically easier to detect and correct than random errors.


# Which language to use?

Although there are thousands of programming languages, a relatively small number of them are widely used. Which languages are the most popular or commonly-used depends on the discipline and area of use. Different languages are designed for very different purposes, and in many cases new work builds on older work, so the use of a particular language in a particular setting will tend to cause that language to propagate within that setting. For example, some languages are well-suited to building interactive web sites, while others may be more suitable for building apps for mobile devices, and others for writing code to be embedded in hardware devices.  One of the most long-standing and representative indices of programming language popularity is the [TIOBE web site](https://www.tiobe.com/tiobe-index/), whose ratings are based on “the number of skilled engineers world-wide, courses and third party vendors. Popular search engines such as Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu are used to calculate the ratings.” As of April, 2020, the 10 most popular languages are (in order): Java, C, Python, C\++, C#, Visual Basic, JavaScript, PHP, SQL, and R. (Don’t worry if you haven’t heard of all of these — there won’t be a test! This is merely to give you a sense of the “lay of the land”, and expose you to the names of languages you’re likely to come across in the future.)

In data science, a few languages are particularly widely used. The internet is rife with clickbait-y pages such as “Top languages every data scientist should know”; while these may rely on questionable methodologies, a general survey of such pages reveals a fairly consistent set of languages, including Python, R, MATLAB, C, Java, SQL, Julia, and Scala. 

It is important to know that there is no one, “best,” programming language — either for programming in general, or for kinesiology in particular. Indeed, many scientists have workflows that include multiple languages. For example, some of my own lab's research involves a robotic device that is controlled through Matlab/Simulink programs. For these studies, we collect the data through Matlab, but rely on Python to process the data and to perform the statistics. However, other neuromechanics labs may use different workflows, such as MATLAB for processing and SPSS for statistics. 

So the punch line is, you should use the language that is best-suited for the task at hand. In the example of my lab’s workflow, we use Python for data processing because I prefer Python as a language to work in for some of the reasons described below. And we perform the statistics using Python as well, because I would like to keep the language consistent across as many aspects of our data analysis pipeline as possible, and I have also written [an open, publicly available Python library](https://github.com/hyosubkim/bayesian-statistics-toolbox) (a library is simply a set of tools written to extend functionality and perform particular tasks) for running the sorts of statistical models we prefer. 


# Why Python for this Course?

With all that said, I have chosen to use Python in this course, for several reasons. Firstly, it is quickly becoming one of the most widely used languages in modern science, generally, and it is being picked up in greater and greater numbers each year within kinesiology as well. In addition, it is one of the  most, if not *the* most, widely used programming language in non-academic data science applications (i.e., industry), so it will serve you well in the future whether you extend your academic career or decide to go into industry, health care, government, etc. Secondly, it is an extremely well-designed language, with syntax and structure that many find much easier to understand than other high-level languages, like R or Matlab. Thirdly, it is an **open-source** language, meaning that it is free to obtain and use. This is also true of R, and indeed most widely-used programming languages, but MATLAB is closed-source: it is developed and sold by The Mathworks, a private company. While UBC pays for a site license that allows all students and staff to use it, this is still a potential limitation for you in the future, and it is also inconsistent with a core principle of this class, which is to use and promote open-source tools and resources as much as possible.


<div>
<img src="images/python-logo-master-v3-TM.png" width="500"/>
</div>


This class will teach you to understand and use the [Python](https://www.python.org) programming language, along with a set of libraries commonly used in data science broadly. I do not expect that you’ve ever written a line of code in any programming language before, so in learning to use Python, you will also be learning to program. Using Python for data science, and programming, are not exactly the same thing — programming describes a broader range of skills than using a programming language to do data science. As well as learning the "words" (commands) and "grammar" (how to define and combine commands), of a programming language, programming encompasses particular ways of thinking. One important programming skill is **operationalization** — analyzing and breaking down problems, and identifying the sequence of steps to solve them. Another is paying close attention to the details of how you write and format your code (all of the sudden, not indenting a line is not just a violation of that annoying APA Style guide, but causes your code to function in a totally different way, or not at all!).

Python was originally written by [Guido van Rossum](https://en.wikipedia.org/wiki/Guido_van_Rossum) and first released in 1991.  Its name has nothing to do with snakes, but rather was derived from the famous comedy sketch troupe Monty Python. Python developed as an [**open-source**](https://en.wikipedia.org/wiki/Open-source_software) project. This means several things. Firstly, that it is made available for free, with anyone being granted the permission to use, examine, modify, and share the [**source code**](https://en.wikipedia.org/wiki/Source_code) (the code that runs when you run a Python command). Secondly, that many people contributed to the development of the language, typically without receiving any payment (though some developers may have contributed to Python in the context of working for a company that relied on the language, or simply embraced values of supporting the open-source community). Van Rossum was the lead developer for the project until 2018, and now the development of the language is guided by a five-person steering council (which still includes Van Rossum). Like virtually every active programming language, Python is under continual development, to fix bugs, improve its efficiency, and extend its abilities. Python has gone through three major versions, each with many minor releases. Development is guided by officially reviewed and approved [Python Enhancement Proposals](https://www.python.org/dev/peps/) (PEP). Some PEPs also serve as official guidelines. For example, PEP 20 is [The Zen of Python](https://www.python.org/dev/peps/pep-0020/), which espouses core values of the language, while PEP 8 is [The Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/), which we will return to later and often as it defines rules concerning how the code is interpreted (e.g., indents, as mentioned above), as well as guidelines that make code consistent and easy to read.

In a very general and nontechnical sense, programming languages can be characterized as falling on a continuum from “higher level” to “lower level” (or, perhaps more simply, easier to use and learn to harder to use and learn). Python falls closer to the “high level” end of this spectrum, relative to languages like C or Java. This often means it takes less code to perform a particular function, more things are baked in "for free" in Python than one might have to explicitly write code for in C. As a result, Python is simpler and more elegant to read and write. Indeed, PEP 20 enshrines certain core values of the language, such as:

*Beautiful is better than ugly.*

*Explicit is better than implicit.*

*Simple is better than complex.*

*Readability counts.*

These values contribute to making Python relatively easy to learn and use, compared to other programming languages. At the same time, programs written in Python (if written properly) tend to run quickly and efficiently, so there is little "overhead" relative to using a lower-level language. Python has been widely adopted by communities in many areas of science, and in data science, because of this (and the fact that it's free). Many add-on packages (**libraries**) have been written to extend Python's functionality in various ways, including a large number of libraries specifically for scientific applications.


---
This section was adapted from Aaron J. Newman's [Data Science for Psychology and Neuroscience - in Python](https://neuraldatascience.io/intro.html).