# What is Data Science?
---

Many areas of modern scientific research and development rely on increasingly large and complex data sets. Discovery and application in science thus relies on the ability to manage these large data sets, and extract meaning from them. In other words, neuroscience now relies heavily on **data science**, which has been variously defined as “…an umbrella term to describe the entire complex and multistep processes used to extract value from data” (Wing, 2019), and the ability to “bring structure to large quantities of formless data and make analysis possible” (Davenport & Patil, 2012, p.73).

**In modern kinesiological research, data science is an increasingly necessary and valued skill.** Data from techniques like EMG, motion capture, and MRI are complex and multidimensional. Being able to understand, manipulate, and visualize the structure of these complex datasets is a necessary skill for performing the research. On top of this, it is increasingly clear that very large data sets - often built collaboratively by many labs - are often required to make reliable inferences about scientific processes.

## Is data science just a trendy name for statistics?
While data science and statistics are overlapping fields, statistics is generally focused on the specific task of testing hypotheses based on data. Data science more broadly includes the storage, manipulation, visualization, filtering, and preparation of data that is typically required prior to statistical analysis. In fact, we will, by necessity, be devoting more time in this course to the pre-data analysis steps than to actual algorithms (although there will also be some of that as well). Data science does also encompass statistics, as well as machine learning; whereas statistics generally involves deriving conclusions from existing data, machine learning involves making predictions from a data set that will generalize to other data. Since statistics is covered in other courses in the kinesiology curricula, this course focuses instead on the other “front-end” aspects of data science described above. Other areas of data science, including software development and “back-end” data science (engineering, hardware, databases), will not be covered.

This highlights a mindset that differs quite dramatically in data science, as compared to the basic statistics taught in undergraduate curricula. Data science includes practices that are more exploratory. In experimentally-oriented disciplines such as exercise physiology or biomechanics, statistics are a natural approach to deriving meaning from data. This is because data typically come from experiments, in which the researcher(s) systematically and intentionally manipulated certain variables. A good experiment is **hypothesis-driven**, meaning that the researcher has predictions in advance as to how the data will systematically vary with the experimental manipulations. These predictions are usually based on past experimental findings, or models of the process being studied. Statistics are fundamentally embedded in data science — and indeed, the concept of "data science" as a discipline emerged from the field of statistics — but data science can be thought of as a larger set of practices the includes statistics, machine learning, data cleaning and transformations, and visualization. Many of these approaches are more **exploratory** than hypothesis-driven. That is, rather than looking for a specific, predicted pattern, the data scientist explores the data to find systematic patterns that may emerge from the data. For example, researchers using techniques like motion capture have attempted to use machine learning algorithms to detect subtle alterations in walking mechanics, as a means of one day being able to "classify" clinically asymptomatic people as being at higher risk of developing neurologic disease, such as Parkinson's. 

## Tools for Data Science
Central to data science is the ability to use scientific programming languages, such as Python, Matlab, and R. This ability includes a strong understanding of the fundamentals of at least one programming language, and the ability to extend one’s knowledge through continuous learning and problem-solving. This course teaches Python, a mature and widely-used language in modern scientific research and data science more broadly. However, many of the fundamentals of scientific programming and data science are common to many languages. Thus, having learned Python, you will be well-prepared to learn new languages in the future, as necessary.

Another important facet of data science is that it is a **team endeavour**. On the one hand, it is founded on open, shared software developed by widely distributed teams of contributors. On the other hand, the practice of data science typically involves teams of individuals with complementary skillsets, both due to the size and complexity of many projects. In science, these teams often comprise students and faculty members in collaborating labs distributed around the world. Team members with different skillsets can also teach each other new things, often through demonstration in a shared project. This class prepares you for such collaboration by developing and coaching your teamwork skills, as well as teaching you how to use software platforms that support such collaboration.

The skills learned in this class will benefit students working in a wide range of areas of kinesiology. As well, the class will provide an introductory foundation in data science that can be applied to a range of areas beyond kinesiology, in academia, industry, and government.

<div>
<img src="images/python-logo-master-v3-TM.png" width="500"/>
</div>


This class will teach you to understand and use the [Python](https://www.python.org) programming language, along with a set of libraries commonly used in data science broadly. I do not expect that you’ve ever written a line of code in any programming language before, so in learning to use Python, you will also be learning to program. Using Python for data science, and programming, are not exactly the same thing — programming describes a broader range of skills than using a programming language to do data science. As well as learning the "words" (commands) and "grammar" (how to define and combine commands), of a programming language, programming encompasses particular ways of thinking. One important programming skill is **operationalization** — analyzing and breaking down problems, and identifying the sequence of steps to solve them. Another is paying close attention to the details of how you write and format your code (all of the sudden, not indenting a line is not just a violation of that annoying APA Style guide, but causes your code to function in a totally different way, or not at all!).

Python was originally written by [Guido van Rossum](https://en.wikipedia.org/wiki/Guido_van_Rossum) and first released in 1991.  Its name has nothing to do with snakes, but rather was derived from the famous comedy sketch troupe Monty Python. Python developed as an [**open-source**](https://en.wikipedia.org/wiki/Open-source_software) project. This means several things. Firstly, that it is made available for free, with anyone being granted the permission to use, examine, modify, and share the [**source code**](https://en.wikipedia.org/wiki/Source_code) (the code that runs when you run a Python command). Secondly, that many people contributed to the development of the language, typically without receiving any payment (though some developers may have contributed to Python in the context of working for a company that relied on the language, or simply embraced values of supporting the open-source community). Van Rossum was the lead developer for the project until 2018, and now the development of the language is guided by a five-person steering council (which still includes Van Rossum). Like virtually every active programming language, Python is under continual development, to fix bugs, improve its efficiency, and extend its abilities. Python has gone through three major versions, each with many minor releases. Development is guided by officially reviewed and approved [Python Enhancement Proposals](https://www.python.org/dev/peps/) (PEP). Some PEPs also serve as official guidelines. For example, PEP 20 is [The Zen of Python](https://www.python.org/dev/peps/pep-0020/), which espouses core values of the language, while PEP 8 is [The Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/), which we will return to later and often as it defines rules concerning how the code is interpreted (e.g., indents, as mentioned above), as well as guidelines that make code consistent and easy to read.

In a very general and nontechnical sense, programming languages can be characterized as falling on a continuum from “higher level” to “lower level” (or, perhaps more simply, easier to use and learn to harder to use and learn). Python falls closer to the “high level” end of this spectrum, relative to languages like C or Java. This often means it takes less code to perform a particular function, more things are baked in "for free" in Python than one might have to explicitly write code for in C. As a result, Python is simpler and more elegant to read and write. Indeed, PEP 20 enshrines certain core values of the language, such as:

*Beautiful is better than ugly.*

*Explicit is better than implicit.*

*Simple is better than complex.*

*Readability counts.*

These values contribute to making Python relatively easy to learn and use, compared to other programming languages. At the same time, programs written in Python (if written properly) tend to run quickly and efficiently, so there is little "overhead" relative to using a lower-level language. Python has been widely adopted by communities in many areas of science, and in data science, because of this (and the fact that it's free). Many add-on packages (**libraries**) have been written to extend Python's functionality in various ways, including a large number of libraries specifically for scientific applications.


---
This section was adapted from Aaron J. Newman's [Data Science for Psychology and Neuroscience - in Python](https://neuraldatascience.io/intro.html).