This is the first tutorial in a series on using the Python programming language for science. By the end of the series you will have a simple-to-use, reproducible workflow that starts at raw data and ends with the final version of your submitted manuscript. This means we'll learn how to set up and organize your projects on your computer, implement version control to keep them organized, how to use Python programming language to wrangle and clean raw data, conduct exploratory analysis using Jupyter Notebooks and the Python package Pandas, and finally how to create interactive, publication quality graphs using another Python Package named Plotly.

Goals of this tutorial:

1) Introduce Python and install it on your computer

3) Introduce Git and install it on your computer

3) Prepare your computer to do science by setting up a Python environment and project directory and linking it to Git

<div class="alert alert-block alert-info">
<h1><b>Section 1:</b> Getting Started with Python</h3>
</div>


<div class="alert alert-block alert-info">
<h2><b>Section 1.1:</b> Introduction</h2>
</div>


<b>What is Python? </b>
<br>
Python is a programming language. It's open-source, which means it's completely free and its source code can be viewed by anyone. Open source projects are awesome because everyone around the world can contribute to their development. This means there's thousands of free Python applications already available for you to use. 

<b> How is Python different from Matlab?</b>
<br>
If you're making the switch from Matlab to Python (you're making the right decision!) you can read this article to get a better idea of how the two compare, including the advantages and drawbacks of each: https://pyzo.org/python_vs_matlab.html. 

Here's another comparison (albeit biased towards Python): https://realpython.com/matlab-vs-python/

<b> The Bottom Line </b>
<br>
For now, one important distinction between the two should be made. Matlab is a programming language, a collection of packages written in the Matlab language, and an integrated development environment (IDE) all wrapped into one (The IDE is the thing you write code in when using Matlab). Python is ONLY the language. In order to do science with Python, you need to choose and install your own packages and an IDE. This sounds complicated and is one of the reasons people are afraid to make the transition from Matlab to Python.

Thankfully, the brave scientists that have made this transition before us have graciously developed ways for newcomers to easily make the transition. They have put together a list of packages we need to use Python for science, some recommended IDEs, and the Python language itself into one installable package. All together, this is known as a Python distribution. For this tutorial, we will be using the Anaconda Python Distribution and its package manager Conda.


Why Python? https://python-graph-gallery.com/
Jupyter Notebooks:
Plotly and Plotly Express https://medium.com/plotly/introducing-plotly-express-808df010143d

http://seismo-live.org/

<b>What is a package?</b>
<br>
A package is a piece of code in a given language (e.g., Matlab or Python) someone else has written that can be run on your computer. So basically its a piece of software. An example of a Matlab package is ODESOLVER. An example of a python package is Numpy.


<b> What is Anaconda? </b>
<br>
Anaconda is an open-source Python distribution for scientists. It comes with a bunch of pre-installed Python packages that are useful to scientists (e.g., Numpy, matplotlib, etc.). This is great because it means you don't have to install them all yourself. The whole purpose of Anaconda is to simplify package management and deployment.

<b> What is Conda? </b>
<br>
Remember two seconds ago when I said the whole purpose of Anaconda is to simplify package management and deployment? This is achieved using Conda. Conda is part of the Anaconda distribution. It's a package manager in that it takes care of installing, updating, and removing packages.

<div class="alert alert-block alert-info">
<h2><b>Section 1.2:</b> Installing Anaconda</h2>
</div>


<b> Installation </b>
1. Go to Software Center and search for Anaconda. 
2. Click install.
<br>

That's it!

<div class="alert alert-block alert-info">
<h2><b>Section 1.3:</b> Familiarizing Yourself with Anaconda</h2>
</div>


<div class="alert alert-block alert-info">
<h3><b>Section 1.3.1: </b>Environments</h3>
</div>


As previously mentioned, open-source software is free for everyone. This means everyone can see the source code of Python and work together to create their own improvements and packages and is one reason why Python has grown so rapidly. Due to its rapid growth, new versions of various Python packages get created quite frequently. This can sometimes complicate things. Read below to see why.


<img src='https://lh3.googleusercontent.com/vYpEI2rOFVYEeWTNWmaxUciCii_CYSGp5wpDxc-GYhV-7yoyrdDki2Vzr0bVVHgnK8hNscSm06b7f7_2I1njyGcu2nKPL094ujDPFv6gRRqM9vVht_5wpGHGajEQm2Whz1IWYKRq'>
<figcaption> Image Source: https://www.zibtek.com/blog/the-incredible-growth-of-python/</figcaption>



<b> The Problem: Unresolved Dependencies </b>

The list of programs and versions that your project needs in order to work is called its dependencies.
Now imagine you have two projects, ProjectA and ProjectB. ProjectA requires Plotly version 3.0 while ProjectB requires Plotly version 4.0. Having two (or more) versions of the same package can quickly screw things up as your computer might not know which version of Plotly to run for a given project.

<b> The Solution: Environments </b>
<br>

Environments avoid this problem by creating a nice little space separate from everything else in which a unique set of packages can be installed. This means you can have an environment for ProjectA and a separate for ProjectB. Your ProjectA environment can have Plotly 3.0 while your ProjectB environment can have Plotly 4.0, and neither of them will mess up the other one.

Environments are also helpful for sharing your code with collaborators. Imagine you make a program that cleans up some raw data, organizes it, and plots it. Then you decide you want to share it with a colleague. You send the code to them but when they try to run it, it doesn't work because they don't have all the correct versions of packages installed (unresolved dependencies!). This headache can be avoided using environments. With one command you can create a file which will list all the packages and their versions present in your environment. Then when your colleague receives your code they can simply enter a command which will look at the text file and automatically create a new environment with all the required dependencies! 

We'll see more on this later, but for now you should hopefully understand what environments are and why they're important.

<b>Note on usage:</b> Data scientists like to have an individual environment for each project. Since us physical scientists write less code, it's usually sufficient to have one environment for all your projects. For example, I just have a single environment named 'science.' However, if you know you'll be writing lots of code with different dependencies, it might be better to have multiple environments. The choice is yours!


<div class="alert alert-block alert-info">
<h3><b>Section 1.3.2:</b> Anaconda Navigator</h3>
</div>

So environments have packages, and packages are installed using the Conda package manager. How do we use the Conda package manager? Well there are a few ways, but for now we'll use the Anaconda Navigator. The Anaconda Navigator is a graphical user interface (GUI) that allows you to communicate with Conda to control all of your different Anaconda environments and their associated packages. Let's take a quick tour.

<b>Home</b> Displays various applications. I don't know what most of these are. The ones that are relevant to us are Notebook and Spyder. 

<b>Jupyter Notebook</b> is an amazing tool. From their webpage:

>"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more."

Sounds awesome, right? It is. We will be learning how to use it soon. Once you get the hang of it, you'll be using it for all your data analysis.

<b>Spyder</b> is another IDE. If youre familiar with Matlab, it's very similar to this. At the beginning you'll probably be more comfortable with Spyder, but as you get more comfortable with coding in Python, you may find Jupyter Notebook is better.
    Alternatively, you can use both... develop and test your code in Spyder and then run it in Jupyter. Whatever works best for you!
    
    

<b>Environments</b>
This gives you a list of all your environments and the packages that are installed within them. Remember, we installed Anaconda because it makes package management SIMPLE. At the beginning, we are given a default base environment with a bunch of default packages already installed. You could in theory just use this environment, but it's better to create a new environment and work form there. 

That's really all you need to know about Anaconda Navigator!

<div class="alert alert-block alert-info">
<h2><b>Section 1 Summary</b></h2>
</div>


This first section might feel overwhelming as you've just learned a bunch of new terms. You may have even googled some of these terms and found out that there's a bunch of other terms related ot them. You might be freaking out and considering running back to your Matlab Safe Space. Don't worry, you'll get familiar with all of this soon enough. For now just focus on what we've learned so far. The rest will come eventually.
    
<b>To recap:</b>
    
<li> Python is an open-source programming language.
<li> Packages are pieces of code written in Python that allow you to do things (like science).
<li> Anaconda is a free distribution of Python that comes with pre-installed packages that make doing science with Python easy.
<li> An environment is an isolated workspace that has various versions of packages (dependencies) loaded.
<li> Conda is a package manager that we can use to control which packages are installed in our various environments.
<li> We can use Anaconda Navigator to communicate with Conda to organize our environments and their installed packages.
<br>
<br>
  If you'd like to read more in depth on all of this, click the following link: https://docs.anaconda.com/anaconda/user-guide/getting-started/

<div class="alert alert-block alert-info">
<h1><b>Section 2:</b> Setting up your Science Workspace</h3>
</div>


Science is like cooking. I mean, I guess it is. I don't really cook. But if I did, I'd want a clean kitchen with all the necessary tools and ingredients in easy to find spaces so I wouldn't burn my casserole while I was busy looking for a spatula. 

Now that we have all our Python tools and ingredients, it's time to organize them on our computer so we can make some bomb-ass science-food.

https://towardsdatascience.com/get-your-computer-ready-for-machine-learning-how-what-and-why-you-should-use-anaconda-miniconda-d213444f36d6

<div class="alert alert-block alert-info">
<h2><b>Section 2.1:</b> Version Control with Git </h3>


You've probably heard of git or github. Physical scientists like us don't use it nearly as much as software engineers, but I think we should.
<br>
If this has ever happened to you:
<img src="http://www.phdcomics.com/comics/archive/phd101212s.gif">
<br>
Then you will appreciate Git. After you use github you'll never go back. 

Git, and any other version control system, allows you to track changes to your files in a simple, organized manner. In this tutorial, we'll be using a version of git named Github. Since you can privately save your projects to the Github servers you can also use it as a means of easily sharing your projects with colleagues or between your own computers.

<div class="alert alert-block alert-info">
<h3><b>Section 2.1.1:</b> Getting Started with Github </h3>


<li>First, create a Github account by going to www.github.com and clicking the green 'Sign Up for Github' button.
<li>After you have signed in to github, download the github GUI at https://desktop.github.com/. 
<li>Open the github GUI on your desktop and login to your github account.

<b>Note: </b>Git is normally used from the command line but here we'll be using Github's desktop GUI. If you'd like to read about using Git from the command line, read this: https://dont-be-afraid-to-commit.readthedocs.io/en/latest/git/commandlinegit.html

Now let's go through the basics of how to use Git.

<div class="alert alert-block alert-info">
<h3><b>Section 2.1.2:</b> Understanding the Git Workflow  </h3>

<img src='https://nceas.github.io/sasap-training/materials/reproducible_research_in_r_fairbanks/images/git-flowchart.png'>
<imgcaption> Theres quite a bit more, but if you can understand just this part then you're going to be able to get a lot of use out of Git. </imgcaption>

Your workspace is a project folder on your computer. If it's 'git enabled' then whenever you make changes, these changes will be listed in the staging index. You can then <b>commit</b> those changes to your local repository. Finally, you can <b>Push</b> changes from your local repository to the remote repository on the Github server. If you're on a different computer, you can then <b>fetch</b> those changes and have the newest version on your computer.

<b>Lets make a basic project quickly and go over this workflow once.</b>
<li>First, make this project a new repo.
<br>
<li>Make a new text file.
<br>
<li>Add some text.
<br>
<li>Now look at the Github gui.
<br>
Those changes are shown in the Staging Index.
<br>
You now commit those changes. This basically means "I've made all the changes I wanted to make and am ready to save a record of these changes." Since git keeps a record of all your preivous commits, if your code suddenly stops working, you can go back in time and load a previous comit that was working.
<br>
Make your commit messages useful! 
<br>
Then you push those changes to the repository.

<img src='https://imgs.xkcd.com/comics/git_commit.png'>

## Now let's download this tutorial's repository to learn how to clone other people's projects.

<b>Further Reading:</b> If you want to use Github to work on a project with a collaborator, you'll need to learn about branches. Github has an awesome set of guides that you can read here: https://guides.github.com/ 

<div class="alert alert-block alert-info">
<h2><b>Section 2.2:</b> Creating Your First Conda Environment</h2>
</div>


Now its time to setup our Conda Environment using Conda. Remember, an environment is a directory that contains a specific collection of packages. Also recall that we can control Conda a few ways: the terminal/command line, Anaconda Prompt, or Anaconda Navigator. Since the process of controlling Conda from the command line is slightly different depending on if you're using a Windows or Mac, we'll use the Anaconda Navigator as it's the same for both operating systems.

<b>Note:</b> If you'd like to learn how to use Conda from the command line (and you really should!) read here: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html#before-you-start

<li>Go to Anaconda Navigator and click Environments.

<li>Create a new Environment by clicking "Create" at the bottom.
<br>

This is the environment we will use from now on. Make sure it's active when doing science.

<div class="alert alert-block alert-info">
<h2><b>Section 2.3:</b> Organizing your Projects with Cookiecutter Data Science</h2>
</div>


Cookiecutter Data Science: "A Logical, reasonably standardized, but flexible project structure for doing and sharing data science work."

If you've ever made a figure or plot and then lost it somewhere on your computer, then you need to organize your project!

Cookiecutter is a script for creating project directories. The Cookiecutter Data Science is a template we can start with.

<b>Homepages:</b> 
<br>
https://cookiecutter.readthedocs.io/en/latest/readme.html - Cookiecutter
<br>
https://drivendata.github.io/cookiecutter-data-science/ - Template

<b>Why should you use Cookiecutter?</b>

1) Helps present-day-you with data analysis.
<br>
2) Helps future-you with future data analysis.
<br>
3) Helps collaborators. 
<br>
4) Organization is a great skill to have for future jobs.

<div class="alert alert-block alert-info">
<h3><b>Section 2.3.1:</b> Installing Cookiecutter</h3>
</div>


Cookiecutter is a python package. 
<br>

Hopefully that sentence meant something to you.
<br>

Remember, we install python packages using the Conda Package Manager. This means we need to go back to the Anaconda Navigator to install the Cookiecutter package. Most python packages can be found on Conda's default channel. But sometimes they aren't. When this is the case, we need to add a new channel.
<br>
<br>
<li>Go to Channels. 
<li>Click 'Add..' enter "conda-forge" then click Update Channels.
<li>click update index. Now all the packages in Conda-Forge repository are available to download.
<li>Search Cookiecutter. Then click Apply.

<b> Note: </b> When you learn how to use command prompts, you can simply type the following in terminal

conda install -c conda-forge cookiecutter


<b>Note:</b> This probably won't work on your desktop computer which kind of sucks. I don't know a way around this.

<b>Now lets create a new project using Cookiecutter.</b>

To-do: Add instrucitons for customizing and creating your own.

<li> Create a projects directory.
<li>We will now create a project with a directory tree inside this project folder using cookiecutter. This can be done directly using templates on the github repository. You can also make your own if you'd like. Here we will use https://github.com/drivendata/cookiecutter-data-science
<li> In terminal type: cookiecutter https://github.com/drivendata/cookiecutter-data-science
<li> Follow the steps.
    
    
https://cookiecutter.readthedocs.io/en/latest/usage.html
<br>
Voila!

Let's go over some important folders. This convention comes from a paper: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424



In [None]:
data/raw --> Raw instrument data. These files are whatever your instrument directly outputs. Once you put the data in this folder, you should never have to touch it again!
Too often you have raw data that becomes altered. Then you come back and dont remember what you've done to the data.

data/interim --> Data that has been transformed. Usually into a CSV of some sort which can then be processed later. This is the data that you will import and do actions on. Such as sorting, correlations, etc.

data/processed --> This should be as close as possible to the version of the data that you're going to plot. Your script will import this csv, minimal operations will be done on it before plotting it.

We will go through an example soon. You can change this to meet your needs. 

references --> manuals, SOPs, publications relevant to your project.

reports--> This is stuff that will be used for powerpoint presentations
reports/figures --> Place figures here that you will then use in the reports
reports/powerpoints --> Powerpoints are saved here.


In [None]:
src --> Any code relevant to your project. not just python code, but other languages as well.
src/data --> Python scripts for preparing your data for analysis. For example, a script that reads in your instrument raw data, then turns it into a CSV and places that CSV in your data/interim folder for future analysis.
src/data/analysis --> any code you want for analysis. For example, a correlation script or something.
src/visualization --> Code for visualizing data


notebooks --> Jupyter notebooks. These are very important! This is a record of your data analysis. You can also use notebooks to test code, explore data, create figures, create reports.

At the beginning you'll probably be saving code snippets in src/ folder and then copying those into your notebooks during analysis. As you get better at coding, you'll eventually stop needing to reference the code in src and simply be able to write it yourself directly into the notebook. It's still good to have the src folder up to date for future use. 

requirements.txt --> The environment file that tells your computer what dependencies are needed for this project to work.

Let's create our first jupyter notebook and save it to the project folder. That's the end of this tutorial. The next tutorial will show you how to use Jupyter and Pandas for data analysis.

Finally, let's create a yml file. conda env export > environment.yml

See the official conda docs for more on this. But just know it's good to have.

Jupyter Notebook: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks

You can epxlore all of these notebooks. You can easily install them to your computer using Github.

Tutorial 2 we'll start to learn how to do these kinds of these things using Pandas.