# Code Prep

Welcome! This notebook walks you through some things you can do and learn about before the summer.

---

# Terminal

[[recommended reading](http://blog.teamtreehouse.com/introduction-to-the-mac-os-x-command-line)]

The terminal lets you interact with your system. You can do many of the things you can do with the GUI (manage directories, create files, etc.), but some tasks are better suited for the Terminal. You can read the `recommended reading` link above, but I will summarize the most important commands.

- `pwd`: print working directory
- `ls`: list contents of directory
- `cd`:  change directory
- `rm <FILE>`: remove file
- `rm -rf <DIRECTORY>`: remove recursively without asking for confirmation.
    - Be EXTREMELY CAREFUL with this command. The deleted files are gone forever.
- `cp <SRC> <DEST>`: copy source to destination.
- `mv <SRC> <DEST>`: move source to destination.

Try changing into different directories on your machine and listing their contents. If you try `ls -a`, you will list all files (and you might see files preceded by a period, which are usually hidden in the Finder). Note: ~ is a shortcut for your home directory.

You can also run programs and scripts from the terminal. Say you wrote a python script, "hello.py". You can run that script in the terminal with `python hello.py`.

Most (maybe all?) of the terminal commands on macOS carry over to Linux. (That's because macOS and Linux are both POSIX-based)

# [Anaconda](https://www.continuum.io/anaconda-overview)

If you are using a Mac or Linux computer, Python is most likely installed on your machine. However, we will not be using this Python. Instead, we will download the Anaconda Python distribution. Anaconda is meant for data processing and scientific computing, and it makes it easy to install packages.

Another benefit of using Anaconda instead of the system's default Python is that it gives us an added layer of safety. In the worst case, you can delete the Anaconda folder if things go wrong (e.g., installing a package breaks things), and your system is back to normal.

## Packages

Anaconda includes many Python and R packages. You can read through the list [here](https://docs.continuum.io/anaconda/pkg-docs) to get an idea of what is out there. The Python packages you should look into are [numpy](http://www.numpy.org/), [pandas](http://pandas.pydata.org/), [scipy](https://docs.scipy.org/doc/scipy/reference/), [statsmodels](http://www.statsmodels.org/stable/index.html), and [matplotlib](https://matplotlib.org/).

NumPy is the backbone of a lot of Python packages for scientific computing. Pandas gives us R-like data structures (e.g., Dataframes), among other things. SciPy has a ton of useful functions (e.g., for statistics). StatsModels has many functions for statistics and has R-like syntax. And Matplotlib is the major plotting library in Python. If you ever have a question on how to use a function in one of these libraries, take a look at their documentation. As an example, look at the [documentation](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.ttest_ind.html) for the independent T-test function in scipy.

These packages are all open-source. You can find their code on GitHub. (For example, [pandas GitHub repo](https://github.com/pandas-dev/pandas))


## Installation

While many of Anaconda's packages are useful, they can take up a lot of space. It also isn't necessary to have packages installed that you will probably never use. So we will install Miniconda, a lightweight version of Anaconda. It only includes Python and `conda` (a package manager).

The instructions to install Miniconda are at https://conda.io/miniconda.html. You should install the Python 3.6 version for your operating system. If you are on a Mac or Linux machine, the installation consists of downloading the bash installer and executing it in the terminal with the command

    bash <FILE>
    
where `<FILE>` is the name of the bash installer file you downloaded. Make sure you point to the file's location. If the file is in your Downloads folder, you would write `bash ~/Downloads/<FILE>`. Alternatively, you could `cd` into the Downloads folder and then run `bash <FILE>`.

The installer will ask whether you want to add Miniconda to the `~/.bashrc` file. Say yes to this (the default is no). This makes Miniconda's Python the default Python. After installation, you can delete the Miniconda installer file.

## Install some packages

You can install packages with the command `conda install`. Let's install numpy, pandas, scipy, statsmodels, matplotlib, and jupyter. (What's Jupyter? See below)

    conda install numpy pandas scipy statsmodels matplotlib jupyter

If you ever need help with a command in the terminal, you can give the command a `-h` or `--help` flag to show a help message.

    conda -h
    conda install -h

# [Jupyter](http://jupyter.org/)

[[try online](https://try.jupyter.org/)]

This document is a Jupyter notebook. You can write and execute code, include images and documentation, and much more. I often use it to view and analyze data. To start up a jupyter notebook, run the command `jupyter notebook` in the terminal. After a little bit, a new window should open in your browser.

The notebook supports many languages, including Python and R. When you're in a notebook, click the plus to add a new cell, and run cells by pressing shift+enter or by clicking the play button.

# [Pandas](http://pandas.pydata.org/)

[[recommended reading](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)] <-- don't worry about understanding all of this

A very quick intro to pandas.

You can find documentation for each function online. For example, here is the [documentation] for the `DataFrame.mean()` method. In the Jupyter Notebook, you can press shift+tab when your cursor is on the function to bring up documentation.

In [None]:
import pandas as pd

# First, read in a CSV file, and set the first column to be the index.
# (There are many read_* methods)
# Notice that you can read a file from a URL. You can also read files on your
# computer.

url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv"
df = pd.read_csv(url, index_col=0)
# View the first 5 rows.
df.head()

In [None]:
# Index parts of the dataframe with .loc[].
# The general syntax is df.loc[<row_indexer>, <col_indexer>].
# <row_indexer> and <col_indexer> can be lists.

# Get a column.
df.loc[:, 'mpg']

# Get a row.
df.loc['Mazda RX4', :]

# Get a cell.
df.loc['Mazda RX4', 'mpg']

# Get multiple columns.
df.loc[:, ['mpg', 'cyl', 'wt']]

# Create new column (full of zeros)
df.loc[:, 'new_col'] = 0

# Multiply the values in a column by 2.
df.loc[:, 'mpg'] *= 2
# Add 2 to the values.
df.loc[:, 'mpg'] += 2

# Note: `a += b` in this case is equivalent to `a = a + b`

In [None]:
# Get descriptive statistics.

# Get the mean of each column.
df.mean(axis=0)
# ... of each row.
df.mean(axis=1)

df.min()
df.max()
df.median()
df.std(); # standard deviation

In [None]:
# Correlate all columns of the dataframe with each other.
df.corr()

# Correlate one column with another column.
r = df.loc[:, 'mpg'].corr(df.loc[:, 'cyl'])
print(r)

In [None]:
# Boolean indexing! This is a very powerful feature of pandas.
# You can get parts of dataframes or series using conditional statements.

# Get all rows where mpg is greater than 20.
df.loc[df['mpg']>20, :]

# If you are curious about boolean indexing, look at the output of this line:
df['mpg'] > 20

In [None]:
# Pandas also has plotting functions (which use matplotlib in the backend)
# The following line tells matplotlib to plot images in the notebook.
%matplotlib inline
import matplotlib.pyplot as plt

df['mpg'].plot.hist(title="Histogram of MPG")
plt.show()

df['hp'].plot.bar(title="Vehicle Horsepower")
plt.show()

# Coding challenge 1

Using the [Iris Dataset](https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv):

- Read in the data.
- Make scatter plot of petal length in cm, and for bonus points, color the points by flower species.
- Save a dataframe of the rows with sepal length in cm between 5 and 6 to a new CSV file.
- Independent T-test comparing sepal length in cm between setosa and virginica.
- T-test of petal length in cm of setosa above median and below median.
- ANOVA of petal length in cm among all three flower species.
- Make some plots that you think are meaningful.

# R

[[recommended reading](http://www.r-tutor.com/r-introduction)]  
[[long intro from Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf)]

Full disclosure, I have very little experience with R. The popular IDE for R is [RStudio](https://www.rstudio.com/) (but you can also use R in Jupyter Notebook). You can install R from `conda`, or you can install R from [here](https://cloud.r-project.org/). You will probably want to use the latter method because it seems like `conda` support for R packages is still in its preliminary stages.

# Coding challenge 2

Do [Coding challenge 1](#Coding-challenge-1) in R.

Hint: the Iris dataset is available by default in R as the variable `iris`. The command `df <- iris` will set the Iris dataset to the variable `df`.

---

# Other things

The sections below are useful but are __not__ necessary for the summer. Focus on the Python packages I mentioned above and learning how to get around in R. Once you feel comfortable with those, explore the items below.

## Good coding practices

[[recommended reading (use arrow keys)](https://edeno.github.io/Better-Science-Code/#/)]  
[[Python style guide](https://www.python.org/dev/peps/pep-0008/)]

## git and GitHub

[[reading for git and GitHub](https://help.github.com/articles/git-and-github-learning-resources/)]  
[[try git online](https://try.github.io/)]

These are not necessary for the summer. You should prioritize learning about and using the Python packages I listed above. That said, `git` and GitHub are very useful. `git` is a version control system, and [GitHub](https://github.com/) is a place to store and share your code. You can use `git` to push or pull code to GitHub. You can take a look at my [GitHub profile](https://github.com/kaczmarj) and view any of my repositories.

You can make your own GitHub account and host your code there.