# Before You Start

### Python and Jupyter Notebooks
Python is a programming language which uses a "Jupyter Notebook" as an interactive web environment to run code. This is a Jupyter Notebook! Text and graphics can be displayed within the notebook, which allows for easy visualization and code editing. Anaconda is the data environment and package manager for these related tools and is used both to initialize Jupyter Notebook on your computer and download helpful packages and code libraries. We'll go into this more later.

Instructions for downloading and installing Anaconda are provided as a PDF under this Recitation 7 in bcourses.

After installation, Jupyter Notebooks can be launched via:
Command Prompt (Windows): http://www.howtogeek.com/235101/10-ways-to-open-the-command-prompt-in-windows-10/
Terminal App (Mac):  http://ss64.com/osx/ OR http://www.dummies.com/how-to/content/how-to-use-basic-unix-commands-to-work-in-terminal.html

After opening command prompt, use the cd command to navigate to your working folder, e.g. cd users\Andrew\CP257

Then, enter the command: jupyter notebook

This should open the tool in your web browser. You can either open a notebook already saved at this location, or create a new notebook.

### What are covered in this tutorial
This tutorial covers with a basic overview of how common data structures are created and referenced in python, with a focus on how to reference and update specific data points for different structures.

# Data Structures

### Lists
Lists are one-dimensional vectors containing elements of any data type. Their syntax is to separate items by commas, enclosed in square brackets.

In [1]:
a = [1,2,3,'Hello']
a

We can call individual elements of a list using their index. IMPORTANT: Python indexes start with 0!

In [2]:
a[0]

Using negative indexes counts back from the last element.

In [3]:
a[-1]

Use a colon to indicate a range. Note that it returns indexed values that are less than the upper limit of your range.

In [4]:
a[0:2]

We can create an empty list, and add elements to it:

In [5]:
b = []
b

In [6]:
b.append('Goodbye')
b

### Arrays
We are familiar with arrays from MATLAB. Arrays allow us to vectorize numeric data for ease of calculation. If you wanted to divide all the numbers in a list by 2, you would have to iterate over each element. Arrays allow for more instantaneous calculations.

#### Libraries
To use arrays in python, we must use the Numpy library, the first of many libraries we will cover in this tutorial. Libraries beyond the base Python library contain a variety of functions and methods which can be used without having to write the code yourself.

In [7]:
import numpy as np

In [8]:
a = list(range(1,6))
a

[1, 2, 3, 4, 5]

In [9]:
b = np.array(a)
b

array([1, 2, 3, 4, 5])

In [10]:
b/10

array([0.1, 0.2, 0.3, 0.4, 0.5])

try a/10 (it should return an error)

Numpy's zeros function allows us to initiate an empty array:

In [11]:
c = np.zeros(9)
c

array([0., 0., 0., 0., 0., 0., 0., 0., 0.])

The reshape function can create a matrix of any number of dimensions:

In [12]:
d = c.reshape(3,3)
d

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [13]:
e = np.zeros(27).reshape(3,3,3)
e

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

We can also replace any value in an array using its index

In [14]:
e[0,2,2] = 5
e

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 5.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

When performing calculations on arrays, the default is by element, rather than matrix calculations using linear algebra.

In [15]:
b

array([1, 2, 3, 4, 5])

In [16]:
f = b*2
f

array([ 2,  4,  6,  8, 10])

Linear algebra calculations require specific functions.

In [17]:
g = np.array([[1,2,3],[3,4,5],[5,6,7]])
g

array([[1, 2, 3],
       [3, 4, 5],
       [5, 6, 7]])

In [18]:
g.dot(g)

array([[22, 28, 34],
       [40, 52, 64],
       [58, 76, 94]])

In [19]:
np.transpose(g)

array([[1, 3, 5],
       [2, 4, 6],
       [3, 5, 7]])

### Dataframes
Dataframes are two-dimensional tabular data structure, very similar to an Excel spreadsheet. They are similar to arrays in that they are indexed in more than one dimension, but have some additional capabilities since rows and columns can also be named with headers.

To explore the concept of dataframes, lets open a simple prepared one in csv format. We'll need the pandas library to open data from csv format and organize it into a dataframe.

In [20]:
import numpy as np
import pandas as pd

In [21]:
df = pd.read_csv('rain.csv')
df

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3
5,jun,1.2
6,jul,0.8
7,aug,0.7
8,sep,
9,oct,3.9


The head function is a useful way to display a sample of the dataframe. If you don't put an argument in the parentheses, it defaults to five rows, or you can indicate a number of rows to display as an argument.

In [22]:
df.head()

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3


There are a few ways to call specific values in dataframes. The loc function allows mixing between column labels and row indicies.

In [23]:
df.loc[6,'rainfall_inches']

0.8

The iloc function uses indices only.

In [24]:
df.iloc[6,1]

0.8

Filtering returns the row indices for true values of a column.

In [25]:
df['rainfall_inches'] > 5

0      True
1      True
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11     True
Name: rainfall_inches, dtype: bool

You can use this to filter the dataframe

In [26]:
df[df['rainfall_inches'] > 5]

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
11,dec,5.9


You can use either type of loc indexing to change individual values.

In [27]:
df.iloc[6,1] = 1
df.iloc[6,1]

1.0

#### Merging Dataframes
Often when manipulating data, you will want to join two dataframes based on a common identifier, like a FIPS code. The merge function allows us to do this.

As an example, let's set up two variables and pretend that the rain data is actually two different dataframes that we wish to merge.

In [28]:
df1 = df
df2 = df

In [29]:
df1.head()

Unnamed: 0,month_2014,rainfall_inches
0,jan,5.3
1,feb,5.4
2,mar,4.8
3,apr,4.7
4,may,3.3


In this case, we may want to merge based on the listed month, for example so we could compare january rainfall between 2014 and 2015.

There are a few arguments to consider here. The first two arguments are the dataframe variables you wish to merge. Df1 is the "left" variable and df2 becomes the "right."

Left_on and right_on are used to indicate which column in each dataframe will house the identifier used for matching.

The how variable gives us a few options on the merge:

"Right" indicates that we will match data from df2 to df1. If df2 has datapoints which do not appear in df1, they will be dropped.

"Left" indicates the opposite - we will match data from df1 to df2. If df1 has datapoints which do not appear in df2, they will be dropped.

"Outer" indicates that no data will be dropped. If either dataframe contains identifiers that are not present in the other, they will be preserved, but missing data will be listed as "NaN"

"Inner" indicates that only identifiers present in both dataframes will be preserved. Any other data will be dropped.

In [30]:
df3 = pd.merge(df1,df2,left_on="month_2014",right_on="month_2014",how = "right")

In [31]:
df3.head()

Unnamed: 0,month_2014,rainfall_inches_x,rainfall_inches_y
0,jan,5.3,5.3
1,feb,5.4,5.4
2,mar,4.8,4.8
3,apr,4.7,4.7
4,may,3.3,3.3


We can also easily write dataframes to csv format, either to save it as a checkpoint, or for further manipulation in Excel. Try to resist the urge to do this - it is better to attempt your data manipulation in code so you have a record of everything you do.

In [32]:
df3.to_csv('rain_merged.csv')