Lesson 02: Working With Data
===============

As scientists, we often use Python to organize, analyze, and display data in a variety of ways. We'll learn some of the basic techniques you'll need in order to read a data file, ask questions of the data in it, display data in a way that makes sense, and store that information. We're going to take a problem-based approach, and will use data from a real study: [_Pliocene Warmth, Polar Amplification, and Stepped Pleistocene Cooling Recorded in NE Arctic Russia_ by J. Brigham-Grette _et al._](https://www.dropbox.com/s/moscw2rr6ksqgw1/Brigham-Grette2014_Pliocene_Pleistocene_Russia.pdf?dl=0). This is what climatologists call a _multiproxy_ study - a study in which multiple indicators of past climates (proxies) are put together to reconstruct characteristics of that past climate. In the next tutorial, we'll recreate some of Brigham-Grette's graphs and make a few of our own. This time, we'll just explore the data.

Goals of this Notebook
-----------------------
1. Read, store, access, and save data using Python, Numpy, and Pandas
2. Choose data structures appropriate for common tasks, including scalar data types, strings, lists, arrays, and data frames
3. Understand basic Python error messages
4. Load and use Python packages
5. Perform simple mathematical and data organization operations
6. View the contents of variables

Variables
----------
A key piece of the puzzle is understanding how information is stored in a computer's _memory_. We'll sidestep the somewhat more complex issue of how information is stored on you computer's hard disk for now. 

Python stores information in what are called _variables_. I like to think of variables as bins in which Python puts information. every variable can hold a certain type of data. Some variables contain single numbers - integers (`0`, `1`, `2`, ... etc.; could be negative), decimal numbers (e.g. `1.2345`), or characters (e.g. `a`, `3`, `$`, `N`, etc.). _Sequences_ contain multiples of a single type of data. Multiple characters make up a _string_. Both strings and characters usually are identified by single or double quotation marks (e.g. `"this is a string"`; `'Another f#&*ing string.'`; `"X"` is a single character). _Lists_, a special type of sequence, are what they sound like: lists of data of a variety of types, identified by brackets (e.g. `['a', 1, 2.5, -637, 'Hubert']`). Before going much further, let's play with a few data types.

#### Now Try This
1. Run the code block below. Notice that you can use one variable to define the contents of another variable. Also notice that the `print` command outputs the contents of a variable.
2. Modify the code block so that instead of adding `a + b`, you add `a + c`. What happens when you try to add variables with two different types of data? What potentially helpful information does Python give you? Fix the mistake by returning the line to its original state.
3. Modify the code block so that `d` is equal to `a` minus 0.5. You are now subtracting a decimal from an integer. What happens? Do you get the same error for adding integers and decimals? 

In [None]:
a = 1
b = 0
print (a)

c = 'cat'
print (c)

a + b

d = a - 5
print (d)

mylist = [1, 2, -3, 'dog']
print ('My List:',mylist)


Packages
---------
We often want to do things with data that are difficult or complicated to do with standard Python. Other clever programmers have often accomplished these tasks, and made the tools to do them available as _packages_. Packages are somewhat like plug-ins in, say, MS Word: they are extra features that you install after you've set up the main application. 

There are two steps to using a Python package. First, you need to download the package and put it in the right place on your computer. There are two ways to install packages, both from the command line. Suppose you wanted to install the [_Pandas_](https://pandas.pydata.org/) data analysis and management package. (Note that this was likely installed already when you first set up Anaconda.) It's better to use Anaconda's installer, `conda`. To do that, type:
```bash
% conda install pandas
```
Alternatively, you could use the `pip` installer:
```bash
% pip install pandas
```
This might put packages in a different location on your comupter than `conda` does, so it's better to use `conda` when you can.

The second step is to tell the Python interpreter to use the package. We do this in the next block of code using the `import` command.

#### Now Try This
1. Run the next block of code. Make sure there are no error messages.
2. Every command from the Pandas package now starts with the prefix "pd." So `pd.DataFrame` refers to the `DataFrame` command in Pandas.
3. You'll also want to use the [_NumPy_](http://www.numpy.org/) scientific computing package. Add a line to the code block below to import `numpy` as `np`.

In [1]:
import pandas as pd
pd.show_versions()
import numpy as np


INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.26.1
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None


NumPy Arrays
-------------
NumPy is an improvement over many of the mathematical and data-processing tools available in standard Python. One of the main improvements is in NumPy's ability to handle _arrays_ quickly and easily. Arrays, like lists, are sequences. Unlike lists, each array only holds one data type, usually numbers. Arrays can be multi-dimensional, meaning that they can hold 2-D grids of numbers, 3-D blocks, or even more complex structures with multiple dimensions if you need them. The command `np.array()` converts a list to an array.

#### Now Try This
1. Run the code block below. How do you know that B is an array of decimal numbers, whereas A contains integers? What happens when one element of an array is a decimal number?
2. What happens when you add A and B? What happens when you multiply A and B?


You can access a particular element of a list or an array using _indexing_. In indexing, you tell Python which item in a list or an array you want to print, change, or otherwise use. To access the first element of array A, you would type `A[0]`, because the numbering of elements starts at 0. So `A` has items numbered from 0 to 2. In multidimensional arrays, you specify elements using two or more "coordinates": for example `B[1,1]`.

#### Now Try This
1. What do you guess element [2,1] of array B will be? Print out element [2,1] of array B.
2. What error message do you see when you ask Python for A[3]?
3. You can access part of an array by *slicing*, or designating a range of elements. For example, A[0:2] will give an array with the first two elements of A (A[0] and A[1], where the elements are *below 2*). How would you access the bottom right 2x2 grid from array B? 

Array Attributes
-----------------

In addition to their contents, arrays have a set of useful _attributes_, pieces of information about them that you might want to know. Attributes are accessed by a dot and the name of the attribute. Here's a list of some of the useful attributes [here](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html). 

#### Now Try This
1. Find the `shape` (number of rows and columns), `size` (number of elements), and `dtype` (data type) of both arrays B and A. For example, for the shape, type 'A.shape()'.  


In [2]:
A = np.array([1, 2, 3])
print (A)

B = np.array([[1, 2, 3],[4, 5, 6],[7,8,9.0]])
print (B)

[1 2 3]
[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]]


Pandas Data Frames
-------------------
Pandas is a step above Numpy when it comes to analyzing large datasets with multiple columns of different kinds of information. This is great for Excel files with sample IDs, measurements, times, dates, etc. Data frames are kind of like data tables in publications. Each column has one kind of data in it.

To create a data frame from an array, you just use the `pd.DataFrame()` command:

In [5]:
df_B=pd.DataFrame(B)
print(df_B)

     0    1    2
0  1.0  2.0  3.0
1  4.0  5.0  6.0
2  7.0  8.0  9.0


It's more helpful if the columns in your data frame have meaningful names. Let's name them using the `columns` attribute:

In [7]:
df_B.columns=['first','second','third']
print(df_B)

   first  second  third
0    1.0     2.0    3.0
1    4.0     5.0    6.0
2    7.0     8.0    9.0


To access a column, you specify its name in the same way you specify an element in an array: using the `[]` brackets.

In [9]:
print(df_B['third'])

0    3.0
1    6.0
2    9.0
Name: third, dtype: float64


Technically, each column is called a pandas `Series`. We'll just call them columns.

If you want to access a particular element in a data frame, use `.loc[]` or `.iloc[]` as follows:

In [15]:
print(df_B.loc[1,'second'])

5.0


In [18]:
print(df_B.iloc[0,2])

3.0


What's the difference? The `.loc[]` attribute lets you use the names of rows and columns, whereas `.iloc[]` uses indexes (like Numpy does with arrays).

You can use slices with `.loc[]` and `.iloc[]` as well. Note that they work differently with the two attributes!

In [20]:
print(df_B.iloc[1:3,2])

1    6.0
2    9.0
Name: third, dtype: float64


In [24]:
print(df_B.loc[0:1,'first':'second'])

   first  second
0    1.0     2.0
1    4.0     5.0


Reading Data Files in Pandas
------------------------------
One of the great things about pandas is its ability to read files and manipulate them as data frames. This allows you to do the kinds of tasks you do in Excel, but faster. Let's read the Lake Elgygytgyn data file into memory as a data frame. We do this using the `read_csv()` command. This command takes several *parameters* - aspects of the command you need to specify (designated in parentheses):
1. The file name (`'elgygytgyn2013_reconstruction.csv'`). This is the first parameter you specify.
2. The number of rows to skip. This is a *named* parameter: you need to specify "`skiprows=`" to tell pandas to skip this many lines at the beginning of the file. Other useful named parameters include `sep` (column separator; useful if you have something other than commas between the columns in a file), `comment` (set this if you want to skip lines starting with a certain character) and `skipfooter` (skip this many lines at the end).

In [32]:
elgy=pd.read_csv('elgygytgyn2013_reconstruction.csv',skiprows=12)
print(elgy)

     Age [ka BP]  Mean Temperature, warmest month [deg C]  \
0       2150.300                                 5.640586   
1       2152.460                                 9.207488   
2       2155.000                                10.511760   
3       2156.820                                 7.843209   
4       2157.310                                10.720896   
5       2159.330                                11.119363   
6       2160.320                                10.217688   
7       2162.620                                11.517836   
8       2165.180                                11.804156   
9       2167.740                                12.404994   
10      2169.170                                12.837006   
11      2175.220                                12.327336   
12      2177.140                                12.168423   
13      2179.050                                12.470264   
14      2182.240                                12.741446   
15      2183.930        

...that's a big file. We can just see the first few lines of the data frame using `.head()`:

In [34]:
print(elgy.head())

   Age [ka BP]  Mean Temperature, warmest month [deg C]  \
0      2150.30                                 5.640586   
1      2152.46                                 9.207488   
2      2155.00                                10.511760   
3      2156.82                                 7.843209   
4      2157.31                                10.720896   

   Minimum Mean Temperature, warmest month [deg C]  \
0                                         2.678636   
1                                         1.885963   
2                                         7.962232   
3                                         1.301907   
4                                         8.203799   

   Maximum Mean Temperature, warmest month [deg C]  \
0                                        12.178636   
1                                        14.785963   
2                                        15.062232   
3                                        14.201907   
4                                        14.503799

Let's say we wanted to know the range in precipitation between the minimum and maximum, year by year. All you have to do is:

In [38]:
elgy['Maximum Precipitation, annual mean [mm]']-elgy['Minimum Precipitation, annual mean [mm]']

0      197.186
1      197.186
2      111.502
3      149.475
4       99.186
5       87.184
6       74.261
7       99.186
8       92.166
9      269.873
10      40.444
11     134.590
12     116.858
13     131.453
14     119.995
15      94.146
16     116.858
17     128.358
18      70.549
19      70.549
20      70.549
21     159.945
22      92.166
23     111.502
24      74.261
25     111.246
26      91.009
27     131.453
28     131.453
29      91.009
        ...   
347    212.447
348    194.943
349    194.943
350    194.943
351    194.439
352    479.941
353    569.157
354    479.941
355    278.598
356    262.454
357    453.349
358    224.566
359    491.771
360    468.380
361    479.941
362    453.349
363    479.941
364     15.383
365     85.117
366    425.121
367     15.383
368    236.090
369    117.646
370    381.268
371    426.442
372    246.233
373    408.329
374    451.828
375    479.941
376    461.086
Length: 377, dtype: float64

In [40]:
elgy['Precipitation Range (mm)']=elgy['Maximum Precipitation, annual mean [mm]']-elgy['Minimum Precipitation, annual mean [mm]']
print(elgy['Precipitation Range (mm)'].head())

0    197.186
1    197.186
2    111.502
3    149.475
4     99.186
Name: Precipitation Range (mm), dtype: float64


...and if we wanted the average, standard deviation, min, and max of all the precipitation ranges?

In [44]:
print('Average: ',elgy['Precipitation Range (mm)'].mean())
print('Std Dev: ',elgy['Precipitation Range (mm)'].std())
print('Min: ',elgy['Precipitation Range (mm)'].min())
print('Max: ',elgy['Precipitation Range (mm)'].max())

Average:  205.96519628647218
Std Dev:  166.0245460289239
Min:  15.383
Max:  582.756


What if we wanted the average amount of tree and shrub pollen in this reconstruction, starting 2.5 million years ago? We'd have to find the slice of the data frame where `Age [ka BP]` is less than 2500:

In [53]:
print(elgy.loc[elgy['Age [ka BP]']<2500.,'Trees & Shrubs [%]'].mean())

56.8791720872578


We can make our request even more ridiculous. What if we wanted to average only the years when the warmest month was over 10$^\circ$ C?

In [54]:
print(elgy.loc[(elgy['Age [ka BP]']<2500.)&(elgy['Mean Temperature, warmest month [deg C]']>10.),'Trees & Shrubs [%]'].mean())

62.35154920481818


### Now Try This
1. Find the average tree/shrub pollen content between 2.5 and 3.0 million years ago.
2. What's *Picea* (look it up)? What are the maximum and minimum values in that column?
3. What years have had over 1% *Picea* content?
4. When did *Picea* reach its maximum?