## Data lesson 3

Today we will learn about plotting with matplotlib

In [None]:
# Add import statements
import numpy as np

#### **Finishing up multi-dimensional arrays**

In [None]:
# Creating the 1-D arrays we used last time
a1 = np.linspace(0,10,11)
a2 = np.arange(0,11,1)
print(a1)
print(a2)

Arrays can also have multiple dimensions.

Below we make a new 2D array, defined using our previous 1D arrays.

The first row of the 2D array will hold the values contained in `a1`.  The second and third rows will hold the values from multiplying or adding `a1` and `a2`.

In [None]:
array2d = np.array([a1, a1*a2, a1+a2])
array2d

There are array methods that provide information about the attributes of the array.

*What do each of the following do?*

In [None]:
print(array2d.ndim)
print(array2d.shape)
print(array2d.size)

There are two valid syntaxes for indexing in a two-dimensional array.  

In both cases, the first index refers to the row, and the second index to the column.

The first method is usually preferred for efficiency.

In [None]:
# Brackets + comma
array2d[1,5]

In [None]:
# Two sets of brackets
array2d[1][5]

You can select an entire row or an entire column using a colon `:`

In [None]:
# all columns of row 1
array2d[1,:]

In [None]:
# all rows of column 1
array2d[:,1]

*Retrieve the element in column 4 and row 2 of array2d*

In [None]:
# Add code here

*Add 2 to all elements in column 2 of array2d*

In [None]:
# Add code here

The array methods we saw last class also work for 2D arrays.
 - `np.sum(a)` sum of all values on the array
 - `np.min(a)` find the minimum value in the array
 - `np.argmin(a)` find position (index) of the minimum value in the array
 - `np.max(a)` find maximum value in the array
 - `np.argmax(a)` find position (index) of the maximum value in the array
 - `np.unique(a)` selects a subset of unique elements
 - `np.sort(a)` sorts the array from the maximum to the minimum value
 - `np.mean(a)` and `numpy.std(a)` compute mean and standard deviation of array values
 - `np.median(a)` computes the median value of an array.
 
*Find the position of the maximum value in array2d*

In [None]:
# Add code here

*Find the sum of all elements in array2d*

In [None]:
# Add code here

#### **Other data types**

We will not go into detail here, but be aware that other data types exist.

* Booleans: used for Boolean logic (True/False)
* Tuples: like a list but immutable
* Sets: like a list but each element can only appear once

#### **Plotting basics**

Data visualization is an incredibly important part of science!

We will be learning how to make plots with the matplotlib library.  

You can make virtually any kind of plot with matplotlib, and it is highly customizable.

Most of the functionality we need is in the `pyplot` module, which has the common nickname `plt`

In [None]:
# import pyplot
import matplotlib.pyplot as plt

First we will make some fake data to plot.

We will pretend we've measured the 200-500 nm absorbance spectrum of a sample.

In [None]:
# Run this cell to make fake data
xdata = np.arange(200,500,1)
ydata = 0.1*np.exp(-(xdata-300)**2/(2*10**2))
ydata += 0.02*np.random.random(len(xdata))

The most straightforward plotting function is just called `plot()`.

It accepts the x and y variables of the plot as its first 2 arguments.

*Fill in the correct arguments to plot `xdata` versus `ydata`*

In [None]:
# Add arguments to the function call
plt.plot()

There are many ways we can customize the style of the plot.

These are specified with optional arguments in the `plot()` function.

While the data is passed using **positional** arguments, these optional arguments are **keyword** arguments since we specify them using keywords if we want to include them.

The `marker` keyword can take the following values; try a few out out and see what they look like.
* 'o'
* '*'
* 'p'
* '^'
* 's'

*What does `marker` control?*

In [None]:
# Add in a marker keyword
plt.plot(xdata, ydata, marker= )

The `ls` keyword can take the following values; try a few out out and see what they look like.
* '-'
* '--'
* '-.'
* ':'

*What does `ls` control?*

In [None]:
# Add in a ls keyword
plt.plot(xdata, ydata, ls= )

Last example for now: the `c` keyword can take the following values; try a few out out and see what they look like.
* 'r'
* 'k'
* 'b'
* 'g'

*What does `c` control?*

In [None]:
# Add in a c keyword
plt.plot(xdata, ydata, c= )

Here is a list of a few common keywords you can use with `plot`:

* `linestyle` or `ls`: line style
* `marker`: marker style
* `color` or `c`: color
* `linewidth` or `lw`: line width
* `markersize` or `ms`: marker size

For each of these, there are additional values avaiable besides what we introduced here.  You can check out the matplotlib documentation to see all of the options.

For data visualization, it is usually important to label the axes of your plots.  

This is done using `plt.xlabel()` and `plt.ylabel()`.  These will be their own lines of code.

To set an x axis label, simply put the string you want inside the parentheses: `plt.xlabel("x axis")`

*Fill in x and y labels on our plot*

In [None]:
plt.plot(xdata, ydata, color='g', marker='o')
plt.xlabel()
plt.ylabel()

You can also change the dimensions of your figure as follows. 

`plt.figure(figsize=(width,height))`

This must be above the `plt.plot()` line.

*Run the cell below, then try using different width and height values*

In [None]:
plt.figure(figsize = (10,4))
plt.plot(xdata, ydata, color='g', marker='o')
plt.xlabel("Wavelength (nm)")
plt.ylabel("Absorbance")

You can also set different limits to the x and y axis to highlight regions of interest. 

This is done with `plt.xlim(xstart,xstop)` and `plt.ylim(ystart,ystop)`.

*Set the x limits on your plot to zoom in on the absorption feature at 300 nm.*

In [None]:
# Add x and y limits
plt.figure(figsize = (10,4))
plt.plot(xdata, ydata, color='g', marker='o')
plt.xlabel("Wavelength (nm)")
plt.ylabel("Absorbance")

plt.xlim()

You can save a copy of your figure using `plt.savefig()`

*Try adding the following line to your cell above:*
`plt.savefig('fake_spectrum.png')`

Be sure to check where it saved to within your DataHub environment

#### **Other types of plots**

Besides the line plots we've seen already, there are many other kinds of plots that you can make in python.

This cheat sheet includes a helpful summary: https://matplotlib.org/cheatsheets/_images/cheatsheets-1.png

We will go through a few examples that are most likely to be useful to you.

We can generate **scatter plots** using `plt.scatter()`.

In some cases this is not too different from a line plot.

In [None]:
plt.scatter(xdata,ydata)

In some cases, scatter plots are more useful for showing correlations between different attributes.

Let's say we have measured the concentrations of three different chemicals in river samples.  We'll use scatter plots to look for correlations among them.

In [None]:
# Run this cell to generate some fake data
conc1 = np.random.randint(20,30,10)
conc2 = conc1*0.7 + 4*np.random.random(10)
conc3 = np.max(conc2) - 0.8*conc2 - 3*np.random.random(10)
conc3[-1] = 15

*Make some scatter plots to look for correlations between the concentrations of the 3 chemicals.*

In [None]:
# Add code here

A cool feature of scatter plots is you can use color as a third "axis".  Let's see this for our three-chemical scatter plot.

In the cell below, we will use the keyword argument `c` to set a value for each point that is used to assign it a color.

We are also adding a **colorbar** to help interpret what the colors mean.

In [None]:
plt.scatter(conc1, conc2, c=conc3)
plt.xlabel("Chemical 1 (mmol)")
plt.ylabel("Chemical 2 (mmol)")

# Adding the colorbar and its label
cbar = plt.colorbar()
cbar.set_label("Chemical 3 (mmol)")

**Fine tuning:**

There are many other choices of **colormaps** available.  The cheat sheet linked above includes some other colormap options.  This is done using the keyword argument `cmap`.

We also have an extreme value in `chem3` that is making it difficult to see color differences for the other points.  To fix this, we can set minimum and maximum values to avoid "stretching" the colormap too far.  This is done using the keyword arguments `vmin` and `vmax`.

*Use the cheat sheet to pick a different colormap.  Pass this as an argument to `plt.scatter()` using `cmap= `.  Is it easier to interpret certain colormaps than others?*

*Now set a maximum value to the colormap.*

In [None]:
plt.scatter(conc1, conc2, c=conc3)
plt.xlabel("Chemical 1 (mmol)")
plt.ylabel("Chemical 2 (mmol)")

# Adding the colorbar and its label
cbar = plt.colorbar()
cbar.set_label("Chemical 3 (mmol)")

**Histograms** are another useful type of plot for visualizing distributions of data.

We generate a histogram using `plt.hist(data)`, where data is the dataset.

Below we'll make some data for student guesses of the number of gummy bears on a plate.

In [None]:
# Run this cell to make fake data
gummy_guesses = np.random.normal(48, 12, 150)

*Make a histogram of the distribution of guesses in the sample.  Be sure to label your axes!*

In [None]:
# Add code here

Each bar in a histogram represents the number of values that fall within a specific range.  The y values of a histogram will depend on the bin size!

Sometimes it can be helpful to change the size of the bins to better visualize the data.  We can do this with the keyword argument `bins`.

`bins` accepts either 
* a single integer representing the number of evenly-spaced bins
* a list explicitly defining the bin edges

*Try increasing and decreasing the number of bins in your histogram*

In [None]:
# Add code here

*Now we will define an explicit set of bin edges*

In [None]:
bin_edges = np.linspace(20,80,12)

# Add code to make your histogram with the bin edges defined

You can also change the color just as you would a line plot.

*Try changing the color of your histogram plot*

#### **Overplotting data**

Let's suppose we have two measurements we want to show in the same figure. 

All we need to do is call multiple plotting functions in the same cell.

In [None]:
# Run this cell to make an additional fake spectrum
ydata2 = 0.06*np.exp(-(xdata-310)**2/(2*10**2))
ydata2 += 0.02*np.random.random(len(xdata))

*Make a figure including the line plot for our original spectrum, `ydata`, as well as our new spectrum, `ydata2`.  Note that you can use the same `xdata` for both cases.*

In [None]:
# Add code here

Now that we have multiple lines on the figure, it is helpful to have a **legend**.

To make a legend, we need to associate a label with each line.  We first add the keyword argument `label=""` to each `plt.plot()` call.  We then add a new line `plt.legend()` at the bottom of the cell.

*Add labels for spectrum 1 and spectrum 2 to the plot*

In [101]:
# Add code here