First, make sure the notebook is aware of the workshop data sets

In [None]:
!git clone https://github.com/icomse/5th_workshop_MachineLearning.git
import os
os.chdir('5th_workshop_MachineLearning/data')

### Python's Numerical Ecosystem

In addition to Python's built-in modules like the ``math`` module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python.
Some of the most important ones are:

#### [``numpy``](http://numpy.org/): Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data.
If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

#### [``scipy``](http://scipy.org/): Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more.
We will not look closely at Scipy today, but we will use its functionality later in the course.

#### [``pandas``](http://pandas.pydata.org/): Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a *Data Frame*.
If you've used the [R](http://rstats.org) statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

#### [``matplotlib``](http://matplotlib.org): Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly). We will get to this next time. 

#### [``sklearn``](https://scikit-learn.org) scikit-learn; flexible and easy ML and data analysis package

scikit-learn is an easy general use a python package for machine learning. It is NOT the highest performing package (you would need accelerators like Keras or pytorch for that), but puts most of the useful approaches and models together in a single package that can run most anywhere.  You probably won't use it for analyzing millions of data points or developing new methods, but for small and medium size jobs, it works great.


# `numpy`, your Matlab-like friend

The `numpy` package (module) is used in almost all numerical computation using Python. It is a package that provide high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good. 

`numpy` has the advantage over MATLAB in that it is a library that is *part* of Python (so you can extensibly use any of the other Python functionality) and it is *free*, unlike MATLAB, which is rather expensive (the university pays your subscription while you are here, but many companies do not have subscriptions!) It is **not** alone as fully capable as MATLAB, and it is frequently a bit harder to do a number of complex things in Python + numpy than it is in MATLAB. However, with the other Python modules, you can do pretty much anything you can in MATLAB. `numpy` is the 'vanilla' implementation of MATLAB data types; you need `scipy`,`matplotlib`,`pandas`, and other tools to do a lot of those other things.

To use `numpy` you need to import the module, using for example:

In [None]:
import numpy as np

We'll call this enough it's worth renaming it `np` to save typing three other characters. In the `numpy` package the terminology used for vectors, matrices and higher-dimensional data sets is *array*. 



## Creating `numpy` arrays

There are a number of ways to initialize new numpy arrays, for example from

* a Python list or tuples
* using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, `zeros`, etc.
* reading data from files

### Creating arrays from python lists

For example, to create new vector and matrix arrays from Python lists we can use the `numpy.array` function, which takes the Python list, and returns a new array with those entries.

In [None]:
# a vector: the argument to the array function is a Python list
l = [1,2,3,4] # a Python list
print(type(l)) # see, a list
v = np.array([1,2,3,4])  # create a numpy array with the same elements as this list
print(type(v)) # a new type
print(type(l)) #l is still the same, v was created from l, but l wasn't changed.

In [None]:
l

In [None]:
v

In [None]:
# a matrix: the argument to the array function is a nested Python list
# it's still type "array"
M = np.array([[1, 2], [3, 4]])
print(M)

The `v` and `M` objects are both of the type `ndarray` that the `numpy` module provides.

In [None]:
type(v), type(M)

The difference between the `v` and `M` arrays is only their shapes. We can get information about the shape of an array by using the `ndarray.shape` property.

In [None]:
v.shape  # a length 4 array.  It's ALMOST the same thing as 
         # a length 4x1 matrix; in some rare cases, the difference matters. 

In [None]:
M.shape

The number of elements in the array is available through the `ndarray.size` property:

In [None]:
M.size

Equivalently, we could use the function `numpy.shape` and `numpy.size` - these functions just return the properties of the objects.

In [None]:
np.shape(M)

In [None]:
np.size(M)

So far the `numpy.ndarray` looks awfully much like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type? 

There are several reasons:

* Python lists are very general. They can contain any kind of object. They are dynamically typed (i.e. the type of data can change). They do not support mathematical functions such as matrix and dot multiplications, etc. Implementing such functions for Python lists would not be very efficient because of the dynamic typing.
* Numpy arrays are **statically typed** and **homogeneous**. The type of the elements is determined when the array is created -- all integers, all float, all booleans.  A Python array can be a list of a bunch of different things.
* Numpy arrays are memory efficient.
* Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of `numpy` arrays can be implemented in a compiled language, like C and Fortran.  So if you use numpy functions with numpy arrays, it is usually **much, much** faster than Python alone.

Using the `dtype` (data type) property of an `ndarray`, we can see what type the data of an array contains:

In [None]:
M.dtype

We get an error if we try to assign a value of the wrong type to an element in a numpy array. Python doesn't care, as it will just redefine and resize the list, but numpy does because it has already allocated a fixed amount of memory for the data.

In [None]:
M[0,0] = "hello"

The problem was that it was declared as an array of integers, and we are trying to put strings into it!

If we want, we can explicitly define the type of the array data when we create it, using the `dtype` keyword argument. Since numpy by default creates floats, you frequently have to do this if you want integers or booleans.

In [None]:
M = np.array([[1, 2], [3, 4]], dtype=complex)
M

In [None]:
M = np.array([[1, 2], [3, 4]],dtype='int')

In [None]:
M.dtype

Common data types that can be used with `dtype` are: `int`, `float`, `complex`, `bool`, `object`, etc.

We can also explicitly define the bit size of the data types, for example: `int64`, `int16`, `float128`, `complex128`. This can be useful if we know we need very high precision data.  `float64` and `int32` are the defaults.

### Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit python lists. Instead we can use one of the many functions in `numpy` that generate arrays of different forms. Some of the more common are:

#### arange

In [None]:
# create a range of numbers
x = np.arange(0, 10, 1) # arguments: start, stop, step
x

In [None]:
x = np.arange(-1, 1, 0.1)
x

#### linspace and logspace
what do `linspace` and `logspace` do?  use the `?` to find out.

#### random data

In [None]:
# uniform random numbers in [0,1]
np.random.rand(5,5)

In [None]:
# standard normal distributed random numbers
np.random.randn(5,5)

**Try it yourself**:  What other sorts of arrays of random numbers can you create?

In [None]:
dir(np.random)

In [None]:
np.random.randint(low=-10,high=10,size=(5,5))

In [None]:
np.random.lognormal(size=(5,5))

In [None]:
np.random.logistic(size=(5,5))

#### diag

In [None]:
# a diagonal matrix
np.diag([1,2,3])

In [None]:
# diagonal with offset from the main diagonal
np.diag([1,2,3], k=1) 

What do the commands `zeros` and `ones` do?

In [None]:
np.ones([5,5])

### Hacking time!

1. Create a 6 by 5 matrix of zeros and store in variable `zeros65`
1. Create an array of 11 equally spaced points from 0 and 20, inclusive.
1. Read the documentation for <tt>vstack</tt> and <tt>hstack</tt>. Then create the following array in Python and store it in variable <tt>M</tt>. Hint: How can you use <tt>vstack</tt>, <tt>identity</tt>, and <tt>zeros</tt> together?

$$ M = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \\ 0 & 0 & 0\end{bmatrix}$$

In [None]:
np.zeros([6,5])

In [None]:
np.linspace(0,20,11)

In [None]:
A = np.eye(3)
B = np.zeros([2,3])
C = np.vstack((A,B))
print(C)
np.shape(C)

## File I/O

### Comma-separated values (CSV)

A very common file format for data files is comma-separated values (CSV), or related formats such as TSV (tab-separated values). To read data from such files into Numpy arrays we can use the `numpy.genfromtxt` function. For example, 

Using `numpy.savetxt` we can store a Numpy array to a file in CSV format:

In [None]:
M = np.random.rand(3,3)
M

In [None]:
np.savetxt("random-matrix.csv", M)

In [None]:
data = np.genfromtxt('random-matrix.csv')
data

In [None]:
data[0,0]

In [None]:
np.savetxt("random-matrix.csv", M, fmt='%.5f') # fmt specifies the format

In [None]:
data = np.genfromtxt('random-matrix.csv')
data

### Numpy's native binary file format

Useful when storing and reading back numpy array data, as the data is stored in the same precision as it exists, rather than space-inefficient text. Use the functions `numpy.save` and `numpy.load`:

In [None]:
np.save("random-matrix.npy", M)

In [None]:
newmat = np.load("random-matrix.npy")
newmat

## More properties of the numpy arrays

In [None]:
M.itemsize # bytes per element

In [None]:
M.ndim # number of dimensions

In [None]:
M.nbytes # number of bytes; the actual size in memory.

## Manipulating arrays

### Indexing

We can index elements in an array using square brackets and indices, same as with a list

`v` is a vector, and has only one dimension, accessed taking one index, like a list

In [None]:
v[1]

`M` is a matrix, or a 2 dimensional array.  We access the entries using TWO indices, separated by a comma

In [None]:
M[1,1]

If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array) 

In [None]:
M

We can look at the entire first row of M using `:` instead of an index: 

In [None]:
M[1,:] # row 1

Or the first column:

In [None]:
M[:,1] # column 1

If we type `M[1]`, what will it return ?

We can assign new values to elements in an array using indexing:

In [None]:
M[0,0] = 1

In [None]:
M

We can also assign entire rows and columns.  Note we are actually making multiple assignments at a time! It's really looping over the assignments under the hood.

In [None]:

M[1,:] = 0
M[:,2] = -1

In [None]:
M

You can also resize an array after creating it, changing the dimensions--as long as the total size remains the same.

In [None]:
A = np.array([2,4,6,8])
print("A is now a vector",A,"\n")
A = A.reshape(2,2)
print("A is now a matrix\n",A)
print("\nNow let's shape A back to a vector")
print(A.reshape(4))

### Index slicing

"Index slicing" is the technical name for the syntax `M[lower:upper:step]` to extract *part* of an array:

In [None]:
A = np.array([1,2,3,4,5,6])
A

In [None]:
A[1:3]

In [None]:
A[1::2]

Array slices are *mutable*: if they are assigned a new value the original array from which the slice was extracted is modified:

In [None]:
A[1:4] = [-2,-3,-4]
A

We can omit any of the three parameters in `M[lower:upper:step]`:

In [None]:
A[::] # lower, upper, step all take the default values

In [None]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

In [None]:
A[:3] # first three elements

In [None]:
A[3:] # elements from index 3

Negative indices counts from the end of the array (positive index from the begining):

In [None]:
A = np.array([1,2,3,4,5])

In [None]:
A[-1] # the last element in the array

In [None]:
A[-3:] # the last three elements; -3 is the 3rd to last, then all of them after that.

Index slicing works exactly the same way for multidimensional arrays:  

In [None]:
A = np.array([[n+m*10 for n in range(5)] for m in range(5)]) # good review for list comprehension: what is this doing?
A

In [None]:
# a block from the original array
A[1:4, 1:4]

In [None]:
# taking every other row and column
A[::2, ::2]

### Fancy indexing

"Fancy indexing" is the name for when an array or list is used **in the place** of an index: 

In [None]:
row_indices = [0, 2, 4]
A[row_indices,:]

In [None]:
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[:,col_indices]

What is this doing??

In [None]:
A[row_indices,col_indices]

We can also use **index masks**: If the index mask is an Numpy array of data type `bool`, then an element is selected (`True`) or not (`False`) depending on the value of the index mask at the position of each element: 

In [None]:
B = np.array([n for n in range(5)])
B

In [None]:
row_mask = np.array([True, False, True, False, False])
B[row_mask]

In [None]:
# same thing
row_mask = np.array([1,0,1,0,0], dtype=bool)
B[row_mask]

This feature is very useful to conditionally select elements from an array, using for example comparison operators:

In [None]:
x = np.arange(0, 10, 0.5)
x

In [None]:
x>5

In [None]:
(x < 7.5)

In [None]:
mask = (x > 5) * (x < 7.5)
mask

In [None]:
x[mask]

Put it all together:

In [None]:
x[(x > 5) * (x < 7.5)]  # think about what this is doing!

### where

The index mask can be converted to position index using the `where` function

In [None]:
indices = np.where(mask)
indices

In [None]:
x[indices] # this indexing is equivalent to the fancy indexing x[mask]

### diag

With the diag function we can also extract the diagonal and subdiagonals of an array:

In [None]:
np.diag(A)

In [None]:
np.diag(A, -1)

### Hacking time!
With a partner, generate a `numpy` array from 1 to 2013 that contains only numbers that are not divisible by 7. **Use only array commands** (i.e. don't using `for` or something else to loop over the elements one by one)

## Linear algebra

**Vectorizing code** is the key to writing efficient numerical calculation with Python and Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations, like matrix-matrix multiplication.

### Scalar-array operations

We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers. Can you describe what the following do?

In [None]:
v1 = np.arange(0, 5)

In [None]:
v1 * 2

In [None]:
v1 + 2

In [None]:
A * 2, A + 2

### Element-wise array-array operations

When we add, subtract, multiply and divide arrays with each other, the default behaviour is **element-wise** operations:

In [None]:
A

In [None]:
A * A # element-wise multiplication

In [None]:
v1 * v1

If we multiply arrays with compatible shapes, we get an *element-wise* multiplication of each row.  Is this the same as matrix multiplication?

In [None]:
A.shape, v1.shape

In [None]:
A * v1

What does the following do?

In [None]:
x = np.array([1,2.5,5])
y = np.array([4,6,7,9.0])
print(x+y)

### Matrix algebra

What about matrix mutiplication? There are two ways. We can either use the `dot` function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments: 

In [None]:
np.dot(A, A)

In [None]:
np.dot(A, v1)

In [None]:
np.dot(v1, v1)

You can also use `@` for matrix multiplicaiton

In [None]:
print(A @ A)
print(A @ v1)
print(v1 @ v1)

In [None]:
print(v1.shape)
print(A.shape)

### Matrix computations

The `linalg` library contains lots of useful linear algebra routines; pretty much anything you would need.

In [None]:
dir(np.linalg)

#### Inverse

In [None]:
M = np.diag([1,2,3,4])
Minv = np.linalg.inv(M)
print(Minv)

`Minv @ M` is $M^{-1} M$, `M @ Minv` is $M M^{-1}$, both of which are equal to the identity matrix $I$.

In [None]:
print(Minv @ M)
print(M @ Minv)

#### Determinant

In [None]:
np.linalg.det(M)

## Hacking Time!

Create a 10 x 10 random matrix `M`, a 10x1 random vector `b`, and solve for the vector `a` such that  `Ma = b`. Check that you did indeed solve the matrix equation.  What `numpy.linalg` functions are available to do this?

### Data processing

Often it is useful to store datasets in `numpy` arrays. `numpy` provides a number of functions to calculate statistics of datasets in arrays.  This data set is available to be downloaded.

In [None]:
data = np.genfromtxt('stockholm_td_adj.dat')

The data format is: 
 - column 1: year
 - column 2: month
 - column 3: day
 - column 4: daily average temperature
 - column 5: low
 - column 6: high
 - column 7: location

In [None]:
data.shape

In [None]:
data[70000,:]

#### mean

In [None]:
# the temperature data is in column 3
np.mean(data[:,3])

The daily mean temperature in Stockholm over the last 200 years has been about 6.2 C.

#### standard deviations and variance

In [None]:
np.std(data[:,3]), np.var(data[:,3])

#### min and max

In [None]:
# lowest daily average temperature
print(data[:,3].min())
print(min(data[:,3]))

In [None]:
# highest daily average temperature
data[:,3].max()

## Hacking time!!
Test the random number generator `np.random.randn` to check that it is indeed generating random numbers that are normally distributed with standard deviation 1 and mean 0.  You don't need to do statistical tests, just make sure that it converges to the right means and standard deviations as the number of random numbers increases.  

How could you use vectorized methods to create random numbers that are normally distributed and have mean 500 and standard deviation 100?  Hint: you shouldn't need to call the random number generator again, you can operate on the data you already generated.

### Computations on subsets of arrays

We can compute with subsets of the data in an array using indexing, fancy indexing, and the other methods of extracting data from an array (described above).

For example, let's go back to the temperature dataset:

The data format is: year, month, day, daily average temperature, low, high, location.

If we are interested in the average temperature only in a particular month, say February, then we can create a index mask and use it to select only the data for that month using:

In [None]:
np.unique(data[:,1]) # the month column takes values from 1 to 12

In [None]:
mask_feb = data[:,1] == 2

In [None]:
# the temperature data is in column 3
np.mean(data[mask_feb,3])

With these tools we have very powerful data processing capabilities at our disposal. For example, to extract the average monthly average temperatures for each month of the year only takes a few lines of code: 

In [None]:
months = np.arange(1,13)
monthly_mean = [np.mean(data[data[:,1] == month, 3]) for month in months]

for i, m in enumerate(monthly_mean):
    print(i+1,m)

## Hacking time!
Compute the lowest temperature ever measured during each month in Stockholm.

### Calculations with higher-dimensional data

When functions such as `min`, `max`, etc. are applied to a multidimensional arrays, it is sometimes useful to apply the calculation to the entire array, and sometimes only on a row or column basis. Using the `axis` argument we can specify how these functions should behave: 

In [None]:
m = np.random.rand(3,4)
m

In [None]:
# global max
m.max()

In [None]:
# max in each column
m.max(axis=0)

In [None]:
# max in each row
m.max(axis=1)

## Copy and "deep copy"

To achieve high performance, assignments in Python usually do not copy the underlaying objects. 

"*Why would Python do this, besides to just confuse me!?!*" you ask.

Imagine a NumPy array has millions of elements. It is best for efficient memory usage if functions do not make multiple copies of millions of elements.

This is important for example when objects are passed between functions, to avoid an excessive amount of memory copying when it is not necessary (technical term: pass by reference). 

This is different than what we saw previously for passing floats, integers, booleans, and strings to functions. 

**Summary:**
* Floats, integers, booleans, strings: You do NOT need to worry about `.copy()`. These are called primitive types in Python, and when you pass them into a function, they copy it.
* Lists, arrays, dictionaries, and anything else. You need to worry about `.copy()`.

In [None]:
A = np.array([[1, 2], [3, 4]])
A

In [None]:
# now B is referring to the same array data as A 
B = A 

In [None]:
# changing B affects A
B[0,0] = 10
B

In [None]:
A

If we want to avoid this behavior, so that when we get a new completely independent object `B` copied from `A`, then we need to do a so-called "deep copy" using the function `copy`:

In [None]:
B = A.copy()

In [None]:
# now, if we modify B, A is not affected
B[0,0] = -5
B

In [None]:
A

## Plotting and visualization

### Use the notebook commants to render matplotlib figures inline with the notebook cells

In [None]:
import matplotlib.pyplot as plt
import matplotlib
# next line is a notebook command that makes figures print in the notebook.
%matplotlib inline

### `Matplotlib` basics
Matplotlib is a simple plotting tool that allows you to plot the arrays from NumPy. We already saw an example above.  Matplotlib is designed to be intuitive, easy to use, and mimic MATLAB syntax.

As an example, let's plot the cardinal sine function: $$\mathrm{sinc}(x) = \frac{\sin(x)}{x}$$

In [None]:
#Create the data to plot (note, it is not plotting a function, but 2 arrays)
x = np.linspace(-100,100,1000)
y = np.sin(x)/x

#plot x versus y
plt.plot(x,y)

#label the y axis - always label your axes!
plt.ylabel("sinc(x) [normalized units]");

#label the x axis
plt.xlabel("x [cm]")

#give the plot a title
plt.title("Line plot of the sinc function")

#show the plot
plt.axis([-100,100,-.4,1.05])

# Save the figure in the current drive as a pdf or png 
plt.savefig("figure.pdf")
plt.savefig("figure.png")
plt.show()

Take a moment to figure things out.  Let's figure out how to do the following:
* How do you change the x range to be -40 to 40?
* How do you change the y range to be -0.5 to 1.5
* How do you change the font size of various elements to 18?
* How do you change the colors and transparency of the line?

### Customizing Plots

You can also change the plot type.

In [None]:
x = np.linspace(-3,3,100)
y = np.exp(-x**2)
fig = plt.figure(figsize=(4,3), dpi=200)
plt.plot(x,y,"ro"); #red dots on the plot
plt.ylabel("e^-x^2 [arb. units]");
plt.xlabel("x [cm]")
plt.title("Plot of e^-x^2")
plt.axis([-3,3,0,1.05])
plt.show()

In [None]:
x = np.linspace(-3,3,100)
y = np.exp(-x**2)
fig = plt.figure(figsize=(4,3), dpi=200)
plt.plot(x,y,color='k',marker='o'); #black dots and a line on the plot
plt.ylabel("e^-x^2 [arb. units]");
plt.xlabel("x [cm]")
plt.title("Plot of e^-x^2")
plt.axis([-3,3,0,1.05])
plt.show()

There are many options you can adjust for the plot.  For a full list, see http://matplotlib.org/users/pyplot_tutorial.html

Here are some more examples. One of the things you'll notice is that in matplotlib you can include LaTeX mathematics in your labels by enclosing it in dollar signs.  In LaTeX to use Greek letters you use a backslash before the name of the letter, and other usually obvious characters.  To find out how to do anything with LaTeX, just Google it.

In [None]:
x = np.linspace(-3,3,100)
y = np.exp((-x**2)/2)
fig = plt.figure(figsize=(4,3), dpi=200)
plt.plot(x,y,color='y',marker="*", linewidth=0.5); 
plt.ylabel("$e^{-\\Omega x^2}$ [arb. units]");
plt.xlabel("x (cm)")
plt.title("Plot of $e^{-\\frac{x^2}{2}}$")
plt.axis([-3,3,0,1.05])
plt.show()

In [None]:
x = np.linspace(-3,3,100)
y = np.exp(-x**2)
fig = plt.figure(figsize=(4,3), dpi=200)
plt.plot(x,y,marker="",linewidth=1,color="maroon"); 
plt.ylabel("$e^{-x^2}$ [arb. units]");
plt.xlabel("x [cm]")
plt.title("Plot of $e^{-x^2}$")
plt.axis([-3,3,0,1.05])
plt.show()

### Plotting multiple lines

You can also plot multiple lines on a plot. If you include a `label` parameter for each plot, invoking the legend command, which puts the legend on the plot.

In [None]:
x = np.linspace(-3,3,100)

# Line 0
y = np.exp(-x**2)
plt.plot(x,y,marker="", color="r", 
         linestyle="--", label="$\sigma^2 = 1$"); 

# Line 1
y1 = np.exp(-x**2/2)
plt.plot(x,y1,color="blue", 
         label="$\sigma^2 = 2$")

# Line 2
y2 = np.exp(-x**2/4)
plt.plot(x,y2,color="black", 
         marker = "+", label="$\sigma^2 = 4$")

# Add labels, title, legend, etc.
plt.ylabel("$e^{-x^2/\sigma^2}$ (arb units)");
plt.xlabel("x (cm)")
plt.title("Plot of $e^{-x^2/\sigma^2}$")
plt.legend(loc='center')
plt.axis([-3,3,0,1.05])
plt.show()

### Hacking time!

Create a plot that with both $\sin(x)$ and $\cos(x)$ from -$2\pi$ to $2\pi$ with the following formating rules:

* Dashed red line for sine.
* Solid blue line for cosine.
* Legend located in the lower left corner.
* Label the axes with $x$ and $f(x)$
* Add grid lines.

Let's try some other types of plots.

In [None]:
y = np.array([45, 35, 20])
mylabels = ["Bananas", "Coconuts", "Grapes"]
plt.pie(y, labels = mylabels)
plt.show() 

In [None]:
plt.pie(y, labels = mylabels, shadow=True, explode=(0,0.1,0))
plt.show() 

In [None]:
points = np.random.randn(50000)
plt.hist(points)
plt.show()

### Hacking time.

How can you improve this histogram plot?  What options are there?

Subplots can also be useful. What we have to do is to create "axes" first, and then plot within each of the axes. The plots are then created using methods of each `ax` object.

In [None]:
# First create some toy data:
x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)

In [None]:
# Create just a figure and only one subplot
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Simple plot')
plt.show()

In [None]:
# Create two subplots and unpack the output array immediately
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
fig.suptitle('Sharing Y axis')
ax1.plot(x, y)
ax1.set_title('This is the left graph',size=10)
ax2.scatter(x, y, color='r',marker='*')
ax2.set_title('This is the right graph',size=10)
plt.show()

In [None]:
# Create four polar axes and access them through the returned array
fig, axs = plt.subplots(2, 2, subplot_kw=dict(polar=True))
axs[0,0].plot(x, y)
axs[1,1].scatter(x, y,s=2,c='red')
plt.show()

Fancier plots are also possible.  We'll introduce whatever you need as we go.

If you ever want to do something fancy in MatPlotLib, try starting with one of these examples: https://matplotlib.org/tutorials/index.html

What about trying other plot styles?  We can do this by calling `matplotlib.style.use(...)`.  Let's try the `ggplot` style that looks like the ggplot2 default style from R.

In [None]:
import matplotlib
matplotlib.style.use('ggplot')

In [None]:
x = np.linspace(-3,3,100)
y = np.exp(-x**2)
plt.plot(x,y); 
plt.ylabel("$e^{-x^2}$ [arb. units]");
plt.xlabel("x [cm]")
plt.title("Plot of $e^{-x^2}$")
plt.axis([-3,3,0,1.05])
plt.show()

You can find the list of matplotlib styles [here](https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html).

In [None]:
#Uncomment Try a few.
#matplotlib.style.use('fivethirtyeight')
#matplotlib.style.use('dark_background')
matplotlib.style.use('seaborn-white')
x = np.linspace(-3,3,100)
y = np.exp(-x**2)
plt.plot(x,y); 
plt.ylabel("$e^{-x^2}$ [arb. units]");
plt.xlabel("x [cm]")
plt.title("Plot of $e^{-x^2}$")
plt.axis([-3,3,0,1.05])
plt.show()

In [None]:
#And a bonus style called a bit differently
with plt.xkcd():
    x = np.linspace(-3,3,100)
    y = np.exp(-x**2)
    plt.plot(x,y); 
    plt.ylabel("$e^{-x^2}$ [arb. units]");
    plt.xlabel("x [cm]")
    plt.title("Plot of $e^{-x^2}$")
    plt.axis([-3,3,0,1.05])
    plt.show()

### Plots with eror bars

In [None]:
x = np.linspace(-3,3,40)
y = np.exp(-x**2)
fig = plt.figure(figsize=(4,3),dpi=100)
plt.errorbar(x,y,yerr=np.abs(0.2*(1-y)),capsize=2.5,elinewidth=0.5,ecolor='gray');
plt.ylabel("Some data [arb. units]");
plt.xlabel("x [cm]")
plt.title("Plot of some data that looks like $e^{-x^2}$, with errors")
plt.axis([-3,3,0,1.05])
plt.show()

**Hacking:** Given the data below: 
1. On the same plot, plot two *scatter* plots with error bars in x and y in different colors.  Include a legend. Change the marker and the error bar properties.
2. On two different vertically aligned subplots, plot scatter plots with error bars. Label the individual plots with the data series name. Make the error bars style different from the plot above.

In [None]:
lenp = 20
x = np.random.rand(lenp)
xerr = 0.07*np.ones(lenp) 
y1 = np.log(x)+0.3*np.random.rand(lenp) #Data series A
y2 = 4*np.sin(4*x)+0.3*np.random.rand(lenp)  #Data series B
yerr = 0.5*np.linspace(1,2,lenp)*x


## Pandas, the data science package

#### 1. Creating and loading data with ``pandas``

#### 2. Cleaning and manipulating data with ``pandas``

#### 3. Graphing data with ``pandas`` and ``matplotlib`` 

## 1. Loading data with ``pandas``

With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis.

### Loading Data with Pandas

In [None]:
import numpy as np
import pandas as pd  # we use it a lot, import under a shortened name, using "import as"
import matplotlib.pyplot as plt  # we will do some plotting.
%matplotlib inline

## Pandas

Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in Pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn. 

Two primary components are `Series` and `DataFrame`. The pandas `Series` object can be interpreted as an enhanced numpy 1D array and the pandas `DataFrame` as an enhanced numpy 2D array. 


In [None]:
sr = pd.Series([1, 3, 5, np.nan, 6, 8]) 
sr

NaN's are used to indicate missing data; *pandas datatypes automatically skip/ignore them*, whereas with numpy arrays need special error handling or they die on missing or inappropriate data.  This is very useful if you only have partial or messy data.

The main difference is that pandas series and pandas dataframes has *explicit* indices, while numpy arrays have implicit indexation.  Which means the indices can be any set of numbers, ordered any way one wants.

In [None]:
labels = np.arange(0,20,2)

In [None]:
somet = pd.Series(np.exp(-0.5*labels),index=labels)
somet

The ``head()`` and ``tail()`` methods show us the first and last rows of the data

In [None]:
somet.head()

In [None]:
somet.tail()

Can be referenced like dictionaries using indices.

In [None]:
somet[4]  # using the key

And like arrays (using slice notation for the positions, not the indices)

In [None]:
somet.iloc[3] #iloc is index location, note the brackets.

In [None]:
somet.iloc[2:7]

When there is no chance of misinterpretation, can use slice notation directly without using iloc.

In [None]:
somet[2:7]  

Many other attributes and methods that Pandas has.

In [None]:
print(somet.is_monotonic_decreasing)
print(somet.is_unique)

In [None]:
somet.argmin()

We can operate on series like an array with numpy element-wise functions, using the `apply` method.

In [None]:
somet.apply(np.arccos)

Series objects are freuqently used for timeseries datas in economics and statistics.  So pandas "feels" more natural to use for database-like data (e.g. csv, excel, and sql files), whereas numpy "feels" more natural for numeric processing of data (e.g. numerical data, images, etc.). You can do many of the same things in both libraries; you can even create pandas data frames from numpy arrays and vice-versa.

In [None]:
npa = np.array(somet)  # convert back to numpy
npa

In [None]:
pd.Series(npa)

**Play around**: What other methods can you find to operate on panda series?

### DataFrames

A DataFrame is the more common way to interact with data using pandas.  It's essentially an enhanced 2D array, again with indices.  The difference between a dataframe and a 2D array is that in numpy, ALL the elements need to be the data type, whereas in a DataFrame, only each column needs to be the same datatype. Different columns can be different data types.

Operationally, we can create a dataframe using a dictionary of series *the same size*.

In [None]:
t = np.linspace(0,1.0,10)
x = np.exp(t)
y = np.exp(-t)
g = [chr(int(9*i)+65) for i in x]   # create some characters from X

In [None]:
df = pd.DataFrame({'t':t,'x':x,'y':y,'g':g})
df

Each column of a dataframe is a series.

In [None]:
type(df['x'])

It contains different data types

In [None]:
df.dtypes

`head()` and `tail()` still work.

In [None]:
df.head()

We can also create dataframes from files. We can use the ``read_csv`` command to read the comma-separated-value data. You personally will need to specify the correct directory that you put the file in.

In [None]:
df.tail()

We can set one of the columns to be the indices

In [None]:
df_labeled = df.set_index('g')

We can select rows by index of location

In [None]:
df_labeled.iloc[1:4]

In [None]:
df_labeled.loc['O']

Some complex selection options! What do these do?

In [None]:
df_labeled.loc['R':]

In [None]:
df_labeled.loc['O':'V',['t','x']]

There are a lot of `pandas` methods!

In [None]:
df.mean()

**Play around**. What do the methods `describe()`, `sort_values()`, `sort_index()` and `transpose()` do?

More common is to import data into dataframes. Now we can use the ``read_csv`` command to read the comma-separated-value data. You will need to specify the correct directory that you put the file in.

In [None]:
data = pd.read_csv('HCEPDB_100K_cleaned.csv')

This is a selection of the Harvard Clean Energy Project Database of computed quantum mechanical properties of a very large database of molecules.  The non-obvious ones are:

   * SMILES_str = text representation of them molecules.
   * PCE = Photon Conversion Efficiency
   * E_HOMO = Energy of the highest occupied energy level
   * E_LUMO = Energy of the lowest occupied energy level
   * VOC = Open circut voltage
   * JSC = Short circut current 

In [None]:
data.head()

In [None]:
data.tail()

The ``shape`` attribute shows us the number of elements:

In [None]:
data.shape

The ``columns`` attribute gives us the column names

In [None]:
data.columns

The ``index`` attribute gives us the index names

In [None]:
data.index

Like series, we can use our own indices, instead of the integer labels that were used. 
Let's make our ``id`` column the ``index``

In [None]:
data.set_index('id',inplace=True)  # inplace means to edit THIS dataframe, not create a new one

Now let's revisit the ``data.index``

In [None]:
data.index

and the data shape

In [None]:
data.shape

View it with head again:

In [None]:
data.head()

In [None]:
data.tail()

The ``dtypes`` attribute gives the data types of each column:

In [None]:
data.dtypes

## More manipulating dataframes with ``pandas``

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing, like a dictionary

In [None]:
data['mass']

Mathematical operations on columns happen *element-wise* (note 18.01528 is the weight of H$_2$O):

In [None]:
data['mass'] / 18.01528

Columns can be created (or overwritten) with the assignment operator.
Let's create a *mass_ratio_H_2O* column with the mass ratio of each molecule to H$_2$O

In [None]:
data['mass_ratio_H2O'] = data['mass'] / 18.01528

In [None]:
data.head()

In preparation for grouping the data, let's bin the molecules by their molecular mass. For that, we'll use ``pd.cut``, with the `10` below being then number of bins (check `pd.cut?`, right?). There are a variety of ways to express the bins.

In [None]:
data['mass_group'] = pd.cut(data['mass'], 10)

In [None]:
pd.cut?

### Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at *value counts* and the basics of *group-by* operations.

#### Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data.
We'll take a look at two of these here.

The ``pandas.value_counts`` returns statistics on the unique values within each column.

We can use it, for example, to break down the molecules by their mass group that we just created:

In [None]:
pd.value_counts(data['mass_group'])

What happens if we try this on a (theoretically) continuous valued variable?

In [None]:
pd.value_counts(data['mass'])

We can do a little data exploration with this to look 0s in columns.  Here, let's look at the power conversion effeciency (``pce``)

In [None]:
pd.value_counts(data['pce'])

## Hacking time
Play around with the data set.  What new Pandas functions can you find to operate on it? Post interesting ones into chat.

## Visualizing data with ``pandas``

Of course, looking at tables of data is not very intuitive.
Fortunately Pandas has many useful plotting functions built-in, all of which make use of the ``matplotlib`` library to generate plots.

Now we can simply call the ``plot()`` method of any series or dataframe (it's a method of the dataframe!) to get a reasonable view of the data. We'll get into the details of `matplotlib` next week.

In [None]:
pd.value_counts(data['mass_group'])

### Other plot types

Pandas supports a range of other plotting types; you can find these by using the <TAB> autocomplete on the ``plot`` method:

In [None]:
data.groupby(['mass_group']).count().plot().hist(10)

Let's do some nicer plots ones.  Let's start with PCE (power conversion efficency) vs HOMO (highest occupied molecular orbital) energy.

In [None]:
plt.rcParams['agg.path.chunksize'] = 10000  
#we'll need this because of some size issues with some plots.

To make it a bit faster, let's take a sample of the data

In [None]:
data_sample = data.sample(frac=0.1)

In [None]:
data_sample.shape

In [None]:
data_sample.plot.scatter('pce', 'e_homo_alpha')

This thing is UGLY! Let's see if we can't pretty it up. First thing is that pd.plot.XXX returns a plot object that we can modify before it gets rendered by calling certain methods on the object. Remember you can always use the Jupyter notebook tab completion after an object to find out what methods are available.

In [None]:
data_sample.plot?

That's a bit better, but there are still some things we can do to make it look nicer.  Like put it on a grid and make the y-axis label more accurate and increase the size as well as setting the aspect ratio.

In [None]:
p_vs_homo_plt = data_sample.plot('pce', 'e_homo_alpha', kind='scatter', figsize=(6,6), alpha = 0.05)
p_vs_homo_plt.set_xlabel('PCE')
p_vs_homo_plt.set_ylabel('$E_{HOMO}$')
p_vs_homo_plt.set_title('Photoconversion Efficiency vs. HOMO energy')
p_vs_homo_plt.grid()

## Hacking time
Play around with plots of the data set.  What patterns can you find when you relate the data elements to each other via plotting?

### The pandas visualization tools documentation is really good:
* [docs here](https://pandas.pydata.org/pandas-docs/stable/visualization.html)

One thing that is very useful is a scatterplot matrix to show the relationship between variables.  Let's make one now.  Be patient as this makes a lot of plots!

In [None]:
pd.plotting.scatter_matrix(data_sample, figsize=(9,9), alpha=0.05)

That's . . . a lot of data! But it does give us a quick overview of the relationship between all the variables in the data frame.

In [None]:
data_sample.head()

OK, moving on, let's look at making density plots.  These show the probability density of particular values for a variable.  Notice how we used a different way of specifying the plot type.

In [None]:
data_sample['pce'].plot(kind='kde')

Let's plot the kde overtop of the histogram.  The key here is to use a secondary axis.  First we save the plot object to `ax` then pass that to the second plot. We'll see some easier methods soon . . . 

In [None]:
ax = data_sample['pce'].plot(kind='hist')
data_sample['pce'].plot(kind='kde', ax=ax, secondary_y=True)

## Seaborn for fun and pretty pictures!
Matplotlib is great for basic scatter plots, bar plots, time series, etc.  But if we want to do really fancy plots, we need to look to other tools like Seaborn.  This is a super quick intro to seaborn. If you don't have seaborn, you can install it with `conda install seaborn` then import it. Seaborn is set up to work with Pandas; Matplotlib which was never really defined for dataframes.

In [None]:
import seaborn as sns

We'll make three different contour / surface plots.
* Basic contour plot
* Density plot

Examples roughly taken from [here](https://python-graph-gallery.com/1136-2/).

In [None]:
sns.set_style('white') # set some defaults

In [None]:
ds2 = data_sample.sample(frac=0.2) # let's downsample a bit more for speed purposes.

In [None]:
sns.kdeplot(x=ds2['pce'], y=ds2['e_homo_alpha'])

In [None]:
sns.kdeplot(x=ds2['pce'], y=ds2['e_homo_alpha'], cmap='Reds', shade=True, bw_method=0.15)

Let's check out some other options that make things much easier than before.

In [None]:
sns.displot(data_sample['pce'],kde=True)

### Hacking time!
What other things can you do with `displot`? What other seaborn plots can you find, in both 1D and 2D?