<a href="https://colab.research.google.com/github/lydiahsu/SHP_Fall_2019/blob/master/Intro_Notebooks_and_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
Python is a general-purpose programming language that prioritizes programmer efficiency, ease of use, and readability. Python's flexibility makes it a powerful tool for a diverse range of applications from web development to scientific computing. As a result, Python has become a standard tool for modern machine learning due to having access to a wide range of cutting-edge libraries. For this reason, we are requiring that you use Python and become familiar with a small number of these libraries. In this tutorial, I will introduce the basic concepts neccessary to make the transition to Python from another high-level programming language (R, Matlab, Julia, ect.).

# The Virtual Machine
When you open a notebook in Colab, the resources used are being hosted on a virtual machine. Prefacing lines with the "!" character in code cells allows you to execute basic Unix style commands on the command line. For example, we can explore the machine's file system and available resources.

In [0]:
!df -h
!lscpu
!free -m

Filesystem      Size  Used Avail Use% Mounted on
overlay          49G   25G   22G  54% /
tmpfs            64M     0   64M   0% /dev
tmpfs           6.4G     0  6.4G   0% /sys/fs/cgroup
tmpfs           6.4G  8.0K  6.4G   1% /var/colab
/dev/sda1        55G   27G   29G  48% /etc/hosts
shm             6.0G  4.0K  6.0G   1% /dev/shm
tmpfs           6.4G     0  6.4G   0% /proc/acpi
tmpfs           6.4G     0  6.4G   0% /proc/scsi
tmpfs           6.4G     0  6.4G   0% /sys/firmware
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:            0
CPU MHz:             2200.000
BogoMIPS:            4400.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:     

# Creating A Working Environment On Your Virtual Machine
The availability of Unix-style commands means that we can use typical python tools like pip to manage the packages installed in our python environment. Fortunately, Colab is oriented towards datascience so we can see below that most packages we will need to use in this course come pre-installed.

In [0]:
!pip list

If at any point you need to install a package which is not already available, this can be done from a code cell as follows:

In [0]:
!pip install bokeh

# Basic Python

## Data Types & Operations

**Integers & Floats**

Python is a dynamically typed language and mathematical operations work exactly as you would expect. 

In [0]:
x = 3
print(x, type(x))
print(x + 1)   # Addition;
print(x - 1)   # Subtraction;
y = 2.5
print(type(y)) # Prints "<type 'float'>"
print(y * 2)   # Multiplication;
print(x / 2)   # Division;
print(y ** 2)  # Exponentiation;
print(type(y*x))

**Booleans**

Python implements all of the standard logical operators for Boolean variables. However, it provides english keywords to enhance readability.

In [0]:
t, f = True, False
print(type(t)) # Prints "<type 'bool'>"
print(t and f) # Logical AND;
print(t or f)  # Logical OR;
print(not t)    # Logical NOT;
print(t != f)   # Logical XOR;

**Strings**

Strings in python are technically immutable, however string variables in python are easily reassigned making them appear mutable.

In [0]:
a = "Foo"  # the variable 'a' points to the string "Foo"
b = a  # 'b' points to the same string "Foo" that 'a' points to
a = a + a  # 'a' now points to the new string "FooFoo",
print(a, b) # but b still points to the old "Foo"

Strings variables also contain a number of useful methods to which reassign them to new strings based on their current value.

In [0]:
s = "hello"
print(s.capitalize())  # Capitalize a string;
print(s.upper())       # Convert a string to uppercase;
print(s.rjust(7))      # Right-justify a string, padding with spaces;
print(s.center(7))     # Center a string, padding with spaces;
print(s.replace('l', '(ell)'))  # Replace all instances of one 
                                # substring with another;
print('  world '.strip())  # Strip leading and trailing whitespace;

Additionally, python provides string formatting which can be used to substitute variable values into a string in place of {} instances.

In [0]:
print("The string variable 'a' contains {} characters.".format(len(a)))
print("The value of the {} variable 'x' is {}".format(type(x), x))

# Containers
Python includes several built-in container types including lists, dictionaries, sets, and tuples.

**Lists**

A list is the Python equivalent of an array, but is resizeable and can contain elements of different types. Note that arrays start at zero and negative indexing works differently than in R.

In [0]:
xs = [3, 1, 2]   # Create a list
print(xs[0], xs[2])
print(xs[-1])  # print(xs[len(xs)-1])

In [0]:
xs[2] = 'foo'    # Lists can contain elements of different types
print(xs)

In [0]:
xs.append('bar') # Add a new element to the end of the list
print(xs)

In [0]:
x = xs.pop()     # Remove and return the last element of the list
print(x, xs)

**Slicing**

Indexing sequential elements of an array is referred to as "sclicing" and works hwo you would expect after using R or Matlab.

In [0]:
nums = list(range(5)) # range is a built-in function that creates a sequence of integers
print(nums)           # Prints "[0, 1, 2, 3, 4]"
print(nums[2:4])      # Get a slice from index 2 to 4 (exclusive);
print(nums[2:])       # Get a slice from index 2 to the end;
print(nums[:2])       # Get a slice from the start to index 2 (exclusive);
print(nums[:])        # Get a slice of the whole list;
print(nums[:-1])      # Slice indices can be negative;
nums[2:4] = [8, 9]    # Assign a new sublist to a slice
print(nums)

**For Loops, & List Comprehensions**

You can loop over the elements of a list like this:

In [0]:
animals = ['cat', 'dog', 'monkey']
for animal in animals:
    print(animal)

If you want access to the index of each element within the body of a loop, use the built-in enumerate function.

In [0]:
animals = ['cat', 'dog', 'monkey']
for idx, animal in enumerate(animals):
    print('#{}: {}'.format(idx + 1, animal))

When programming, frequently we want to transform one type of data into another. As a simple example, consider the following code that computes square numbers.

In [0]:
nums = [0, 1, 2, 3, 4]
squares = []
for x in nums:
    squares.append(x ** 2)
print(squares)

You can make this code simpler using a list comprehension.

In [0]:
squares = [x ** 2 for x in nums]
print(squares)

List comprehensions can also contain conditions.

In [0]:
nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]
print(even_squares)

**Dictionaries**

A dictionary stores (key, value) pairs, similar to a Map in Java or an object in Javascript. You can use it like this:

In [0]:
d = {'cat': 'feline', 'dog': 'canine'}  # Create a new dictionary with some data
print(d['cat'])                         # Get an entry from a dictionary;
print('cat' in d)                       # Check if a dictionary has a given key;

In [0]:
d['bear'] = 'ursa'    # Set an entry in a dictionary
print(d['bear'])     

In [0]:
print(d['monkey'])     # KeyError: 'monkey' not a key of d

In [0]:
print(d.get('monkey', 'N/A'))  
print(d.get('bear', 'N/A'))    

In [0]:
#del d['bear']                # Remove an element from a dictionary
print(d.get('bear', 'N/A'))  

**For Loops, & Dictionary Comprehensions**

It is easy to iterate over the keys in a dictionary:

In [0]:
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
    print('A {} has {} legs'.format(animal, d[animal]))

If you want access to keys and their corresponding values, use the items method:

In [0]:
for animal, legs in d.items():
    print('A {} has {} legs'.format(animal, legs))

Dictionary comprehensions: These are similar to list comprehensions, but allow you to easily construct dictionaries. For example:

In [0]:
even_num_to_square = {x: x ** 2 for x in range(7) if x % 2 == 0}
print(even_num_to_square[4])
print(even_num_to_square)

**Sets**

A set is an unordered collection of distinct elements. As a simple example, consider the following:

In [0]:
animals = {'cat', 'dog'}
print('cat' in animals)   # Check if an element is in a set; prints "True"
print('fish' in animals)  # prints "False"

In [0]:
animals.add('fish')      # Add an element to a set
print('fish' in animals)
print(len(animals))       # Number of elements in a set;

In [0]:
animals.add('cat')       # Adding an element that is already in the set does nothing
print(len(animals))       
animals.remove('cat')    # Remove an element from a set
print(len(animals))

**For Loops & Set Comprehensions**

Loops: Iterating over a set has the same syntax as iterating over a list; however since sets are unordered, you cannot make assumptions about the order in which you visit the elements of the set:

In [0]:
animals = {'cat', 'dog', 'fish'}
for idx, animal in enumerate(animals):
    print('#{}: {}'.format(idx + 1, animal))

Set comprehensions: Like lists and dictionaries, we can easily construct sets using set comprehensions:

In [0]:
from math import sqrt
print({int(sqrt(x)) for x in range(30)})

**Tuples**

A tuple is in many ways similar to a list; one of the most important differences is that tuples can be used as keys in dictionaries and as elements of sets, while lists cannot. Here is a trivial example:

In [0]:
d = {(x, x + 1): x for x in range(10)}  # Create a dictionary with tuple keys
t = (5, 6)                              # Create a tuple
print(type(t))
print(d[t])
print(d[(1, 2)])

Additionally, tuples are immutable setting their behavior apart from lists.

In [0]:
t[0] = 1

# Functions
Python functions are defined using the def keyword. For example:

In [0]:
def sign(x):
    if x > 0:
        return 'positive'
    elif x < 0:
        return 'negative'
    else:
        return 'zero'

for x in [-1, 0, 1]:
    print(sign(x))

We will often define functions to take optional keyword arguments, like this:

In [0]:
def hello(name, loud=False):
    if loud:
        print('HELLO, {}!'.format(name.upper()))
    else:
        print('Hello, {}.'.format(name))

hello('Bob')
hello('Fred', loud=True)

# Classes
Python has been used as an object oriented language since its inception. As such, the syntax for defining and using classes in is straightforward.

In [0]:
class Greeter:

    # Constructor
    def __init__(self, name):
        self.name = name  # Create an instance variable

    # Instance method
    def greet(self, loud=False):
        if loud:
            print('HELLO, {}!'.format(self.name.upper()))
        else:
            print('Hello, {}.'.format(self.name))

g = Greeter('Fred')  # Construct an instance of the Greeter class
g.greet()            # Call an instance method
g.greet(loud=True)   # Call an instance method

# Numpy
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this tutorial useful to get started with Numpy.

To use Numpy, we first need to import the numpy package. The following command will all work, but the first is commonly accepted as best practice we will use that:

In [0]:
import numpy as np  # use functions from numpy by calling 'np.func'
# import numpy  # use functions from numpy by calling 'numpy.func'
# from numpy import func  # imports the function 'numpy.func' into the namespace as 'func'
# from numpy import *  # imports the entire numpy namespace

**Arrays**

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can initialize numpy arrays from nested Python lists, and access elements using square brackets:

In [0]:
a = np.array([1, 2, 3])  # Create a rank 1 array
print(type(a), a.shape, a[0], a[1], a[2])
a[0] = 5                 # Change an element of the array
print(a)

In [0]:
b = np.array([[1,2,3],[4,5,6]])   # Create a rank 2 array
print(b)

In [0]:
print(b.shape)
print(b[0, 0], b[0, 1], b[1, 0])

In [0]:
a = np.zeros((2,2))  # Create an array of all zeros
print(a)

In [0]:
b = np.ones((1,2))   # Create an array of all ones
print(b)

In [0]:
c = np.full((2,2), 7) # Create a constant array
print(c)

In [0]:
d = np.eye(2)        # Create a 2x2 identity matrix
print(d)

In [0]:
e = np.random.random((2,2)) # Create an array filled with random values
print(e)

**Array Indexing**

Numpy offers several ways to index into arrays.

Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

In [0]:
# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]
print(b)


[[2 3]
 [6 7]]
7


A slice of an array is a view into the same data, so modifying it will modify the original array.

In [0]:
print(a[0, 1])  
b[0, 0] = 77    # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])

You can also mix integer indexing with slice indexing. However, doing so will yield an array of lower rank than the original array. Note that this is quite different from the way that MATLAB handles array slicing.

In [0]:
# Create the following rank 2 array with shape (3, 4)
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)

Two ways of accessing the data in the middle row of the array. Mixing integer indexing with slices yields an array of lower rank, while using only slices yields an array of the same rank as the original array.

In [0]:
row_r1 = a[1, :]    # Rank 1 view of the second row of a  
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
row_r3 = a[[1], :]  # Rank 2 view of the second row of a
print(row_r1, row_r1.shape) 
print(row_r2, row_r2.shape)
print(row_r3, row_r3.shape)

We can make the same distinction when accessing columns of an array.

In [0]:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print(col_r1, col_r1.shape)
print("\n", col_r2, col_r2.shape)

Integer array indexing: When you index into numpy arrays using slicing, the resulting array view will always be a subarray of the original array. In contrast, integer array indexing allows you to construct arbitrary arrays using the data from another array. Here is an example:

In [0]:
a = np.array([[1,2], [3, 4], [5, 6]])

# An example of integer array indexing.
# The returned array will have shape (3,) and 
print(a[[0, 1, 2], [0, 1, 0]])

# The above example of integer array indexing is equivalent to this:
print(np.array([a[0, 0], a[1, 1], a[2, 0]]))

In [0]:
# When using integer array indexing, you can reuse the same
# element from the source array:
print(a[[0, 0], [1, 1]])

# Equivalent to the previous integer array indexing example
print(np.array([a[0, 1], a[0, 1]]))

One useful trick with integer array indexing is selecting or mutating one element from each row of a matrix:

In [0]:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print(a)

In [0]:
# Create an array of indices
b = np.array([0, 2, 0, 1])

# Select one element from each row of a using the indices in b
print(a[np.arange(4), b])  # Prints "[ 1  6  7 11]"

In [0]:
# Mutate one element from each row of a using the indices in b
a[np.arange(4), b] += 10
print(a)

**Boolean array indexing:** 

Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition. Here is an example:

In [0]:
a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2)  # Find the elements of a that are bigger than 2;
                    # this returns a numpy array of Booleans of the same
                    # shape as a, where each slot of bool_idx tells
                    # whether that element of a is > 2.

print(bool_idx)

In [0]:
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print(a[bool_idx])

# We can do all of the above in a single concise statement:
print(a[a > 2])

**Datatypes**

Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:

In [0]:
x = np.array([1, 2])  # Let numpy choose the datatype
y = np.array([1.0, 2.0])  # Let numpy choose the datatype
z = np.array([1.0, 2.0], dtype=np.int64)  # Force a particular datatype

print(x.dtype, y.dtype, z.dtype)

**Array Math**

Basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the numpy module:

In [0]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
print(x + y)
print(np.add(x, y))

In [0]:
# Elementwise difference; both produce the array
print(x - y)
print(np.subtract(x, y))

In [0]:
# Elementwise product; both produce the array
print(x * y)
print(np.multiply(x, y))

In [0]:
# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

In [0]:
# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

Note that unlike MATLAB, * is elementwise multiplication, not matrix multiplication. We instead use the dot function to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices. dot is available both as a function in the numpy module and as an instance method of array objects:

In [0]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

In [0]:
# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))

In [0]:
# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
#  [43 50]]
print(x.dot(y))
print(np.dot(x, y))

Numpy provides many useful functions for performing computations on arrays; one of the most useful is sum:

In [0]:
x = np.array([[1,2],[3,4]])

print(np.sum(x))          # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0))  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"
10

# Matplotlib
Matplotlib is a plotting library similar to that of Matlab (hence the name).

In [0]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12, 8)

In [0]:
%matplotlib inline

**Standalone Plots**

The most important function in matplotlib is plot, which allows you to plot 2D data. Here is a simple example:

In [0]:
# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)

# Plot the points using matplotlib
plt.plot(x, y)
plt.show()

With just a little bit of extra work we can easily plot multiple lines at once, and add a title, legend, and axis labels:

In [0]:
y_sin = np.sin(x)
y_cos = np.cos(x)

# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
plt.show()

**Subplots**

You can plot different things in the same figure using the subplot function. Here is an example:

In [0]:
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# Show the figure.
plt.show()

# Scipy.Stats
Scipy is a scientific computing library with many useful submodules. Many of you coming from R will find the Scipy.Stats module useful as it contains statistical functions that can sample from and calculate the pdfs of common distribution. For example, consider the following tutorial with the folded normal distribution.

In [0]:
from scipy.stats import foldnorm
fig, ax = plt.subplots(1, 1)
c = 1.95 # Shape parameter

# Calculate & print the moments
mean, var, skew, kurt = foldnorm.stats(c, moments='mvsk')
print("mean:{}\nvar: {}\nskew:{}\nkurt:{}".format(mean, var, skew, kurt))

# Plot the probability density function (pdf):
x = np.linspace(foldnorm.ppf(0.01, c), foldnorm.ppf(0.99, c), 100)
ax.plot(x, foldnorm.pdf(x, c), 'r-', lw=5, alpha=0.6, label='foldnorm pdf')

# Generate random numbers and compare the histogram:
r = foldnorm.rvs(c, size=1000)
ax.hist(r, normed=True, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
ax.set_title('Folded Normal Histogram')
# Display Plot
plt.show()

# Scikit-Learn
Scikit-Learn is an open source machine learning library which implements many standard off-the-shelf machine learning algorithms and utility functions. Even if you are not using any of the models from this library, you will find a lot of the built in functionality useful in running your own experiments. For example consider the following tutorial which efficiently plots cross validated predictions from a linear regression against their true values.

In [0]:
from sklearn import datasets
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model

lr = linear_model.LinearRegression()
boston = datasets.load_boston()
y = boston.target

# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, boston.data, y, cv=10)

fig, ax = plt.subplots()
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
ax.set_title('Cross Validated Predictions')
plt.show()

NameError: ignored